Hi team,
ClawBench is already referenced in your list (paper_by_key/paper_benchmark.md, paper_by_key/paper_long-horizon_tasks.md, paper_by_env/paper_web.md). Opening this as an informational / structural suggestion rather than an addition request.
Context for the project, in case useful:
- 153 everyday tasks on 144 live production websites (15 life categories)
- Unlike WebArena/VisualWebArena, runs directly on real sites via a submission-interception layer (CDP + Chrome extension) that blocks only the final write request
- 7 frontier models evaluated; best passes 33.3%
- Paper: https://arxiv.org/abs/2604.08523
- Repo: https://github.com/reacher-z/ClawBench
The paper is already indexed at paper_by_key/paper_benchmark.md. Would you consider creating a dedicated 'live-site benchmarks' subsection (ClawBench, Online-Mind2Web, and similar) to separate them from sandbox benchmarks (WebArena/VisualWebArena)? The execution model differences are substantive enough that readers looking specifically for production-site evals currently have to cross-reference multiple entries.
Happy to help assemble the subsection if useful.
Disclosure: I'm affiliated with ClawBench.
Hi team,
ClawBench is already referenced in your list (paper_by_key/paper_benchmark.md, paper_by_key/paper_long-horizon_tasks.md, paper_by_env/paper_web.md). Opening this as an informational / structural suggestion rather than an addition request.
Context for the project, in case useful:
The paper is already indexed at paper_by_key/paper_benchmark.md. Would you consider creating a dedicated 'live-site benchmarks' subsection (ClawBench, Online-Mind2Web, and similar) to separate them from sandbox benchmarks (WebArena/VisualWebArena)? The execution model differences are substantive enough that readers looking specifically for production-site evals currently have to cross-reference multiple entries.
Happy to help assemble the subsection if useful.
Disclosure: I'm affiliated with ClawBench.