Comparison benchmark: ClawBench (live-site browser agent eval)

Hi team,

ClawBench is already referenced in your list (paper_by_key/paper_benchmark.md, paper_by_key/paper_long-horizon_tasks.md, paper_by_env/paper_web.md). Opening this as an informational / structural suggestion rather than an addition request.

Context for the project, in case useful:

- 153 everyday tasks on 144 live production websites (15 life categories)
- Unlike WebArena/VisualWebArena, runs directly on real sites via a submission-interception layer (CDP + Chrome extension) that blocks only the final write request
- 7 frontier models evaluated; best passes 33.3%
- Paper: https://arxiv.org/abs/2604.08523
- Repo: https://github.com/reacher-z/ClawBench

The paper is already indexed at paper_by_key/paper_benchmark.md. Would you consider creating a dedicated 'live-site benchmarks' subsection (ClawBench, Online-Mind2Web, and similar) to separate them from sandbox benchmarks (WebArena/VisualWebArena)? The execution model differences are substantive enough that readers looking specifically for production-site evals currently have to cross-reference multiple entries.

Happy to help assemble the subsection if useful.

Disclosure: I'm affiliated with ClawBench.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison benchmark: ClawBench (live-site browser agent eval) #110

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Comparison benchmark: ClawBench (live-site browser agent eval) #110

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions