Skip to content

Comparison benchmark: ClawBench (live-site browser agent eval) #110

@reacher-z

Description

@reacher-z

Hi team,

ClawBench is already referenced in your list (paper_by_key/paper_benchmark.md, paper_by_key/paper_long-horizon_tasks.md, paper_by_env/paper_web.md). Opening this as an informational / structural suggestion rather than an addition request.

Context for the project, in case useful:

  • 153 everyday tasks on 144 live production websites (15 life categories)
  • Unlike WebArena/VisualWebArena, runs directly on real sites via a submission-interception layer (CDP + Chrome extension) that blocks only the final write request
  • 7 frontier models evaluated; best passes 33.3%
  • Paper: https://arxiv.org/abs/2604.08523
  • Repo: https://github.com/reacher-z/ClawBench

The paper is already indexed at paper_by_key/paper_benchmark.md. Would you consider creating a dedicated 'live-site benchmarks' subsection (ClawBench, Online-Mind2Web, and similar) to separate them from sandbox benchmarks (WebArena/VisualWebArena)? The execution model differences are substantive enough that readers looking specifically for production-site evals currently have to cross-reference multiple entries.

Happy to help assemble the subsection if useful.

Disclosure: I'm affiliated with ClawBench.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions