An in-the-wild benchmark for AI agents in the OpenClaw Environment.
-
Updated
May 19, 2026 - Python
An in-the-wild benchmark for AI agents in the OpenClaw Environment.
A SnitchBench-style benchmark inverted for the dark-forest problem: does a listener AI alert humans about an alien signal when alerting may doom humanity?
Pipeline to investigate structured reasoning and instruction adherence in multimodal LLMs
Measure AI agents’ performance with standardized tests across 314 tasks, 33 domains, and 4 difficulty levels for clear, reproducible comparison.
Add a description, image, and links to the agentic-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the agentic-evaluation topic, visit your repo's landing page and select "manage topics."