agentic-evaluation

Star

Here are 4 public repositories matching this topic...

InternLM / WildClawBench

Star

An in-the-wild benchmark for AI agents in the OpenClaw Environment.

benchmarks agents agentic-ai openclaw agentic-evaluation

Updated May 19, 2026
Python

SammyReifel / alienbench

Star

A SnitchBench-style benchmark inverted for the dark-forest problem: does a listener AI alert humans about an alien signal when alerting may doom humanity?

ai-safety tool-use dark-forest llm-evaluation ai-benchmark agentic-evaluation snitchbench

Updated Apr 27, 2026
TypeScript

contactvaibhavi / GVR-Bench

Star

Pipeline to investigate structured reasoning and instruction adherence in multimodal LLMs

benchmark robustness grounding out-of-distribution neuro-symbolic robustness-verification instruction-following trustworthy-ai large-language-models faithfulness hallucination-detection agentic-ai llm-alignment agentic-evaluation agentic-reasoning deterministic-eval

Updated May 28, 2026
Python

wild-balthazar224 / claw-bench

Star

Measure AI agents’ performance with standardized tests across 314 tasks, 33 domains, and 4 difficulty levels for clear, reproducible comparison.

agent benchmark ai discovery openai benchmarks codex claude ai4science llm ai-scientist clawdbot openclaw agentic-evaluation auto-research research-claw

Updated May 29, 2026
Python

Improve this page

Add a description, image, and links to the agentic-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agentic-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agentic-evaluation

Here are 4 public repositories matching this topic...

InternLM / WildClawBench

SammyReifel / alienbench

contactvaibhavi / GVR-Bench

wild-balthazar224 / claw-bench

Improve this page

Add this topic to your repo