A forkable template repository with example Auto-Judge implementations for building custom judges.
New here? See the Developer How-To for a step-by-step guide from fork to submission.
Using Claude Code? Type "Setup checklist" to get an interactive walkthrough of all setup steps in your new repository.
This repository contains the code used for evaluation and approaches for the TREC Auto-Judge shared tasks.
- judges the Auto-Judge implementations
We are developing a step-by-step guide on how to submit at documentation/README.md.
TREC Auto-Judge offers the first rigorous, cross-task benchmark for Large-Language-Model judges.
Large-Language-Model judges have emerged as a pragmatic solution when manual relevance assessment is costly or infeasible. However, recent studies reveal wide variation in accuracy across tasks, prompts, and model sizes.
Currently, shared task organizers choose an LLM judge per track ad hoc, risking inconsistent baselines and hidden biases.
Auto-Judge provides a test bed for comparing different LLM judge ideas across several tasks and correlating results against manually created relevance judgments. AutoJudge provides a testbed to study emerging evaluation approaches, as well as vulnerabilities of LLM judges, and the efficacy of safeguards for those vulnerabilities.
This Auto-Judge evaluation script standardizes data handling and evaluation across multiple shared tasks/TREC tracks that rely on LLM judging and provides a centralized, comparative evaluation of LLM judges under realistic conditions.
This project provides a means to evaluate AutoJudge approaches and provide a system ranking / leaderboard.
It will be used by TREC AutoJudge coordinators to score submissions. We encourage prospective participants to run this locally for method development.
This code will handle obtaining data sets (akin to ir_datasets), input/output and format conversions, and evaluation measures.
- Fork this repository
- Clone
- Create and activate venv
uv venv
source .venv/bin/activate # Use this to restart your session
- Minimal install via
uv pip(pipshould also work)
uv pip install -e .
- Optional: installation with all extra tools (includes
auto-judge-evaluate)
uv pip install -e ".[all]"
uv pip install -e ".[all]" installs all of the below.
If you want to be selective in installing tools
- Auto-Judge Meta-Evaluation tools
uv pip install -e ".[evaluate]" - Lightweight batteries-included LLM client (used by TinyJudge)
uv pip install -e ".[minima-llm]" - PyTerrier retrieval (used by PyTerrier retrieval judge)
uv pip install -e ".[pyterrier]" - Pytest unittest infrastructure
uv pip install -e ".[test]"
Add your own dependencies in pyproject.toml under [project] > dependencies.
After modification fetch dependencies, replacing all with selected tools and adding --refresh to avoid stale package caches
uv pip install -e ".[all]" --refresh
When installed with [evaluate], the Auto-Judge meta-evaluation package provides CLI commands for
- leaderboard correlation:
auto-judge-evaluate meta-evaluate --help - inter-annotator agreement:
auto-judge-evaluate qrel-evaluate --help - format conversion:
auto-judge-evaluate eval-result --help.
See the autojudge-evaluate README and built-in --help
A judge is any class with a judge() method:
from autojudge_base import Leaderboard, LeaderboardBuilder, LeaderboardSpec, MeasureSpec
MY_SPEC = LeaderboardSpec(measures=(MeasureSpec("MY_SCORE"),))
class MyJudge:
def judge(self, rag_responses, rag_topics, llm_config, **kwargs) -> Leaderboard:
builder = LeaderboardBuilder(MY_SPEC)
for response in rag_responses:
score = evaluate_response(response) # your logic here
builder.add(
run_id=response.metadata.run_id,
topic_id=response.metadata.topic_id,
values={"MY_SCORE": score},
)
topic_ids = [t.request_id for t in rag_topics]
return builder.build(expected_topic_ids=topic_ids, on_missing="fix_aggregate")Register in workflow.yml:
judge_class: "judges.myjudge.my_judge:MyJudge"For data class documentation (Report, Request, Leaderboard, etc.), see autojudge-base.
For a full example with LLM calls, see judges/tinyjudge/.
auto-judge run \
--workflow judges/tinyjudge/workflow.yml \
--rag-responses /path/to/responses/ \
--rag-topics /path/to/topics.jsonl \
--out-dir ./output/
# See all options
auto-judge run --workflow judges/tinyjudge/workflow.yml --helpFor variants, parameter sweeps, and advanced configurations, see the workflow documentation.
Important: Your judge must use the llm_config parameter passed to judge(). Do not hardcode endpoints or API keys.
The llm_config object (LlmConfigBase) provides basic fields (model, base_url, cache_dir) and stores the full YAML config in .raw to store additional parameters for your LLM backend (e.g. here MinimaLlmConfig):
import asyncio
from minima_llm import MinimaLlmConfig, MinimaLlmRequest, OpenAIMinimaLlm
def judge(self, rag_responses, rag_topics, llm_config, **kwargs) -> Leaderboard:
# Convert to full config for backend features (batching, retry, etc.)
full_config = MinimaLlmConfig.from_dict(llm_config.raw)
backend = OpenAIMinimaLlm(full_config)
# ... your judge logic
response = asyncio.run(backend.generate(MinimaLlmRequest(
request_id="example",
messages=[{"role": "user", "content": "Is this answer relevant? Reply 1 or 0."}],
)))
score = float(response.text.strip())The llm_config object is automatically populated from environment variables and optional config files.
This example uses MinimaLlm, but you can use any LLM backend you prefer (including litellm).
Set these before running:
export OPENAI_BASE_URL="https://api.openai.com/v1" # or your endpoint
export OPENAI_MODEL="gpt-4o-mini"
export OPENAI_API_KEY="sk-..."
export CACHE_DIR="./cache" # optional, enables prompt cachingCreate llm-config.yml:
base_url: "http://localhost:8000/v1"
model: "llama-3.3-70b-instruct"
cache_dir: "./cache"Then pass it via CLI:
auto-judge run --llm-config llm-config.yml --workflow ...Configuration layers: env → yaml → cli (each layer overrides the previous).
A fully-documented example demonstrating all three protocols:
ExampleNuggetCreator: Creates nugget questions for topicsExampleQrelsCreator: Creates relevance judgmentsExampleLeaderboardJudge: Scores responses and produces leaderboard
No LLM calls - all logic is deterministic. Use this as a reference for building judges that use nuggets and qrels.
A simple baseline judge that scores based on:
- Response text length
- Deterministic random score (for baseline comparison)
Uses PyTerrier retrieval models to score responses:
- Indexes responses per topic
- Runs multiple weighting models (BM25, TF-IDF, etc.)
- Ranks responses by retrieval score
Requires the pyterrier optional dependency.
A small synthetic dataset for development and testing:
- 5 topics with simple queries
- 4 runs of varying quality
- Useful for validating workflow configurations and quick iteration
# Run your judge against kiddie
auto-judge run \
--workflow judges/naive/workflow.yml \
--rag-responses data/kiddie/runs/repgen/ \
--rag-topics data/kiddie/topics/kiddie-topics.jsonl \
--out-dir ./output-kiddie/Or run the included smoke test script which also does meta-evaluation: bash run_kiddie.sh
Use run_all_datasets.py to run a workflow against multiple datasets configured in a YAML file:
python run_all_datasets.py --workflow judges/naive/workflow.yml --datasets data/datasets.ymlTo run on more than just kiddie, add entries to datasets.yml:
datasets:
- name: kiddie
responses: data/kiddie/runs/repgen/
topics: data/kiddie/topics/kiddie-topics.jsonl
prio1_runs: # Used with --runs prio1
- run1
- run2
assessed_topics: # Used with --topics assessed
- leaf
- cloud
- beeThe data/kiddie/eval/ directory contains a synthetic ground-truth leaderboard for testing meta-evaluation:
auto-judge-evaluate meta-evaluate \
--truth-leaderboard data/kiddie/eval/kiddie_fake.eval.ir_measures.txt \
--truth-format ir_measures --truth-header \
--eval-format tot \
--on-missing default \
output-kiddie/*.eval.txtFor real evaluation, obtain official TREC datasets separately.
-
Copy an example: Start from
judges/complete_example/orjudges/naive/ -
Implement the protocol: Your judge class needs:
from pathlib import Path from autojudge_base import AutoJudge, Leaderboard, Report, Request class MyJudge(AutoJudge): nugget_banks_type = NuggetBanks def create_nuggets(self, rag_responses, rag_topics, llm_config, filebase: str = "default", outdir: Path = Path("."), **kwargs): # Optional: create nugget questions return None def create_qrels(self, rag_responses, rag_topics, llm_config, filebase: str = "default", outdir: Path = Path("."), **kwargs): # Optional: create relevance judgments return None def judge(self, rag_responses, rag_topics, llm_config, filebase: str = "default", outdir: Path = Path("."), **kwargs): # Required: produce leaderboard return leaderboard
-
Configure workflow.yml: Set lifecycle flags, settings, variants
-
Run your judge:
auto-judge run --workflow judges/myjudge/workflow.yml ...
auto-judge-starterkit/
├── pyproject.toml # Dependencies and package config
├── README.md # This file
├── run_kiddie.sh # End-to-end smoke test on kiddie
├── judges/
│ ├── complete_example/ # Full protocol example (nuggets, qrels, leaderboard)
│ ├── naive/ # Simple baseline judge
│ ├── tinyjudge/ # Minimal LLM judge example
│ └── pyterrier_retrieval/ # PyTerrier retrieval judge
├── data/
│ └── kiddie/ # Synthetic test dataset
├── documentation/ # Submission guide
└── tests/
└── test_examples.py # Smoke tests
- See
judges/complete_example/README.mdfor detailed protocol documentation - See autojudge-base for core data classes (
Report,Request,Leaderboard,NuggetBanks, etc.) and protocol definitions
MIT
