structured_outputs: add LLM-as-a-judge semantic verification by schauhan54 · Pull Request #1334 · NVIDIA-NeMo/Gym

schauhan54 · 2026-05-15T17:07:40Z

Summary

Extends the structured_outputs resource server to support per-task semantic evaluation via LLM judge calls, running independently alongside the existing syntax (schema validation) reward
Tasks can include a semantic_verifier_config with llmaaj criteria, each carrying custom rubric text and a weight (major/minor)
Follows the same judge wiring pattern as multichallenge and math_with_judge (ModelServerRef, parallel asyncio.gather calls, [[PASS]]/[[FAIL]] verdict parsing)

Changes

Config: judge_model_server, judge_responses_create_params, judge_prompt_template, pass_label/fail_label, parallel_evaluation, reward_mode
Request: semantic_verifier_config field (per-task criteria from task JSONL)
Response: semantic_reward (0.0-1.0 weighted aggregate), semantic_results (per-criterion detail)
verify(): runs semantic evaluation regardless of syntax result -- both rewards returned independently
_evaluate_criterion(): builds judge prompt from per-task rubric, calls /v1/responses, parses verdict labels
_aggregate_semantic(): weighted score aggregation (major=2x weight, minor=1x)
reward_mode: combined (default) multiplies syntax × semantic; independent returns syntax only with semantic reported separately

Backward compatibility

All new fields are optional with defaults of None/False -- existing tasks without semantic_verifier_config are unaffected
When judge_model_server is not configured, semantic evaluation is skipped (returns None)
The existing reward field (syntax) is unchanged

Example rollouts

examples/example_tasks.jsonl contains 5 GPT-5.4 rollouts with LLMaaJ semantic criteria (3 syntax fail, 2 syntax pass). Reference Gym config at configs/structured_outputs_semantic.yaml (not tested directly — shows how to wire up judge in a real deployment).

How to validate

# 1. Unit tests (no API key needed)
cd resources_servers/structured_outputs
pytest tests/test_app.py -v

# 2. Verify gold outputs pass syntax (no API key needed)
python examples/test_verify_tasks.py --tasks examples/example_tasks.jsonl

# 3. Replay stored rollouts through verify() (no API key needed)
python examples/run_evaluation.py \
  --gym-path . \
  --tasks examples/example_tasks.jsonl \
  --gold-only

# 4. Fresh model rollout with judge (needs OPENAI_API_KEY)
python examples/run_evaluation.py \
  --gym-path . \
  --api-key $OPENAI_API_KEY \
  --model gpt-4o \
  --judge-model gpt-4o \
  --tasks examples/example_tasks.jsonl \
  --reward-mode combined

# 5. Compare reward modes (same command, swap flag)
python examples/run_evaluation.py \
  --gym-path . \
  --api-key $OPENAI_API_KEY \
  --model gpt-4o \
  --judge-model gpt-4o \
  --tasks examples/example_tasks.jsonl \
  --reward-mode independent

# 6. Run on your own tasks directory
python examples/run_evaluation.py \
  --gym-path . \
  --api-key $OPENAI_API_KEY \
  --model gpt-4o \
  --judge-model gpt-4o \
  --tasks /path/to/tasks/ \
  --max-tasks 10 \
  --reward-mode combined

# 7. E2E test with real OpenAI judge (needs OPENAI_API_KEY + TASKS_DIR)
OPENAI_API_KEY=sk-... TASKS_DIR=/path/to/tasks MAX_TASKS=5 \
  python test_e2e_openai_judge.py

Config knobs

Knob	Description
`reward_mode`	`combined` (default): reward = syntax × semantic. `independent`: reward = syntax only, semantic reported separately
`judge_model_server`	Which model server to use for judge calls
`judge_prompt_template`	Customizable judge prompt (format vars: `{model_output}`, `{rubric}`)
`judge_system_message`	System message for judge calls
`pass_label` / `fail_label`	Verdict markers (default `[[PASS]]`/`[[FAIL]]`)
`parallel_evaluation`	Run criteria in parallel (default true)

Test plan

21 unit tests (9 new mocked-judge tests covering both reward modes, weight aggregation, judge errors, edge cases)
Gold outputs pass syntax: 5/5
Gold-only replay (no judge): 5/5 reward=1.00 (semantic defaults to 1.0)
Fresh model rollout + combined mode: syntax × semantic = combined reward
Fresh model rollout + independent mode: reward = syntax only, semantic reported separately
E2E with real OpenAI judge: 10/10 LLMaaJ criteria pass
No regression: tasks without semantic_verifier_config return semantic_reward: null
Parallel evaluation: multiple criteria fan out via asyncio.gather

copy-pr-bot · 2026-05-15T17:07:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jkyi-nvidia · 2026-05-16T01:21:52Z

can we add example rollouts here with the semantic rewards fields filled in ?

Extends the structured_outputs resource server to support per-task semantic evaluation via LLM judge calls, running independently alongside the existing syntax (schema validation) reward. Changes: - Config: judge_model_server, judge_responses_create_params, judge_prompt_template, pass_label/fail_label, parallel_evaluation, reward_mode - Request: semantic_verifier_config field (per-task criteria from JSONL) - Response: semantic_reward (0.0-1.0 weighted aggregate), semantic_results (per-criterion detail) - verify(): runs semantic evaluation regardless of syntax result - reward_mode: combined (default) multiplies syntax x semantic; independent returns syntax only with semantic reported separately Bug fixes: - strictify_schema(): skip conditional/composition keywords (if/then/else, oneOf/anyOf/allOf, $defs/definitions) during recursion and recurse into list values - coerce_xml_types(): handle nullable union types like ["integer", "null"] by extracting the non-null type Backward compatibility: - All new fields optional -- existing tasks are unaffected - The existing reward field (syntax) is unchanged Co-Authored-By: Claude Opus 4.6 <[email protected]>

schauhan54 force-pushed the structured-outputs-llmaaj branch from d6a6a33 to 8d34dd3 Compare May 15, 2026 18:10

jkyi-nvidia requested changes May 16, 2026

View reviewed changes

schauhan54 force-pushed the structured-outputs-llmaaj branch 3 times, most recently from c80bd3b to a8d7192 Compare May 18, 2026 20:26

schauhan54 requested a review from jkyi-nvidia May 18, 2026 20:37

schauhan54 force-pushed the structured-outputs-llmaaj branch from a8d7192 to a8c6e83 Compare May 18, 2026 21:51

schauhan54 force-pushed the structured-outputs-llmaaj branch from 0ff25d9 to d09dc84 Compare May 20, 2026 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

structured_outputs: add LLM-as-a-judge semantic verification#1334

structured_outputs: add LLM-as-a-judge semantic verification#1334
schauhan54 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
schauhan54:structured-outputs-llmaaj

schauhan54 commented May 15, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

jkyi-nvidia May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

schauhan54 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Backward compatibility

Example rollouts

How to validate

Config knobs

Test plan

Uh oh!

copy-pr-bot Bot commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

jkyi-nvidia May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

schauhan54 commented May 15, 2026 •

edited

Loading