Skip to content

structured_outputs: add LLM-as-a-judge semantic verification#1334

Open
schauhan54 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
schauhan54:structured-outputs-llmaaj
Open

structured_outputs: add LLM-as-a-judge semantic verification#1334
schauhan54 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
schauhan54:structured-outputs-llmaaj

Conversation

@schauhan54
Copy link
Copy Markdown

@schauhan54 schauhan54 commented May 15, 2026

Summary

  • Extends the structured_outputs resource server to support per-task semantic evaluation via LLM judge calls, running independently alongside the existing syntax (schema validation) reward
  • Tasks can include a semantic_verifier_config with llmaaj criteria, each carrying custom rubric text and a weight (major/minor)
  • Follows the same judge wiring pattern as multichallenge and math_with_judge (ModelServerRef, parallel asyncio.gather calls, [[PASS]]/[[FAIL]] verdict parsing)

Changes

  • Config: judge_model_server, judge_responses_create_params, judge_prompt_template, pass_label/fail_label, parallel_evaluation, reward_mode
  • Request: semantic_verifier_config field (per-task criteria from task JSONL)
  • Response: semantic_reward (0.0-1.0 weighted aggregate), semantic_results (per-criterion detail)
  • verify(): runs semantic evaluation regardless of syntax result -- both rewards returned independently
  • _evaluate_criterion(): builds judge prompt from per-task rubric, calls /v1/responses, parses verdict labels
  • _aggregate_semantic(): weighted score aggregation (major=2x weight, minor=1x)
  • reward_mode: combined (default) multiplies syntax × semantic; independent returns syntax only with semantic reported separately

Backward compatibility

  • All new fields are optional with defaults of None/False -- existing tasks without semantic_verifier_config are unaffected
  • When judge_model_server is not configured, semantic evaluation is skipped (returns None)
  • The existing reward field (syntax) is unchanged

Example rollouts

examples/example_tasks.jsonl contains 5 GPT-5.4 rollouts with LLMaaJ semantic criteria (3 syntax fail, 2 syntax pass). Reference Gym config at configs/structured_outputs_semantic.yaml (not tested directly — shows how to wire up judge in a real deployment).

How to validate

# 1. Unit tests (no API key needed)
cd resources_servers/structured_outputs
pytest tests/test_app.py -v

# 2. Verify gold outputs pass syntax (no API key needed)
python examples/test_verify_tasks.py --tasks examples/example_tasks.jsonl

# 3. Replay stored rollouts through verify() (no API key needed)
python examples/run_evaluation.py \
  --gym-path . \
  --tasks examples/example_tasks.jsonl \
  --gold-only

# 4. Fresh model rollout with judge (needs OPENAI_API_KEY)
python examples/run_evaluation.py \
  --gym-path . \
  --api-key $OPENAI_API_KEY \
  --model gpt-4o \
  --judge-model gpt-4o \
  --tasks examples/example_tasks.jsonl \
  --reward-mode combined

# 5. Compare reward modes (same command, swap flag)
python examples/run_evaluation.py \
  --gym-path . \
  --api-key $OPENAI_API_KEY \
  --model gpt-4o \
  --judge-model gpt-4o \
  --tasks examples/example_tasks.jsonl \
  --reward-mode independent

# 6. Run on your own tasks directory
python examples/run_evaluation.py \
  --gym-path . \
  --api-key $OPENAI_API_KEY \
  --model gpt-4o \
  --judge-model gpt-4o \
  --tasks /path/to/tasks/ \
  --max-tasks 10 \
  --reward-mode combined

# 7. E2E test with real OpenAI judge (needs OPENAI_API_KEY + TASKS_DIR)
OPENAI_API_KEY=sk-... TASKS_DIR=/path/to/tasks MAX_TASKS=5 \
  python test_e2e_openai_judge.py

Config knobs

Knob Description
reward_mode combined (default): reward = syntax × semantic. independent: reward = syntax only, semantic reported separately
judge_model_server Which model server to use for judge calls
judge_prompt_template Customizable judge prompt (format vars: {model_output}, {rubric})
judge_system_message System message for judge calls
pass_label / fail_label Verdict markers (default [[PASS]]/[[FAIL]])
parallel_evaluation Run criteria in parallel (default true)

Test plan

  • 21 unit tests (9 new mocked-judge tests covering both reward modes, weight aggregation, judge errors, edge cases)
  • Gold outputs pass syntax: 5/5
  • Gold-only replay (no judge): 5/5 reward=1.00 (semantic defaults to 1.0)
  • Fresh model rollout + combined mode: syntax × semantic = combined reward
  • Fresh model rollout + independent mode: reward = syntax only, semantic reported separately
  • E2E with real OpenAI judge: 10/10 LLMaaJ criteria pass
  • No regression: tasks without semantic_verifier_config return semantic_reward: null
  • Parallel evaluation: multiple criteria fan out via asyncio.gather

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@schauhan54 schauhan54 force-pushed the structured-outputs-llmaaj branch from d6a6a33 to 8d34dd3 Compare May 15, 2026 18:10
Comment thread resources_servers/structured_outputs/app.py
Comment thread resources_servers/structured_outputs/app.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add example rollouts here with the semantic rewards fields filled in ?

@schauhan54 schauhan54 force-pushed the structured-outputs-llmaaj branch 3 times, most recently from c80bd3b to a8d7192 Compare May 18, 2026 20:26
@schauhan54 schauhan54 requested a review from jkyi-nvidia May 18, 2026 20:37
@schauhan54 schauhan54 force-pushed the structured-outputs-llmaaj branch from a8d7192 to a8c6e83 Compare May 18, 2026 21:51
Extends the structured_outputs resource server to support per-task
semantic evaluation via LLM judge calls, running independently
alongside the existing syntax (schema validation) reward.

Changes:
- Config: judge_model_server, judge_responses_create_params,
  judge_prompt_template, pass_label/fail_label, parallel_evaluation,
  reward_mode
- Request: semantic_verifier_config field (per-task criteria from JSONL)
- Response: semantic_reward (0.0-1.0 weighted aggregate),
  semantic_results (per-criterion detail)
- verify(): runs semantic evaluation regardless of syntax result
- reward_mode: combined (default) multiplies syntax x semantic;
  independent returns syntax only with semantic reported separately

Bug fixes:
- strictify_schema(): skip conditional/composition keywords
  (if/then/else, oneOf/anyOf/allOf, $defs/definitions) during recursion
  and recurse into list values
- coerce_xml_types(): handle nullable union types like
  ["integer", "null"] by extracting the non-null type

Backward compatibility:
- All new fields optional -- existing tasks are unaffected
- The existing reward field (syntax) is unchanged

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@schauhan54 schauhan54 force-pushed the structured-outputs-llmaaj branch from 0ff25d9 to d09dc84 Compare May 20, 2026 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants