structured_outputs: add LLM-as-a-judge semantic verification#1334
Open
schauhan54 wants to merge 1 commit into
Open
structured_outputs: add LLM-as-a-judge semantic verification#1334schauhan54 wants to merge 1 commit into
schauhan54 wants to merge 1 commit into
Conversation
d6a6a33 to
8d34dd3
Compare
jkyi-nvidia
requested changes
May 16, 2026
Contributor
There was a problem hiding this comment.
can we add example rollouts here with the semantic rewards fields filled in ?
c80bd3b to
a8d7192
Compare
a8d7192 to
a8c6e83
Compare
Extends the structured_outputs resource server to support per-task semantic evaluation via LLM judge calls, running independently alongside the existing syntax (schema validation) reward. Changes: - Config: judge_model_server, judge_responses_create_params, judge_prompt_template, pass_label/fail_label, parallel_evaluation, reward_mode - Request: semantic_verifier_config field (per-task criteria from JSONL) - Response: semantic_reward (0.0-1.0 weighted aggregate), semantic_results (per-criterion detail) - verify(): runs semantic evaluation regardless of syntax result - reward_mode: combined (default) multiplies syntax x semantic; independent returns syntax only with semantic reported separately Bug fixes: - strictify_schema(): skip conditional/composition keywords (if/then/else, oneOf/anyOf/allOf, $defs/definitions) during recursion and recurse into list values - coerce_xml_types(): handle nullable union types like ["integer", "null"] by extracting the non-null type Backward compatibility: - All new fields optional -- existing tasks are unaffected - The existing reward field (syntax) is unchanged Co-Authored-By: Claude Opus 4.6 <[email protected]>
0ff25d9 to
d09dc84
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
semantic_verifier_configwithllmaajcriteria, each carrying custom rubric text and a weight (major/minor)multichallengeandmath_with_judge(ModelServerRef, parallelasyncio.gathercalls,[[PASS]]/[[FAIL]]verdict parsing)Changes
judge_model_server,judge_responses_create_params,judge_prompt_template,pass_label/fail_label,parallel_evaluation,reward_modesemantic_verifier_configfield (per-task criteria from task JSONL)semantic_reward(0.0-1.0 weighted aggregate),semantic_results(per-criterion detail)/v1/responses, parses verdict labelscombined(default) multiplies syntax × semantic;independentreturns syntax only with semantic reported separatelyBackward compatibility
None/False-- existing tasks withoutsemantic_verifier_configare unaffectedjudge_model_serveris not configured, semantic evaluation is skipped (returnsNone)rewardfield (syntax) is unchangedExample rollouts
examples/example_tasks.jsonlcontains 5 GPT-5.4 rollouts with LLMaaJ semantic criteria (3 syntax fail, 2 syntax pass). Reference Gym config atconfigs/structured_outputs_semantic.yaml(not tested directly — shows how to wire up judge in a real deployment).How to validate
Config knobs
reward_modecombined(default): reward = syntax × semantic.independent: reward = syntax only, semantic reported separatelyjudge_model_serverjudge_prompt_template{model_output},{rubric})judge_system_messagepass_label/fail_label[[PASS]]/[[FAIL]])parallel_evaluationTest plan
semantic_verifier_configreturnsemantic_reward: nullasyncio.gather