feat(stirrup_agent): per-task timeout + walltime-resilient failure routing#1367
Merged
Merged
Conversation
7dca2ee to
d9bb4a8
Compare
agronskiy
added a commit
that referenced
this pull request
May 19, 2026
… env STIRRUP_PER_TASK_TIMEOUT_S) Signed-off-by: Alex Gronskiy <[email protected]>
d9bb4a8 to
52388fa
Compare
agronskiy
added a commit
that referenced
this pull request
May 19, 2026
… env STIRRUP_PER_TASK_TIMEOUT_S) Signed-off-by: Alex Gronskiy <[email protected]>
52388fa to
ca6bb3b
Compare
agronskiy
added a commit
that referenced
this pull request
May 19, 2026
…lient failure routing) Signed-off-by: Alex Gronskiy <[email protected]>
Kh4L
previously approved these changes
May 19, 2026
ca6bb3b to
2606328
Compare
…uting
Default per-task timeout 3h30m (env STIRRUP_PER_TASK_TIMEOUT_S), classifies
every dispatched-task failure at the two _build_failed_run_payload callsites
and routes it through new sentinels read by the rollout dispatcher:
kill_shaped → no row written anywhere; resume's set-difference on the
main rollouts jsonl re-dispatches naturally (bounded
per-attempt by the timeout above)
timeout_exceeded → 1 sidecar entry, _ng_failure_terminal=True, never retried
skipped → 1 sidecar entry, terminal=True
transient (5xx, → sidecar entry per attempt; retried up to
conn err, etc.) NEMO_GYM_MAX_ROLLOUT_ATTEMPTS (default 3) on resume
legitimate → sidecar entry per attempt; retried up to max_attempts
Rollout-side detection (app.py:_classify_rollout_failure) uses Ray's
actor-died classes (RayActorError, WorkerCrashedError, NodeDiedError,
OutOfMemoryError, LocalRayletDiedError) plus a user-code-in-traceback
fallback for RayTaskError. The fallback distinguishes a real user
exception (frames under responses_api_agents/ or stirrup/) from Ray's
internal post-mortem (e.g. summary-builder hitting a vanished worker log
after Slurm's epilogue scrubbed /tmp/ray) — the latter is the
walltime / SIGTERM signature and routes to kill_shaped.
Verify-side detection (app.py:_classify_verify_failure) treats aiohttp 5xx,
ClientConnectionError and asyncio.TimeoutError as transient; everything
else as legitimate.
Dispatcher (nemo_gym/rollout_collection.py):
- Successes still written to <output_jsonl_fpath>.
- Failures written to <stem>_failures.jsonl (one row per attempt).
- _ng_no_persist rows skipped on both files.
- _load_from_cache reads BOTH files: main jsonl as the success ledger,
sidecar to count attempts and identify terminal rows; the retry set
is materialized_inputs minus (successes ∪ terminal ∪ maxed_out).
Without this, walltime-killed in-flight rollouts permanently disappear on
chain-hop 2 (the old _load_from_cache dedup keyed on (task_index,
rollout_index) regardless of -failed status, so synthetic -failed rows
from SIGTERM looked already-done to the resumer).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Alex Gronskiy <[email protected]>
Signed-off-by: Alex Gronskiy <[email protected]>
2606328 to
81a1cb5
Compare
Kh4L
approved these changes
May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two coupled changes that together fix walltime-induced rollout loss on multi-node GDPVal runs.
1. Per-task timeout — default 3h30m, env
STIRRUP_PER_TASK_TIMEOUT_Soverrides. Wrapsawait futureinasyncio.wait_for; on timeout,ray.cancel(future, force=True)+ raisesTaskPerAttemptTimeoutError. Logs once per process at first dispatch.2. Failure classification + sidecar routing — at the two
_build_failed_run_payloadcallsites, every failure is classified into one of five classes and persisted accordingly:kill_shaped(Ray actor died, SIGTERM, OOM, node failure)timeout_exceeded_ng_failure_terminal=Trueskipped(TaskSampleSkipError)transient(verify-side 5xx, ConnectionError, asyncio.TimeoutError)NEMO_GYM_MAX_ROLLOUT_ATTEMPTS(default 3)legitimate(real Python exception with user code in traceback)Successes still write to
<output_jsonl_fpath>; failures write to<stem>_failures.jsonl._load_from_cachereads both: main jsonl is the success ledger, sidecar tracks attempts + terminal flags. The retry set ismaterialized_inputs − (successes ∪ terminal ∪ maxed_out).Why this matters
Without #1, a single pathological task that exceeds Slurm walltime can permanently consume every chain-hop's compute and never complete.
Without #2, walltime-killed in-flight rollouts permanently disappear on chain-hop 2. The old
_load_from_cachededup keyed on(task_index, rollout_index)regardless of-failedstatus, so synthetic-failedrows written during the SIGTERM grace window looked already-done to the resumer. Worse, under harsh kills (SIGKILL, OOM) the-failedrow write itself is non-atomic — we couldn't depend on it being there to filter. Using "row absent from main jsonl" as the canonical "needs retry" signal sidesteps both problems: kill_shaped writes nothing at all, so the disk state survives arbitrary kill timing.See the debug write-up for the production incident and design rationale: https://gitlab-master.nvidia.com/agronskiy/idea/-/blob/main/reports/debug/20260519T1011-gdpval-missing-histories.md
Kill-shaped detection
_classify_rollout_failureuses Ray's actor-died classes (RayActorError,WorkerCrashedError,NodeDiedError,OutOfMemoryError,LocalRayletDiedError) plus a user-code-in-traceback fallback forRayTaskError. The fallback distinguishes a real user exception (frames under `responses_api_agents/` or `stirrup/`) from Ray's internal post-mortem (e.g. summary-builder hitting a vanished worker log after Slurm's epilogue scrubbed/tmp/ray) — the latter is the walltime / SIGTERM signature and routes tokill_shaped. Detection fails open: if Ray's exception surface drifts, everything goes to bounded-retrylegitimateinstead of unbounded-retrykill_shaped— safe, not catastrophic.Knobs
STIRRUP_PER_TASK_TIMEOUT_S— per-attempt timeout (default 12600 s = 3h30m).NEMO_GYM_MAX_ROLLOUT_ATTEMPTS— max retries per(task_index, rollout_index)(default 3).Test plan
STIRRUP_PER_TASK_TIMEOUT_S=60, confirm the log line is emitted once and that long-running rollouts get cancelled cleanly with_ng_failure_class=timeout_exceededin<stem>_failures.jsonl.scancel -s TERM), verify_failures.jsonlis unchanged (kill_shaped writes nothing) and that on chain-hop 2 the killed rollouts re-dispatch.