feat(stirrup_agent): per-task timeout + walltime-resilient failure routing by agronskiy · Pull Request #1367 · NVIDIA-NeMo/Gym

agronskiy · 2026-05-19T12:22:33Z

Summary

Two coupled changes that together fix walltime-induced rollout loss on multi-node GDPVal runs.

1. Per-task timeout — default 3h30m, env STIRRUP_PER_TASK_TIMEOUT_S overrides. Wraps await future in asyncio.wait_for; on timeout, ray.cancel(future, force=True) + raises TaskPerAttemptTimeoutError. Logs once per process at first dispatch.

2. Failure classification + sidecar routing — at the two _build_failed_run_payload callsites, every failure is classified into one of five classes and persisted accordingly:

class	persist	retry on chain-hop 2
`kill_shaped` (Ray actor died, SIGTERM, OOM, node failure)	NO row anywhere	yes, unbounded (per-attempt timeout bounds wallclock)
`timeout_exceeded`	1 sidecar entry, `_ng_failure_terminal=True`	no
`skipped` (TaskSampleSkipError)	1 sidecar entry, terminal	no
`transient` (verify-side 5xx, ConnectionError, asyncio.TimeoutError)	sidecar entry per attempt	yes, up to `NEMO_GYM_MAX_ROLLOUT_ATTEMPTS` (default 3)
`legitimate` (real Python exception with user code in traceback)	sidecar entry per attempt	yes, up to max_attempts

Successes still write to <output_jsonl_fpath>; failures write to <stem>_failures.jsonl. _load_from_cache reads both: main jsonl is the success ledger, sidecar tracks attempts + terminal flags. The retry set is materialized_inputs − (successes ∪ terminal ∪ maxed_out).

Why this matters

Without #1, a single pathological task that exceeds Slurm walltime can permanently consume every chain-hop's compute and never complete.

Without #2, walltime-killed in-flight rollouts permanently disappear on chain-hop 2. The old _load_from_cache dedup keyed on (task_index, rollout_index) regardless of -failed status, so synthetic -failed rows written during the SIGTERM grace window looked already-done to the resumer. Worse, under harsh kills (SIGKILL, OOM) the -failed row write itself is non-atomic — we couldn't depend on it being there to filter. Using "row absent from main jsonl" as the canonical "needs retry" signal sidesteps both problems: kill_shaped writes nothing at all, so the disk state survives arbitrary kill timing.

See the debug write-up for the production incident and design rationale: https://gitlab-master.nvidia.com/agronskiy/idea/-/blob/main/reports/debug/20260519T1011-gdpval-missing-histories.md

Kill-shaped detection

_classify_rollout_failure uses Ray's actor-died classes (RayActorError, WorkerCrashedError, NodeDiedError, OutOfMemoryError, LocalRayletDiedError) plus a user-code-in-traceback fallback for RayTaskError. The fallback distinguishes a real user exception (frames under `responses_api_agents/` or `stirrup/`) from Ray's internal post-mortem (e.g. summary-builder hitting a vanished worker log after Slurm's epilogue scrubbed /tmp/ray) — the latter is the walltime / SIGTERM signature and routes to kill_shaped. Detection fails open: if Ray's exception surface drifts, everything goes to bounded-retry legitimate instead of unbounded-retry kill_shaped — safe, not catastrophic.

Knobs

STIRRUP_PER_TASK_TIMEOUT_S — per-attempt timeout (default 12600 s = 3h30m).
NEMO_GYM_MAX_ROLLOUT_ATTEMPTS — max retries per (task_index, rollout_index) (default 3).

Test plan

`python -m py_compile` clean for both edited files.
ruff format clean.
Suggested smoke: run a GDPVal eval with STIRRUP_PER_TASK_TIMEOUT_S=60, confirm the log line is emitted once and that long-running rollouts get cancelled cleanly with _ng_failure_class=timeout_exceeded in <stem>_failures.jsonl.
Suggested integration: deliberately kill the deployment srun mid-run (scancel -s TERM), verify _failures.jsonl is unchanged (kill_shaped writes nothing) and that on chain-hop 2 the killed rollouts re-dispatch.

… env STIRRUP_PER_TASK_TIMEOUT_S) Signed-off-by: Alex Gronskiy <[email protected]>

…lient failure routing) Signed-off-by: Alex Gronskiy <[email protected]>

…uting Default per-task timeout 3h30m (env STIRRUP_PER_TASK_TIMEOUT_S), classifies every dispatched-task failure at the two _build_failed_run_payload callsites and routes it through new sentinels read by the rollout dispatcher: kill_shaped → no row written anywhere; resume's set-difference on the main rollouts jsonl re-dispatches naturally (bounded per-attempt by the timeout above) timeout_exceeded → 1 sidecar entry, _ng_failure_terminal=True, never retried skipped → 1 sidecar entry, terminal=True transient (5xx, → sidecar entry per attempt; retried up to conn err, etc.) NEMO_GYM_MAX_ROLLOUT_ATTEMPTS (default 3) on resume legitimate → sidecar entry per attempt; retried up to max_attempts Rollout-side detection (app.py:_classify_rollout_failure) uses Ray's actor-died classes (RayActorError, WorkerCrashedError, NodeDiedError, OutOfMemoryError, LocalRayletDiedError) plus a user-code-in-traceback fallback for RayTaskError. The fallback distinguishes a real user exception (frames under responses_api_agents/ or stirrup/) from Ray's internal post-mortem (e.g. summary-builder hitting a vanished worker log after Slurm's epilogue scrubbed /tmp/ray) — the latter is the walltime / SIGTERM signature and routes to kill_shaped. Verify-side detection (app.py:_classify_verify_failure) treats aiohttp 5xx, ClientConnectionError and asyncio.TimeoutError as transient; everything else as legitimate. Dispatcher (nemo_gym/rollout_collection.py): - Successes still written to <output_jsonl_fpath>. - Failures written to <stem>_failures.jsonl (one row per attempt). - _ng_no_persist rows skipped on both files. - _load_from_cache reads BOTH files: main jsonl as the success ledger, sidecar to count attempts and identify terminal rows; the retry set is materialized_inputs minus (successes ∪ terminal ∪ maxed_out). Without this, walltime-killed in-flight rollouts permanently disappear on chain-hop 2 (the old _load_from_cache dedup keyed on (task_index, rollout_index) regardless of -failed status, so synthetic -failed rows from SIGTERM looked already-done to the resumer). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Alex Gronskiy <[email protected]>

Signed-off-by: Alex Gronskiy <[email protected]>

copy-pr-bot Bot temporarily deployed to public May 19, 2026 12:23 Inactive

copy-pr-bot Bot temporarily deployed to public May 19, 2026 12:24 Inactive

agronskiy force-pushed the agronskiy/feat/gdpval-stirrup-per-task-timeout branch from 7dca2ee to d9bb4a8 Compare May 19, 2026 13:00

agronskiy changed the title ~~feat(stirrup_agent): per-task timeout (default 3h30m, env STIRRUP_PER_TASK_TIMEOUT)~~ feat(stirrup_agent): per-task timeout (default 3h30m, env STIRRUP_PER_TASK_TIMEOUT_S) May 19, 2026

agronskiy added a commit that referenced this pull request May 19, 2026

Merge PR #1367 (feat(stirrup_agent): per-task timeout, default 3h30m,…

71377ca

… env STIRRUP_PER_TASK_TIMEOUT_S) Signed-off-by: Alex Gronskiy <[email protected]>

agronskiy force-pushed the agronskiy/feat/gdpval-stirrup-per-task-timeout branch from d9bb4a8 to 52388fa Compare May 19, 2026 13:10

agronskiy added a commit that referenced this pull request May 19, 2026

Merge PR #1367 (feat(stirrup_agent): per-task timeout, default 3h30m,…

8fdbb11

… env STIRRUP_PER_TASK_TIMEOUT_S) Signed-off-by: Alex Gronskiy <[email protected]>

agronskiy force-pushed the agronskiy/feat/gdpval-stirrup-per-task-timeout branch from 52388fa to ca6bb3b Compare May 19, 2026 14:03

agronskiy changed the title ~~feat(stirrup_agent): per-task timeout (default 3h30m, env STIRRUP_PER_TASK_TIMEOUT_S)~~ feat(stirrup_agent): per-task timeout + walltime-resilient failure routing May 19, 2026

agronskiy added a commit that referenced this pull request May 19, 2026

Merge PR #1367 (feat(stirrup_agent): per-task timeout + walltime-resi…

27e290b

…lient failure routing) Signed-off-by: Alex Gronskiy <[email protected]>

Kh4L previously approved these changes May 19, 2026

View reviewed changes

agronskiy dismissed Kh4L’s stale review via 2606328 May 20, 2026 15:31

agronskiy force-pushed the agronskiy/feat/gdpval-stirrup-per-task-timeout branch from ca6bb3b to 2606328 Compare May 20, 2026 15:31

agronskiy and others added 2 commits May 20, 2026 17:52

style: sort imports in stirrup_agent/app.py (ruff I001)

81a1cb5

Signed-off-by: Alex Gronskiy <[email protected]>

agronskiy force-pushed the agronskiy/feat/gdpval-stirrup-per-task-timeout branch from 2606328 to 81a1cb5 Compare May 20, 2026 15:53

agronskiy requested a review from Kh4L May 20, 2026 16:16

Kh4L approved these changes May 20, 2026

View reviewed changes

agronskiy merged commit ce2f093 into main May 20, 2026
5 checks passed

agronskiy deleted the agronskiy/feat/gdpval-stirrup-per-task-timeout branch May 20, 2026 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(stirrup_agent): per-task timeout + walltime-resilient failure routing#1367

feat(stirrup_agent): per-task timeout + walltime-resilient failure routing#1367
agronskiy merged 2 commits into
mainfrom
agronskiy/feat/gdpval-stirrup-per-task-timeout

agronskiy commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

agronskiy commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this matters

Kill-shaped detection

Knobs

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

agronskiy commented May 19, 2026 •

edited

Loading