Skip to content

feat(stirrup_agent): per-task timeout + walltime-resilient failure routing#1367

Merged
agronskiy merged 2 commits into
mainfrom
agronskiy/feat/gdpval-stirrup-per-task-timeout
May 20, 2026
Merged

feat(stirrup_agent): per-task timeout + walltime-resilient failure routing#1367
agronskiy merged 2 commits into
mainfrom
agronskiy/feat/gdpval-stirrup-per-task-timeout

Conversation

@agronskiy
Copy link
Copy Markdown
Contributor

@agronskiy agronskiy commented May 19, 2026

Summary

Two coupled changes that together fix walltime-induced rollout loss on multi-node GDPVal runs.

1. Per-task timeout — default 3h30m, env STIRRUP_PER_TASK_TIMEOUT_S overrides. Wraps await future in asyncio.wait_for; on timeout, ray.cancel(future, force=True) + raises TaskPerAttemptTimeoutError. Logs once per process at first dispatch.

2. Failure classification + sidecar routing — at the two _build_failed_run_payload callsites, every failure is classified into one of five classes and persisted accordingly:

class persist retry on chain-hop 2
kill_shaped (Ray actor died, SIGTERM, OOM, node failure) NO row anywhere yes, unbounded (per-attempt timeout bounds wallclock)
timeout_exceeded 1 sidecar entry, _ng_failure_terminal=True no
skipped (TaskSampleSkipError) 1 sidecar entry, terminal no
transient (verify-side 5xx, ConnectionError, asyncio.TimeoutError) sidecar entry per attempt yes, up to NEMO_GYM_MAX_ROLLOUT_ATTEMPTS (default 3)
legitimate (real Python exception with user code in traceback) sidecar entry per attempt yes, up to max_attempts

Successes still write to <output_jsonl_fpath>; failures write to <stem>_failures.jsonl. _load_from_cache reads both: main jsonl is the success ledger, sidecar tracks attempts + terminal flags. The retry set is materialized_inputs − (successes ∪ terminal ∪ maxed_out).

Why this matters

Without #1, a single pathological task that exceeds Slurm walltime can permanently consume every chain-hop's compute and never complete.

Without #2, walltime-killed in-flight rollouts permanently disappear on chain-hop 2. The old _load_from_cache dedup keyed on (task_index, rollout_index) regardless of -failed status, so synthetic -failed rows written during the SIGTERM grace window looked already-done to the resumer. Worse, under harsh kills (SIGKILL, OOM) the -failed row write itself is non-atomic — we couldn't depend on it being there to filter. Using "row absent from main jsonl" as the canonical "needs retry" signal sidesteps both problems: kill_shaped writes nothing at all, so the disk state survives arbitrary kill timing.

See the debug write-up for the production incident and design rationale: https://gitlab-master.nvidia.com/agronskiy/idea/-/blob/main/reports/debug/20260519T1011-gdpval-missing-histories.md

Kill-shaped detection

_classify_rollout_failure uses Ray's actor-died classes (RayActorError, WorkerCrashedError, NodeDiedError, OutOfMemoryError, LocalRayletDiedError) plus a user-code-in-traceback fallback for RayTaskError. The fallback distinguishes a real user exception (frames under `responses_api_agents/` or `stirrup/`) from Ray's internal post-mortem (e.g. summary-builder hitting a vanished worker log after Slurm's epilogue scrubbed /tmp/ray) — the latter is the walltime / SIGTERM signature and routes to kill_shaped. Detection fails open: if Ray's exception surface drifts, everything goes to bounded-retry legitimate instead of unbounded-retry kill_shaped — safe, not catastrophic.

Knobs

  • STIRRUP_PER_TASK_TIMEOUT_S — per-attempt timeout (default 12600 s = 3h30m).
  • NEMO_GYM_MAX_ROLLOUT_ATTEMPTS — max retries per (task_index, rollout_index) (default 3).

Test plan

  • `python -m py_compile` clean for both edited files.
  • ruff format clean.
  • Suggested smoke: run a GDPVal eval with STIRRUP_PER_TASK_TIMEOUT_S=60, confirm the log line is emitted once and that long-running rollouts get cancelled cleanly with _ng_failure_class=timeout_exceeded in <stem>_failures.jsonl.
  • Suggested integration: deliberately kill the deployment srun mid-run (scancel -s TERM), verify _failures.jsonl is unchanged (kill_shaped writes nothing) and that on chain-hop 2 the killed rollouts re-dispatch.

@agronskiy agronskiy force-pushed the agronskiy/feat/gdpval-stirrup-per-task-timeout branch from 7dca2ee to d9bb4a8 Compare May 19, 2026 13:00
@agronskiy agronskiy changed the title feat(stirrup_agent): per-task timeout (default 3h30m, env STIRRUP_PER_TASK_TIMEOUT) feat(stirrup_agent): per-task timeout (default 3h30m, env STIRRUP_PER_TASK_TIMEOUT_S) May 19, 2026
agronskiy added a commit that referenced this pull request May 19, 2026
… env STIRRUP_PER_TASK_TIMEOUT_S)

Signed-off-by: Alex Gronskiy <[email protected]>
@agronskiy agronskiy force-pushed the agronskiy/feat/gdpval-stirrup-per-task-timeout branch from d9bb4a8 to 52388fa Compare May 19, 2026 13:10
agronskiy added a commit that referenced this pull request May 19, 2026
… env STIRRUP_PER_TASK_TIMEOUT_S)

Signed-off-by: Alex Gronskiy <[email protected]>
@agronskiy agronskiy force-pushed the agronskiy/feat/gdpval-stirrup-per-task-timeout branch from 52388fa to ca6bb3b Compare May 19, 2026 14:03
@agronskiy agronskiy changed the title feat(stirrup_agent): per-task timeout (default 3h30m, env STIRRUP_PER_TASK_TIMEOUT_S) feat(stirrup_agent): per-task timeout + walltime-resilient failure routing May 19, 2026
agronskiy added a commit that referenced this pull request May 19, 2026
…lient failure routing)

Signed-off-by: Alex Gronskiy <[email protected]>
Kh4L
Kh4L previously approved these changes May 19, 2026
@agronskiy agronskiy force-pushed the agronskiy/feat/gdpval-stirrup-per-task-timeout branch from ca6bb3b to 2606328 Compare May 20, 2026 15:31
agronskiy and others added 2 commits May 20, 2026 17:52
…uting

Default per-task timeout 3h30m (env STIRRUP_PER_TASK_TIMEOUT_S), classifies
every dispatched-task failure at the two _build_failed_run_payload callsites
and routes it through new sentinels read by the rollout dispatcher:

  kill_shaped       → no row written anywhere; resume's set-difference on the
                      main rollouts jsonl re-dispatches naturally (bounded
                      per-attempt by the timeout above)
  timeout_exceeded  → 1 sidecar entry, _ng_failure_terminal=True, never retried
  skipped           → 1 sidecar entry, terminal=True
  transient (5xx,   → sidecar entry per attempt; retried up to
   conn err, etc.)    NEMO_GYM_MAX_ROLLOUT_ATTEMPTS (default 3) on resume
  legitimate        → sidecar entry per attempt; retried up to max_attempts

Rollout-side detection (app.py:_classify_rollout_failure) uses Ray's
actor-died classes (RayActorError, WorkerCrashedError, NodeDiedError,
OutOfMemoryError, LocalRayletDiedError) plus a user-code-in-traceback
fallback for RayTaskError. The fallback distinguishes a real user
exception (frames under responses_api_agents/ or stirrup/) from Ray's
internal post-mortem (e.g. summary-builder hitting a vanished worker log
after Slurm's epilogue scrubbed /tmp/ray) — the latter is the
walltime / SIGTERM signature and routes to kill_shaped.

Verify-side detection (app.py:_classify_verify_failure) treats aiohttp 5xx,
ClientConnectionError and asyncio.TimeoutError as transient; everything
else as legitimate.

Dispatcher (nemo_gym/rollout_collection.py):
- Successes still written to <output_jsonl_fpath>.
- Failures written to <stem>_failures.jsonl (one row per attempt).
- _ng_no_persist rows skipped on both files.
- _load_from_cache reads BOTH files: main jsonl as the success ledger,
  sidecar to count attempts and identify terminal rows; the retry set
  is materialized_inputs minus (successes ∪ terminal ∪ maxed_out).

Without this, walltime-killed in-flight rollouts permanently disappear on
chain-hop 2 (the old _load_from_cache dedup keyed on (task_index,
rollout_index) regardless of -failed status, so synthetic -failed rows
from SIGTERM looked already-done to the resumer).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Alex Gronskiy <[email protected]>
@agronskiy agronskiy force-pushed the agronskiy/feat/gdpval-stirrup-per-task-timeout branch from 2606328 to 81a1cb5 Compare May 20, 2026 15:53
@agronskiy agronskiy requested a review from Kh4L May 20, 2026 16:16
@agronskiy agronskiy merged commit ce2f093 into main May 20, 2026
5 checks passed
@agronskiy agronskiy deleted the agronskiy/feat/gdpval-stirrup-per-task-timeout branch May 20, 2026 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants