Symptom
When the Hermes gateway process dies without clean shutdown (SIGKILL, OOM, crash), the --no-viewer bridge process becomes orphaned:
- stdin gets EOF from the OS (parent pipe FDs close on process exit)
activeStdio.done resolves
- Bridge calls
core.shutdown() → flush() → L2/L3 LLM calls
- On a large DB under lock contention, this blocks indefinitely
- With the parent process gone, nobody remains to send SIGKILL
Over ~36 hours, 19 orphaned bridges accumulated, each consuming 14-29% CPU and fighting for the SQLite write lock. Every turn.end sync timed out at 30s. The DB grew ~150MB from duplicate traces (each zombie bridge independently wrote the same turns — one turn ID had 6,572 copies).
Root Cause
core.shutdown() has no timeout. The flush() pipeline chains through capture drain → reward drain → L2 drain (LLM call, ~5s each) → L3 drain → skills/feedback flush. On a large DB under lock contention, this blocks indefinitely.
Two sites are affected:
bridge/stdio.ts waitForShutdown() — called by the SIGTERM handler (line 473)
bridge.cts headless bridge exit path (line 503) — --no-viewer bridge after stdin EOF
The SIGTERM handler has the same structural issue but is partially mitigated by the Python-side SIGKILL escalation (commit 97dc73e7 in the Hermes agent) — but only when the Python process is still alive.
Fix
Add a 20-second Promise.race timeout around core.shutdown() at both sites:
// bridge/stdio.ts waitForShutdown()
await Promise.race([
core.shutdown(),
new Promise<void>((r) => setTimeout(r, 20_000)),
]);
// bridge.cts headless exit
await Promise.race([
core.shutdown(),
new Promise<void>((r) => setTimeout(r, 20_000)),
]);
This ensures the bridge exits after 20 seconds even if the flush pipeline is stuck, preventing orphan accumulation. The daemon SIGTERM handler has the same pattern but is less critical — the daemon's parent gateway persists across sessions.
Evidence
- 19 orphaned
--no-viewer bridges over 36 hours, 299% aggregate CPU
- Gateway journal: "bridge process did not exit after stdin close, sending SIGTERM"
- Gateway journal: "sync_turn turn.end failed — [timeout] turn.end did not respond within 30.0s"
- Duplicate traces: single turn ID written 6,572 times (19 bridges × multiple sync cycles)
- Gateway process held 57 dead pipe FDs from un-killable bridges
- Python-side SIGKILL escalation (97dc73e) already handles normal session end
Symptom
When the Hermes gateway process dies without clean shutdown (SIGKILL, OOM, crash), the
--no-viewerbridge process becomes orphaned:activeStdio.doneresolvescore.shutdown()→flush()→ L2/L3 LLM callsOver ~36 hours, 19 orphaned bridges accumulated, each consuming 14-29% CPU and fighting for the SQLite write lock. Every
turn.endsync timed out at 30s. The DB grew ~150MB from duplicate traces (each zombie bridge independently wrote the same turns — one turn ID had 6,572 copies).Root Cause
core.shutdown()has no timeout. Theflush()pipeline chains through capture drain → reward drain → L2 drain (LLM call, ~5s each) → L3 drain → skills/feedback flush. On a large DB under lock contention, this blocks indefinitely.Two sites are affected:
bridge/stdio.tswaitForShutdown()— called by the SIGTERM handler (line 473)bridge.ctsheadless bridge exit path (line 503) —--no-viewerbridge after stdin EOFThe SIGTERM handler has the same structural issue but is partially mitigated by the Python-side SIGKILL escalation (commit
97dc73e7in the Hermes agent) — but only when the Python process is still alive.Fix
Add a 20-second
Promise.racetimeout aroundcore.shutdown()at both sites:This ensures the bridge exits after 20 seconds even if the flush pipeline is stuck, preventing orphan accumulation. The daemon SIGTERM handler has the same pattern but is less critical — the daemon's parent gateway persists across sessions.
Evidence
--no-viewerbridges over 36 hours, 299% aggregate CPU