Skip to content

bridge: core.shutdown() hangs indefinitely when gateway dies abnormally (orphaned bridge) #1798

@chiefmojo

Description

@chiefmojo

Symptom

When the Hermes gateway process dies without clean shutdown (SIGKILL, OOM, crash), the --no-viewer bridge process becomes orphaned:

  • stdin gets EOF from the OS (parent pipe FDs close on process exit)
  • activeStdio.done resolves
  • Bridge calls core.shutdown()flush() → L2/L3 LLM calls
  • On a large DB under lock contention, this blocks indefinitely
  • With the parent process gone, nobody remains to send SIGKILL

Over ~36 hours, 19 orphaned bridges accumulated, each consuming 14-29% CPU and fighting for the SQLite write lock. Every turn.end sync timed out at 30s. The DB grew ~150MB from duplicate traces (each zombie bridge independently wrote the same turns — one turn ID had 6,572 copies).

Root Cause

core.shutdown() has no timeout. The flush() pipeline chains through capture drain → reward drain → L2 drain (LLM call, ~5s each) → L3 drain → skills/feedback flush. On a large DB under lock contention, this blocks indefinitely.

Two sites are affected:

  1. bridge/stdio.ts waitForShutdown() — called by the SIGTERM handler (line 473)
  2. bridge.cts headless bridge exit path (line 503) — --no-viewer bridge after stdin EOF

The SIGTERM handler has the same structural issue but is partially mitigated by the Python-side SIGKILL escalation (commit 97dc73e7 in the Hermes agent) — but only when the Python process is still alive.

Fix

Add a 20-second Promise.race timeout around core.shutdown() at both sites:

// bridge/stdio.ts waitForShutdown()
await Promise.race([
  core.shutdown(),
  new Promise<void>((r) => setTimeout(r, 20_000)),
]);

// bridge.cts headless exit
await Promise.race([
  core.shutdown(),
  new Promise<void>((r) => setTimeout(r, 20_000)),
]);

This ensures the bridge exits after 20 seconds even if the flush pipeline is stuck, preventing orphan accumulation. The daemon SIGTERM handler has the same pattern but is less critical — the daemon's parent gateway persists across sessions.

Evidence

  • 19 orphaned --no-viewer bridges over 36 hours, 299% aggregate CPU
  • Gateway journal: "bridge process did not exit after stdin close, sending SIGTERM"
  • Gateway journal: "sync_turn turn.end failed — [timeout] turn.end did not respond within 30.0s"
  • Duplicate traces: single turn ID written 6,572 times (19 bridges × multiple sync cycles)
  • Gateway process held 57 dead pipe FDs from un-killable bridges
  • Python-side SIGKILL escalation (97dc73e) already handles normal session end

Metadata

Metadata

Assignees

No one assigned

    Labels

    pluginPlugin/adapter/bridge layer (apps/ directory)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions