bridge: core.shutdown() hangs indefinitely when gateway dies abnormally (orphaned bridge)

## Symptom

When the Hermes gateway process dies without clean shutdown (SIGKILL, OOM, crash), the `--no-viewer` bridge process becomes orphaned:
- stdin gets EOF from the OS (parent pipe FDs close on process exit)
- `activeStdio.done` resolves
- Bridge calls `core.shutdown()` → `flush()` → L2/L3 LLM calls
- On a large DB under lock contention, this blocks indefinitely
- With the parent process gone, nobody remains to send SIGKILL

Over ~36 hours, 19 orphaned bridges accumulated, each consuming 14-29% CPU and fighting for the SQLite write lock. Every `turn.end` sync timed out at 30s. The DB grew ~150MB from duplicate traces (each zombie bridge independently wrote the same turns — one turn ID had 6,572 copies).

## Root Cause

`core.shutdown()` has no timeout. The `flush()` pipeline chains through capture drain → reward drain → L2 drain (LLM call, ~5s each) → L3 drain → skills/feedback flush. On a large DB under lock contention, this blocks indefinitely.

Two sites are affected:
1. `bridge/stdio.ts` `waitForShutdown()` — called by the SIGTERM handler (line 473)
2. `bridge.cts` headless bridge exit path (line 503) — `--no-viewer` bridge after stdin EOF

The SIGTERM handler has the same structural issue but is partially mitigated by the Python-side SIGKILL escalation (commit `97dc73e7` in the Hermes agent) — but only when the Python process is still alive.

## Fix

Add a 20-second `Promise.race` timeout around `core.shutdown()` at both sites:

```typescript
// bridge/stdio.ts waitForShutdown()
await Promise.race([
  core.shutdown(),
  new Promise<void>((r) => setTimeout(r, 20_000)),
]);

// bridge.cts headless exit
await Promise.race([
  core.shutdown(),
  new Promise<void>((r) => setTimeout(r, 20_000)),
]);
```

This ensures the bridge exits after 20 seconds even if the flush pipeline is stuck, preventing orphan accumulation. The daemon SIGTERM handler has the same pattern but is less critical — the daemon's parent gateway persists across sessions.

## Evidence

- 19 orphaned `--no-viewer` bridges over 36 hours, 299% aggregate CPU
- Gateway journal: "bridge process did not exit after stdin close, sending SIGTERM"
- Gateway journal: "sync_turn turn.end failed — [timeout] turn.end did not respond within 30.0s"
- Duplicate traces: single turn ID written 6,572 times (19 bridges × multiple sync cycles)
- Gateway process held 57 dead pipe FDs from un-killable bridges
- Python-side SIGKILL escalation (97dc73e7) already handles normal session end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bridge: core.shutdown() hangs indefinitely when gateway dies abnormally (orphaned bridge) #1798

Symptom

Root Cause

Fix

Evidence

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bridge: core.shutdown() hangs indefinitely when gateway dies abnormally (orphaned bridge) #1798

Description

Symptom

Root Cause

Fix

Evidence

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions