Skip to content

Commit 9dbc3f4

Browse files
committed
fix(alarms): exclude systemd-coredump@* transient units from FailedUnits count
systemd-coredump@<uid>-<pid>-<n>.service units are one-shot transient units that systemd spawns to handle a coredump and then leaves in 'failed' state after exit. They are not real service failures but they inflate the FailedUnits metric and cause loki-failed-units to fire on any box that has recently dumped a core. Patch the documented health-check command to grep them out. (Live fix also applied to /usr/local/bin/loki-health-check.sh on the current instance.)
1 parent 124c966 commit 9dbc3f4

1 file changed

Lines changed: 1 addition & 1 deletion

File tree

bootstraps/essential/BOOTSTRAP-ALARMS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,7 @@ Pushes all Tier 3 custom metrics in a single `put-metric-data` call (batched).
163163
**What it checks:**
164164
1. **OpenClaw instances:** `pgrep -f openclaw-gatewa` — OpenClaw gateway process alive
165165
**Hermes instances:** `pgrep -f hermes` — Hermes agent process alive
166-
2. `systemctl list-units --failed --no-legend | wc -l` — Failed unit count
166+
2. `systemctl list-units --failed --no-legend | grep -v 'systemd-coredump@' | wc -l` — Failed unit count (excludes transient coredump handler units, which linger in `failed` state after handling any crash)
167167
4. `df --output=pcent / | tail -1` — Root disk percent
168168
5. `free | awk '/Mem/ {printf "%.0f", $3/$2*100}'` — Memory percent
169169
6. Quick Bedrock `InvokeModel` with tiny payload (1 embedding, cached model) — API reachable

0 commit comments

Comments
 (0)