fix: align redundancy NUT visibility with grace#46
Conversation
Move the project and package guidance to AGENTS.md while keeping Claude Code compatibility shims that point to the new source of truth. Co-authored-by: Codex <[email protected]>
Fix issue #4 by making redundancy health consume live connection context instead of relying only on snapshot age. Runtime stale/lost NUT visibility now contributes DEGRADED while the member is inside the configured connection grace window, including slow in-flight upsc polls that leave the last snapshot stale, and becomes UNKNOWN once grace expires or the monitor reports FAILED. Add rate-limited slow upsc log visibility and sustained-slowness notification gating so operators get early diagnostics without one-off latency alerts. Add unit coverage for transient stale, in-flight slow polls, grace expiry, slow-poll logging, and sustained notifications. Add redundancy E2E regression cases for brief runtime NUT loss recovering inside grace and persistent loss firing after grace. Bump to 5.3.0-rc2 and document the operator impact and no-config-change migration note. Co-authored-by: Codex <[email protected]>
📝 WalkthroughWalkthroughThis PR refactors repository documentation by promoting package-level guidance from ChangesDocumentation Consolidation & Reference Updates
Health Assessment with Connection Grace Period & NUT Latency Visibility
Sequence DiagramsequenceDiagram
participant Monitor as UPSGroupMonitor
participant LatencyRecorder as _record_upsc_latency
participant HealthModel as assess_health()
participant Redundancy as RedundancyGroupEvaluator
participant State as MonitorState
Monitor->>Monitor: _get_all_ups_data() (full poll)
Monitor->>Monitor: _run_upsc() measures elapsed time
alt Elapsed > SLOW_NUT_LOG_THRESHOLD
Monitor->>LatencyRecorder: _record_upsc_latency(elapsed, is_full_poll)
alt Last log was >300s ago
LatencyRecorder->>Monitor: Rate-limited "Slow NUT response" log
end
alt Consecutive slow full polls >= 3
LatencyRecorder->>Monitor: Enqueue "Sustained slow NUT responses" notification
end
else Elapsed normal
LatencyRecorder->>Monitor: Reset slow-poll streak
end
Monitor->>State: Update stale_data_count, connection_lost_time
Monitor->>HealthModel: assess_health(snapshot, max_stale_data_tolerance, connection_grace_enabled, connection_grace_duration)
alt connection_state == "FAILED"
HealthModel->>HealthModel: Return UNKNOWN
else connection_state == "GRACE_PERIOD" & now > grace_expiry
HealthModel->>HealthModel: Return UNKNOWN
else transient_visibility_loss (within pre-grace or grace window)
HealthModel->>HealthModel: Return DEGRADED (not UNKNOWN)
else normal stale/age checks
HealthModel->>HealthModel: Return HEALTHY/DEGRADED/CRITICAL
end
Redundancy->>Redundancy: Collect member health states (DEGRADED/UNKNOWN/HEALTHY)
alt Quorum still met
Redundancy->>Redundancy: Continue operation
else Quorum lost (too many UNKNOWN)
Redundancy->>Redundancy: Shutdown redundancy group
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The PR spans multiple interconnected domains: documentation reorganization (low complexity but pervasive), health assessment logic refactor with grace-period windowing and stale-retry bounds (dense algorithmic changes requiring careful validation), monitor-level latency tracking with rate-limiting and state management (moderate complexity), redundancy integration wiring (moderate), and extensive test coverage across multiple test suites and E2E scenarios (breadth). The heterogeneity of changes—combining documentation consolidation, core health logic, monitoring infrastructure, and comprehensive test expansion—demands separate reasoning for each layer. The grace-period logic in particular involves subtle timing semantics and multiple state variables that require careful review. Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #46 +/- ##
==========================================
+ Coverage 79.59% 79.75% +0.16%
==========================================
Files 25 25
Lines 4616 4683 +67
Branches 861 869 +8
==========================================
+ Hits 3674 3735 +61
- Misses 737 743 +6
Partials 205 205
🚀 New features to boost your workflow:
|
Close StatsStore instances that were opened only to exercise schema migration or empty database setup paths. The one-line open calls left sqlite3 connections for garbage collection, which surfaced as ResourceWarning noise during pytest coverage runs. This keeps the issue #4 PR review signal clean before upstream AI review while preserving the same migration and deferred-delivery behavior under test. Co-authored-by: Codex <[email protected]>
The new redundancy R1/R2 regression tests on this branch shell out to `pkill` inside the nut-dummy container to selectively stop the UPS1 and UPS2 dummy drivers and force a runtime NUT visibility loss. The image is debian:bookworm-slim with only nut-server / nut-client / inotify-tools installed -- pkill ships with procps and was missing, so the kill was a silent no-op. R1 happened to pass anyway because its second `restart_redundancy_nut_server` brings NUT down via container restart; R2 has no second restart, so the monitors never lost visibility, never entered grace, and the test asserted on a "Grace period started" log line that never appeared. Adding procps unblocks R2 without touching test logic. Co-Authored-By: Claude Opus 4.7 <[email protected]>
Surface the load-bearing invariants in the new redundancy/grace path so
future readers don't accidentally "fix" what looks like a bug:
- health_model: hoist the previously-magic `5` retry multiplier into
``STALE_RETRY_TOLERANCE_MULTIPLIER`` and split the GRACE_PERIOD branch
into an explicit if/else. The back-compat fallback grace_age is
intentionally allowed to go negative so a fresh GRACE_PERIOD snapshot
with no ``connection_lost_time`` stays DEGRADED even when the
evaluator was constructed with grace_window == 0; the comment now
spells that out.
- state: document the dual-sentinel meaning of
``connection_lost_time == 0.0`` ("never lost" vs "explicitly cleared
after recovery / failsafe") so readers treat both cases as
"no live grace timer to consult".
- monitor: note that the FSB failsafe intentionally does NOT reset
``stale_data_count`` -- once connection_state is FAILED, the count
is irrelevant and the next good poll resets it anyway. Explain that
the slow-NUT notification gate fires on N *consecutive* slow polls,
not sustained slowness over a window: a single fast poll resets
both the streak counter and the "already notified" gate.
Adds two unit tests covering the back-compat GRACE_PERIOD fallback
(fresh snapshot stays DEGRADED, aged snapshot fails safe to UNKNOWN).
Co-Authored-By: Claude Opus 4.7 <[email protected]>
|
@coderabbitai review |
|
@cubic-dev-ai review |
✅ Actions performedReview triggered.
|
@m4r1k I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/test_stats.py (1)
43-49: ⚡ Quick winAdd type hints to the new helper function.
Line 43 introduces
open_and_close_storewithout annotations. Please type it aspath: Pathand-> Noneto stay consistent with the repo-wide typing standard.Suggested diff
-def open_and_close_store(path): +def open_and_close_store(path: Path) -> None: """Exercise the open lifecycle without leaking the SQLite handle.""" s = StatsStore(path) try: s.open() finally: s.close()As per coding guidelines:
**/*.py: Python version 3.9+ with type hints throughout codebase.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_stats.py` around lines 43 - 49, Add type annotations to the helper function open_and_close_store: annotate the parameter as path: Path and the return type as -> None. Update the function signature for open_and_close_store(path) to open_and_close_store(path: Path) -> None and ensure you import or reference Path from pathlib if not already present; keep the body unchanged and retain usage of StatsStore.open() and StatsStore.close().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@AGENTS.md`:
- Around line 47-129: The fenced code blocks in AGENTS.md (the repository tree
and example/command blocks shown in the diff) are unlabeled and trigger
markdownlint MD040; update each triple-backtick fence to include an appropriate
language tag (use text for file-tree/output blocks and bash or markdown for
command/example blocks) for the blocks shown (the repository tree block and the
other unlabeled fences noted in the comment) so the document passes MD040.
In `@src/eneru/state.py`:
- Around line 26-37: The new HealthSnapshot fields stale_data_count and
connection_lost_time are written lock-free in UPSGroupMonitor, causing racey
reads; wrap every mutation that updates these snapshot-published health fields
inside the same lock used for reads (self.state._lock) so updates become atomic
with snapshot publication. Locate assignments to stale_data_count and
connection_lost_time in UPSGroupMonitor (and the nearby writes referred to
around lines 150-151) and move them into the critical section guarded by
self.state._lock (or acquire the lock before updating and release after),
ensuring snapshot creation/assignment and any related state mutations occur
while holding self.state._lock. Ensure no other code publishes the snapshot
fields outside that lock.
---
Nitpick comments:
In `@tests/test_stats.py`:
- Around line 43-49: Add type annotations to the helper function
open_and_close_store: annotate the parameter as path: Path and the return type
as -> None. Update the function signature for open_and_close_store(path) to
open_and_close_store(path: Path) -> None and ensure you import or reference Path
from pathlib if not already present; keep the body unchanged and retain usage of
StatsStore.open() and StatsStore.close().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 403e9b47-c881-40b2-a17c-8d34d91615a7
📒 Files selected for processing (24)
AGENTS.mdCLAUDE.mdcodecov.ymldocs/architecture.mddocs/changelog.mddocs/testing.mdsrc/eneru/AGENTS.mdsrc/eneru/CLAUDE.mdsrc/eneru/health_model.pysrc/eneru/lifecycle.pysrc/eneru/monitor.pysrc/eneru/redundancy.pysrc/eneru/state.pysrc/eneru/stats.pysrc/eneru/version.pytests/e2e/config-e2e-redundancy-short-grace.yamltests/e2e/groups/redundancy.shtests/e2e/nut-dummy/Dockerfiletests/test_deferred_delivery.pytests/test_health_model.pytests/test_monitor_core.pytests/test_packaging.pytests/test_redundancy.pytests/test_stats.py
| ``` | ||
| src/eneru/ # Main package | ||
| AGENTS.md # Module map + mixin pattern (agent context) | ||
| __init__.py # Public API exports | ||
| __main__.py # CLI entry point (python -m eneru) | ||
| version.py # Version string (single source of truth) | ||
| config.py # Configuration dataclasses + ConfigLoader | ||
| state.py # MonitorState dataclass | ||
| logger.py # TimezoneFormatter + UPSLogger | ||
| notifications.py # NotificationWorker (Apprise integration) | ||
| utils.py # Helper functions (run_command, etc.) | ||
| actions.py # REMOTE_ACTIONS templates | ||
| monitor.py # UPSGroupMonitor core: init, polling, orchestration, main loop | ||
| multi_ups.py # MultiUPSCoordinator (thread-per-group) | ||
| cli.py # CLI argument parsing + main() | ||
| shutdown/ # Per-phase shutdown mixins | ||
| vms.py # VMShutdownMixin (libvirt) | ||
| containers.py # ContainerShutdownMixin (docker/podman + compose) | ||
| filesystems.py # FilesystemShutdownMixin (sync + unmount) | ||
| remote.py # RemoteShutdownMixin (SSH-based remote servers) | ||
| health/ # Health-monitoring mixins | ||
| voltage.py # VoltageMonitorMixin (thresholds, AVR, bypass, overload) | ||
| battery.py # BatteryMonitorMixin (depletion rate, anomaly detection) | ||
|
|
||
| tests/ # pytest tests | ||
| conftest.py # Shared fixtures | ||
| test_constants.py # Shared test constants (sample webhook URLs, etc.) | ||
| test_config_loading.py # Config: defaults + YAML file parse | ||
| test_config_notifications.py # Config: legacy Discord, avatar handling | ||
| test_config_filesystems.py # Config: mount path parsing | ||
| test_config_vm_containers.py # Config: compose files, container runtime | ||
| test_config_remote.py # Config: remote servers, ordering, safety margin | ||
| test_config_validation.py # Config: cross-field validation, edge cases | ||
| test_*.py # Unit/integration tests for non-config modules | ||
| e2e/ # End-to-end tests | ||
| docker-compose.yml # E2E test environment | ||
| config-e2e*.yaml # E2E test configs | ||
| nut-dummy/Dockerfile # NUT server simulator | ||
| ssh-target/Dockerfile # SSH target container | ||
|
|
||
| docs/ # MkDocs documentation (ReadTheDocs) | ||
| index.md # Homepage | ||
| getting-started.md # Installation guide | ||
| configuration.md # Config reference | ||
| triggers.md # Shutdown triggers | ||
| notifications.md # Apprise setup | ||
| remote-servers.md # SSH configuration | ||
| testing.md # CI/CD strategy | ||
| troubleshooting.md # Debug guide | ||
| changelog.md # Changelog (comprehensive, single source of truth) | ||
|
|
||
| .github/ | ||
| workflows/ | ||
| validate.yml # Lint + unit tests | ||
| integration.yml # Package install tests | ||
| e2e.yml # End-to-end tests | ||
| release.yml # Build .deb/.rpm packages | ||
| pypi.yml # Publish to PyPI | ||
| ISSUE_TEMPLATE/ # Bug/feature templates | ||
| PULL_REQUEST_TEMPLATE.md # PR template | ||
|
|
||
| examples/ # Example configs | ||
| config-reference.yaml # Comprehensive reference (every feature flag) | ||
| config-minimal.yaml # Minimal single-UPS setup | ||
| config-homelab.yaml # Homelab: VMs, containers, NAS | ||
| config-enterprise.yaml # Multi-server enterprise setup | ||
| config-dual-ups.yaml # Multi-UPS setup | ||
|
|
||
| packaging/ | ||
| eneru-wrapper.py # Package entry point wrapper | ||
| eneru.service # Systemd service file | ||
| scripts/ # Package lifecycle scripts | ||
|
|
||
| pyproject.toml # PEP 517/518 packaging | ||
| pytest.ini # pytest configuration | ||
| mkdocs.yml # MkDocs configuration | ||
| nfpm.yaml # .deb/.rpm package config | ||
| .readthedocs.yaml # RTD build config | ||
| requirements.txt # Runtime dependencies | ||
| requirements-dev.txt # Dev dependencies | ||
| CONTRIBUTING.md # Contribution guidelines | ||
| README.md # Project overview | ||
| ``` |
There was a problem hiding this comment.
Add language tags to these fenced blocks.
These unlabeled fences trip markdownlint MD040. Using text for tree/output blocks and bash or markdown for command/example blocks will keep the new source-of-truth doc lint-clean.
Also applies to: 214-228, 233-240, 261-274, 357-365
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 47-47: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@AGENTS.md` around lines 47 - 129, The fenced code blocks in AGENTS.md (the
repository tree and example/command blocks shown in the diff) are unlabeled and
trigger markdownlint MD040; update each triple-backtick fence to include an
appropriate language tag (use text for file-tree/output blocks and bash or
markdown for command/example blocks) for the blocks shown (the repository tree
block and the other unlabeled fences noted in the comment) so the document
passes MD040.
| # Live connection context for redundancy: lets the evaluator tell | ||
| # "stale but still inside connection grace" from a dead monitor. | ||
| # ``connection_lost_time`` uses 0.0 as a dual-purpose sentinel for | ||
| # "never lost" and "explicitly cleared after recovery / failsafe"; | ||
| # readers must treat both cases as "no live grace timer to consult". | ||
| "stale_data_count", # consecutive failed/stale polls since last success | ||
| "connection_lost_time", # ``time.time()`` when connection grace started | ||
| ], | ||
| ) | ||
| # Back-compat for tests / third-party code still constructing the old | ||
| # 10-field HealthSnapshot shape directly. | ||
| HealthSnapshot.__new__.__defaults__ = (0, 0.0) |
There was a problem hiding this comment.
Publish these new snapshot fields under the same lock used for reads.
stale_data_count and connection_lost_time are now part of the redundancy snapshot, but the corresponding writes in UPSGroupMonitor still happen lock-free. That means the evaluator can observe mixed state around failure/recovery transitions, e.g. GRACE_PERIOD with a cleared or stale connection_lost_time, and misclassify the member right at the grace boundary. Please move every mutation of snapshot-published health fields behind self.state._lock before relying on this snapshot as an atomic view.
Also applies to: 150-151
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/eneru/state.py` around lines 26 - 37, The new HealthSnapshot fields
stale_data_count and connection_lost_time are written lock-free in
UPSGroupMonitor, causing racey reads; wrap every mutation that updates these
snapshot-published health fields inside the same lock used for reads
(self.state._lock) so updates become atomic with snapshot publication. Locate
assignments to stale_data_count and connection_lost_time in UPSGroupMonitor (and
the nearby writes referred to around lines 150-151) and move them into the
critical section guarded by self.state._lock (or acquire the lock before
updating and release after), ensuring snapshot creation/assignment and any
related state mutations occur while holding self.state._lock. Ensure no other
code publishes the snapshot fields outside that lock.
There was a problem hiding this comment.
1 issue found across 24 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/eneru/health_model.py">
<violation number="1" location="src/eneru/health_model.py:157">
P3: Use the already-computed `stale_threshold` variable instead of re-expanding `STALE_INTERVAL_MULTIPLIER * interval`. This avoids drift if one is updated without the other.</violation>
</file>
Partial review: This PR has more than 50 files, so cubic reviewed the highest-priority files first. During the trial, paid plans get a higher file limit.
You can try an ultrareview to bypass the file limit, comment @cubic-dev-ai ultrareview. Learn more.
Fix all with cubic.
The first restart_redundancy_nut_server in R1 has been getting SIGTERMed ~43s in with no visible failure: the script stops after "Waiting for redundancy NUT sources (2/30)" and never prints iterations 3-30, never prints "FAIL: redundancy NUT sources did not recover", and never reaches the next dbg-able boundary. Without instrumentation we cannot tell whether wait_for_redundancy_nut hangs in upsc, whether the docker compose call wedges, or whether something in the R1 sequence itself deadlocks before its first sleep. Add minimum-friction observability so the next failed run is self-diagnosing: - dbg() prints UTC-timestamped step markers; sprinkled at every R1/R2 phase boundary plus inside the helper functions. The last printed marker before SIGTERM pinpoints the wedge. - Each upsc poll inside wait_for_redundancy_nut is now bounded by `timeout 5s` so a wedged libupsclient read cannot eat the entire 30-iteration budget on a single iteration. - dump_redundancy_nut_state() prints `docker compose ps`, the dummy-ups/upsd processes inside the container, and a per-UPS upsc probe. Called after every restart and on any R1/R2 failure path. - stop_redundancy_nut_drivers verifies the post-condition (no UPS1/UPS2 driver processes remain) and bounds its docker exec with `timeout 10s`. - R1/R2 failure handlers now `cat` the full eneru log (was tail -80/-100) and dump container state, so the runner log is sufficient to debug without re-running with extra diagnostics. No semantic changes to the test assertions or sleep timings; this is purely additive observability. Co-Authored-By: Claude Opus 4.7 <[email protected]>
Root cause of the consistent R1 SIGTERM: ``pkill -f 'dummy-ups.*-a UPS1'`` runs inside ``sh -c "pkill -f 'dummy-ups.*-a UPS1' || true; pkill -f 'dummy-ups.*-a UPS2' || true"``. The pkill regex matches the wrapper sh's own command line (which literally contains ``dummy-ups.*-a UPS1``), so pkill kills the wrapper before the second pkill ever runs. ``docker compose exec`` is then left holding a half-dead exec stream that never returns -- the runner SIGTERMs the whole step ~15s later (visible in CI as the consistent 4-minute "Process completed with exit code 143" we were chasing). Apply the standard ``[d]ummy-ups`` bracket idiom so the regex still matches real ``/usr/lib/nut/dummy-ups -a UPS1`` cmdlines but NOT the literal ``[d]ummy-ups`` text in pkill's own argv. Same trick for the post-condition ``ps -ef | grep`` so it doesn't need ``grep -v grep`` either. Add ``--kill-after=5s`` to the bounding ``timeout`` so a future docker-compose-exec wedge can't sit on SIGTERM forever. Verified by re-reading the freshly-instrumented CI log: the missing dbg marker after "pkill UPS1+UPS2 dummy-ups in container" is exactly where the wrapper sh dies, which is now the only diagnosis the data supports. Co-Authored-By: Claude Opus 4.7 <[email protected]>
The R1 transient-NUT-loss regression was firing a real shutdown on slow
GHA runners because the per-UPS connection grace was 15s but a full
``docker compose restart nut-server`` takes ~10s of graceful stop +
entrypoint startup + driver settle, which on these runners is reliably
>15s end-to-end. The recovery thus landed AFTER grace expired, the
member flipped to UNKNOWN, ``unknown_counts_as: critical`` lost the
quorum and fired ``REDUNDANCY GROUP SHUTDOWN`` -- exactly what R1 is
supposed to assert never happens.
This is a test-infra timing problem, not a production-code bug:
production grace defaults are 60s and operators don't restart NUT in
that window. The minimal fix is to widen the test grace and the R2
hold-loss sleep so both regressions still exercise their respective
branches with comfortable headroom on the slowest runners we've seen
(tests 21-27 ran 2.3x slower than baseline on the same job).
Changes:
- tests/e2e/config-e2e-redundancy-short-grace.yaml: grace 15s → 40s on
both members, with a comment explaining the budget.
- tests/e2e/groups/redundancy.sh:
- R1 eneru ``timeout 48s`` → ``90s`` (covers restart + sleep 13 +
stop + sleep 7 + restart + sleep 10 = ~60s on slow runners with
headroom).
- R2 ``sleep 28`` → ``sleep 55`` (must exceed new 40s grace).
- R2 eneru ``timeout 58s`` → ``105s`` accordingly.
Co-Authored-By: Claude Opus 4.7 <[email protected]>
cubic (P3, src/eneru/health_model.py:157): the late stale-threshold check re-expanded ``STALE_INTERVAL_MULTIPLIER * interval`` instead of reusing the already-computed ``stale_threshold`` local. Replace with ``stale_threshold`` so the two sites can never drift if the multiplier is later renamed or recomputed. CodeRabbit (Major, src/eneru/state.py:37 - applied to monitor.py): the new ``connection_lost_time`` snapshot field is part of a 2-field state machine paired with ``connection_state``. Reads via ``MonitorState.snapshot()`` already take ``self._lock``, but the writes in UPSGroupMonitor were lock-free. The redundancy evaluator could observe a torn pair (e.g. ``GRACE_PERIOD`` with ``connection_lost_time == 0.0``) right at the grace boundary; the back-compat fallback in health_model masks the worst of it but the contract should be tight, not papered over. Wrap each ``connection_state`` + ``connection_lost_time`` paired write in ``with self.state._lock:`` so snapshot() always sees a consistent pair. Single-field writes to ``stale_data_count`` are left lock-free (CPython attribute assignment is atomic and the field has no paired invariant). Five paired sites updated: - _handle_connection_failure: enter GRACE_PERIOD (set both fields). - _handle_connection_failure: GRACE_PERIOD → FAILED transition. - _main_loop: failsafe (FSB) path. - _main_loop: recovery from GRACE_PERIOD. - _main_loop: recovery from FAILED. No behavior change; tightens the snapshot read/write contract. Co-Authored-By: Claude Opus 4.7 <[email protected]>
Adds a Buy Me a Coffee link (https://buymeacoffee.com/m4r1k) so operators who want to chip in toward UPS hardware and NUT testing have a clear, optional way to do so. - README.md: badge in the existing badge row + dedicated "Support the project" section above the License section. - docs/index.md: matching "Support the project" section at the end of the index page so the link surfaces on Read the Docs too. Wording is intentionally low-pressure -- Eneru stays free and MIT regardless. Co-Authored-By: Claude Opus 4.7 <[email protected]>
Summary
Fixes issue #4 by aligning redundancy health with the per-UPS connection-loss grace path.
DEGRADEDwhile the member is still inside configured connection grace.UNKNOWNonce grace expires, when the monitor reportsFAILED, or when no successful poll has ever been published after startup grace.upscpolls so the redundancy evaluator does not fire before the monitor can enter grace.upsclogs and sustained-slowness notifications.5.3.0-rc2.Tests
/tmp/eneru-venv/bin/pytest tests/test_health_model.py tests/test_redundancy.py tests/test_monitor_core.py -m unit -q/tmp/eneru-venv/bin/pytest -m unitbash -n tests/e2e/groups/redundancy.sh/tmp/eneru-venv/bin/python -m eneru validate --config tests/e2e/config-e2e-redundancy-short-grace.yamlDocker/E2E was not run locally by design. The new redundancy runtime cases are for upstream CI.
Notes
unknown_counts_as: criticalremains fail-safe after connection grace expires..claude/local metadata is intentionally not included.Summary by cubic
Aligns redundancy health with per‑UPS connection‑loss grace to prevent false UNKNOWN and accidental shutdowns during brief NUT flaps. Adds slow NUT latency visibility with rate‑limited logs and sustained‑slowness notifications, plus tighter snapshot consistency at the grace boundary.
Bug Fixes
upscso redundancy waits for grace; add rate‑limited slow‑poll logs and sustained‑slowness notifications; clarify grace fallback and sentinel behavior with tests.connection_stateandconnection_lost_timeunder a lock to avoid torn reads at grace edges; widen E2E grace to 40s and extend R1/R2 sleeps/timeouts; add timestamped markers, boundupscwithtimeout --kill-after, dump container state, installprocpsforpkill, and fix thepkillregex; close SQLite stores in tests to silence ResourceWarnings.Migration
unknown_counts_as: criticalremains fail‑safe after connection grace.Written for commit 53da695. Summary will update on new commits.
Summary by CodeRabbit
Release Notes for 5.3.0-rc2
Bug Fixes
New Features
Tests
Note: No configuration changes required; updated behavior only prevents transient connection loss from bypassing the existing grace window.