fix: align redundancy NUT visibility with grace by m4r1k · Pull Request #46 · m4r1k/Eneru

m4r1k · 2026-05-04T19:23:19Z

Summary

Fixes issue #4 by aligning redundancy health with the per-UPS connection-loss grace path.

Treat runtime stale/lost NUT visibility as DEGRADED while the member is still inside configured connection grace.
Treat members as UNKNOWN once grace expires, when the monitor reports FAILED, or when no successful poll has ever been published after startup grace.
Add bounded handling for slow in-flight upsc polls so the redundancy evaluator does not fire before the monitor can enter grace.
Add rate-limited slow upsc logs and sustained-slowness notifications.
Bump version to 5.3.0-rc2.

Tests

/tmp/eneru-venv/bin/pytest tests/test_health_model.py tests/test_redundancy.py tests/test_monitor_core.py -m unit -q
/tmp/eneru-venv/bin/pytest -m unit
bash -n tests/e2e/groups/redundancy.sh
/tmp/eneru-venv/bin/python -m eneru validate --config tests/e2e/config-e2e-redundancy-short-grace.yaml

Docker/E2E was not run locally by design. The new redundancy runtime cases are for upstream CI.

Notes

No YAML change is required for users.
unknown_counts_as: critical remains fail-safe after connection grace expires.
.claude/ local metadata is intentionally not included.

Summary by cubic

Aligns redundancy health with per‑UPS connection‑loss grace to prevent false UNKNOWN and accidental shutdowns during brief NUT flaps. Adds slow NUT latency visibility with rate‑limited logs and sustained‑slowness notifications, plus tighter snapshot consistency at the grace boundary.

Bug Fixes
- Use live connection context: stale/lost data is DEGRADED during grace; becomes UNKNOWN only after grace expires, the monitor reports FAILED, or no successful post‑startup poll ever occurred.
- Bound slow in‑flight upsc so redundancy waits for grace; add rate‑limited slow‑poll logs and sustained‑slowness notifications; clarify grace fallback and sentinel behavior with tests.
- Harden runtime and tests: update connection_state and connection_lost_time under a lock to avoid torn reads at grace edges; widen E2E grace to 40s and extend R1/R2 sleeps/timeouts; add timestamped markers, bound upsc with timeout --kill-after, dump container state, install procps for pkill, and fix the pkill regex; close SQLite stores in tests to silence ResourceWarnings.
Migration
- No config changes required.
- unknown_counts_as: critical remains fail‑safe after connection grace.

^{Written for commit 53da695. Summary will update on new commits.}

Summary by CodeRabbit

Release Notes for 5.3.0-rc2

Bug Fixes
- Fixed redundancy runtime issue where transient UPS connection loss could prematurely trigger quorum loss; grace window now properly prevents shutdown during brief connection flaps.
New Features
- Added slow UPS connection monitoring with rate-limited logging and stricter notification handling only after sustained slowness.
Tests
- Enhanced redundancy grace period and stale connection handling test coverage.

Note: No configuration changes required; updated behavior only prevents transient connection loss from bypassing the existing grace window.

Move the project and package guidance to AGENTS.md while keeping Claude Code compatibility shims that point to the new source of truth. Co-authored-by: Codex <[email protected]>

Fix issue #4 by making redundancy health consume live connection context instead of relying only on snapshot age. Runtime stale/lost NUT visibility now contributes DEGRADED while the member is inside the configured connection grace window, including slow in-flight upsc polls that leave the last snapshot stale, and becomes UNKNOWN once grace expires or the monitor reports FAILED. Add rate-limited slow upsc log visibility and sustained-slowness notification gating so operators get early diagnostics without one-off latency alerts. Add unit coverage for transient stale, in-flight slow polls, grace expiry, slow-poll logging, and sustained notifications. Add redundancy E2E regression cases for brief runtime NUT loss recovering inside grace and persistent loss firing after grace. Bump to 5.3.0-rc2 and document the operator impact and no-config-change migration note. Co-authored-by: Codex <[email protected]>

coderabbitai · 2026-05-04T19:23:26Z

📝 Walkthrough

Walkthrough

This PR refactors repository documentation by promoting package-level guidance from CLAUDE.md files into comprehensive AGENTS.md files at the root and src/eneru/ levels, implements connection grace period logic in health assessment to prevent transient NUT visibility loss from triggering redundancy shutdown, and adds slow NUT response monitoring with rate-limited logging and sustained-slowness notifications.

Changes

Documentation Consolidation & Reference Updates

Layer / File(s)	Summary
Root Documentation `AGENTS.md`, `CLAUDE.md`	New comprehensive `AGENTS.md` defines development setup (uv-only enforcement), command reference, project structure, code style, testing/packaging/schema evolution rules, code review workflow, Git practices, and dependencies. `CLAUDE.md` reduced to a brief bridge pointing to `AGENTS.md`.
Package-Level Documentation `src/eneru/AGENTS.md`, `src/eneru/CLAUDE.md`	New `src/eneru/AGENTS.md` documents the module map, mixin pattern for shutdown/health phases, build/test integration, SQLite schema evolution rules with migration patterns, and package-specific emoji/convention guidance. Package `CLAUDE.md` reduced to a bridge referring to `AGENTS.md`.
Reference Updates `codecov.yml`, `docs/architecture.md`, `src/eneru/lifecycle.py`, `src/eneru/stats.py`, `tests/test_packaging.py`	Updated inline documentation comments and links to reference the new `AGENTS.md` files instead of `CLAUDE.md`.

Health Assessment with Connection Grace Period & NUT Latency Visibility

Layer / File(s)	Summary
Data Shape `src/eneru/state.py`	`HealthSnapshot` namedtuple extended with `stale_data_count` and `connection_lost_time` fields; `__new__.__defaults__` added for back-compat. `MonitorState.snapshot()` populates these new fields.
Core Health Logic `src/eneru/health_model.py`	`assess_health()` signature extended with keyword-only parameters `max_stale_data_tolerance`, `connection_grace_enabled`, `connection_grace_duration`. Logic reworked to compute bounded pre-grace stale retry and grace windows; `connection_state="FAILED"` immediately yields `UNKNOWN`; `GRACE_PERIOD` yields `DEGRADED` until expiry then `UNKNOWN`; transient visibility loss (aged snapshots within grace/retry bounds) remains `DEGRADED` rather than `UNKNOWN`. Added `STALE_RETRY_TOLERANCE_MULTIPLIER` constant.
Monitor Latency Tracking `src/eneru/monitor.py`	Added module-level slow-NUT thresholds (`SLOW_NUT_LOG_THRESHOLD_SECONDS=2.0`, `SLOW_NUT_NOTIFY_THRESHOLD_SECONDS=10.0`, `SLOW_NUT_NOTIFY_CONSECUTIVE_POLLS=3`, `SLOW_NUT_LOG_RATE_LIMIT_SECONDS=300.0`) and per-monitor state for tracking slow-poll streaks. Introduced `_run_upsc()` wrapper that measures elapsed time and delegates to `_record_upsc_latency()` for rate-limited logging and notification control. Updated `_get_ups_var()` and `_get_all_ups_data()` to use the wrapper; preserved `stale_data_count` when transitioning to `FAILED` on battery.
Redundancy Integration `src/eneru/redundancy.py`	Updated `RedundancyGroupEvaluator.evaluate_once()` to pass monitor's `max_stale_data_tolerance` and connection-grace configuration to `assess_health()`.
E2E Configuration & Tests `tests/e2e/config-e2e-redundancy-short-grace.yaml`, `tests/e2e/groups/redundancy.sh`, `tests/e2e/nut-dummy/Dockerfile`	Added E2E redundancy config with 15-second connection grace and dual-UPS setup. Added helper functions (`wait_for_redundancy_nut`, `restart_redundancy_nut_server`, `stop_redundancy_nut_drivers`) and regression tests (R1: transient loss within grace stays healthy; R2: loss past grace triggers shutdown). Extended NUT dummy container with `procps` for process management.
Unit Test Coverage `tests/test_health_model.py`, `tests/test_monitor_core.py`, `tests/test_redundancy.py`, `tests/test_deferred_delivery.py`, `tests/test_stats.py`	`_snap()` fixtures updated with new `HealthSnapshot` fields. `TestAssessHealthStaleness` expanded with 11 new cases covering grace-period degraded/unknown transitions, back-compat fallback, in-flight slow polling, transient stale data, and retry bounds. Added `TestNUTLatencyVisibility` validating rate-limited logging and sustained-slowness notifications. Redundancy test fixtures and `TestEvaluatorCounting` extended to verify time-based stale/grace quorum behavior. Test helpers simplified with `open_and_close_store()` utility.
Changelog & Version `docs/changelog.md`, `src/eneru/version.py`	Version bumped to `5.3.0-rc2`. Changelog documents redundancy grace-period fix, slow-NUT visibility feature, regression test coverage, and migration notes (no YAML changes required).
Documentation Updates `docs/testing.md`	Reorganized "Test areas" and E2E workflow groups to highlight "Redundancy runtime" as a distinct section; updated test inventory to document R1/R2 regression cases.

Sequence Diagram

sequenceDiagram
    participant Monitor as UPSGroupMonitor
    participant LatencyRecorder as _record_upsc_latency
    participant HealthModel as assess_health()
    participant Redundancy as RedundancyGroupEvaluator
    participant State as MonitorState

    Monitor->>Monitor: _get_all_ups_data() (full poll)
    Monitor->>Monitor: _run_upsc() measures elapsed time
    alt Elapsed > SLOW_NUT_LOG_THRESHOLD
        Monitor->>LatencyRecorder: _record_upsc_latency(elapsed, is_full_poll)
        alt Last log was >300s ago
            LatencyRecorder->>Monitor: Rate-limited "Slow NUT response" log
        end
        alt Consecutive slow full polls >= 3
            LatencyRecorder->>Monitor: Enqueue "Sustained slow NUT responses" notification
        end
    else Elapsed normal
        LatencyRecorder->>Monitor: Reset slow-poll streak
    end
    Monitor->>State: Update stale_data_count, connection_lost_time
    Monitor->>HealthModel: assess_health(snapshot, max_stale_data_tolerance, connection_grace_enabled, connection_grace_duration)
    alt connection_state == "FAILED"
        HealthModel->>HealthModel: Return UNKNOWN
    else connection_state == "GRACE_PERIOD" & now > grace_expiry
        HealthModel->>HealthModel: Return UNKNOWN
    else transient_visibility_loss (within pre-grace or grace window)
        HealthModel->>HealthModel: Return DEGRADED (not UNKNOWN)
    else normal stale/age checks
        HealthModel->>HealthModel: Return HEALTHY/DEGRADED/CRITICAL
    end
    Redundancy->>Redundancy: Collect member health states (DEGRADED/UNKNOWN/HEALTHY)
    alt Quorum still met
        Redundancy->>Redundancy: Continue operation
    else Quorum lost (too many UNKNOWN)
        Redundancy->>Redundancy: Shutdown redundancy group
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The PR spans multiple interconnected domains: documentation reorganization (low complexity but pervasive), health assessment logic refactor with grace-period windowing and stale-retry bounds (dense algorithmic changes requiring careful validation), monitor-level latency tracking with rate-limiting and state management (moderate complexity), redundancy integration wiring (moderate), and extensive test coverage across multiple test suites and E2E scenarios (breadth). The heterogeneity of changes—combining documentation consolidation, core health logic, monitoring infrastructure, and comprehensive test expansion—demands separate reasoning for each layer. The grace-period logic in particular involves subtle timing semantics and multiple state variables that require careful review.

Possibly related PRs

m4r1k/Eneru#26: Directly related—both modify core health/redundancy/monitor/state modules and implement interdependent changes to health assessment and snapshot state management.
m4r1k/Eneru#43: Related at code level—both update src/eneru/version.py for version bumps in the 5.3.0 release cycle.
m4r1k/Eneru#33: Related—both modify src/eneru/redundancy.py and alter redundancy health evaluation and notification behavior.

Poem

🐰 Hopping through grace windows and latency lanes,
The bunny docs leap to AGENTS' domains,
Stale data now waits in its mercy-full grace,
While slow NUT responses know their own space,
Redundancy stands firm—no premature flight,
The warren is safer, the quorum is tight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.59% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: align redundancy NUT visibility with grace' is clear and specific, summarizing the main bug fix addressing the grace-period alignment issue for redundancy NUT visibility.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/startup-trigger-lockout

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-04T19:25:00Z

Codecov Report

❌ Patch coverage is 88.09524% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.75%. Comparing base (f854fce) to head (53da695).

Files with missing lines	Patch %	Lines
src/eneru/monitor.py	88.23%	6 Missing ⚠️
src/eneru/health_model.py	85.71%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #46      +/-   ##
==========================================
+ Coverage   79.59%   79.75%   +0.16%     
==========================================
  Files          25       25              
  Lines        4616     4683      +67     
  Branches      861      869       +8     
==========================================
+ Hits         3674     3735      +61     
- Misses        737      743       +6     
  Partials      205      205

Files with missing lines	Coverage Δ
src/eneru/lifecycle.py	`92.90% <ø> (ø)`
src/eneru/redundancy.py	`92.04% <100.00%> (+0.09%)`	⬆️
src/eneru/state.py	`100.00% <100.00%> (ø)`
src/eneru/stats.py	`82.30% <ø> (ø)`
src/eneru/version.py	`100.00% <100.00%> (ø)`
src/eneru/health_model.py	`92.45% <85.71%> (-7.55%)`	⬇️
src/eneru/monitor.py	`63.69% <88.23%> (+1.92%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Close StatsStore instances that were opened only to exercise schema migration or empty database setup paths. The one-line open calls left sqlite3 connections for garbage collection, which surfaced as ResourceWarning noise during pytest coverage runs. This keeps the issue #4 PR review signal clean before upstream AI review while preserving the same migration and deferred-delivery behavior under test. Co-authored-by: Codex <[email protected]>

The new redundancy R1/R2 regression tests on this branch shell out to `pkill` inside the nut-dummy container to selectively stop the UPS1 and UPS2 dummy drivers and force a runtime NUT visibility loss. The image is debian:bookworm-slim with only nut-server / nut-client / inotify-tools installed -- pkill ships with procps and was missing, so the kill was a silent no-op. R1 happened to pass anyway because its second `restart_redundancy_nut_server` brings NUT down via container restart; R2 has no second restart, so the monitors never lost visibility, never entered grace, and the test asserted on a "Grace period started" log line that never appeared. Adding procps unblocks R2 without touching test logic. Co-Authored-By: Claude Opus 4.7 <[email protected]>

Surface the load-bearing invariants in the new redundancy/grace path so future readers don't accidentally "fix" what looks like a bug: - health_model: hoist the previously-magic `5` retry multiplier into ``STALE_RETRY_TOLERANCE_MULTIPLIER`` and split the GRACE_PERIOD branch into an explicit if/else. The back-compat fallback grace_age is intentionally allowed to go negative so a fresh GRACE_PERIOD snapshot with no ``connection_lost_time`` stays DEGRADED even when the evaluator was constructed with grace_window == 0; the comment now spells that out. - state: document the dual-sentinel meaning of ``connection_lost_time == 0.0`` ("never lost" vs "explicitly cleared after recovery / failsafe") so readers treat both cases as "no live grace timer to consult". - monitor: note that the FSB failsafe intentionally does NOT reset ``stale_data_count`` -- once connection_state is FAILED, the count is irrelevant and the next good poll resets it anyway. Explain that the slow-NUT notification gate fires on N *consecutive* slow polls, not sustained slowness over a window: a single fast poll resets both the streak counter and the "already notified" gate. Adds two unit tests covering the back-compat GRACE_PERIOD fallback (fresh snapshot stays DEGRADED, aged snapshot fails safe to UNKNOWN). Co-Authored-By: Claude Opus 4.7 <[email protected]>

m4r1k · 2026-05-04T19:54:00Z

@coderabbitai review

m4r1k · 2026-05-04T19:54:02Z

@cubic-dev-ai review

coderabbitai · 2026-05-04T19:54:06Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cubic-dev-ai · 2026-05-04T19:54:07Z

@cubic-dev-ai review

@m4r1k I have started the AI code review. It will take a few minutes to complete.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/test_stats.py (1)

43-49: ⚡ Quick win

Add type hints to the new helper function.

Line 43 introduces open_and_close_store without annotations. Please type it as path: Path and -> None to stay consistent with the repo-wide typing standard.

Suggested diff

-def open_and_close_store(path):
+def open_and_close_store(path: Path) -> None:
     """Exercise the open lifecycle without leaking the SQLite handle."""
     s = StatsStore(path)
     try:
         s.open()
     finally:
         s.close()

As per coding guidelines: **/*.py: Python version 3.9+ with type hints throughout codebase.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_stats.py` around lines 43 - 49, Add type annotations to the helper
function open_and_close_store: annotate the parameter as path: Path and the
return type as -> None. Update the function signature for
open_and_close_store(path) to open_and_close_store(path: Path) -> None and
ensure you import or reference Path from pathlib if not already present; keep
the body unchanged and retain usage of StatsStore.open() and StatsStore.close().

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@AGENTS.md`:
- Around line 47-129: The fenced code blocks in AGENTS.md (the repository tree
and example/command blocks shown in the diff) are unlabeled and trigger
markdownlint MD040; update each triple-backtick fence to include an appropriate
language tag (use text for file-tree/output blocks and bash or markdown for
command/example blocks) for the blocks shown (the repository tree block and the
other unlabeled fences noted in the comment) so the document passes MD040.

In `@src/eneru/state.py`:
- Around line 26-37: The new HealthSnapshot fields stale_data_count and
connection_lost_time are written lock-free in UPSGroupMonitor, causing racey
reads; wrap every mutation that updates these snapshot-published health fields
inside the same lock used for reads (self.state._lock) so updates become atomic
with snapshot publication. Locate assignments to stale_data_count and
connection_lost_time in UPSGroupMonitor (and the nearby writes referred to
around lines 150-151) and move them into the critical section guarded by
self.state._lock (or acquire the lock before updating and release after),
ensuring snapshot creation/assignment and any related state mutations occur
while holding self.state._lock. Ensure no other code publishes the snapshot
fields outside that lock.

---

Nitpick comments:
In `@tests/test_stats.py`:
- Around line 43-49: Add type annotations to the helper function
open_and_close_store: annotate the parameter as path: Path and the return type
as -> None. Update the function signature for open_and_close_store(path) to
open_and_close_store(path: Path) -> None and ensure you import or reference Path
from pathlib if not already present; keep the body unchanged and retain usage of
StatsStore.open() and StatsStore.close().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 403e9b47-c881-40b2-a17c-8d34d91615a7

📥 Commits

Reviewing files that changed from the base of the PR and between f854fce and 397522a.

📒 Files selected for processing (24)

AGENTS.md
CLAUDE.md
codecov.yml
docs/architecture.md
docs/changelog.md
docs/testing.md
src/eneru/AGENTS.md
src/eneru/CLAUDE.md
src/eneru/health_model.py
src/eneru/lifecycle.py
src/eneru/monitor.py
src/eneru/redundancy.py
src/eneru/state.py
src/eneru/stats.py
src/eneru/version.py
tests/e2e/config-e2e-redundancy-short-grace.yaml
tests/e2e/groups/redundancy.sh
tests/e2e/nut-dummy/Dockerfile
tests/test_deferred_delivery.py
tests/test_health_model.py
tests/test_monitor_core.py
tests/test_packaging.py
tests/test_redundancy.py
tests/test_stats.py

coderabbitai · 2026-05-04T20:00:34Z

+```
+src/eneru/                      # Main package
+  AGENTS.md                     # Module map + mixin pattern (agent context)
+  __init__.py                   # Public API exports
+  __main__.py                   # CLI entry point (python -m eneru)
+  version.py                    # Version string (single source of truth)
+  config.py                     # Configuration dataclasses + ConfigLoader
+  state.py                      # MonitorState dataclass
+  logger.py                     # TimezoneFormatter + UPSLogger
+  notifications.py              # NotificationWorker (Apprise integration)
+  utils.py                      # Helper functions (run_command, etc.)
+  actions.py                    # REMOTE_ACTIONS templates
+  monitor.py                    # UPSGroupMonitor core: init, polling, orchestration, main loop
+  multi_ups.py                  # MultiUPSCoordinator (thread-per-group)
+  cli.py                        # CLI argument parsing + main()
+  shutdown/                     # Per-phase shutdown mixins
+    vms.py                      # VMShutdownMixin (libvirt)
+    containers.py               # ContainerShutdownMixin (docker/podman + compose)
+    filesystems.py              # FilesystemShutdownMixin (sync + unmount)
+    remote.py                   # RemoteShutdownMixin (SSH-based remote servers)
+  health/                       # Health-monitoring mixins
+    voltage.py                  # VoltageMonitorMixin (thresholds, AVR, bypass, overload)
+    battery.py                  # BatteryMonitorMixin (depletion rate, anomaly detection)
+
+tests/                          # pytest tests
+  conftest.py                   # Shared fixtures
+  test_constants.py             # Shared test constants (sample webhook URLs, etc.)
+  test_config_loading.py        # Config: defaults + YAML file parse
+  test_config_notifications.py  # Config: legacy Discord, avatar handling
+  test_config_filesystems.py    # Config: mount path parsing
+  test_config_vm_containers.py  # Config: compose files, container runtime
+  test_config_remote.py         # Config: remote servers, ordering, safety margin
+  test_config_validation.py     # Config: cross-field validation, edge cases
+  test_*.py                     # Unit/integration tests for non-config modules
+  e2e/                          # End-to-end tests
+    docker-compose.yml          # E2E test environment
+    config-e2e*.yaml            # E2E test configs
+    nut-dummy/Dockerfile        # NUT server simulator
+    ssh-target/Dockerfile       # SSH target container
+
+docs/                           # MkDocs documentation (ReadTheDocs)
+  index.md                      # Homepage
+  getting-started.md            # Installation guide
+  configuration.md              # Config reference
+  triggers.md                   # Shutdown triggers
+  notifications.md              # Apprise setup
+  remote-servers.md             # SSH configuration
+  testing.md                    # CI/CD strategy
+  troubleshooting.md            # Debug guide
+  changelog.md                  # Changelog (comprehensive, single source of truth)
+
+.github/
+  workflows/
+    validate.yml                # Lint + unit tests
+    integration.yml             # Package install tests
+    e2e.yml                     # End-to-end tests
+    release.yml                 # Build .deb/.rpm packages
+    pypi.yml                    # Publish to PyPI
+  ISSUE_TEMPLATE/               # Bug/feature templates
+  PULL_REQUEST_TEMPLATE.md      # PR template
+
+examples/                       # Example configs
+  config-reference.yaml         # Comprehensive reference (every feature flag)
+  config-minimal.yaml           # Minimal single-UPS setup
+  config-homelab.yaml           # Homelab: VMs, containers, NAS
+  config-enterprise.yaml        # Multi-server enterprise setup
+  config-dual-ups.yaml          # Multi-UPS setup
+
+packaging/
+  eneru-wrapper.py              # Package entry point wrapper
+  eneru.service                 # Systemd service file
+  scripts/                      # Package lifecycle scripts
+
+pyproject.toml                  # PEP 517/518 packaging
+pytest.ini                      # pytest configuration
+mkdocs.yml                      # MkDocs configuration
+nfpm.yaml                       # .deb/.rpm package config
+.readthedocs.yaml               # RTD build config
+requirements.txt                # Runtime dependencies
+requirements-dev.txt            # Dev dependencies
+CONTRIBUTING.md                 # Contribution guidelines
+README.md                       # Project overview
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language tags to these fenced blocks.

These unlabeled fences trip markdownlint MD040. Using text for tree/output blocks and bash or markdown for command/example blocks will keep the new source-of-truth doc lint-clean.

Also applies to: 214-228, 233-240, 261-274, 357-365

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 47-47: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@AGENTS.md` around lines 47 - 129, The fenced code blocks in AGENTS.md (the repository tree and example/command blocks shown in the diff) are unlabeled and trigger markdownlint MD040; update each triple-backtick fence to include an appropriate language tag (use text for file-tree/output blocks and bash or markdown for command/example blocks) for the blocks shown (the repository tree block and the other unlabeled fences noted in the comment) so the document passes MD040.

coderabbitai · 2026-05-04T20:00:34Z

+        # Live connection context for redundancy: lets the evaluator tell
+        # "stale but still inside connection grace" from a dead monitor.
+        # ``connection_lost_time`` uses 0.0 as a dual-purpose sentinel for
+        # "never lost" and "explicitly cleared after recovery / failsafe";
+        # readers must treat both cases as "no live grace timer to consult".
+        "stale_data_count",    # consecutive failed/stale polls since last success
+        "connection_lost_time", # ``time.time()`` when connection grace started
    ],
 )
+# Back-compat for tests / third-party code still constructing the old
+# 10-field HealthSnapshot shape directly.
+HealthSnapshot.__new__.__defaults__ = (0, 0.0)


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Publish these new snapshot fields under the same lock used for reads.

stale_data_count and connection_lost_time are now part of the redundancy snapshot, but the corresponding writes in UPSGroupMonitor still happen lock-free. That means the evaluator can observe mixed state around failure/recovery transitions, e.g. GRACE_PERIOD with a cleared or stale connection_lost_time, and misclassify the member right at the grace boundary. Please move every mutation of snapshot-published health fields behind self.state._lock before relying on this snapshot as an atomic view.

Also applies to: 150-151

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/eneru/state.py` around lines 26 - 37, The new HealthSnapshot fields stale_data_count and connection_lost_time are written lock-free in UPSGroupMonitor, causing racey reads; wrap every mutation that updates these snapshot-published health fields inside the same lock used for reads (self.state._lock) so updates become atomic with snapshot publication. Locate assignments to stale_data_count and connection_lost_time in UPSGroupMonitor (and the nearby writes referred to around lines 150-151) and move them into the critical section guarded by self.state._lock (or acquire the lock before updating and release after), ensuring snapshot creation/assignment and any related state mutations occur while holding self.state._lock. Ensure no other code publishes the snapshot fields outside that lock.

cubic-dev-ai

1 issue found across 24 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/eneru/health_model.py">

<violation number="1" location="src/eneru/health_model.py:157">
P3: Use the already-computed `stale_threshold` variable instead of re-expanding `STALE_INTERVAL_MULTIPLIER * interval`. This avoids drift if one is updated without the other.</violation>
</file>

_{Partial review: This PR has more than 50 files, so cubic reviewed the highest-priority files first. During the trial, paid plans get a higher file limit.

You can try an ultrareview to bypass the file limit, comment @cubic-dev-ai ultrareview. Learn more.
Fix all with cubic.}

The first restart_redundancy_nut_server in R1 has been getting SIGTERMed ~43s in with no visible failure: the script stops after "Waiting for redundancy NUT sources (2/30)" and never prints iterations 3-30, never prints "FAIL: redundancy NUT sources did not recover", and never reaches the next dbg-able boundary. Without instrumentation we cannot tell whether wait_for_redundancy_nut hangs in upsc, whether the docker compose call wedges, or whether something in the R1 sequence itself deadlocks before its first sleep. Add minimum-friction observability so the next failed run is self-diagnosing: - dbg() prints UTC-timestamped step markers; sprinkled at every R1/R2 phase boundary plus inside the helper functions. The last printed marker before SIGTERM pinpoints the wedge. - Each upsc poll inside wait_for_redundancy_nut is now bounded by `timeout 5s` so a wedged libupsclient read cannot eat the entire 30-iteration budget on a single iteration. - dump_redundancy_nut_state() prints `docker compose ps`, the dummy-ups/upsd processes inside the container, and a per-UPS upsc probe. Called after every restart and on any R1/R2 failure path. - stop_redundancy_nut_drivers verifies the post-condition (no UPS1/UPS2 driver processes remain) and bounds its docker exec with `timeout 10s`. - R1/R2 failure handlers now `cat` the full eneru log (was tail -80/-100) and dump container state, so the runner log is sufficient to debug without re-running with extra diagnostics. No semantic changes to the test assertions or sleep timings; this is purely additive observability. Co-Authored-By: Claude Opus 4.7 <[email protected]>

Root cause of the consistent R1 SIGTERM: ``pkill -f 'dummy-ups.*-a UPS1'`` runs inside ``sh -c "pkill -f 'dummy-ups.*-a UPS1' || true; pkill -f 'dummy-ups.*-a UPS2' || true"``. The pkill regex matches the wrapper sh's own command line (which literally contains ``dummy-ups.*-a UPS1``), so pkill kills the wrapper before the second pkill ever runs. ``docker compose exec`` is then left holding a half-dead exec stream that never returns -- the runner SIGTERMs the whole step ~15s later (visible in CI as the consistent 4-minute "Process completed with exit code 143" we were chasing). Apply the standard ``[d]ummy-ups`` bracket idiom so the regex still matches real ``/usr/lib/nut/dummy-ups -a UPS1`` cmdlines but NOT the literal ``[d]ummy-ups`` text in pkill's own argv. Same trick for the post-condition ``ps -ef | grep`` so it doesn't need ``grep -v grep`` either. Add ``--kill-after=5s`` to the bounding ``timeout`` so a future docker-compose-exec wedge can't sit on SIGTERM forever. Verified by re-reading the freshly-instrumented CI log: the missing dbg marker after "pkill UPS1+UPS2 dummy-ups in container" is exactly where the wrapper sh dies, which is now the only diagnosis the data supports. Co-Authored-By: Claude Opus 4.7 <[email protected]>

The R1 transient-NUT-loss regression was firing a real shutdown on slow GHA runners because the per-UPS connection grace was 15s but a full ``docker compose restart nut-server`` takes ~10s of graceful stop + entrypoint startup + driver settle, which on these runners is reliably >15s end-to-end. The recovery thus landed AFTER grace expired, the member flipped to UNKNOWN, ``unknown_counts_as: critical`` lost the quorum and fired ``REDUNDANCY GROUP SHUTDOWN`` -- exactly what R1 is supposed to assert never happens. This is a test-infra timing problem, not a production-code bug: production grace defaults are 60s and operators don't restart NUT in that window. The minimal fix is to widen the test grace and the R2 hold-loss sleep so both regressions still exercise their respective branches with comfortable headroom on the slowest runners we've seen (tests 21-27 ran 2.3x slower than baseline on the same job). Changes: - tests/e2e/config-e2e-redundancy-short-grace.yaml: grace 15s → 40s on both members, with a comment explaining the budget. - tests/e2e/groups/redundancy.sh: - R1 eneru ``timeout 48s`` → ``90s`` (covers restart + sleep 13 + stop + sleep 7 + restart + sleep 10 = ~60s on slow runners with headroom). - R2 ``sleep 28`` → ``sleep 55`` (must exceed new 40s grace). - R2 eneru ``timeout 58s`` → ``105s`` accordingly. Co-Authored-By: Claude Opus 4.7 <[email protected]>

cubic (P3, src/eneru/health_model.py:157): the late stale-threshold check re-expanded ``STALE_INTERVAL_MULTIPLIER * interval`` instead of reusing the already-computed ``stale_threshold`` local. Replace with ``stale_threshold`` so the two sites can never drift if the multiplier is later renamed or recomputed. CodeRabbit (Major, src/eneru/state.py:37 - applied to monitor.py): the new ``connection_lost_time`` snapshot field is part of a 2-field state machine paired with ``connection_state``. Reads via ``MonitorState.snapshot()`` already take ``self._lock``, but the writes in UPSGroupMonitor were lock-free. The redundancy evaluator could observe a torn pair (e.g. ``GRACE_PERIOD`` with ``connection_lost_time == 0.0``) right at the grace boundary; the back-compat fallback in health_model masks the worst of it but the contract should be tight, not papered over. Wrap each ``connection_state`` + ``connection_lost_time`` paired write in ``with self.state._lock:`` so snapshot() always sees a consistent pair. Single-field writes to ``stale_data_count`` are left lock-free (CPython attribute assignment is atomic and the field has no paired invariant). Five paired sites updated: - _handle_connection_failure: enter GRACE_PERIOD (set both fields). - _handle_connection_failure: GRACE_PERIOD → FAILED transition. - _main_loop: failsafe (FSB) path. - _main_loop: recovery from GRACE_PERIOD. - _main_loop: recovery from FAILED. No behavior change; tightens the snapshot read/write contract. Co-Authored-By: Claude Opus 4.7 <[email protected]>

Adds a Buy Me a Coffee link (https://buymeacoffee.com/m4r1k) so operators who want to chip in toward UPS hardware and NUT testing have a clear, optional way to do so. - README.md: badge in the existing badge row + dedicated "Support the project" section above the License section. - docs/index.md: matching "Support the project" section at the end of the index page so the link surfaces on Read the Docs too. Wording is intentionally low-pressure -- Eneru stays free and MIT regardless. Co-Authored-By: Claude Opus 4.7 <[email protected]>

m4r1k and others added 2 commits May 4, 2026 20:36

docs: adopt AGENTS.md agent context

b4014ef

Move the project and package guidance to AGENTS.md while keeping Claude Code compatibility shims that point to the new source of truth. Co-authored-by: Codex <[email protected]>

m4r1k and others added 4 commits May 4, 2026 21:27

Merge branch 'main' into fix/startup-trigger-lockout

01c44d6

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

cubic-dev-ai Bot reviewed May 4, 2026

View reviewed changes

Comment thread src/eneru/health_model.py Outdated

m4r1k and others added 5 commits May 4, 2026 22:18

m4r1k merged commit c57e62e into main May 4, 2026
40 checks passed

m4r1k deleted the fix/startup-trigger-lockout branch May 4, 2026 20:50

m4r1k mentioned this pull request May 4, 2026

[FEATURE] Support for Multiple UPSs #4

Open

Conversation

m4r1k commented May 4, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Notes

Summary by cubic

Summary by CodeRabbit

Release Notes for 5.3.0-rc2

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

m4r1k commented May 4, 2026

Uh oh!

m4r1k commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Uh oh!

cubic-dev-ai Bot commented May 4, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

m4r1k commented May 4, 2026 •

edited by cubic-dev-ai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

codecov Bot commented May 4, 2026 •

edited

Loading