Skip to content

fix: align redundancy NUT visibility with grace#46

Merged
m4r1k merged 11 commits intomainfrom
fix/startup-trigger-lockout
May 4, 2026
Merged

fix: align redundancy NUT visibility with grace#46
m4r1k merged 11 commits intomainfrom
fix/startup-trigger-lockout

Conversation

@m4r1k
Copy link
Copy Markdown
Owner

@m4r1k m4r1k commented May 4, 2026

Summary

Fixes issue #4 by aligning redundancy health with the per-UPS connection-loss grace path.

  • Treat runtime stale/lost NUT visibility as DEGRADED while the member is still inside configured connection grace.
  • Treat members as UNKNOWN once grace expires, when the monitor reports FAILED, or when no successful poll has ever been published after startup grace.
  • Add bounded handling for slow in-flight upsc polls so the redundancy evaluator does not fire before the monitor can enter grace.
  • Add rate-limited slow upsc logs and sustained-slowness notifications.
  • Bump version to 5.3.0-rc2.

Tests

  • /tmp/eneru-venv/bin/pytest tests/test_health_model.py tests/test_redundancy.py tests/test_monitor_core.py -m unit -q
  • /tmp/eneru-venv/bin/pytest -m unit
  • bash -n tests/e2e/groups/redundancy.sh
  • /tmp/eneru-venv/bin/python -m eneru validate --config tests/e2e/config-e2e-redundancy-short-grace.yaml

Docker/E2E was not run locally by design. The new redundancy runtime cases are for upstream CI.

Notes

  • No YAML change is required for users.
  • unknown_counts_as: critical remains fail-safe after connection grace expires.
  • .claude/ local metadata is intentionally not included.

Summary by cubic

Aligns redundancy health with per‑UPS connection‑loss grace to prevent false UNKNOWN and accidental shutdowns during brief NUT flaps. Adds slow NUT latency visibility with rate‑limited logs and sustained‑slowness notifications, plus tighter snapshot consistency at the grace boundary.

  • Bug Fixes

    • Use live connection context: stale/lost data is DEGRADED during grace; becomes UNKNOWN only after grace expires, the monitor reports FAILED, or no successful post‑startup poll ever occurred.
    • Bound slow in‑flight upsc so redundancy waits for grace; add rate‑limited slow‑poll logs and sustained‑slowness notifications; clarify grace fallback and sentinel behavior with tests.
    • Harden runtime and tests: update connection_state and connection_lost_time under a lock to avoid torn reads at grace edges; widen E2E grace to 40s and extend R1/R2 sleeps/timeouts; add timestamped markers, bound upsc with timeout --kill-after, dump container state, install procps for pkill, and fix the pkill regex; close SQLite stores in tests to silence ResourceWarnings.
  • Migration

    • No config changes required.
    • unknown_counts_as: critical remains fail‑safe after connection grace.

Written for commit 53da695. Summary will update on new commits.

Summary by CodeRabbit

Release Notes for 5.3.0-rc2

  • Bug Fixes

    • Fixed redundancy runtime issue where transient UPS connection loss could prematurely trigger quorum loss; grace window now properly prevents shutdown during brief connection flaps.
  • New Features

    • Added slow UPS connection monitoring with rate-limited logging and stricter notification handling only after sustained slowness.
  • Tests

    • Enhanced redundancy grace period and stale connection handling test coverage.

Note: No configuration changes required; updated behavior only prevents transient connection loss from bypassing the existing grace window.

m4r1k and others added 2 commits May 4, 2026 20:36
Move the project and package guidance to AGENTS.md while keeping Claude Code compatibility shims that point to the new source of truth.

Co-authored-by: Codex <[email protected]>
Fix issue #4 by making redundancy health consume live connection context instead of relying only on snapshot age. Runtime stale/lost NUT visibility now contributes DEGRADED while the member is inside the configured connection grace window, including slow in-flight upsc polls that leave the last snapshot stale, and becomes UNKNOWN once grace expires or the monitor reports FAILED.

Add rate-limited slow upsc log visibility and sustained-slowness notification gating so operators get early diagnostics without one-off latency alerts. Add unit coverage for transient stale, in-flight slow polls, grace expiry, slow-poll logging, and sustained notifications.

Add redundancy E2E regression cases for brief runtime NUT loss recovering inside grace and persistent loss firing after grace. Bump to 5.3.0-rc2 and document the operator impact and no-config-change migration note.

Co-authored-by: Codex <[email protected]>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

This PR refactors repository documentation by promoting package-level guidance from CLAUDE.md files into comprehensive AGENTS.md files at the root and src/eneru/ levels, implements connection grace period logic in health assessment to prevent transient NUT visibility loss from triggering redundancy shutdown, and adds slow NUT response monitoring with rate-limited logging and sustained-slowness notifications.

Changes

Documentation Consolidation & Reference Updates

Layer / File(s) Summary
Root Documentation
AGENTS.md, CLAUDE.md
New comprehensive AGENTS.md defines development setup (uv-only enforcement), command reference, project structure, code style, testing/packaging/schema evolution rules, code review workflow, Git practices, and dependencies. CLAUDE.md reduced to a brief bridge pointing to AGENTS.md.
Package-Level Documentation
src/eneru/AGENTS.md, src/eneru/CLAUDE.md
New src/eneru/AGENTS.md documents the module map, mixin pattern for shutdown/health phases, build/test integration, SQLite schema evolution rules with migration patterns, and package-specific emoji/convention guidance. Package CLAUDE.md reduced to a bridge referring to AGENTS.md.
Reference Updates
codecov.yml, docs/architecture.md, src/eneru/lifecycle.py, src/eneru/stats.py, tests/test_packaging.py
Updated inline documentation comments and links to reference the new AGENTS.md files instead of CLAUDE.md.

Health Assessment with Connection Grace Period & NUT Latency Visibility

Layer / File(s) Summary
Data Shape
src/eneru/state.py
HealthSnapshot namedtuple extended with stale_data_count and connection_lost_time fields; __new__.__defaults__ added for back-compat. MonitorState.snapshot() populates these new fields.
Core Health Logic
src/eneru/health_model.py
assess_health() signature extended with keyword-only parameters max_stale_data_tolerance, connection_grace_enabled, connection_grace_duration. Logic reworked to compute bounded pre-grace stale retry and grace windows; connection_state="FAILED" immediately yields UNKNOWN; GRACE_PERIOD yields DEGRADED until expiry then UNKNOWN; transient visibility loss (aged snapshots within grace/retry bounds) remains DEGRADED rather than UNKNOWN. Added STALE_RETRY_TOLERANCE_MULTIPLIER constant.
Monitor Latency Tracking
src/eneru/monitor.py
Added module-level slow-NUT thresholds (SLOW_NUT_LOG_THRESHOLD_SECONDS=2.0, SLOW_NUT_NOTIFY_THRESHOLD_SECONDS=10.0, SLOW_NUT_NOTIFY_CONSECUTIVE_POLLS=3, SLOW_NUT_LOG_RATE_LIMIT_SECONDS=300.0) and per-monitor state for tracking slow-poll streaks. Introduced _run_upsc() wrapper that measures elapsed time and delegates to _record_upsc_latency() for rate-limited logging and notification control. Updated _get_ups_var() and _get_all_ups_data() to use the wrapper; preserved stale_data_count when transitioning to FAILED on battery.
Redundancy Integration
src/eneru/redundancy.py
Updated RedundancyGroupEvaluator.evaluate_once() to pass monitor's max_stale_data_tolerance and connection-grace configuration to assess_health().
E2E Configuration & Tests
tests/e2e/config-e2e-redundancy-short-grace.yaml, tests/e2e/groups/redundancy.sh, tests/e2e/nut-dummy/Dockerfile
Added E2E redundancy config with 15-second connection grace and dual-UPS setup. Added helper functions (wait_for_redundancy_nut, restart_redundancy_nut_server, stop_redundancy_nut_drivers) and regression tests (R1: transient loss within grace stays healthy; R2: loss past grace triggers shutdown). Extended NUT dummy container with procps for process management.
Unit Test Coverage
tests/test_health_model.py, tests/test_monitor_core.py, tests/test_redundancy.py, tests/test_deferred_delivery.py, tests/test_stats.py
_snap() fixtures updated with new HealthSnapshot fields. TestAssessHealthStaleness expanded with 11 new cases covering grace-period degraded/unknown transitions, back-compat fallback, in-flight slow polling, transient stale data, and retry bounds. Added TestNUTLatencyVisibility validating rate-limited logging and sustained-slowness notifications. Redundancy test fixtures and TestEvaluatorCounting extended to verify time-based stale/grace quorum behavior. Test helpers simplified with open_and_close_store() utility.
Changelog & Version
docs/changelog.md, src/eneru/version.py
Version bumped to 5.3.0-rc2. Changelog documents redundancy grace-period fix, slow-NUT visibility feature, regression test coverage, and migration notes (no YAML changes required).
Documentation Updates
docs/testing.md
Reorganized "Test areas" and E2E workflow groups to highlight "Redundancy runtime" as a distinct section; updated test inventory to document R1/R2 regression cases.

Sequence Diagram

sequenceDiagram
    participant Monitor as UPSGroupMonitor
    participant LatencyRecorder as _record_upsc_latency
    participant HealthModel as assess_health()
    participant Redundancy as RedundancyGroupEvaluator
    participant State as MonitorState

    Monitor->>Monitor: _get_all_ups_data() (full poll)
    Monitor->>Monitor: _run_upsc() measures elapsed time
    alt Elapsed > SLOW_NUT_LOG_THRESHOLD
        Monitor->>LatencyRecorder: _record_upsc_latency(elapsed, is_full_poll)
        alt Last log was >300s ago
            LatencyRecorder->>Monitor: Rate-limited "Slow NUT response" log
        end
        alt Consecutive slow full polls >= 3
            LatencyRecorder->>Monitor: Enqueue "Sustained slow NUT responses" notification
        end
    else Elapsed normal
        LatencyRecorder->>Monitor: Reset slow-poll streak
    end
    Monitor->>State: Update stale_data_count, connection_lost_time
    Monitor->>HealthModel: assess_health(snapshot, max_stale_data_tolerance, connection_grace_enabled, connection_grace_duration)
    alt connection_state == "FAILED"
        HealthModel->>HealthModel: Return UNKNOWN
    else connection_state == "GRACE_PERIOD" & now > grace_expiry
        HealthModel->>HealthModel: Return UNKNOWN
    else transient_visibility_loss (within pre-grace or grace window)
        HealthModel->>HealthModel: Return DEGRADED (not UNKNOWN)
    else normal stale/age checks
        HealthModel->>HealthModel: Return HEALTHY/DEGRADED/CRITICAL
    end
    Redundancy->>Redundancy: Collect member health states (DEGRADED/UNKNOWN/HEALTHY)
    alt Quorum still met
        Redundancy->>Redundancy: Continue operation
    else Quorum lost (too many UNKNOWN)
        Redundancy->>Redundancy: Shutdown redundancy group
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The PR spans multiple interconnected domains: documentation reorganization (low complexity but pervasive), health assessment logic refactor with grace-period windowing and stale-retry bounds (dense algorithmic changes requiring careful validation), monitor-level latency tracking with rate-limiting and state management (moderate complexity), redundancy integration wiring (moderate), and extensive test coverage across multiple test suites and E2E scenarios (breadth). The heterogeneity of changes—combining documentation consolidation, core health logic, monitoring infrastructure, and comprehensive test expansion—demands separate reasoning for each layer. The grace-period logic in particular involves subtle timing semantics and multiple state variables that require careful review.

Possibly related PRs

  • m4r1k/Eneru#26: Directly related—both modify core health/redundancy/monitor/state modules and implement interdependent changes to health assessment and snapshot state management.
  • m4r1k/Eneru#43: Related at code level—both update src/eneru/version.py for version bumps in the 5.3.0 release cycle.
  • m4r1k/Eneru#33: Related—both modify src/eneru/redundancy.py and alter redundancy health evaluation and notification behavior.

Poem

🐰 Hopping through grace windows and latency lanes,
The bunny docs leap to AGENTS' domains,
Stale data now waits in its mercy-full grace,
While slow NUT responses know their own space,
Redundancy stands firm—no premature flight,
The warren is safer, the quorum is tight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: align redundancy NUT visibility with grace' is clear and specific, summarizing the main bug fix addressing the grace-period alignment issue for redundancy NUT visibility.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/startup-trigger-lockout

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

❌ Patch coverage is 88.09524% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.75%. Comparing base (f854fce) to head (53da695).

Files with missing lines Patch % Lines
src/eneru/monitor.py 88.23% 6 Missing ⚠️
src/eneru/health_model.py 85.71% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #46      +/-   ##
==========================================
+ Coverage   79.59%   79.75%   +0.16%     
==========================================
  Files          25       25              
  Lines        4616     4683      +67     
  Branches      861      869       +8     
==========================================
+ Hits         3674     3735      +61     
- Misses        737      743       +6     
  Partials      205      205              
Files with missing lines Coverage Δ
src/eneru/lifecycle.py 92.90% <ø> (ø)
src/eneru/redundancy.py 92.04% <100.00%> (+0.09%) ⬆️
src/eneru/state.py 100.00% <100.00%> (ø)
src/eneru/stats.py 82.30% <ø> (ø)
src/eneru/version.py 100.00% <100.00%> (ø)
src/eneru/health_model.py 92.45% <85.71%> (-7.55%) ⬇️
src/eneru/monitor.py 63.69% <88.23%> (+1.92%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

m4r1k and others added 4 commits May 4, 2026 21:27
Close StatsStore instances that were opened only to exercise schema migration or empty database setup paths. The one-line open calls left sqlite3 connections for garbage collection, which surfaced as ResourceWarning noise during pytest coverage runs.

This keeps the issue #4 PR review signal clean before upstream AI review while preserving the same migration and deferred-delivery behavior under test.

Co-authored-by: Codex <[email protected]>
The new redundancy R1/R2 regression tests on this branch shell out to
`pkill` inside the nut-dummy container to selectively stop the UPS1 and
UPS2 dummy drivers and force a runtime NUT visibility loss. The image is
debian:bookworm-slim with only nut-server / nut-client / inotify-tools
installed -- pkill ships with procps and was missing, so the kill was a
silent no-op. R1 happened to pass anyway because its second
`restart_redundancy_nut_server` brings NUT down via container restart;
R2 has no second restart, so the monitors never lost visibility, never
entered grace, and the test asserted on a "Grace period started" log
line that never appeared.

Adding procps unblocks R2 without touching test logic.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
Surface the load-bearing invariants in the new redundancy/grace path so
future readers don't accidentally "fix" what looks like a bug:

- health_model: hoist the previously-magic `5` retry multiplier into
  ``STALE_RETRY_TOLERANCE_MULTIPLIER`` and split the GRACE_PERIOD branch
  into an explicit if/else. The back-compat fallback grace_age is
  intentionally allowed to go negative so a fresh GRACE_PERIOD snapshot
  with no ``connection_lost_time`` stays DEGRADED even when the
  evaluator was constructed with grace_window == 0; the comment now
  spells that out.
- state: document the dual-sentinel meaning of
  ``connection_lost_time == 0.0`` ("never lost" vs "explicitly cleared
  after recovery / failsafe") so readers treat both cases as
  "no live grace timer to consult".
- monitor: note that the FSB failsafe intentionally does NOT reset
  ``stale_data_count`` -- once connection_state is FAILED, the count
  is irrelevant and the next good poll resets it anyway. Explain that
  the slow-NUT notification gate fires on N *consecutive* slow polls,
  not sustained slowness over a window: a single fast poll resets
  both the streak counter and the "already notified" gate.

Adds two unit tests covering the back-compat GRACE_PERIOD fallback
(fresh snapshot stays DEGRADED, aged snapshot fails safe to UNKNOWN).

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@m4r1k
Copy link
Copy Markdown
Owner Author

m4r1k commented May 4, 2026

@coderabbitai review

@m4r1k
Copy link
Copy Markdown
Owner Author

m4r1k commented May 4, 2026

@cubic-dev-ai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cubic-dev-ai
Copy link
Copy Markdown

cubic-dev-ai Bot commented May 4, 2026

@cubic-dev-ai review

@m4r1k I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/test_stats.py (1)

43-49: ⚡ Quick win

Add type hints to the new helper function.

Line 43 introduces open_and_close_store without annotations. Please type it as path: Path and -> None to stay consistent with the repo-wide typing standard.

Suggested diff
-def open_and_close_store(path):
+def open_and_close_store(path: Path) -> None:
     """Exercise the open lifecycle without leaking the SQLite handle."""
     s = StatsStore(path)
     try:
         s.open()
     finally:
         s.close()

As per coding guidelines: **/*.py: Python version 3.9+ with type hints throughout codebase.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_stats.py` around lines 43 - 49, Add type annotations to the helper
function open_and_close_store: annotate the parameter as path: Path and the
return type as -> None. Update the function signature for
open_and_close_store(path) to open_and_close_store(path: Path) -> None and
ensure you import or reference Path from pathlib if not already present; keep
the body unchanged and retain usage of StatsStore.open() and StatsStore.close().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@AGENTS.md`:
- Around line 47-129: The fenced code blocks in AGENTS.md (the repository tree
and example/command blocks shown in the diff) are unlabeled and trigger
markdownlint MD040; update each triple-backtick fence to include an appropriate
language tag (use text for file-tree/output blocks and bash or markdown for
command/example blocks) for the blocks shown (the repository tree block and the
other unlabeled fences noted in the comment) so the document passes MD040.

In `@src/eneru/state.py`:
- Around line 26-37: The new HealthSnapshot fields stale_data_count and
connection_lost_time are written lock-free in UPSGroupMonitor, causing racey
reads; wrap every mutation that updates these snapshot-published health fields
inside the same lock used for reads (self.state._lock) so updates become atomic
with snapshot publication. Locate assignments to stale_data_count and
connection_lost_time in UPSGroupMonitor (and the nearby writes referred to
around lines 150-151) and move them into the critical section guarded by
self.state._lock (or acquire the lock before updating and release after),
ensuring snapshot creation/assignment and any related state mutations occur
while holding self.state._lock. Ensure no other code publishes the snapshot
fields outside that lock.

---

Nitpick comments:
In `@tests/test_stats.py`:
- Around line 43-49: Add type annotations to the helper function
open_and_close_store: annotate the parameter as path: Path and the return type
as -> None. Update the function signature for open_and_close_store(path) to
open_and_close_store(path: Path) -> None and ensure you import or reference Path
from pathlib if not already present; keep the body unchanged and retain usage of
StatsStore.open() and StatsStore.close().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 403e9b47-c881-40b2-a17c-8d34d91615a7

📥 Commits

Reviewing files that changed from the base of the PR and between f854fce and 397522a.

📒 Files selected for processing (24)
  • AGENTS.md
  • CLAUDE.md
  • codecov.yml
  • docs/architecture.md
  • docs/changelog.md
  • docs/testing.md
  • src/eneru/AGENTS.md
  • src/eneru/CLAUDE.md
  • src/eneru/health_model.py
  • src/eneru/lifecycle.py
  • src/eneru/monitor.py
  • src/eneru/redundancy.py
  • src/eneru/state.py
  • src/eneru/stats.py
  • src/eneru/version.py
  • tests/e2e/config-e2e-redundancy-short-grace.yaml
  • tests/e2e/groups/redundancy.sh
  • tests/e2e/nut-dummy/Dockerfile
  • tests/test_deferred_delivery.py
  • tests/test_health_model.py
  • tests/test_monitor_core.py
  • tests/test_packaging.py
  • tests/test_redundancy.py
  • tests/test_stats.py

Comment thread AGENTS.md
Comment on lines +47 to +129
```
src/eneru/ # Main package
AGENTS.md # Module map + mixin pattern (agent context)
__init__.py # Public API exports
__main__.py # CLI entry point (python -m eneru)
version.py # Version string (single source of truth)
config.py # Configuration dataclasses + ConfigLoader
state.py # MonitorState dataclass
logger.py # TimezoneFormatter + UPSLogger
notifications.py # NotificationWorker (Apprise integration)
utils.py # Helper functions (run_command, etc.)
actions.py # REMOTE_ACTIONS templates
monitor.py # UPSGroupMonitor core: init, polling, orchestration, main loop
multi_ups.py # MultiUPSCoordinator (thread-per-group)
cli.py # CLI argument parsing + main()
shutdown/ # Per-phase shutdown mixins
vms.py # VMShutdownMixin (libvirt)
containers.py # ContainerShutdownMixin (docker/podman + compose)
filesystems.py # FilesystemShutdownMixin (sync + unmount)
remote.py # RemoteShutdownMixin (SSH-based remote servers)
health/ # Health-monitoring mixins
voltage.py # VoltageMonitorMixin (thresholds, AVR, bypass, overload)
battery.py # BatteryMonitorMixin (depletion rate, anomaly detection)

tests/ # pytest tests
conftest.py # Shared fixtures
test_constants.py # Shared test constants (sample webhook URLs, etc.)
test_config_loading.py # Config: defaults + YAML file parse
test_config_notifications.py # Config: legacy Discord, avatar handling
test_config_filesystems.py # Config: mount path parsing
test_config_vm_containers.py # Config: compose files, container runtime
test_config_remote.py # Config: remote servers, ordering, safety margin
test_config_validation.py # Config: cross-field validation, edge cases
test_*.py # Unit/integration tests for non-config modules
e2e/ # End-to-end tests
docker-compose.yml # E2E test environment
config-e2e*.yaml # E2E test configs
nut-dummy/Dockerfile # NUT server simulator
ssh-target/Dockerfile # SSH target container

docs/ # MkDocs documentation (ReadTheDocs)
index.md # Homepage
getting-started.md # Installation guide
configuration.md # Config reference
triggers.md # Shutdown triggers
notifications.md # Apprise setup
remote-servers.md # SSH configuration
testing.md # CI/CD strategy
troubleshooting.md # Debug guide
changelog.md # Changelog (comprehensive, single source of truth)

.github/
workflows/
validate.yml # Lint + unit tests
integration.yml # Package install tests
e2e.yml # End-to-end tests
release.yml # Build .deb/.rpm packages
pypi.yml # Publish to PyPI
ISSUE_TEMPLATE/ # Bug/feature templates
PULL_REQUEST_TEMPLATE.md # PR template

examples/ # Example configs
config-reference.yaml # Comprehensive reference (every feature flag)
config-minimal.yaml # Minimal single-UPS setup
config-homelab.yaml # Homelab: VMs, containers, NAS
config-enterprise.yaml # Multi-server enterprise setup
config-dual-ups.yaml # Multi-UPS setup

packaging/
eneru-wrapper.py # Package entry point wrapper
eneru.service # Systemd service file
scripts/ # Package lifecycle scripts

pyproject.toml # PEP 517/518 packaging
pytest.ini # pytest configuration
mkdocs.yml # MkDocs configuration
nfpm.yaml # .deb/.rpm package config
.readthedocs.yaml # RTD build config
requirements.txt # Runtime dependencies
requirements-dev.txt # Dev dependencies
CONTRIBUTING.md # Contribution guidelines
README.md # Project overview
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language tags to these fenced blocks.

These unlabeled fences trip markdownlint MD040. Using text for tree/output blocks and bash or markdown for command/example blocks will keep the new source-of-truth doc lint-clean.

Also applies to: 214-228, 233-240, 261-274, 357-365

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 47-47: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@AGENTS.md` around lines 47 - 129, The fenced code blocks in AGENTS.md (the
repository tree and example/command blocks shown in the diff) are unlabeled and
trigger markdownlint MD040; update each triple-backtick fence to include an
appropriate language tag (use text for file-tree/output blocks and bash or
markdown for command/example blocks) for the blocks shown (the repository tree
block and the other unlabeled fences noted in the comment) so the document
passes MD040.

Comment thread src/eneru/state.py
Comment on lines +26 to +37
# Live connection context for redundancy: lets the evaluator tell
# "stale but still inside connection grace" from a dead monitor.
# ``connection_lost_time`` uses 0.0 as a dual-purpose sentinel for
# "never lost" and "explicitly cleared after recovery / failsafe";
# readers must treat both cases as "no live grace timer to consult".
"stale_data_count", # consecutive failed/stale polls since last success
"connection_lost_time", # ``time.time()`` when connection grace started
],
)
# Back-compat for tests / third-party code still constructing the old
# 10-field HealthSnapshot shape directly.
HealthSnapshot.__new__.__defaults__ = (0, 0.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Publish these new snapshot fields under the same lock used for reads.

stale_data_count and connection_lost_time are now part of the redundancy snapshot, but the corresponding writes in UPSGroupMonitor still happen lock-free. That means the evaluator can observe mixed state around failure/recovery transitions, e.g. GRACE_PERIOD with a cleared or stale connection_lost_time, and misclassify the member right at the grace boundary. Please move every mutation of snapshot-published health fields behind self.state._lock before relying on this snapshot as an atomic view.

Also applies to: 150-151

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/eneru/state.py` around lines 26 - 37, The new HealthSnapshot fields
stale_data_count and connection_lost_time are written lock-free in
UPSGroupMonitor, causing racey reads; wrap every mutation that updates these
snapshot-published health fields inside the same lock used for reads
(self.state._lock) so updates become atomic with snapshot publication. Locate
assignments to stale_data_count and connection_lost_time in UPSGroupMonitor (and
the nearby writes referred to around lines 150-151) and move them into the
critical section guarded by self.state._lock (or acquire the lock before
updating and release after), ensuring snapshot creation/assignment and any
related state mutations occur while holding self.state._lock. Ensure no other
code publishes the snapshot fields outside that lock.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 24 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/eneru/health_model.py">

<violation number="1" location="src/eneru/health_model.py:157">
P3: Use the already-computed `stale_threshold` variable instead of re-expanding `STALE_INTERVAL_MULTIPLIER * interval`. This avoids drift if one is updated without the other.</violation>
</file>

Partial review: This PR has more than 50 files, so cubic reviewed the highest-priority files first. During the trial, paid plans get a higher file limit.
You can try an ultrareview to bypass the file limit, comment @cubic-dev-ai ultrareview. Learn more.
Fix all with cubic.

Comment thread src/eneru/health_model.py Outdated
m4r1k and others added 5 commits May 4, 2026 22:18
The first restart_redundancy_nut_server in R1 has been getting SIGTERMed
~43s in with no visible failure: the script stops after "Waiting for
redundancy NUT sources (2/30)" and never prints iterations 3-30, never
prints "FAIL: redundancy NUT sources did not recover", and never reaches
the next dbg-able boundary. Without instrumentation we cannot tell
whether wait_for_redundancy_nut hangs in upsc, whether the docker
compose call wedges, or whether something in the R1 sequence itself
deadlocks before its first sleep.

Add minimum-friction observability so the next failed run is
self-diagnosing:

- dbg() prints UTC-timestamped step markers; sprinkled at every R1/R2
  phase boundary plus inside the helper functions. The last printed
  marker before SIGTERM pinpoints the wedge.
- Each upsc poll inside wait_for_redundancy_nut is now bounded by
  `timeout 5s` so a wedged libupsclient read cannot eat the entire
  30-iteration budget on a single iteration.
- dump_redundancy_nut_state() prints `docker compose ps`, the
  dummy-ups/upsd processes inside the container, and a per-UPS upsc
  probe. Called after every restart and on any R1/R2 failure path.
- stop_redundancy_nut_drivers verifies the post-condition (no UPS1/UPS2
  driver processes remain) and bounds its docker exec with `timeout 10s`.
- R1/R2 failure handlers now `cat` the full eneru log (was tail -80/-100)
  and dump container state, so the runner log is sufficient to debug
  without re-running with extra diagnostics.

No semantic changes to the test assertions or sleep timings; this is
purely additive observability.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
Root cause of the consistent R1 SIGTERM:
``pkill -f 'dummy-ups.*-a UPS1'`` runs inside ``sh -c "pkill -f
'dummy-ups.*-a UPS1' || true; pkill -f 'dummy-ups.*-a UPS2' || true"``.
The pkill regex matches the wrapper sh's own command line (which
literally contains ``dummy-ups.*-a UPS1``), so pkill kills the wrapper
before the second pkill ever runs. ``docker compose exec`` is then left
holding a half-dead exec stream that never returns -- the runner
SIGTERMs the whole step ~15s later (visible in CI as the consistent
4-minute "Process completed with exit code 143" we were chasing).

Apply the standard ``[d]ummy-ups`` bracket idiom so the regex still
matches real ``/usr/lib/nut/dummy-ups -a UPS1`` cmdlines but NOT the
literal ``[d]ummy-ups`` text in pkill's own argv. Same trick for the
post-condition ``ps -ef | grep`` so it doesn't need ``grep -v grep``
either. Add ``--kill-after=5s`` to the bounding ``timeout`` so a future
docker-compose-exec wedge can't sit on SIGTERM forever.

Verified by re-reading the freshly-instrumented CI log: the missing dbg
marker after "pkill UPS1+UPS2 dummy-ups in container" is exactly where
the wrapper sh dies, which is now the only diagnosis the data supports.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
The R1 transient-NUT-loss regression was firing a real shutdown on slow
GHA runners because the per-UPS connection grace was 15s but a full
``docker compose restart nut-server`` takes ~10s of graceful stop +
entrypoint startup + driver settle, which on these runners is reliably
>15s end-to-end. The recovery thus landed AFTER grace expired, the
member flipped to UNKNOWN, ``unknown_counts_as: critical`` lost the
quorum and fired ``REDUNDANCY GROUP SHUTDOWN`` -- exactly what R1 is
supposed to assert never happens.

This is a test-infra timing problem, not a production-code bug:
production grace defaults are 60s and operators don't restart NUT in
that window. The minimal fix is to widen the test grace and the R2
hold-loss sleep so both regressions still exercise their respective
branches with comfortable headroom on the slowest runners we've seen
(tests 21-27 ran 2.3x slower than baseline on the same job).

Changes:
- tests/e2e/config-e2e-redundancy-short-grace.yaml: grace 15s → 40s on
  both members, with a comment explaining the budget.
- tests/e2e/groups/redundancy.sh:
  - R1 eneru ``timeout 48s`` → ``90s`` (covers restart + sleep 13 +
    stop + sleep 7 + restart + sleep 10 = ~60s on slow runners with
    headroom).
  - R2 ``sleep 28`` → ``sleep 55`` (must exceed new 40s grace).
  - R2 eneru ``timeout 58s`` → ``105s`` accordingly.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
cubic (P3, src/eneru/health_model.py:157): the late stale-threshold
check re-expanded ``STALE_INTERVAL_MULTIPLIER * interval`` instead of
reusing the already-computed ``stale_threshold`` local. Replace with
``stale_threshold`` so the two sites can never drift if the multiplier
is later renamed or recomputed.

CodeRabbit (Major, src/eneru/state.py:37 - applied to monitor.py): the
new ``connection_lost_time`` snapshot field is part of a 2-field state
machine paired with ``connection_state``. Reads via ``MonitorState.snapshot()``
already take ``self._lock``, but the writes in UPSGroupMonitor were
lock-free. The redundancy evaluator could observe a torn pair (e.g.
``GRACE_PERIOD`` with ``connection_lost_time == 0.0``) right at the
grace boundary; the back-compat fallback in health_model masks the
worst of it but the contract should be tight, not papered over.

Wrap each ``connection_state`` + ``connection_lost_time`` paired write
in ``with self.state._lock:`` so snapshot() always sees a consistent
pair. Single-field writes to ``stale_data_count`` are left lock-free
(CPython attribute assignment is atomic and the field has no paired
invariant). Five paired sites updated:

- _handle_connection_failure: enter GRACE_PERIOD (set both fields).
- _handle_connection_failure: GRACE_PERIOD → FAILED transition.
- _main_loop: failsafe (FSB) path.
- _main_loop: recovery from GRACE_PERIOD.
- _main_loop: recovery from FAILED.

No behavior change; tightens the snapshot read/write contract.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
Adds a Buy Me a Coffee link (https://buymeacoffee.com/m4r1k) so
operators who want to chip in toward UPS hardware and NUT testing have
a clear, optional way to do so.

- README.md: badge in the existing badge row + dedicated "Support the
  project" section above the License section.
- docs/index.md: matching "Support the project" section at the end of
  the index page so the link surfaces on Read the Docs too.

Wording is intentionally low-pressure -- Eneru stays free and MIT
regardless.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@m4r1k m4r1k merged commit c57e62e into main May 4, 2026
40 checks passed
@m4r1k m4r1k deleted the fix/startup-trigger-lockout branch May 4, 2026 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant