feat(apply): magic rollback#133
Conversation
a186653 to
42e7674
Compare
42e7674 to
9f80b24
Compare
e59057e to
6188280
Compare
6e5bcda to
6b0541f
Compare
7f996e0 to
c97de8f
Compare
43a2e20 to
6ec21fc
Compare
5435cca to
f7c89e8
Compare
7b49dc6 to
d73488e
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughImplements magic rollback for remote system switches via an activation supervisor script and infrastructure. Adds systemd-duration-based rollback timeout configuration, CLI flags ( Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/systemd/time.go (1)
16-94:⚠️ Potential issue | 🟠 MajorHarden
DurationFromTimeSpanagainst malformed components.Inputs like
1currently panic when the unit-skip loop walks past the end of the string, and oversized numeric components are silently accepted because theParseInterror is ignored. This parser now backs both flag parsing and config unmarshalling, so bad user input should return an error instead of crashing or producing a nonsense duration. A regression case for1and an overflow-sized component would help lock this down.Possible hardening
func DurationFromTimeSpan(span string) (SystemdDuration, error) { @@ - num, _ := strconv.ParseInt(span[numStart:i], 10, 64) + num, err := strconv.ParseInt(span[numStart:i], 10, 64) + if err != nil { + return 0, fmt.Errorf("invalid duration component %q: %w", span[numStart:i], err) + } @@ - for unicode.IsSpace(rune(span[i])) { + for i < spanLen && unicode.IsSpace(rune(span[i])) { i += 1 } + if i >= spanLen { + return 0, fmt.Errorf("span components must have units") + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/systemd/time.go` around lines 16 - 94, DurationFromTimeSpan can panic or accept invalid values: check and handle strconv.ParseInt errors, ensure bounds after skipping spaces (i < len(span)) before reading unit and return a clear error if unit is missing or empty (fix the loop that skips spaces so it doesn't advance past end), and guard against numeric overflow when converting num*durationUnit (return an error on overflow); update the parsing in DurationFromTimeSpan to validate ParseInt's err, verify unitStart < spanLen and unit contains letters, and detect overflow when computing totalDuration; add regression tests for inputs like "1 " and an extremely large numeric component.internal/generation/generation.go (1)
180-184:⚠️ Potential issue | 🟠 MajorUse filesystem abstraction for
ReadLinkto maintain consistency.Line 181 uses
os.Readlinkdirectly while the rest of the function usess.FS()methods for filesystem operations. This breaks the abstraction pattern and will fail for remote systems. TheFilesysteminterface includes aReadLinkmethod that should be used instead.Fix: Use filesystem abstraction
- currentGenerationLink, err := os.Readlink(currentGenerationDirname) + currentGenerationLink, err := s.FS().ReadLink(currentGenerationDirname) if err != nil { log.Warnf("unable to determine current generation: %v", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/generation/generation.go` around lines 180 - 184, The code uses os.Readlink directly for currentGenerationDirname (assigning to currentGenerationLink) which breaks the filesystem abstraction; replace os.Readlink with the Filesystem abstraction method by calling s.FS().ReadLink(currentGenerationDirname), propagate/handle the returned value and error exactly as done elsewhere in this function, and remove the direct os dependency so remote or mocked filesystems work consistently with the rest of the code.
🧹 Nitpick comments (5)
internal/logger/replay.go (1)
6-7: Remove unusedlevelstate to avoid drift/confusion.
ReplayLogger.levelis initialized but never read; either use it for internal filtering or remove it.♻️ Optional cleanup
type ReplayLogger struct { - level LogLevel entries []logEntry out Logger } func NewReplayLogger(out Logger) *ReplayLogger { return &ReplayLogger{ - level: LogLevelInfo, entries: make([]logEntry, 0, 64), out: out, } }Also applies to: 38-39
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/logger/replay.go` around lines 6 - 7, ReplayLogger has an unused field ReplayLogger.level that should be removed to avoid state drift; remove the level field from the struct definition and delete any code that initializes or assigns to ReplayLogger.level (e.g., in the NewReplayLogger constructor or any setters) and ensure only entries ([]logEntry) remains, updating any methods that referenced level to use entries or external filtering instead; also remove corresponding unused imports or tests that reference the removed field.internal/activation/supervisor.sh (1)
74-100: Consider handling case whereprevToplevelis empty.If
readlink /run/current-systemfails (line 76),prevToplevelwill be empty. Therollback()function would then construct an invalid path like/bin/switch-to-configurationand fail with a confusing error.♻️ Proposed fix: Validate prevToplevel before rollback
prevToplevel=$(readlink /run/current-system) +if [ -z "$prevToplevel" ]; then + log "warning: could not determine previous system; rollback may fail" +fi rollback() { + if [ -z "$prevToplevel" ]; then + log "cannot rollback: previous system toplevel is unknown" + exit 1 + fi + # $ROLLBACK_PROFILE_ON_FAILURE will not be set when switching🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/activation/supervisor.sh` around lines 74 - 100, The rollback() function uses prevToplevel (set from readlink /run/current-system) without validating it, which can produce an invalid path; update rollback() to check if prevToplevel is non-empty and points to an existing directory before constructing prevSwitchToConfigurationScript and invoking it (verify the resolved path exists and is executable); if prevToplevel is empty or invalid, log a clear error via the same logging mechanism and exit with a non-zero status instead of attempting to call "/bin/switch-to-configuration"; keep the existing behavior for PREVIOUS_SPECIALISATION when prevToplevel is valid.cmd/apply/apply.go (1)
702-749: VerifynewGenerationCreatedhandles edge cases.The logic compares symlink targets before and after profile creation to detect new generations. Consider these edge cases:
- If
ReadLinkfails onprevLinkbut succeeds onafterLink,newGenerationCreatedstays false (line 744 requires both non-empty)- If both links resolve to the same path (no new generation created), rollback is correctly skipped
This seems correct - when we can't determine the previous state, we conservatively don't rollback. However, the warning at line 716-717 might benefit from noting that rollback won't be attempted if the previous link couldn't be read.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/apply/apply.go` around lines 702 - 749, The ReadLink failure for prevLink (when calling targetHost.FS().ReadLink(activeProfileLink)) should log that because prevLink couldn't be determined a rollback will be skipped; update the warning emitted where prevLink is read (referencing activeProfileLink, prevLink, and ReadLink) to include a note that rollback of the system profile will not be attempted when the previous link is unknown, and keep newGenerationCreated logic (which compares prevLink and afterLink) unchanged.internal/activation/activation.go (2)
595-598: Use%wfor error wrapping consistency.Line 597 uses
%vwhile line 525 uses%w. Using%wenables proper error unwrapping witherrors.Is()anderrors.As().♻️ Proposed fix
activationErr := <-activationComplete if activationErr != nil { - return fmt.Errorf("activation supervisor exited with error: %v", activationErr) + return fmt.Errorf("activation supervisor exited with error: %w", activationErr) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/activation/activation.go` around lines 595 - 598, Change the fmt.Errorf call that reports errors read from the activationComplete channel to use %w so the error is wrapped for unwrapping; specifically, where activationErr := <-activationComplete and you currently return fmt.Errorf("activation supervisor exited with error: %v", activationErr), replace the format verb to return fmt.Errorf("activation supervisor exited with error: %w", activationErr) so tools like errors.Is()/errors.As() can inspect the original activationErr.
338-357: World-writable directory is a security consideration worth tracking.The world-writable trigger directory (
0o777) with sticky bit is documented as intentional to enable non-root activation. While the sticky bit prevents users from deleting each other's files, any local user can create trigger files, potentially allowing them to acknowledge activations they shouldn't control.Consider opening an issue to track this for future hardening, perhaps using group-based permissions or a more restrictive mechanism when security is a concern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/activation/activation.go` around lines 338 - 357, The triggerDirectory in EnsureActivationDirectoryExists is intentionally created world-writable (0o777|os.ModeSticky), which is a security concern to track; open a tracking issue describing the risk and proposed hardening (e.g., group-based permissions or configurable restrictiveness), then add a concise TODO comment above the EnsureActivationDirectoryExists function referencing the new issue number and summarizing why the directory is world-writable today and what should be changed in future; optionally, also add a runtime warning log in EnsureActivationDirectoryExists (mentioning triggerDirectory) that points to the issue so operators are aware.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/activation/activation.go`:
- Around line 546-568: The goroutine can leak an SSH connection because
reconnectCh is created buffered (reconnectCh := make(chan *system.SSHSystem, 1))
so the send of s2 from s.(*system.SSHSystem).Clone() never blocks and s2 can
remain in the channel if the receiver times out; change reconnectCh to an
unbuffered channel (make(chan *system.SSHSystem)) so the send will block and the
goroutine's timeout branch (select with time.After(opts.AckTimeout)) will run
and call s2.Close(), or alternatively ensure the send path closes s2 when the
receiver times out by adding a coordinating select/cancel; update uses of
reconnectCh, Clone(), s2 and s2.Close() accordingly.
In `@internal/logger/replay.go`:
- Around line 173-175: ReplayLogger.SetLogLevel is currently a no-op so calls
like logger.SetLogLevel(logger.LogLevelDebug) (e.g., in PreRun) have no effect
when a ReplayLogger is active; implement SetLogLevel on type ReplayLogger to
actually honor the requested level by storing/applying it to the logger instance
(or forwarding it to its underlying sink/parent logger) so verbose/debug flags
take effect — update the method on ReplayLogger (SetLogLevel) to set an internal
field or call the underlying sink's SetLogLevel/Configure method (matching your
existing LogLevel type and any sink methods) and ensure any checks that use the
level (e.g., ReplayLogger.log or level checks) read that stored/applied level.
In `@internal/settings/settings.go`:
- Around line 520-534: The code currently only populates cfg.setDeprecatedFields
inside parse( k *koanf.Koanf ), so dynamic updates via (*Settings).SetValue
don't mark deprecated keys as explicitly set; add a single helper on Settings
(e.g., markDeprecatedFieldSet(key string)) that records keys into
setDeprecatedFields and use it from the parse loop (where you currently set
cfg.setDeprecatedFields[key]) and also call it from (*Settings).SetValue
immediately after a successful assignment; update tests to include a regression
case where SetValue("auto_rollback", "false") is applied and Validate() respects
the explicit false.
In `@nix/tests/magic-rollback.test.nix`:
- Around line 148-149: The test references
nodes.target.system.build.networkConfig via targetNetworkJSON but the target
node only defines targetConfig under system.build, causing evaluation failure;
either add a networkConfig attribute to the target node's system.build (e.g.,
populate nodes.target.system.build.networkConfig with the intended network data)
or remove/stop using targetNetworkJSON so the test no longer reads
nodes.target.system.build.networkConfig—update the test to consistently use
either targetConfig or a newly added networkConfig on the target node.
---
Outside diff comments:
In `@internal/generation/generation.go`:
- Around line 180-184: The code uses os.Readlink directly for
currentGenerationDirname (assigning to currentGenerationLink) which breaks the
filesystem abstraction; replace os.Readlink with the Filesystem abstraction
method by calling s.FS().ReadLink(currentGenerationDirname), propagate/handle
the returned value and error exactly as done elsewhere in this function, and
remove the direct os dependency so remote or mocked filesystems work
consistently with the rest of the code.
In `@internal/systemd/time.go`:
- Around line 16-94: DurationFromTimeSpan can panic or accept invalid values:
check and handle strconv.ParseInt errors, ensure bounds after skipping spaces (i
< len(span)) before reading unit and return a clear error if unit is missing or
empty (fix the loop that skips spaces so it doesn't advance past end), and guard
against numeric overflow when converting num*durationUnit (return an error on
overflow); update the parsing in DurationFromTimeSpan to validate ParseInt's
err, verify unitStart < spanLen and unit contains letters, and detect overflow
when computing totalDuration; add regression tests for inputs like "1 " and an
extremely large numeric component.
---
Nitpick comments:
In `@cmd/apply/apply.go`:
- Around line 702-749: The ReadLink failure for prevLink (when calling
targetHost.FS().ReadLink(activeProfileLink)) should log that because prevLink
couldn't be determined a rollback will be skipped; update the warning emitted
where prevLink is read (referencing activeProfileLink, prevLink, and ReadLink)
to include a note that rollback of the system profile will not be attempted when
the previous link is unknown, and keep newGenerationCreated logic (which
compares prevLink and afterLink) unchanged.
In `@internal/activation/activation.go`:
- Around line 595-598: Change the fmt.Errorf call that reports errors read from
the activationComplete channel to use %w so the error is wrapped for unwrapping;
specifically, where activationErr := <-activationComplete and you currently
return fmt.Errorf("activation supervisor exited with error: %v", activationErr),
replace the format verb to return fmt.Errorf("activation supervisor exited with
error: %w", activationErr) so tools like errors.Is()/errors.As() can inspect the
original activationErr.
- Around line 338-357: The triggerDirectory in EnsureActivationDirectoryExists
is intentionally created world-writable (0o777|os.ModeSticky), which is a
security concern to track; open a tracking issue describing the risk and
proposed hardening (e.g., group-based permissions or configurable
restrictiveness), then add a concise TODO comment above the
EnsureActivationDirectoryExists function referencing the new issue number and
summarizing why the directory is world-writable today and what should be changed
in future; optionally, also add a runtime warning log in
EnsureActivationDirectoryExists (mentioning triggerDirectory) that points to the
issue so operators are aware.
In `@internal/activation/supervisor.sh`:
- Around line 74-100: The rollback() function uses prevToplevel (set from
readlink /run/current-system) without validating it, which can produce an
invalid path; update rollback() to check if prevToplevel is non-empty and points
to an existing directory before constructing prevSwitchToConfigurationScript and
invoking it (verify the resolved path exists and is executable); if prevToplevel
is empty or invalid, log a clear error via the same logging mechanism and exit
with a non-zero status instead of attempting to call
"/bin/switch-to-configuration"; keep the existing behavior for
PREVIOUS_SPECIALISATION when prevToplevel is valid.
In `@internal/logger/replay.go`:
- Around line 6-7: ReplayLogger has an unused field ReplayLogger.level that
should be removed to avoid state drift; remove the level field from the struct
definition and delete any code that initializes or assigns to ReplayLogger.level
(e.g., in the NewReplayLogger constructor or any setters) and ensure only
entries ([]logEntry) remains, updating any methods that referenced level to use
entries or external filtering instead; also remove corresponding unused imports
or tests that reference the removed field.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 12b1226b-9172-41ad-8d87-5a819b4cb14d
📒 Files selected for processing (35)
cmd/activate/activate.gocmd/activate/run.gocmd/apply/apply.gocmd/generation/delete/delete.gocmd/generation/delete/resolver.gocmd/generation/delete/resolver_test.gocmd/generation/rollback/rollback.gocmd/generation/shared/utils.gocmd/generation/switch/switch.gocmd/info/info.gocmd/root/root.godoc/man/nixos-cli-apply.1.scddoc/man/nixos-cli-env.5.scdinternal/activation/activation.gointernal/activation/supervisor.shinternal/cmd/opts/opts.gointernal/cmd/utils/utils.gointernal/constants/constants.gointernal/generation/completion.gointernal/generation/generation.gointernal/generation/specialisations.gointernal/logger/replay.gointernal/settings/settings.gointernal/settings/settings_test.gointernal/system/elevator.gointernal/system/fs.gointernal/system/local_fs.gointernal/system/sftp_fs.gointernal/system/ssh.gointernal/systemd/time.gointernal/systemd/time_test.gonix/tests/default.nixnix/tests/example.test.nixnix/tests/magic-rollback.test.nixnix/tests/resources/ssh-keys.nix
💤 Files with no reviewable changes (1)
- cmd/root/root.go
Before, time spans were stored as a string, and parsed whenever a direct time.Duration was needed. This is strange and requires manual conversions, as compared to just using a new type for this and implementing serialization/deserialization for the type directly so that it can be used by both cobra and koanf as flag/config types. This will also enable usage of systemd.time(7) spans in koanf configs, with stronger validation as well.
shlex.Quote() is much more strict in escaping than my much more lax utils.Quote() string. Using manual quoting allows passing multi-line strings such as the supervisor script properly.
d73488e to
35a306c
Compare
Description
In case changes in configuration settings for connections (i.e. SSH, internet interfaces, etc.) restart and cause loss of access to the machine in question, this PR implements a magic rollback mechanism that runs on the target.
It offloads a ton of the switch logic into a
supervisorbash script, which gets ran as asystemd-runtransient unit on the host. What makes this hugely different fromdeploy-rsis that it does not require modifying the destination closure whatsoever, so magic rollback can be used on systems withoutnixos-cliWHATSOEVER, including on a completely base NixOS system with no extra options.Closes #119.
Summary by CodeRabbit
Release Notes
New Features
--rollback-timeoutand--no-rollbackCLI flags for apply command to control automatic system rollback behavior.rollback.enableandrollback.timeout(replacesauto_rollback).Documentation
Tests