fix: cancel evicted group when it was swaped#5235
Conversation
📝 WalkthroughWalkthroughOn successful CompareAndSwap in Dispatcher.groupAlert, the dispatcher now calls cancel() on the replaced aggregation group so its goroutine stops immediately; a unit test verifies the canceled context. ChangesAggregation Group Lifecycle Fix
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Christoph Maser <[email protected]>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
dispatch/dispatch.go (1)
526-528: ⚡ Quick winVerify the fix prevents goroutine leaks with production monitoring.
The fix correctly addresses the memory leak by calling
cancel()on the evicted group's goroutine immediately afterCompareAndSwapsucceeds. The old group becomes unreachable in the map anddoMaintenance()won't find it, so this explicit cancellation ensures its goroutine exits via the<-ag.ctx.Done()case inrun()rather than running indefinitely.The
alertmanager_dispatcher_aggregation_groupsmetric already exists and decreases properly when groups are destroyed indoMaintenance()(line 332). Consider:
- Monitoring this metric in production post-deployment to confirm destroyed groups are properly cleaned up
- Extending the existing
TestGroupAlert_RecoversWhenCASFailstest to assert on goroutine count before and after group replacement, or add a dedicated test that monitors goroutine lifecycle under group contention🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dispatch/dispatch.go` around lines 526 - 528, Add production monitoring and tests to ensure the cancel() call on evicted aggrGroup instances prevents goroutine leaks: instrument and monitor the existing alertmanager_dispatcher_aggregation_groups metric in production after deploy to confirm counts decrease on group destruction, and extend TestGroupAlert_RecoversWhenCASFails (or add a new test) to capture goroutine counts before and after a CompareAndSwap replacement of an aggrGroup and assert the old group's goroutine terminates (i.e., that run() exits via <-ag.ctx.Done()); make the test provoke CAS contention and validate doMaintenance() behavior as part of the lifecycle check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@dispatch/dispatch.go`:
- Around line 526-528: Add production monitoring and tests to ensure the
cancel() call on evicted aggrGroup instances prevents goroutine leaks:
instrument and monitor the existing alertmanager_dispatcher_aggregation_groups
metric in production after deploy to confirm counts decrease on group
destruction, and extend TestGroupAlert_RecoversWhenCASFails (or add a new test)
to capture goroutine counts before and after a CompareAndSwap replacement of an
aggrGroup and assert the old group's goroutine terminates (i.e., that run()
exits via <-ag.ctx.Done()); make the test provoke CAS contention and validate
doMaintenance() behavior as part of the lifecycle check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: adfb6b7c-a7e8-4a12-a08f-f7898cf32086
📒 Files selected for processing (1)
dispatch/dispatch.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
dispatch/dispatch_test.go (1)
828-868: ⚡ Quick winConsider verifying the new group was created and contains the alert.
The test correctly verifies that the destroyed group's context is canceled after CAS swap. To strengthen confidence that the fix is complete and the alert wasn't lost, consider adding assertions to verify:
- A new (non-destroyed) aggregation group now occupies the fingerprint
- The new group contains the alert
Proposed enhancement
dispatcher.groupAlert(context.Background(), alert, route) + // Verify a new group was created and stored + el, ok := dispatcher.routeGroupsSlice[0].groups.Load(destroyedAg.fingerprint()) + require.True(t, ok, "a new group should exist at the fingerprint") + newAg := el.(*aggrGroup) + require.False(t, newAg.destroyed(), "new group should not be destroyed") + require.NotSame(t, destroyedAg, newAg, "new group should be different from destroyed group") + require.Len(t, newAg.alerts.List(), 1, "new group should contain the alert") + require.ErrorIs(t, destroyedAg.ctx.Err(), context.Canceled, "destroyed group's context must be cancelled after being CAS-swapped out")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dispatch/dispatch_test.go` around lines 828 - 868, Add assertions after calling dispatcher.groupAlert to ensure the new aggregation group replaced the destroyed one and that it contains the alert: load the stored value from dispatcher.routeGroupsSlice[0].groups using destroyedAg.fingerprint(), assert the loaded value is an *aggrGroup and that aggr.destroyed() is false, then check the aggr's alerts map (or aggr.alerts.Lookup / equivalent API used elsewhere) contains the fingerprint for the alert you created; reference TestGroupAlert_CancelsDestroyedGroupOnCASSwap, dispatcher.groupAlert, destroyedAg.fingerprint(), and the routeAggrGroups groups store when locating code to modify.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@dispatch/dispatch_test.go`:
- Around line 828-868: Add assertions after calling dispatcher.groupAlert to
ensure the new aggregation group replaced the destroyed one and that it contains
the alert: load the stored value from dispatcher.routeGroupsSlice[0].groups
using destroyedAg.fingerprint(), assert the loaded value is an *aggrGroup and
that aggr.destroyed() is false, then check the aggr's alerts map (or
aggr.alerts.Lookup / equivalent API used elsewhere) contains the fingerprint for
the alert you created; reference TestGroupAlert_CancelsDestroyedGroupOnCASSwap,
dispatcher.groupAlert, destroyedAg.fingerprint(), and the routeAggrGroups groups
store when locating code to modify.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b5d5351b-e8d7-4075-b6ae-51e8c47fbed3
📒 Files selected for processing (2)
dispatch/dispatch.godispatch/dispatch_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- dispatch/dispatch.go
Pull Request Checklist
Please check all the applicable boxes.
Which user-facing changes does this PR introduce?
Summary by CodeRabbit
Bug Fixes
Tests