Skip to content

fix: cancel evicted group when it was swaped#5235

Open
TheMeier wants to merge 1 commit into
prometheus:mainfrom
TheMeier:issues/5182
Open

fix: cancel evicted group when it was swaped#5235
TheMeier wants to merge 1 commit into
prometheus:mainfrom
TheMeier:issues/5182

Conversation

@TheMeier
Copy link
Copy Markdown
Contributor

@TheMeier TheMeier commented May 12, 2026

Pull Request Checklist

Please check all the applicable boxes.

Which user-facing changes does this PR introduce?

[FIX] DISPATCH: go routines not cancelled

Summary by CodeRabbit

  • Bug Fixes

    • Fixed resource management so replaced aggregation groups are stopped immediately, preventing stray background work and improving stability and performance.
  • Tests

    • Added a unit test to verify replaced aggregation groups are properly canceled and do not continue running.

Review Change Stack

@TheMeier TheMeier requested a review from a team as a code owner May 12, 2026 19:41
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

On successful CompareAndSwap in Dispatcher.groupAlert, the dispatcher now calls cancel() on the replaced aggregation group so its goroutine stops immediately; a unit test verifies the canceled context.

Changes

Aggregation Group Lifecycle Fix

Layer / File(s) Summary
Cancel unreachable aggregation group on CAS success
dispatch/dispatch.go, dispatch/dispatch_test.go
After a successful CompareAndSwap replacing an aggregation group, the dispatcher calls cancel() on the previously-referenced aggrGroup to terminate its run loop; added test TestGroupAlert_CancelsDestroyedGroupOnCASSwap verifies the canceled context.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

component/dispatch

Suggested reviewers

  • siavashs
  • SoloJacobs
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: cancel evicted group when it was swaped' directly describes the main change: adding cancellation logic for evicted aggregation groups during CompareAndSwap operations.
Description check ✅ Passed The description includes the linked issue (#5182), release notes entry, and required checklist items (signed-off commits, contribution guidelines acknowledgment), though it lacks explicit test coverage documentation.
Linked Issues check ✅ Passed The PR fixes the memory leak reported in #5182 by canceling goroutines when aggregation groups are evicted during CompareAndSwap, preventing unreachable groups from persisting until maintenance cleanup.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the memory leak: cancel() call in dispatch.go and a test verifying the cancellation behavior in dispatch_test.go.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
dispatch/dispatch.go (1)

526-528: ⚡ Quick win

Verify the fix prevents goroutine leaks with production monitoring.

The fix correctly addresses the memory leak by calling cancel() on the evicted group's goroutine immediately after CompareAndSwap succeeds. The old group becomes unreachable in the map and doMaintenance() won't find it, so this explicit cancellation ensures its goroutine exits via the <-ag.ctx.Done() case in run() rather than running indefinitely.

The alertmanager_dispatcher_aggregation_groups metric already exists and decreases properly when groups are destroyed in doMaintenance() (line 332). Consider:

  • Monitoring this metric in production post-deployment to confirm destroyed groups are properly cleaned up
  • Extending the existing TestGroupAlert_RecoversWhenCASFails test to assert on goroutine count before and after group replacement, or add a dedicated test that monitors goroutine lifecycle under group contention
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dispatch/dispatch.go` around lines 526 - 528, Add production monitoring and
tests to ensure the cancel() call on evicted aggrGroup instances prevents
goroutine leaks: instrument and monitor the existing
alertmanager_dispatcher_aggregation_groups metric in production after deploy to
confirm counts decrease on group destruction, and extend
TestGroupAlert_RecoversWhenCASFails (or add a new test) to capture goroutine
counts before and after a CompareAndSwap replacement of an aggrGroup and assert
the old group's goroutine terminates (i.e., that run() exits via
<-ag.ctx.Done()); make the test provoke CAS contention and validate
doMaintenance() behavior as part of the lifecycle check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@dispatch/dispatch.go`:
- Around line 526-528: Add production monitoring and tests to ensure the
cancel() call on evicted aggrGroup instances prevents goroutine leaks:
instrument and monitor the existing alertmanager_dispatcher_aggregation_groups
metric in production after deploy to confirm counts decrease on group
destruction, and extend TestGroupAlert_RecoversWhenCASFails (or add a new test)
to capture goroutine counts before and after a CompareAndSwap replacement of an
aggrGroup and assert the old group's goroutine terminates (i.e., that run()
exits via <-ag.ctx.Done()); make the test provoke CAS contention and validate
doMaintenance() behavior as part of the lifecycle check.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: adfb6b7c-a7e8-4a12-a08f-f7898cf32086

📥 Commits

Reviewing files that changed from the base of the PR and between 7806d6e and 0165314.

📒 Files selected for processing (1)
  • dispatch/dispatch.go

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
dispatch/dispatch_test.go (1)

828-868: ⚡ Quick win

Consider verifying the new group was created and contains the alert.

The test correctly verifies that the destroyed group's context is canceled after CAS swap. To strengthen confidence that the fix is complete and the alert wasn't lost, consider adding assertions to verify:

  • A new (non-destroyed) aggregation group now occupies the fingerprint
  • The new group contains the alert
Proposed enhancement
 	dispatcher.groupAlert(context.Background(), alert, route)
 
+	// Verify a new group was created and stored
+	el, ok := dispatcher.routeGroupsSlice[0].groups.Load(destroyedAg.fingerprint())
+	require.True(t, ok, "a new group should exist at the fingerprint")
+	newAg := el.(*aggrGroup)
+	require.False(t, newAg.destroyed(), "new group should not be destroyed")
+	require.NotSame(t, destroyedAg, newAg, "new group should be different from destroyed group")
+	require.Len(t, newAg.alerts.List(), 1, "new group should contain the alert")
+
 	require.ErrorIs(t, destroyedAg.ctx.Err(), context.Canceled,
 		"destroyed group's context must be cancelled after being CAS-swapped out")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dispatch/dispatch_test.go` around lines 828 - 868, Add assertions after
calling dispatcher.groupAlert to ensure the new aggregation group replaced the
destroyed one and that it contains the alert: load the stored value from
dispatcher.routeGroupsSlice[0].groups using destroyedAg.fingerprint(), assert
the loaded value is an *aggrGroup and that aggr.destroyed() is false, then check
the aggr's alerts map (or aggr.alerts.Lookup / equivalent API used elsewhere)
contains the fingerprint for the alert you created; reference
TestGroupAlert_CancelsDestroyedGroupOnCASSwap, dispatcher.groupAlert,
destroyedAg.fingerprint(), and the routeAggrGroups groups store when locating
code to modify.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@dispatch/dispatch_test.go`:
- Around line 828-868: Add assertions after calling dispatcher.groupAlert to
ensure the new aggregation group replaced the destroyed one and that it contains
the alert: load the stored value from dispatcher.routeGroupsSlice[0].groups
using destroyedAg.fingerprint(), assert the loaded value is an *aggrGroup and
that aggr.destroyed() is false, then check the aggr's alerts map (or
aggr.alerts.Lookup / equivalent API used elsewhere) contains the fingerprint for
the alert you created; reference TestGroupAlert_CancelsDestroyedGroupOnCASSwap,
dispatcher.groupAlert, destroyedAg.fingerprint(), and the routeAggrGroups groups
store when locating code to modify.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b5d5351b-e8d7-4075-b6ae-51e8c47fbed3

📥 Commits

Reviewing files that changed from the base of the PR and between 0165314 and 3c3b808.

📒 Files selected for processing (2)
  • dispatch/dispatch.go
  • dispatch/dispatch_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • dispatch/dispatch.go

@TheMeier TheMeier requested a review from ultrotter May 13, 2026 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leak after upgrade to 0.32.0

1 participant