fix: cancel evicted group when it was swaped by TheMeier · Pull Request #5235 · prometheus/alertmanager

TheMeier · 2026-05-12T19:41:56Z

Pull Request Checklist

Please check all the applicable boxes.

Please list all open issue(s) discussed with maintainers related to this change
- Fixes Memory leak after upgrade to 0.32.0 #5182
Is this a new Receiver integration?
- I have already tried to use the Webhook Receiver Integration and 3rd party integrations before adding this new Receiver Integration
Is this a bugfix?
- I have added tests that can reproduce the bug which pass with this bugfix applied
I have signed-off my commits
I will follow best practices for contributing to this project

Which user-facing changes does this PR introduce?

[FIX] DISPATCH: go routines not cancelled

Summary by CodeRabbit

Bug Fixes
- Fixed resource management so replaced aggregation groups are stopped immediately, preventing stray background work and improving stability and performance.
Tests
- Added a unit test to verify replaced aggregation groups are properly canceled and do not continue running.

coderabbitai · 2026-05-12T19:42:08Z

📝 Walkthrough

Walkthrough

On successful CompareAndSwap in Dispatcher.groupAlert, the dispatcher now calls cancel() on the replaced aggregation group so its goroutine stops immediately; a unit test verifies the canceled context.

Changes

Aggregation Group Lifecycle Fix

Layer / File(s)	Summary
Cancel unreachable aggregation group on CAS success `dispatch/dispatch.go`, `dispatch/dispatch_test.go`	After a successful `CompareAndSwap` replacing an aggregation group, the dispatcher calls `cancel()` on the previously-referenced `aggrGroup` to terminate its run loop; added test `TestGroupAlert_CancelsDestroyedGroupOnCASSwap` verifies the canceled context.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

prometheus/alertmanager#5179: Also modifies Dispatcher.groupAlert CAS handling for aggregation-group swaps.

Suggested labels

component/dispatch

Suggested reviewers

siavashs
SoloJacobs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: cancel evicted group when it was swaped' directly describes the main change: adding cancellation logic for evicted aggregation groups during CompareAndSwap operations.
Description check	✅ Passed	The description includes the linked issue (`#5182`), release notes entry, and required checklist items (signed-off commits, contribution guidelines acknowledgment), though it lacks explicit test coverage documentation.
Linked Issues check	✅ Passed	The PR fixes the memory leak reported in `#5182` by canceling goroutines when aggregation groups are evicted during CompareAndSwap, preventing unreachable groups from persisting until maintenance cleanup.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing the memory leak: cancel() call in dispatch.go and a test verifying the cancellation behavior in dispatch_test.go.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Christoph Maser <[email protected]>

coderabbitai

🧹 Nitpick comments (1)

dispatch/dispatch.go (1)
526-528: ⚡ Quick win

Verify the fix prevents goroutine leaks with production monitoring.

The fix correctly addresses the memory leak by calling cancel() on the evicted group's goroutine immediately after CompareAndSwap succeeds. The old group becomes unreachable in the map and doMaintenance() won't find it, so this explicit cancellation ensures its goroutine exits via the <-ag.ctx.Done() case in run() rather than running indefinitely.

The alertmanager_dispatcher_aggregation_groups metric already exists and decreases properly when groups are destroyed in doMaintenance() (line 332). Consider:

Monitoring this metric in production post-deployment to confirm destroyed groups are properly cleaned up

Extending the existing TestGroupAlert_RecoversWhenCASFails test to assert on goroutine count before and after group replacement, or add a dedicated test that monitors goroutine lifecycle under group contention
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dispatch/dispatch.go` around lines 526 - 528, Add production monitoring and
tests to ensure the cancel() call on evicted aggrGroup instances prevents
goroutine leaks: instrument and monitor the existing
alertmanager_dispatcher_aggregation_groups metric in production after deploy to
confirm counts decrease on group destruction, and extend
TestGroupAlert_RecoversWhenCASFails (or add a new test) to capture goroutine
counts before and after a CompareAndSwap replacement of an aggrGroup and assert
the old group's goroutine terminates (i.e., that run() exits via
<-ag.ctx.Done()); make the test provoke CAS contention and validate
doMaintenance() behavior as part of the lifecycle check.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@dispatch/dispatch.go`:
- Around line 526-528: Add production monitoring and tests to ensure the
cancel() call on evicted aggrGroup instances prevents goroutine leaks:
instrument and monitor the existing alertmanager_dispatcher_aggregation_groups
metric in production after deploy to confirm counts decrease on group
destruction, and extend TestGroupAlert_RecoversWhenCASFails (or add a new test)
to capture goroutine counts before and after a CompareAndSwap replacement of an
aggrGroup and assert the old group's goroutine terminates (i.e., that run()
exits via <-ag.ctx.Done()); make the test provoke CAS contention and validate
doMaintenance() behavior as part of the lifecycle check.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: adfb6b7c-a7e8-4a12-a08f-f7898cf32086

📥 Commits

Reviewing files that changed from the base of the PR and between 7806d6e and 0165314.

📒 Files selected for processing (1)

dispatch/dispatch.go

coderabbitai

🧹 Nitpick comments (1)

dispatch/dispatch_test.go (1)

828-868: ⚡ Quick win

Consider verifying the new group was created and contains the alert.

The test correctly verifies that the destroyed group's context is canceled after CAS swap. To strengthen confidence that the fix is complete and the alert wasn't lost, consider adding assertions to verify:

A new (non-destroyed) aggregation group now occupies the fingerprint
The new group contains the alert

Proposed enhancement

 	dispatcher.groupAlert(context.Background(), alert, route)
 
+	// Verify a new group was created and stored
+	el, ok := dispatcher.routeGroupsSlice[0].groups.Load(destroyedAg.fingerprint())
+	require.True(t, ok, "a new group should exist at the fingerprint")
+	newAg := el.(*aggrGroup)
+	require.False(t, newAg.destroyed(), "new group should not be destroyed")
+	require.NotSame(t, destroyedAg, newAg, "new group should be different from destroyed group")
+	require.Len(t, newAg.alerts.List(), 1, "new group should contain the alert")
+
 	require.ErrorIs(t, destroyedAg.ctx.Err(), context.Canceled,
 		"destroyed group's context must be cancelled after being CAS-swapped out")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dispatch/dispatch_test.go` around lines 828 - 868, Add assertions after
calling dispatcher.groupAlert to ensure the new aggregation group replaced the
destroyed one and that it contains the alert: load the stored value from
dispatcher.routeGroupsSlice[0].groups using destroyedAg.fingerprint(), assert
the loaded value is an *aggrGroup and that aggr.destroyed() is false, then check
the aggr's alerts map (or aggr.alerts.Lookup / equivalent API used elsewhere)
contains the fingerprint for the alert you created; reference
TestGroupAlert_CancelsDestroyedGroupOnCASSwap, dispatcher.groupAlert,
destroyedAg.fingerprint(), and the routeAggrGroups groups store when locating
code to modify.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@dispatch/dispatch_test.go`:
- Around line 828-868: Add assertions after calling dispatcher.groupAlert to
ensure the new aggregation group replaced the destroyed one and that it contains
the alert: load the stored value from dispatcher.routeGroupsSlice[0].groups
using destroyedAg.fingerprint(), assert the loaded value is an *aggrGroup and
that aggr.destroyed() is false, then check the aggr's alerts map (or
aggr.alerts.Lookup / equivalent API used elsewhere) contains the fingerprint for
the alert you created; reference TestGroupAlert_CancelsDestroyedGroupOnCASSwap,
dispatcher.groupAlert, destroyedAg.fingerprint(), and the routeAggrGroups groups
store when locating code to modify.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b5d5351b-e8d7-4075-b6ae-51e8c47fbed3

📥 Commits

Reviewing files that changed from the base of the PR and between 0165314 and 3c3b808.

📒 Files selected for processing (2)

dispatch/dispatch.go
dispatch/dispatch_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

dispatch/dispatch.go

TheMeier requested a review from a team as a code owner May 12, 2026 19:41

fix: cancel evicted group when it was swaped

3c3b808

Signed-off-by: Christoph Maser <[email protected]>

TheMeier force-pushed the issues/5182 branch from 0165314 to 3c3b808 Compare May 12, 2026 19:45

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

TheMeier requested a review from ultrotter May 13, 2026 08:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: cancel evicted group when it was swaped#5235

fix: cancel evicted group when it was swaped#5235
TheMeier wants to merge 1 commit into
prometheus:mainfrom
TheMeier:issues/5182

TheMeier commented May 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheMeier commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Checklist

Which user-facing changes does this PR introduce?

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TheMeier commented May 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading