Skip to content

fix: remove stuck ROSAMachinePool finalizers during namespace cleanup#578

Open
bsquizz wants to merge 2 commits intomainfrom
fix/rosa-machinepool-stuck-finalizer
Open

fix: remove stuck ROSAMachinePool finalizers during namespace cleanup#578
bsquizz wants to merge 2 commits intomainfrom
fix/rosa-machinepool-stuck-finalizer

Conversation

@bsquizz
Copy link
Copy Markdown
Contributor

@bsquizz bsquizz commented Apr 29, 2026

Summary

  • During ROSA HCP cluster deletion, the ROSAMachinePool controller can get permanently stuck in a reconcile loop when the OCM API returns a 400 error (e.g. Attribute 'aws_node_pool.root_volume.size' is not allowed), preventing its finalizer from being removed
  • This blocks the full deletion chain: ROSAMachinePool → MachinePool → Cluster, leaving the namespace stuck in deleting state indefinitely
  • Adds RemoveStuckROSAMachinePoolFinalizers to the CAPI cleanup helper, which detects ROSAMachinePool resources with a deletionTimestamp and a stuck controller finalizer and patches them away on each requeue pass

Root cause

When a NamespaceReservation is deleted, the operator issues a delete on the Cluster, which cascades to MachinePool and ROSAMachinePool. The ROSA infrastructure controller then tries to reconcile the ROSAMachinePool (e.g. to drain nodes) before running its cleanup, but if the OCM API rejects the update, the controller error-loops indefinitely and never removes its finalizer — leaving the entire chain frozen.

Test plan

  • Verify that go vet ./... and unit tests pass
  • Deploy to a test environment with a ROSA HCP cluster whose ROSAMachinePool has a stuck finalizer and confirm the namespace is released after the next reconcile

🤖 Generated with Claude Code

When a ROSA HCP cluster is deleted, the ROSAMachinePool controller can
get stuck in a reconciliation failure loop (e.g. OCM API returning 400
for 'aws_node_pool.root_volume.size' not allowed), preventing the
finalizer from being removed and blocking the entire Cluster deletion
chain: ROSAMachinePool → MachinePool → Cluster.

During namespace cleanup, after issuing the Cluster delete,
RemoveStuckROSAMachinePoolFinalizers now finds any ROSAMachinePool
with a deletionTimestamp and its controller finalizer still present and
patches the finalizer away, allowing the deletion chain to proceed.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@bsquizz bsquizz force-pushed the fix/rosa-machinepool-stuck-finalizer branch from 3314b14 to cd08aa5 Compare April 29, 2026 13:48
Only remove the finalizer when both conditions match the specific failure
we troubleshot: the ROSAMachinePool's RosaMachinePoolReady condition
shows ReconciliationFailed with the exact OCM error message
"Attribute 'aws_node_pool.root_volume.size' is not allowed", AND the
owning Cluster's Deleting condition has reason WaitingForWorkersDeletion.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant