fix: remove stuck ROSAMachinePool finalizers during namespace cleanup#578
Open
fix: remove stuck ROSAMachinePool finalizers during namespace cleanup#578
Conversation
When a ROSA HCP cluster is deleted, the ROSAMachinePool controller can get stuck in a reconciliation failure loop (e.g. OCM API returning 400 for 'aws_node_pool.root_volume.size' not allowed), preventing the finalizer from being removed and blocking the entire Cluster deletion chain: ROSAMachinePool → MachinePool → Cluster. During namespace cleanup, after issuing the Cluster delete, RemoveStuckROSAMachinePoolFinalizers now finds any ROSAMachinePool with a deletionTimestamp and its controller finalizer still present and patches the finalizer away, allowing the deletion chain to proceed. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
3314b14 to
cd08aa5
Compare
Only remove the finalizer when both conditions match the specific failure we troubleshot: the ROSAMachinePool's RosaMachinePoolReady condition shows ReconciliationFailed with the exact OCM error message "Attribute 'aws_node_pool.root_volume.size' is not allowed", AND the owning Cluster's Deleting condition has reason WaitingForWorkersDeletion. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ROSAMachinePoolcontroller can get permanently stuck in a reconcile loop when the OCM API returns a 400 error (e.g.Attribute 'aws_node_pool.root_volume.size' is not allowed), preventing its finalizer from being removedROSAMachinePool → MachinePool → Cluster, leaving the namespace stuck indeletingstate indefinitelyRemoveStuckROSAMachinePoolFinalizersto the CAPI cleanup helper, which detectsROSAMachinePoolresources with adeletionTimestampand a stuck controller finalizer and patches them away on each requeue passRoot cause
When a
NamespaceReservationis deleted, the operator issues a delete on theCluster, which cascades toMachinePoolandROSAMachinePool. The ROSA infrastructure controller then tries to reconcile theROSAMachinePool(e.g. to drain nodes) before running its cleanup, but if the OCM API rejects the update, the controller error-loops indefinitely and never removes its finalizer — leaving the entire chain frozen.Test plan
go vet ./...and unit tests passROSAMachinePoolhas a stuck finalizer and confirm the namespace is released after the next reconcile🤖 Generated with Claude Code