Remove CAPI cleanup timeout to prevent credential secret deletion race#577
Merged
Remove CAPI cleanup timeout to prevent credential secret deletion race#577
Conversation
d18423a to
b83dd79
Compare
The 1-hour timeout in handleCAPICleanup force-released the ENO's finalizer, allowing the NamespaceReservation (and namespace) to be deleted while CAPI resources still had active finalizers. During namespace termination, rosa-creds-secret was deleted before the CAPA controller could use it to authenticate to OCM, permanently deadlocking the namespace in Terminating state. Remove the timeout so the ENO waits indefinitely for CAPI resources to fully finalize. Set Status.State to "deleting" during the CAPI cleanup phase so stuck reservations are easily identifiable. Metrics for detecting stuck namespaces are tracked in ENGPROD-9908. Co-Authored-By: Claude Opus 4.6 <[email protected]>
b83dd79 to
790a062
Compare
JuanmaBM
approved these changes
Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
handleCAPICleanupthat force-released the ENO's finalizer while CAPI resources still had active finalizersrosa-creds-secretbefore the CAPA controller could authenticate to OCM, permanently deadlocking namespaces inTerminatingstateStatus.Stateto"deleting"during the CAPI cleanup phase so stuck reservations are easily identifiable viakubectl get namespacereservationRoot cause
When the timeout fired, the ENO released its
capi-cleanup.cloud.redhat.comfinalizer on the NamespaceReservation. This allowed the NR to be fully deleted, which cascade-deleted the namespace. During namespace termination, Kubernetes deletedrosa-creds-secret(no finalizer) while the CAPI resources (Cluster, ROSAControlPlane, ROSAMachinePool) still had active finalizers. The CAPA controller then could not authenticate to OCM to confirm cluster deletion, so the finalizers were never removed and the namespace was stuck forever.Follow-up
ENGPROD-9908 tracks adding a Prometheus metric (
eno_capi_cleanup_duration_seconds) to detect namespaces stuck in the CAPI cleanup phase for an extended period.Test plan
deletingstate during CAPI cleanup🤖 Generated with Claude Code