Skip to content

Remove CAPI cleanup timeout to prevent credential secret deletion race#577

Merged
bsquizz merged 1 commit intomainfrom
remove-capi-cleanup-timeout
Apr 28, 2026
Merged

Remove CAPI cleanup timeout to prevent credential secret deletion race#577
bsquizz merged 1 commit intomainfrom
remove-capi-cleanup-timeout

Conversation

@bsquizz
Copy link
Copy Markdown
Contributor

@bsquizz bsquizz commented Apr 28, 2026

Summary

  • Removes the 1-hour timeout from handleCAPICleanup that force-released the ENO's finalizer while CAPI resources still had active finalizers
  • The timeout caused a race condition: namespace termination deleted rosa-creds-secret before the CAPA controller could authenticate to OCM, permanently deadlocking namespaces in Terminating state
  • The ENO now waits indefinitely for CAPI resources to fully finalize before releasing the namespace
  • Sets reservation Status.State to "deleting" during the CAPI cleanup phase so stuck reservations are easily identifiable via kubectl get namespacereservation

Root cause

When the timeout fired, the ENO released its capi-cleanup.cloud.redhat.com finalizer on the NamespaceReservation. This allowed the NR to be fully deleted, which cascade-deleted the namespace. During namespace termination, Kubernetes deleted rosa-creds-secret (no finalizer) while the CAPI resources (Cluster, ROSAControlPlane, ROSAMachinePool) still had active finalizers. The CAPA controller then could not authenticate to OCM to confirm cluster deletion, so the finalizers were never removed and the namespace was stuck forever.

Follow-up

ENGPROD-9908 tracks adding a Prometheus metric (eno_capi_cleanup_duration_seconds) to detect namespaces stuck in the CAPI cleanup phase for an extended period.

Test plan

  • All 57 existing unit tests pass
  • Deploy to staging and verify that ROSA namespace teardown completes without deadlock
  • Verify that reservations show deleting state during CAPI cleanup
  • Verify that the ENO correctly holds the finalizer until CAPI resources are fully gone

🤖 Generated with Claude Code

@bsquizz bsquizz force-pushed the remove-capi-cleanup-timeout branch from d18423a to b83dd79 Compare April 28, 2026 15:58
The 1-hour timeout in handleCAPICleanup force-released the ENO's
finalizer, allowing the NamespaceReservation (and namespace) to be
deleted while CAPI resources still had active finalizers. During
namespace termination, rosa-creds-secret was deleted before the CAPA
controller could use it to authenticate to OCM, permanently
deadlocking the namespace in Terminating state.

Remove the timeout so the ENO waits indefinitely for CAPI resources to
fully finalize. Set Status.State to "deleting" during the CAPI cleanup
phase so stuck reservations are easily identifiable. Metrics for
detecting stuck namespaces are tracked in ENGPROD-9908.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@bsquizz bsquizz force-pushed the remove-capi-cleanup-timeout branch from b83dd79 to 790a062 Compare April 28, 2026 15:58
@bsquizz bsquizz merged commit 63e9597 into main Apr 28, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants