fix: re-fetch namespace on conflict retry in UpdateAnnotations#568
Merged
fix: re-fetch namespace on conflict retry in UpdateAnnotations#568
Conversation
…n UpdateAnnotations Namespaces in no-ClowdEnvironment pools (e.g. rosa) were stuck in 'creating' forever due to two bugs in UpdateAnnotations: 1. GetNamespace was called once before the retry loop, so every retry attempt used the same stale resource version. When OpenShift controllers (Tekton, OLM, etc.) updated the namespace between the operator's initial write and the ready-annotation write, a conflict error was returned and all retries failed identically. 2. The function always returned nil regardless of whether RetryOnConflict succeeded or failed, silently swallowing the error. UpdateNamespaceResources then logged "successfully created" and returned nil, leaving the namespace stuck with no recovery path. Fix: move GetNamespace inside the retry callback so each attempt gets a fresh resource version, and return the result of RetryOnConflict so callers can handle failures correctly. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
UpdateAnnotationsfetched the namespace object once before theRetryOnConflictloop, so every retry attempt used the same stale resource version — all retries would fail with the same conflict errornil, silently swallowing the failure and causingUpdateNamespaceResourcesto log "successfully created" even when the ready annotation was never appliedImpact
Namespaces in pools with no
ClowdEnvironment(e.g. therosapool) get stuck increatingstate permanently. Because the pool counts stuckcreatingnamespaces toward its size quota, it never provisions replacements either — the pool stays perpetually empty.Pools with a ClowdEnvironment were not affected: the
ClowdenvironmentReconcilerrequeues the entire reconciliation on error, re-fetching fresh state each time, so conflicts were retried correctly at that level.Root cause (confirmed from operator logs)
At namespace creation time, OpenShift controllers (Tekton, OLM, workload-monitoring, pod-security admission) all write to the new namespace within milliseconds. By the time the operator calls
UpdateAnnotationsto setenv-status: ready, the resource version is already stale → conflict → silent failure.Fix
Move
GetNamespaceinside the retry callback so each attempt works with a fresh resource version, and return the result ofRetryOnConflictso callers receive real errors.Test plan
make fmt vet— cleanmake test— 93/93 tests pass (53 helper + 40 controller)ephemeral-askzvlinrosapool confirmed stuck due to this exact bug🤖 Generated with Claude Code