Skip to content

fix: re-fetch namespace on conflict retry in UpdateAnnotations#568

Merged
bsquizz merged 1 commit intomainfrom
fix/update-annotations-conflict-retry
Apr 21, 2026
Merged

fix: re-fetch namespace on conflict retry in UpdateAnnotations#568
bsquizz merged 1 commit intomainfrom
fix/update-annotations-conflict-retry

Conversation

@bsquizz
Copy link
Copy Markdown
Contributor

@bsquizz bsquizz commented Apr 21, 2026

Summary

  • UpdateAnnotations fetched the namespace object once before the RetryOnConflict loop, so every retry attempt used the same stale resource version — all retries would fail with the same conflict error
  • The function also unconditionally returned nil, silently swallowing the failure and causing UpdateNamespaceResources to log "successfully created" even when the ready annotation was never applied

Impact

Namespaces in pools with no ClowdEnvironment (e.g. the rosa pool) get stuck in creating state permanently. Because the pool counts stuck creating namespaces toward its size quota, it never provisions replacements either — the pool stays perpetually empty.

Pools with a ClowdEnvironment were not affected: the ClowdenvironmentReconciler requeues the entire reconciliation on error, re-fetching fresh state each time, so conflicts were retried correctly at that level.

Root cause (confirmed from operator logs)

At namespace creation time, OpenShift controllers (Tekton, OLM, workload-monitoring, pod-security admission) all write to the new namespace within milliseconds. By the time the operator calls UpdateAnnotations to set env-status: ready, the resource version is already stale → conflict → silent failure.

Fix

Move GetNamespace inside the retry callback so each attempt works with a fresh resource version, and return the result of RetryOnConflict so callers receive real errors.

Test plan

  • make fmt vet — clean
  • make test — 93/93 tests pass (53 helper + 40 controller)
  • Verified against live cluster: ephemeral-askzvl in rosa pool confirmed stuck due to this exact bug

🤖 Generated with Claude Code

…n UpdateAnnotations

Namespaces in no-ClowdEnvironment pools (e.g. rosa) were stuck in
'creating' forever due to two bugs in UpdateAnnotations:

1. GetNamespace was called once before the retry loop, so every retry
   attempt used the same stale resource version. When OpenShift
   controllers (Tekton, OLM, etc.) updated the namespace between the
   operator's initial write and the ready-annotation write, a conflict
   error was returned and all retries failed identically.

2. The function always returned nil regardless of whether
   RetryOnConflict succeeded or failed, silently swallowing the error.
   UpdateNamespaceResources then logged "successfully created" and
   returned nil, leaving the namespace stuck with no recovery path.

Fix: move GetNamespace inside the retry callback so each attempt gets a
fresh resource version, and return the result of RetryOnConflict so
callers can handle failures correctly.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@bsquizz bsquizz merged commit c37d1d8 into main Apr 21, 2026
5 of 6 checks passed
@bsquizz bsquizz deleted the fix/update-annotations-conflict-retry branch April 21, 2026 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant