fix: re-fetch namespace on conflict retry in UpdateAnnotations by bsquizz · Pull Request #568 · RedHatInsights/ephemeral-namespace-operator

bsquizz · 2026-04-21T18:47:37Z

Summary

UpdateAnnotations fetched the namespace object once before the RetryOnConflict loop, so every retry attempt used the same stale resource version — all retries would fail with the same conflict error
The function also unconditionally returned nil, silently swallowing the failure and causing UpdateNamespaceResources to log "successfully created" even when the ready annotation was never applied

Impact

Namespaces in pools with no ClowdEnvironment (e.g. the rosa pool) get stuck in creating state permanently. Because the pool counts stuck creating namespaces toward its size quota, it never provisions replacements either — the pool stays perpetually empty.

Pools with a ClowdEnvironment were not affected: the ClowdenvironmentReconciler requeues the entire reconciliation on error, re-fetching fresh state each time, so conflicts were retried correctly at that level.

Root cause (confirmed from operator logs)

At namespace creation time, OpenShift controllers (Tekton, OLM, workload-monitoring, pod-security admission) all write to the new namespace within milliseconds. By the time the operator calls UpdateAnnotations to set env-status: ready, the resource version is already stale → conflict → silent failure.

Fix

Move GetNamespace inside the retry callback so each attempt works with a fresh resource version, and return the result of RetryOnConflict so callers receive real errors.

Test plan

make fmt vet — clean
make test — 93/93 tests pass (53 helper + 40 controller)
Verified against live cluster: ephemeral-askzvl in rosa pool confirmed stuck due to this exact bug

🤖 Generated with Claude Code

…n UpdateAnnotations Namespaces in no-ClowdEnvironment pools (e.g. rosa) were stuck in 'creating' forever due to two bugs in UpdateAnnotations: 1. GetNamespace was called once before the retry loop, so every retry attempt used the same stale resource version. When OpenShift controllers (Tekton, OLM, etc.) updated the namespace between the operator's initial write and the ready-annotation write, a conflict error was returned and all retries failed identically. 2. The function always returned nil regardless of whether RetryOnConflict succeeded or failed, silently swallowing the error. UpdateNamespaceResources then logged "successfully created" and returned nil, leaving the namespace stuck with no recovery path. Fix: move GetNamespace inside the retry callback so each attempt gets a fresh resource version, and return the result of RetryOnConflict so callers can handle failures correctly. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

bsquizz merged commit c37d1d8 into main Apr 21, 2026
5 of 6 checks passed

bsquizz deleted the fix/update-annotations-conflict-retry branch April 21, 2026 19:07

bsquizz mentioned this pull request Apr 21, 2026

Fix/update annotations conflict retry #570

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: re-fetch namespace on conflict retry in UpdateAnnotations#568

fix: re-fetch namespace on conflict retry in UpdateAnnotations#568
bsquizz merged 1 commit intomainfrom
fix/update-annotations-conflict-retry

bsquizz commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bsquizz commented Apr 21, 2026

Summary

Impact

Root cause (confirmed from operator logs)

Fix

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant