fix(controller): requeue while source cluster is still bootstrapping by IvanHunters · Pull Request #17 · cozystack/kilo-clustermesh-operator

IvanHunters · 2026-05-28T08:14:57Z

Summary

Source-side companion to #15. The reconciler watches only ClusterMesh CRs, not the Nodes of remote clusters. If the first reconcile runs while a source cluster is still bootstrapping (apiserver answers but kubelet hasn't joined yet, or kubeconfig secret hasn't landed in the registry yet), reconcile completes without error and produces no useful effect: ensureNodeEndpoints sees zero nodes, pushPeersToTargets emits an empty desired set. controller-runtime then has nothing to requeue on, and the next attempt waits for an external event on the CR — which can take many minutes.

Observed in the wild: a freshly recreated tenant's mesh-up was delayed 17 minutes because mesh1's VM came up ~3 minutes after the operator's first reconcile, and the next reconcile didn't fire until the Cozystack Package controller re-applied the CR for unrelated reasons.

Fix

Track an incomplete flag in reconcileAllClusters:

r.Registry.Client(srcEntry.Name) returned !ok — kubeconfig secret hasn't been merged into the registry yet,
or listNodes(srcClient) returned an empty slice — apiserver up, no nodes joined.

At the Reconcile boundary, when err == nil && incomplete, return ctrl.Result{RequeueAfter: 30 * time.Second}. Mixing RequeueAfter with an error would defeat controller-runtime's exponential backoff, so the error path is unchanged.

The constant bootstrapRequeueAfter = 30 * time.Second is a deliberate trade-off:

short enough that freshly bootstrapped tenants converge in tens of seconds rather than tens of minutes,
long enough that a quiescent fleet (steady-state) doesn't get spurious reconcile churn — once all sources have nodes, incomplete is false and the timer is not armed.

This is the source-side dual of #15 (target-side NoMatchError on the Peer CRD via mapper Reset). Together they close both halves of the freshly-bootstrapped-tenant race.

Test plan

go build ./... clean.
go vet ./... clean.
go test ./... (excluding integration) passes.
golangci-lint run zero issues.
Verify on a Cozystack hosting cluster against a recreated KubernetesSwitchcloud tenant: mesh-up should converge within ~1 minute of the tenant's first node becoming Ready (versus 17 minutes observed without this patch).

The reconciler watches only ClusterMesh CRs, not the Nodes of remote clusters. If the first reconcile runs against a source cluster whose apiserver answers but whose kubelet has not joined yet (or whose kubeconfig secret is not in the registry at all), reconcile completes without error: ensureNodeEndpoints sees zero nodes, pushPeersToTargets emits an empty desired peer set, no useful state changes. controller runtime then has nothing to requeue on, and the next attempt waits for an external event on the CR — which can take many minutes (in the wild we saw a 17-minute mesh-up gap on a freshly bootstrapped tenant whose VM landed a few minutes after the operator's first reconcile). Flag the run as "incomplete" when any source is missing from the registry or returned zero nodes, and translate that into a RequeueAfter at the Reconcile boundary. The interval is a short constant (30s) so freshly bootstrapped tenants converge promptly without waiting on a stray watch event. Errors are unaffected: controller-runtime keeps its own exponential backoff for those. This is the source-side dual of the PR #15 fix (which addressed the target-side NoMatchError on the Peer CRD via mapper Reset). Together they close the two halves of the freshly-bootstrapped-tenant race. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

coderabbitai · 2026-05-28T08:15:09Z

Warning

Review limit reached

@IvanHunters, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 16 minutes and 51 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3662f357-9d01-4531-bdd6-09ab5ccc4f27

📥 Commits

Reviewing files that changed from the base of the PR and between 2f76730 and d2f1a56.

📒 Files selected for processing (1)

internal/controller/clustermesh_controller.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/requeue-while-source-bootstrapping

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a periodic requeue mechanism (every 30 seconds) in the ClusterMesh controller when at least one source cluster is still bootstrapping (missing from the registry or has no nodes yet). This prevents the operator from stalling when remote clusters are not yet ready. The feedback suggests two key improvements: first, updating the mesh status to reflect a bootstrapping state rather than unconditionally marking it as ready when reconciliation is incomplete; second, optimizing the reconciliation loop by skipping further steps and continuing early when a cluster has zero nodes.

gemini-code-assist · 2026-05-28T08:16:53Z

 	r.cleanupOrphanMeshPeers(ctx, log, mesh.Namespace)

-	return r.updateStatus(ctx, mesh, clusterStatuses)
+	return incomplete, r.updateStatus(ctx, mesh, clusterStatuses)


When incomplete is true, the mesh is still bootstrapping (e.g., some clusters have no nodes or are missing from the registry). However, updateStatus unconditionally sets the Ready condition to True (Reconciled). This can mislead users or external controllers into thinking the mesh is fully established when it is not.

Consider updating updateStatus to accept the incomplete boolean and set the Ready condition to False (with a reason like Bootstrapping) when incomplete is true.

gemini-code-assist · 2026-05-28T08:16:53Z

+		if len(nodes) == 0 {
+			log.Info("source cluster has no nodes yet; will requeue",
+				slog.String("cluster", srcEntry.Name),
+				slog.Duration("after", bootstrapRequeueAfter),
+			)
+
+			incomplete = true
 		}


If a source cluster has 0 nodes, we can skip the rest of the reconciliation steps for this cluster (such as filtering nodes and pushing peers to other target clusters) and continue early. This avoids performing unnecessary List and reconciliation API calls to all other clusters in the mesh, which is especially important since we are requeuing every 30 seconds during bootstrap.

Suggested change

if len(nodes) == 0 {

log.Info("source cluster has no nodes yet; will requeue",

slog.String("cluster", srcEntry.Name),

slog.Duration("after", bootstrapRequeueAfter),

)

incomplete = true

}

if len(nodes) == 0 {

log.Info("source cluster has no nodes yet; will requeue",

slog.String("cluster", srcEntry.Name),

slog.Duration("after", bootstrapRequeueAfter),

)

incomplete = true

statuses = append(statuses, v1alpha1.ClusterStatus{Name: srcEntry.Name})

continue

}

…covery is plausible (#18) * fix(controller): also requeue when source nodes are awaiting kilo daemon annotations PR #17 added a periodic requeue when the source cluster's listNodes returned an empty slice, on the theory that bootstrap-in-progress means "kubelet has not joined yet". That theory missed the next stage of the same race: kubelet has joined and the node is Ready, but the kilo daemon has not yet written kilo.squat.ai/wireguard-ip (or public-key, or endpoint) onto the node. validateNode rejects the node on NodeNoWireguardIP, filterNodes drops it, pushPeersToTargets emits an empty desired set, reconcile returns nil with len(nodes)>0 — and the incomplete flag stays false, so the requeue timer never arms. The operator does not watch Nodes in remote clusters, so the annotation landing produces no event, and the tenant's peer is never pushed into the target cluster. Observed in the wild on a freshly recreated mesh3 tenant: control plane came up, node went Ready in ~2 minutes, kilo daemon annotated the node ~30 seconds later, but the operator had already finished its only reconcile pass and went silent. A manual `kubectl annotate` on the ClusterMesh CR was required to bump resourceVersion and kick the operator into a successful peer push. Move the incomplete check from "empty listNodes" to "empty validNodes after filtering". This covers both shapes in one predicate: - len(nodes) == 0 — apiserver up but no kubelet joined yet - len(nodes) > 0 but all skipped — kilo daemon still annotating Steady-state remote clusters with always-skipped nodes (e.g. ceph's location-non-leader nodes that never receive a wireguard-ip annotation by kilo's per-location granularity design) still have at least one valid node (the location leader), so validNodes > 0 and the requeue does not fire. The check is safely scoped to true bootstrap stalls. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com> * fix(validation): classify NodeSkipReason transient vs permanent The previous patch treated every "validNodes==0" as bootstrap-in- progress and requeued every 30s indefinitely. That works for transient reasons (kilo daemon still annotating the node) but burns reconciles forever on permanent configuration errors (PodCIDROutOfRange, WGIPDuplicate, malformed annotations). Introduce validation.IsTransient(reason) classifying each NodeSkipReason: - transient (NodeNoPodCIDR, NodeNoWireguardIP, NodeNoPublicKey, NodeNoEndpoint) — kubelet / kilo daemon still finishing setup; retry will pick up the new annotation on the next tick. - permanent (PodCIDROutOfRange, WGIPInvalid, WGIPOutOfRange, WGIPDuplicate, EndpointInvalid) — operator intervention required; retry without it cannot change the outcome. The exhaustive switch failing closed on unknown reasons means new NodeSkipReason values must be added to the transient list intentionally — accidental silent retry is impossible. filterNodes now returns (validNodes, skipped, transientSkipped). reconcileAllClusters arms the periodic requeue only when there is a reason to expect recovery (len(nodes)==0 OR transientSkipped>0). The all-permanent-skips case is logged at WARN level (with explicit "mesh will not converge without intervention") and the controller goes idle until the operator's spec actually changes — much better signal to ops than a silent retry loop. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com> --------- Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

IvanHunters marked this pull request as ready for review May 28, 2026 08:15

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

Arsolitt (Arsolitt) approved these changes May 28, 2026

View reviewed changes

IvanHunters merged commit 8c8b97e into main May 28, 2026
9 checks passed

IvanHunters deleted the fix/requeue-while-source-bootstrapping branch May 28, 2026 08:20

IvanHunters mentioned this pull request May 28, 2026

fix(controller,validation): classify node skips, requeue only when recovery is plausible #18

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(controller): requeue while source cluster is still bootstrapping#17

fix(controller): requeue while source cluster is still bootstrapping#17
IvanHunters merged 1 commit into
mainfrom
fix/requeue-while-source-bootstrapping

IvanHunters commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

IvanHunters commented May 28, 2026

Summary

Fix

Test plan

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 28, 2026 •

edited

Loading