Skip to content

fix(controller): requeue while source cluster is still bootstrapping#17

Merged
IvanHunters merged 1 commit into
mainfrom
fix/requeue-while-source-bootstrapping
May 28, 2026
Merged

fix(controller): requeue while source cluster is still bootstrapping#17
IvanHunters merged 1 commit into
mainfrom
fix/requeue-while-source-bootstrapping

Conversation

@IvanHunters
Copy link
Copy Markdown
Collaborator

Summary

Source-side companion to #15. The reconciler watches only ClusterMesh CRs, not the Nodes of remote clusters. If the first reconcile runs while a source cluster is still bootstrapping (apiserver answers but kubelet hasn't joined yet, or kubeconfig secret hasn't landed in the registry yet), reconcile completes without error and produces no useful effect: ensureNodeEndpoints sees zero nodes, pushPeersToTargets emits an empty desired set. controller-runtime then has nothing to requeue on, and the next attempt waits for an external event on the CR — which can take many minutes.

Observed in the wild: a freshly recreated tenant's mesh-up was delayed 17 minutes because mesh1's VM came up ~3 minutes after the operator's first reconcile, and the next reconcile didn't fire until the Cozystack Package controller re-applied the CR for unrelated reasons.

Fix

Track an incomplete flag in reconcileAllClusters:

  • r.Registry.Client(srcEntry.Name) returned !ok — kubeconfig secret hasn't been merged into the registry yet,
  • or listNodes(srcClient) returned an empty slice — apiserver up, no nodes joined.

At the Reconcile boundary, when err == nil && incomplete, return ctrl.Result{RequeueAfter: 30 * time.Second}. Mixing RequeueAfter with an error would defeat controller-runtime's exponential backoff, so the error path is unchanged.

The constant bootstrapRequeueAfter = 30 * time.Second is a deliberate trade-off:

  • short enough that freshly bootstrapped tenants converge in tens of seconds rather than tens of minutes,
  • long enough that a quiescent fleet (steady-state) doesn't get spurious reconcile churn — once all sources have nodes, incomplete is false and the timer is not armed.

This is the source-side dual of #15 (target-side NoMatchError on the Peer CRD via mapper Reset). Together they close both halves of the freshly-bootstrapped-tenant race.

Test plan

  • go build ./... clean.
  • go vet ./... clean.
  • go test ./... (excluding integration) passes.
  • golangci-lint run zero issues.
  • Verify on a Cozystack hosting cluster against a recreated KubernetesSwitchcloud tenant: mesh-up should converge within ~1 minute of the tenant's first node becoming Ready (versus 17 minutes observed without this patch).

The reconciler watches only ClusterMesh CRs, not the Nodes of remote
clusters. If the first reconcile runs against a source cluster whose
apiserver answers but whose kubelet has not joined yet (or whose
kubeconfig secret is not in the registry at all), reconcile completes
without error: ensureNodeEndpoints sees zero nodes, pushPeersToTargets
emits an empty desired peer set, no useful state changes. controller
runtime then has nothing to requeue on, and the next attempt waits for
an external event on the CR — which can take many minutes (in the
wild we saw a 17-minute mesh-up gap on a freshly bootstrapped tenant
whose VM landed a few minutes after the operator's first reconcile).

Flag the run as "incomplete" when any source is missing from the
registry or returned zero nodes, and translate that into a
RequeueAfter at the Reconcile boundary. The interval is a short
constant (30s) so freshly bootstrapped tenants converge promptly
without waiting on a stray watch event. Errors are unaffected:
controller-runtime keeps its own exponential backoff for those.

This is the source-side dual of the PR #15 fix (which addressed the
target-side NoMatchError on the Peer CRD via mapper Reset). Together
they close the two halves of the freshly-bootstrapped-tenant race.

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
@IvanHunters IvanHunters marked this pull request as ready for review May 28, 2026 08:15
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Warning

Review limit reached

@IvanHunters, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 16 minutes and 51 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3662f357-9d01-4531-bdd6-09ab5ccc4f27

📥 Commits

Reviewing files that changed from the base of the PR and between 2f76730 and d2f1a56.

📒 Files selected for processing (1)
  • internal/controller/clustermesh_controller.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/requeue-while-source-bootstrapping

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a periodic requeue mechanism (every 30 seconds) in the ClusterMesh controller when at least one source cluster is still bootstrapping (missing from the registry or has no nodes yet). This prevents the operator from stalling when remote clusters are not yet ready. The feedback suggests two key improvements: first, updating the mesh status to reflect a bootstrapping state rather than unconditionally marking it as ready when reconciliation is incomplete; second, optimizing the reconciliation loop by skipping further steps and continuing early when a cluster has zero nodes.

r.cleanupOrphanMeshPeers(ctx, log, mesh.Namespace)

return r.updateStatus(ctx, mesh, clusterStatuses)
return incomplete, r.updateStatus(ctx, mesh, clusterStatuses)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When incomplete is true, the mesh is still bootstrapping (e.g., some clusters have no nodes or are missing from the registry). However, updateStatus unconditionally sets the Ready condition to True (Reconciled). This can mislead users or external controllers into thinking the mesh is fully established when it is not.

Consider updating updateStatus to accept the incomplete boolean and set the Ready condition to False (with a reason like Bootstrapping) when incomplete is true.

Comment on lines +476 to 483
if len(nodes) == 0 {
log.Info("source cluster has no nodes yet; will requeue",
slog.String("cluster", srcEntry.Name),
slog.Duration("after", bootstrapRequeueAfter),
)

incomplete = true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If a source cluster has 0 nodes, we can skip the rest of the reconciliation steps for this cluster (such as filtering nodes and pushing peers to other target clusters) and continue early. This avoids performing unnecessary List and reconciliation API calls to all other clusters in the mesh, which is especially important since we are requeuing every 30 seconds during bootstrap.

Suggested change
if len(nodes) == 0 {
log.Info("source cluster has no nodes yet; will requeue",
slog.String("cluster", srcEntry.Name),
slog.Duration("after", bootstrapRequeueAfter),
)
incomplete = true
}
if len(nodes) == 0 {
log.Info("source cluster has no nodes yet; will requeue",
slog.String("cluster", srcEntry.Name),
slog.Duration("after", bootstrapRequeueAfter),
)
incomplete = true
statuses = append(statuses, v1alpha1.ClusterStatus{Name: srcEntry.Name})
continue
}

@IvanHunters IvanHunters merged commit 8c8b97e into main May 28, 2026
9 checks passed
@IvanHunters IvanHunters deleted the fix/requeue-while-source-bootstrapping branch May 28, 2026 08:20
IvanHunters added a commit that referenced this pull request May 28, 2026
…covery is plausible (#18)

* fix(controller): also requeue when source nodes are awaiting kilo daemon annotations

PR #17 added a periodic requeue when the source cluster's listNodes
returned an empty slice, on the theory that bootstrap-in-progress means
"kubelet has not joined yet". That theory missed the next stage of the
same race: kubelet has joined and the node is Ready, but the kilo
daemon has not yet written kilo.squat.ai/wireguard-ip (or public-key,
or endpoint) onto the node. validateNode rejects the node on
NodeNoWireguardIP, filterNodes drops it, pushPeersToTargets emits an
empty desired set, reconcile returns nil with len(nodes)>0 — and the
incomplete flag stays false, so the requeue timer never arms. The
operator does not watch Nodes in remote clusters, so the annotation
landing produces no event, and the tenant's peer is never pushed into
the target cluster.

Observed in the wild on a freshly recreated mesh3 tenant: control plane
came up, node went Ready in ~2 minutes, kilo daemon annotated the node
~30 seconds later, but the operator had already finished its only
reconcile pass and went silent. A manual `kubectl annotate` on the
ClusterMesh CR was required to bump resourceVersion and kick the
operator into a successful peer push.

Move the incomplete check from "empty listNodes" to "empty validNodes
after filtering". This covers both shapes in one predicate:

- len(nodes) == 0 — apiserver up but no kubelet joined yet
- len(nodes) > 0 but all skipped — kilo daemon still annotating

Steady-state remote clusters with always-skipped nodes (e.g. ceph's
location-non-leader nodes that never receive a wireguard-ip annotation
by kilo's per-location granularity design) still have at least one
valid node (the location leader), so validNodes > 0 and the requeue
does not fire. The check is safely scoped to true bootstrap stalls.

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

* fix(validation): classify NodeSkipReason transient vs permanent

The previous patch treated every "validNodes==0" as bootstrap-in-
progress and requeued every 30s indefinitely. That works for transient
reasons (kilo daemon still annotating the node) but burns reconciles
forever on permanent configuration errors (PodCIDROutOfRange,
WGIPDuplicate, malformed annotations).

Introduce validation.IsTransient(reason) classifying each
NodeSkipReason:

- transient (NodeNoPodCIDR, NodeNoWireguardIP, NodeNoPublicKey,
  NodeNoEndpoint) — kubelet / kilo daemon still finishing setup;
  retry will pick up the new annotation on the next tick.
- permanent (PodCIDROutOfRange, WGIPInvalid, WGIPOutOfRange,
  WGIPDuplicate, EndpointInvalid) — operator intervention required;
  retry without it cannot change the outcome.

The exhaustive switch failing closed on unknown reasons means new
NodeSkipReason values must be added to the transient list
intentionally — accidental silent retry is impossible.

filterNodes now returns (validNodes, skipped, transientSkipped).
reconcileAllClusters arms the periodic requeue only when there is a
reason to expect recovery (len(nodes)==0 OR transientSkipped>0). The
all-permanent-skips case is logged at WARN level (with explicit
"mesh will not converge without intervention") and the controller goes
idle until the operator's spec actually changes — much better signal
to ops than a silent retry loop.

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

---------

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants