fix(controller): requeue while source cluster is still bootstrapping#17
Conversation
The reconciler watches only ClusterMesh CRs, not the Nodes of remote clusters. If the first reconcile runs against a source cluster whose apiserver answers but whose kubelet has not joined yet (or whose kubeconfig secret is not in the registry at all), reconcile completes without error: ensureNodeEndpoints sees zero nodes, pushPeersToTargets emits an empty desired peer set, no useful state changes. controller runtime then has nothing to requeue on, and the next attempt waits for an external event on the CR — which can take many minutes (in the wild we saw a 17-minute mesh-up gap on a freshly bootstrapped tenant whose VM landed a few minutes after the operator's first reconcile). Flag the run as "incomplete" when any source is missing from the registry or returned zero nodes, and translate that into a RequeueAfter at the Reconcile boundary. The interval is a short constant (30s) so freshly bootstrapped tenants converge promptly without waiting on a stray watch event. Errors are unaffected: controller-runtime keeps its own exponential backoff for those. This is the source-side dual of the PR #15 fix (which addressed the target-side NoMatchError on the Peer CRD via mapper Reset). Together they close the two halves of the freshly-bootstrapped-tenant race. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
|
Warning Review limit reached
More reviews will be available in 16 minutes and 51 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a periodic requeue mechanism (every 30 seconds) in the ClusterMesh controller when at least one source cluster is still bootstrapping (missing from the registry or has no nodes yet). This prevents the operator from stalling when remote clusters are not yet ready. The feedback suggests two key improvements: first, updating the mesh status to reflect a bootstrapping state rather than unconditionally marking it as ready when reconciliation is incomplete; second, optimizing the reconciliation loop by skipping further steps and continuing early when a cluster has zero nodes.
| r.cleanupOrphanMeshPeers(ctx, log, mesh.Namespace) | ||
|
|
||
| return r.updateStatus(ctx, mesh, clusterStatuses) | ||
| return incomplete, r.updateStatus(ctx, mesh, clusterStatuses) |
There was a problem hiding this comment.
When incomplete is true, the mesh is still bootstrapping (e.g., some clusters have no nodes or are missing from the registry). However, updateStatus unconditionally sets the Ready condition to True (Reconciled). This can mislead users or external controllers into thinking the mesh is fully established when it is not.
Consider updating updateStatus to accept the incomplete boolean and set the Ready condition to False (with a reason like Bootstrapping) when incomplete is true.
| if len(nodes) == 0 { | ||
| log.Info("source cluster has no nodes yet; will requeue", | ||
| slog.String("cluster", srcEntry.Name), | ||
| slog.Duration("after", bootstrapRequeueAfter), | ||
| ) | ||
|
|
||
| incomplete = true | ||
| } |
There was a problem hiding this comment.
If a source cluster has 0 nodes, we can skip the rest of the reconciliation steps for this cluster (such as filtering nodes and pushing peers to other target clusters) and continue early. This avoids performing unnecessary List and reconciliation API calls to all other clusters in the mesh, which is especially important since we are requeuing every 30 seconds during bootstrap.
| if len(nodes) == 0 { | |
| log.Info("source cluster has no nodes yet; will requeue", | |
| slog.String("cluster", srcEntry.Name), | |
| slog.Duration("after", bootstrapRequeueAfter), | |
| ) | |
| incomplete = true | |
| } | |
| if len(nodes) == 0 { | |
| log.Info("source cluster has no nodes yet; will requeue", | |
| slog.String("cluster", srcEntry.Name), | |
| slog.Duration("after", bootstrapRequeueAfter), | |
| ) | |
| incomplete = true | |
| statuses = append(statuses, v1alpha1.ClusterStatus{Name: srcEntry.Name}) | |
| continue | |
| } |
…covery is plausible (#18) * fix(controller): also requeue when source nodes are awaiting kilo daemon annotations PR #17 added a periodic requeue when the source cluster's listNodes returned an empty slice, on the theory that bootstrap-in-progress means "kubelet has not joined yet". That theory missed the next stage of the same race: kubelet has joined and the node is Ready, but the kilo daemon has not yet written kilo.squat.ai/wireguard-ip (or public-key, or endpoint) onto the node. validateNode rejects the node on NodeNoWireguardIP, filterNodes drops it, pushPeersToTargets emits an empty desired set, reconcile returns nil with len(nodes)>0 — and the incomplete flag stays false, so the requeue timer never arms. The operator does not watch Nodes in remote clusters, so the annotation landing produces no event, and the tenant's peer is never pushed into the target cluster. Observed in the wild on a freshly recreated mesh3 tenant: control plane came up, node went Ready in ~2 minutes, kilo daemon annotated the node ~30 seconds later, but the operator had already finished its only reconcile pass and went silent. A manual `kubectl annotate` on the ClusterMesh CR was required to bump resourceVersion and kick the operator into a successful peer push. Move the incomplete check from "empty listNodes" to "empty validNodes after filtering". This covers both shapes in one predicate: - len(nodes) == 0 — apiserver up but no kubelet joined yet - len(nodes) > 0 but all skipped — kilo daemon still annotating Steady-state remote clusters with always-skipped nodes (e.g. ceph's location-non-leader nodes that never receive a wireguard-ip annotation by kilo's per-location granularity design) still have at least one valid node (the location leader), so validNodes > 0 and the requeue does not fire. The check is safely scoped to true bootstrap stalls. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com> * fix(validation): classify NodeSkipReason transient vs permanent The previous patch treated every "validNodes==0" as bootstrap-in- progress and requeued every 30s indefinitely. That works for transient reasons (kilo daemon still annotating the node) but burns reconciles forever on permanent configuration errors (PodCIDROutOfRange, WGIPDuplicate, malformed annotations). Introduce validation.IsTransient(reason) classifying each NodeSkipReason: - transient (NodeNoPodCIDR, NodeNoWireguardIP, NodeNoPublicKey, NodeNoEndpoint) — kubelet / kilo daemon still finishing setup; retry will pick up the new annotation on the next tick. - permanent (PodCIDROutOfRange, WGIPInvalid, WGIPOutOfRange, WGIPDuplicate, EndpointInvalid) — operator intervention required; retry without it cannot change the outcome. The exhaustive switch failing closed on unknown reasons means new NodeSkipReason values must be added to the transient list intentionally — accidental silent retry is impossible. filterNodes now returns (validNodes, skipped, transientSkipped). reconcileAllClusters arms the periodic requeue only when there is a reason to expect recovery (len(nodes)==0 OR transientSkipped>0). The all-permanent-skips case is logged at WARN level (with explicit "mesh will not converge without intervention") and the controller goes idle until the operator's spec actually changes — much better signal to ops than a silent retry loop. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com> --------- Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
Summary
Source-side companion to #15. The reconciler watches only
ClusterMeshCRs, not the Nodes of remote clusters. If the first reconcile runs while a source cluster is still bootstrapping (apiserver answers but kubelet hasn't joined yet, or kubeconfig secret hasn't landed in the registry yet), reconcile completes without error and produces no useful effect:ensureNodeEndpointssees zero nodes,pushPeersToTargetsemits an empty desired set. controller-runtime then has nothing to requeue on, and the next attempt waits for an external event on the CR — which can take many minutes.Observed in the wild: a freshly recreated tenant's mesh-up was delayed 17 minutes because mesh1's VM came up ~3 minutes after the operator's first reconcile, and the next reconcile didn't fire until the Cozystack Package controller re-applied the CR for unrelated reasons.
Fix
Track an
incompleteflag inreconcileAllClusters:r.Registry.Client(srcEntry.Name)returned!ok— kubeconfig secret hasn't been merged into the registry yet,listNodes(srcClient)returned an empty slice — apiserver up, no nodes joined.At the
Reconcileboundary, whenerr == nil && incomplete, returnctrl.Result{RequeueAfter: 30 * time.Second}. MixingRequeueAfterwith an error would defeat controller-runtime's exponential backoff, so the error path is unchanged.The constant
bootstrapRequeueAfter = 30 * time.Secondis a deliberate trade-off:incompleteis false and the timer is not armed.This is the source-side dual of #15 (target-side
NoMatchErroron thePeerCRD via mapperReset). Together they close both halves of the freshly-bootstrapped-tenant race.Test plan
go build ./...clean.go vet ./...clean.go test ./...(excluding integration) passes.golangci-lint runzero issues.KubernetesSwitchcloudtenant: mesh-up should converge within ~1 minute of the tenant's first node becoming Ready (versus 17 minutes observed without this patch).