design-proposal: migrate kubernetes workers to Talos and split into kubernetes-nodes#8
Conversation
Propose extracting node pools from the kubernetes application into a sibling kubernetes-nodes application, modelled on the vm-instance/vm-disk split. Add a backend abstraction that supports the existing KubeVirt+kubeadm flow alongside new Talos backends: KubeVirt+Talos via clastix/talos-csr-signer, and cloud-talos for Hetzner and Azure without Cluster API. Co-Authored-By: Claude <[email protected]> Signed-off-by: Andrei Kvapil <[email protected]>
There was a problem hiding this comment.
Code Review
This pull request proposes a significant architectural change to split the monolithic kubernetes package into separate control-plane and node-pool components, enabling independent lifecycles and multi-backend support for KubeVirt and cloud-native Talos workers. The review feedback identifies a technical error regarding the gRPC protocol (TCP vs UDP), suggests consolidating the node-lifecycle-controller and Talos tokens at the cluster level for better efficiency and simpler configuration, and requests further implementation details for the dependency discovery mechanism in the proposed admission webhook.
|
|
||
| Renders the same CAPI/CAPK objects as above, but with a `TalosConfigTemplate` (from `cluster-api-bootstrap-provider-talos`) replacing `KubeadmConfigTemplate`. Worker VMs boot from a Talos image. Bootstrap fetches the Talos machineconfig from CAPI and joins the cluster via standard Talos PKI. | ||
|
|
||
| The tenant's `KamajiControlPlane` carries an `additionalContainers` entry running `clastix/talos-csr-signer` listening on UDP/50001, exposed alongside `:6443` on the tenant API LoadBalancer. This is what allows `talosctl` to operate against worker nodes whose control-plane is Kamaji rather than Talos. |
|
|
||
| When `cluster-autoscaler` scales a `cloud-talos-*` pool down, it deletes the cloud VM. The tenant's apiserver still has a `Node` object that will linger until something deletes it. CAPI was previously the agent doing this; without CAPI, we need an equivalent. | ||
|
|
||
| The `node-lifecycle-controller` from `cozystack/local-ccm` is a good fit for this role. The `kubernetes-nodes` chart for `cloud-talos-*` backends renders an NLC Deployment that runs in the management cluster but uses a kubeconfig pointing to the **tenant** apiserver. It watches Node objects with the `ToBeDeletedByClusterAutoscaler:NoSchedule` taint and removes them after a configurable grace period and unreachability check. |
There was a problem hiding this comment.
Deploying a separate node-lifecycle-controller (NLC) for each kubernetes-nodes release leads to redundant processes watching the same tenant API server. A more efficient approach would be to manage a single NLC instance per tenant cluster (e.g., within the kubernetes control-plane chart) that handles node cleanup for all associated node pools.
| ## Failure and edge cases | ||
|
|
||
| - **`kubernetes-nodes` HelmRelease created before its parent `kubernetes` HelmRelease** → chart `fail`s the render with a clear error message identifying the missing parent. No partial CAPI/autoscaler resources created. | ||
| - **Parent `kubernetes` HelmRelease deleted while children exist** → all `kubernetes-nodes` HelmReleases for that cluster fail subsequent reconciles. An admission webhook on `kubernetes` HelmRelease delete blocks the operation if any `kubernetes-nodes` references it. |
There was a problem hiding this comment.
The proposed admission webhook for blocking kubernetes HelmRelease deletion requires a mechanism to discover name-based dependencies. Clarifying where this logic will reside and how it will perform discovery would strengthen the proposal, especially since the linkage doesn't use standard Kubernetes owner references.
| **System layer (chart-managed, not exposed to user):** | ||
|
|
||
| - Cluster CA, machine CA, apiserver endpoint — read at template time via `lookup` from the tenant's `KamajiControlPlane`. | ||
| - Talos token — generated once per pool, stored alongside the machineconfig. |
There was a problem hiding this comment.
Generating a TALOS_TOKEN per pool while running a single talos-csr-signer sidecar per cluster (as discussed in Open Question #3 on line 253) introduces a configuration challenge for the signer, as it must be aware of all active tokens. Using a single TALOS_TOKEN per tenant cluster would simplify the sidecar configuration and the kubernetes-nodes chart logic while maintaining sufficient isolation between tenants.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Adjust the proposal to reflect that the controller will be developed as an independent project under the kilo-io organization, per confirmed interest from Kilo maintainer @squat. Generalize the CRD from a tenant-specific TenantMeshLink to a tenant-agnostic ClusterMesh that references peer clusters through a map of kubeconfig Secrets. Move all tenant semantics into a dedicated Cozystack integration section that also accounts for the kubernetes-nodes split (PR cozystack#8) so a single ClusterMesh covers multi-location, multi-backend tenants. Co-Authored-By: Claude <[email protected]> Signed-off-by: Andrei Kvapil <[email protected]>
Andrei Kvapil (kvaps)
left a comment
There was a problem hiding this comment.
Action items from an internal meeting. Capturing here so the design can be revised before this lands.
1. Zone / location ownership model is undefined
The proposal does not specify whether a "location" (zone) is configured by the platform admin or by the tenant. This shapes the entire kubernetes-nodes API:
- Admin-owned location: we need a separate platform-level resource that declares per-zone settings (provider creds, base image, template, network config). Node pools reference it by name and the user never sees cloud-init contents or instance template details.
- User-owned location: the tenant gets more freedom to pick instance types, templates, disks — closer to the existing KubeVirt path.
These have very different trust and UX implications. The proposal must pick one (or describe an explicit hybrid) and bake it into the values shape.
2. Base image / template source for autoscaler-managed nodes
For cloud-talos-* backends the proposal describes machineconfig injection but is silent on where the base Talos image for the cloud comes from: pre-baked snapshot/VHD, Talos Image Factory schematic computed on the fly, or user-supplied image reference. This needs an explicit field with documented defaults and an operator workflow for adding new images.
3. Credentials when the user brings their own cloud
If we want to support tenants connecting their own cloud accounts (a direction raised in the meeting), the design must cover:
- Where per-tenant cloud credentials live (tenant-namespace Secret references in the
kubernetes-nodesHelmRelease). - How the management-cluster
cluster-autoscalerreads tenant-specific credentials without privilege escalation. - Trust boundary: a compromised tenant must not be able to leak or repurpose those credentials.
The current proposal implicitly assumes platform-supplied credentials. Add an explicit section on tenant-supplied credentials, or push to an open question with a clear statement.
4. Scheduling class semantics break for external backends
tenant.spec.scheduling (allowed and default scheduling classes) is enforced when KubeVirt VMs are scheduled in the management cluster. For cloud-talos-* backends there is no management-cluster pod, so the existing scheduling-class machinery has nothing to act on.
The proposal should:
- State explicitly that scheduling classes only apply to
kubevirt-*backends. - Decide whether equivalent constraints (per-tenant restrictions on which
kubernetes-nodesconfigurations are allowed) should be lifted intokubernetes-nodesitself rather than relying on scheduling classes.
5. Backend abstraction: should it be a real contract, not just an enum?
Today backend.type is a string switch and the chart branches internally. As we add backends (KubeVirt-Talos, AWS, GCP, on-prem KVM), this becomes hard to maintain. Worth considering:
- A minimal contract any backend must implement (CRs produced, credentials interface, autoscaler integration shape).
- Documenting it so external contributors can add a backend without reading the entire chart.
- Possibly per-backend sub-charts.
Not blocking, but should be an explicit open question.
6. UI form generation per backend
The dashboard needs a different form per backend.type. The proposal should mention that the values schema is structured so the frontend can render per-backend forms cleanly (tagged-union shape, not a flat values bag with conditional fields).
7. Tangential: management cluster on non-KubeVirt
The same kubernetes-nodes model could plausibly apply to the management cluster's own node pools (running it on Azure VMs, not just KubeVirt). The existing autoscaling docs already show the infra-level pattern. The proposal should:
- Explicitly say "out of scope — management cluster node management remains separate", or
- Note that the design is intentionally compatible with that future direction.
Will revise the proposal to address these points.
Drop the multi-backend design (cloud-talos-hetzner, cloud-talos-azure, LocationProfile, NLC, etc.) and rewrite around two phases of internal restructuring: - Phase 1: replace Ubuntu+kubeadm worker bootstrap with Talos via CABPT, inside the existing monolithic chart, with no user-facing API change. Patch needed in cluster-api-control-plane-provider-kamaji to render the talos-csr-signer sidecar in TenantControlPlane. - Phase 2: once workers are uniformly Talos, split the chart into kubernetes (control-plane) + kubernetes-nodes (per-pool). Single backend, no backend.type field. Hybrid / external-cloud clusters are deferred to Phase 3, tracked separately as a follow-up draft proposal. Co-Authored-By: Claude <[email protected]> Signed-off-by: Andrei Kvapil <[email protected]>
Co-Authored-By: Claude <[email protected]> Signed-off-by: Andrei Kvapil <[email protected]>
Summary
A two-phase reshape of the
kubernetesapplication:cluster-api-bootstrap-provider-talos, inside the existing monolithic chart. No user-facing API change. Seamless migration via standard CAPI MachineDeployment rolling update. Requires a patch toclastix/cluster-api-control-plane-provider-kamajito render thetalos-csr-signersidecar in theTenantControlPlane.kubernetes(control-plane only) +kubernetes-nodes(one per pool), modelled on thevm-instance/vm-diskprecedent. Single backend (kubevirt-talos), nobackend.typefield yet.Hybrid clusters (workers in external clouds, BYO, bare-metal) are deferred as Phase 3 to a separate placeholder draft: #9.
Supersedes an earlier scope that tried to land multiple backends and the split in one shot — see commit history for the rescoping rationale.
Test plan
Implementation testing is scoped in the proposal:
kubernetes-nodeschart renderingkubernetesHelmRelease withnodeGroupsmigrated to split shape