Skip to content

Stale /var/lib/calico/nodename causes duplicate Pod IPs via incorrect IPAM bookkeeping and subsequent leak GC #12257

@seriousgong

Description

@seriousgong

Expected Behavior

When a node is (re)provisioned, the CNI plugin should use the current node's identity for all IPAM allocations and WEP creation. Even if /var/lib/calico/nodename contains a stale value from a previous node, the system should not:

  1. Book IPAM allocations under the wrong node name
  2. Garbage-collect active IP allocations that are still bound to running Pods
  3. Re-assign an IP that is actively in use, resulting in duplicate Pod IPs

Current Behavior

When a node boots with a stale /var/lib/calico/nodename left over from a previous node identity (e.g., node reimaged from a VM template), the following chain of events occurs:

  1. install-cni init container completes and makes the CNI plugin available
  2. kubelet immediately triggers CmdAddK8s for pending DaemonSet Pods
  3. The CNI plugin reads the stale nodename from /var/lib/calico/nodename (via DetermineNodename()) and uses it for:
    • WorkloadEndpoint.Spec.Node
    • IPAM allocation attrs["node"]
    • IPAM AutoAssignArgs.Hostname
  4. calico-node starts after the first CNI ADD calls and overwrites the nodename file with the correct value — subsequent Pods get the right identity
  5. ~15 minutes later, calico-kube-controllers runs allocationIsValid() and compares Pod.Spec.NodeName (correct, e.g., 10-199-0-105) against allocation.attrs.node (stale, e.g., 10-199-0-21)
  6. The controller concludes "Pod rescheduled on new node. Allocation no longer valid" and GCs the allocation
  7. The IP is returned to the pool while the original Pod still uses it on its network interface
  8. A new Pod on another node gets assigned the same IP → duplicate Pod IP

Evidence from CNI log on node 10-199-0-105 — first Pod booked under stale nodename:

2026-03-23 10:42:13.497 [INFO] k8s.go 77: Extracted identifiers for CmdAddK8s
  ContainerID="fc3a04..." Pod="csi-node-driver-rlfpx"
  WorkloadEndpoint="10--199--0--21-k8s-csi--node--driver--rlfpx-eth0"

2026-03-23 10:42:13.531 [INFO] ipam_plugin.go 270: Auto assigning IP
  Attrs:{"node":"10-199-0-21", "pod":"csi-node-driver-rlfpx", ...}
  Hostname:"10-199-0-21"

2026-03-23 10:42:13.732 [INFO] ipam.go 1216: Successfully claimed IPs: [10.200.129.198/26]

6 seconds later, same node uses correct identity for the next Pod:

2026-03-23 10:42:19.137 [INFO] k8s.go 77: Extracted identifiers for CmdAddK8s
  ContainerID="97062860..." Pod="node-problem-detector-774b4"
  WorkloadEndpoint="10--199--0--105-k8s-node--problem--detector--774b4-eth0"

  Attrs:{"node":"10-199-0-105", ...}
  Hostname:"10-199-0-105"

Controller log showing incorrect GC of the active allocation:

Pod rescheduled on new node. Allocation no longer valid  old=10-199-0-21 new=10-199-0-105
Candidate IP leak  ip=10.200.129.198
Confirmed IP leak after 15m0s  ip=10.200.129.198
Garbage collecting leaked IP address  ip=10.200.129.198

Resulting duplicate IP — two Pods on different nodes holding the same IP:

$ calicoctl get wep -A -o wide | grep '10.200.129.198'
calico-system  10--199--1--92-k8s-csi--node--driver--6bhc7-eth0   10-199-1-92   10.200.129.198/32
calico-system  10--199--0--105-k8s-csi--node--driver--rlfpx-eth0  10-199-0-105  10.200.129.198/32

This was reproduced on multiple nodes (10-199-0-105, 10-199-1-92) in the same cluster, all with the same stale nodename 10-199-0-21.

Possible Solution

There are two contributing issues that could each be addressed:

1. CNI plugin: DetermineNodename() trusts stale file without validation

In cni-plugin/internal/pkg/utils/utils.go, DetermineNodename() reads /var/lib/calico/nodename and trusts its content unconditionally:

func DetermineNodename(conf types.NetConf) (nodename string) {
    if conf.Nodename != "" {
        nodename = conf.Nodename
    } else if nff := nodenameFromFile(conf.NodenameFile); nff != "" {
        nodename = nff                    // ← reads stale file without validation
    } else if conf.Hostname != "" {
        nodename = conf.Hostname
    } else {
        nodename, _ = names.Hostname()
    }
    return
}

Suggested fix: Cross-validate the nodename file content against KUBERNETES_NODE_NAME (available from CNI args / Pod downward API environment). If they differ, prefer KUBERNETES_NODE_NAME or return an error. Alternatively, ensure calico-node writes the nodename file before install-cni signals CNI readiness.

2. kube-controllers: allocationIsValid() treats node mismatch as definitive evidence of rescheduling

In kube-controllers/pkg/controllers/node/ipam.go:

// TODO: Do we need this check?
if p.Spec.NodeName != "" && a.knode != "" && p.Spec.NodeName != a.knode {
    logc.WithFields(fields).Info("Pod rescheduled on new node. Allocation no longer valid")
    return false
}

Note the existing // TODO: Do we need this check? comment.

This check assumes that a node mismatch means the Pod was rescheduled. But in this scenario, the Pod never moved — the allocation was simply recorded under the wrong node by CNI. The Pod is Running, its status.podIP matches the allocation, and it is actively using the IP.

Suggested fix: Before concluding the allocation is invalid, additionally verify whether Pod.Status.PodIP matches the allocated IP. If the Pod is Running on the "new" node with the exact same IP, the allocation is likely a bookkeeping error rather than a genuine reschedule — it should not be GC'd.

Steps to Reproduce (for bugs)

  1. Set up a Calico cluster using KDD (Kubernetes datastore)
  2. Provision a node from a VM image/template that retains /var/lib/calico/nodename from a different node (e.g., node B has nodename file containing node A's name)
  3. Start the node — kubelet will schedule DaemonSet Pods immediately
  4. Observe the startup ordering:
    • install-cni completes → CNI becomes available
    • First CmdAddK8s calls use stale nodename from the file (within ~1-3 seconds)
    • calico-node starts and corrects the nodename file (~3-6 seconds after install-cni)
    • Subsequent CmdAddK8s calls use the correct nodename
  5. Wait ~15 minutes (default leakGracePeriod)
  6. calico-kube-controllers logs Garbage collecting leaked IP address for the affected IPs
  7. New Pods scheduled elsewhere may now receive the same IP → duplicate Pod IP

Context

We operate a large Kubernetes cluster and frequently batch-add ~30 nodes at a time. Nodes are provisioned from VM templates that may retain /var/lib/calico/ data from a previous node identity. After each batch expansion, we consistently observe duplicate Pod IPs caused by this race condition.

The impact is severe:

  • Silent traffic misrouting — two Pods on different nodes hold the same IP, causing unpredictable network behavior
  • No error surfaced — the duplicate is only discovered through manual inspection or when applications fail
  • Scales with cluster growth — the more nodes provisioned in parallel, the more Pods are affected

Current workarounds:

  • Deleting /var/lib/calico/nodename before node joins (requires changes to provisioning pipeline)
  • Increasing leakGracePeriod (reduces probability but does not eliminate the root cause)
  • Manually deleting affected Pods after detection

Your Environment

  • Calico version: v3.28.1
  • Calico dataplane: iptables
  • Orchestrator version: Kubernetes v1.30.5
  • Operating System and version: Ubuntu 22.04 LTS (kernel 5.15.0-94-generic)
  • Container runtime: containerd 1.7.23
  • IPAM config: 2 workload IPPools (10.196.0.0/15, 10.195.128.0/17), blockSize 26, ipipMode Always, strictAffinity false

Metadata

Metadata

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions