Skip to content

fix: Reprogram NCs missing from CNS for Succeeded CRs after restart#4357

Merged
rbtr merged 6 commits intomasterfrom
jostupka/fix-stale-nc-reconciliation
Apr 20, 2026
Merged

fix: Reprogram NCs missing from CNS for Succeeded CRs after restart#4357
rbtr merged 6 commits intomasterfrom
jostupka/fix-stale-nc-reconciliation

Conversation

@hunter32292
Copy link
Copy Markdown
Collaborator

When CNS restarts or loses persisted state, NetworkContainers may be lost from the in-memory ContainerIDByOrchestratorContext map while the corresponding MultiTenantNetworkContainer CRs remain in Succeeded state.

Previously, the reconciler skipped all CRs not in Initialized state, meaning Succeeded CRs with missing NCs were never reprogrammed. This caused permanent CNI ADD failures (Code 18: UnknownContainerID) with no self-healing path.

Now, the reconciler allows Succeeded CRs through to the NC existence check. If the NC exists in CNS, reconciliation is skipped as before (no behavior change for the happy path). If the NC is missing, it is reprogrammed from the CR's
status fields.

Transient CNS errors are not masked — only UnknownContainerID triggers reprogramming, matching the existing behavior for Initialized CRs.

Reason for Change: Fix permanent CNI ADD failures after CNS restart when NCs are lost from memory but CRs remain in Succeeded state.

Issue Fixed:

Requirements:

  • uses conventional commit messages
  • includes documentation
  • adds unit tests
  • relevant PR labels added

Notes:

  • No behavior change for the happy path (Succeeded CRs with NCs present in CNS are still skipped)
  • Only UnknownContainerID triggers reprogramming; transient errors are surfaced as before
  • Three new test cases cover: missing NC reprogramming, existing NC skip, and transient error handling

When CNS restarts or loses persisted state, NetworkContainers may be
lost from the in-memory ContainerIDByOrchestratorContext map while the
corresponding MultiTenantNetworkContainer CRs remain in Succeeded state.

Previously, the reconciler skipped all CRs not in Initialized state,
meaning Succeeded CRs with missing NCs were never reprogrammed. This
caused permanent CNI ADD failures (Code 18: UnknownContainerID) with
no self-healing path.

Now, the reconciler allows Succeeded CRs through to the NC existence
check. If the NC exists in CNS, reconciliation is skipped as before
(no behavior change for the happy path). If the NC is missing, it is
reprogrammed from the CR's status fields.

Transient CNS errors are not masked — only UnknownContainerID triggers
reprogramming, matching the existing behavior for Initialized CRs.

Co-authored-by: Copilot <[email protected]>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the multi-tenant NetworkContainer CR reconciler so that Succeeded CRs are no longer skipped before verifying NC existence in CNS, enabling self-healing reprogramming after CNS restarts that lose in-memory NC state.

Changes:

  • Allow Succeeded CRs to proceed to the CNS NC existence check (previously only Initialized CRs were reconciled).
  • If CNS reports UnknownContainerID for a Succeeded CR, reprogram the NC from CR status; otherwise surface transient CNS errors.
  • Add unit tests covering succeeded+missing NC reprogramming, succeeded+existing NC skip, and transient error propagation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
cns/multitenantcontroller/multitenantoperator/multitenantcrdreconciler.go Adjusts reconcile gating to include Succeeded CRs and adds a warning log when reprogramming a missing NC.
cns/multitenantcontroller/multitenantoperator/multitenantcrdreconciler_test.go Adds targeted tests for the new Succeeded-state reconciliation behavior and error handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cns/multitenantcontroller/multitenantoperator/multitenantcrdreconciler.go Outdated
@rbtr rbtr requested a review from Copilot April 16, 2026 19:57
@rbtr
Copy link
Copy Markdown
Collaborator

rbtr commented Apr 16, 2026

/azp run Azure Container Networking PR

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rbtr rbtr enabled auto-merge April 16, 2026 20:28
@rbtr
Copy link
Copy Markdown
Collaborator

rbtr commented Apr 16, 2026

/azp run Azure Container Networking PR

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copilot finished work on behalf of rbtr April 16, 2026 20:33
@rbtr
Copy link
Copy Markdown
Collaborator

rbtr commented Apr 17, 2026

/azp run Azure Container Networking PR

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@rbtr rbtr requested a review from thatmattlong April 17, 2026 15:37
@rbtr rbtr assigned rbtr and unassigned rbtr Apr 17, 2026
@hunter32292 hunter32292 requested review from a team as code owners April 20, 2026 20:03
Comment thread .pipelines/build/dockerfiles/azure-iptables-monitor.Dockerfile
@rbtr
Copy link
Copy Markdown
Collaborator

rbtr commented Apr 20, 2026

/azp run Azure Container Networking PR

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

@QxBytes QxBytes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signing off for dockerfile changes

@rbtr rbtr added this pull request to the merge queue Apr 20, 2026
Merged via the queue into master with commit 754f60b Apr 20, 2026
33 checks passed
@rbtr rbtr deleted the jostupka/fix-stale-nc-reconciliation branch April 20, 2026 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants