fix: Reprogram NCs missing from CNS for Succeeded CRs after restart#4357
fix: Reprogram NCs missing from CNS for Succeeded CRs after restart#4357
Conversation
When CNS restarts or loses persisted state, NetworkContainers may be lost from the in-memory ContainerIDByOrchestratorContext map while the corresponding MultiTenantNetworkContainer CRs remain in Succeeded state. Previously, the reconciler skipped all CRs not in Initialized state, meaning Succeeded CRs with missing NCs were never reprogrammed. This caused permanent CNI ADD failures (Code 18: UnknownContainerID) with no self-healing path. Now, the reconciler allows Succeeded CRs through to the NC existence check. If the NC exists in CNS, reconciliation is skipped as before (no behavior change for the happy path). If the NC is missing, it is reprogrammed from the CR's status fields. Transient CNS errors are not masked — only UnknownContainerID triggers reprogramming, matching the existing behavior for Initialized CRs. Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Pull request overview
Updates the multi-tenant NetworkContainer CR reconciler so that Succeeded CRs are no longer skipped before verifying NC existence in CNS, enabling self-healing reprogramming after CNS restarts that lose in-memory NC state.
Changes:
- Allow Succeeded CRs to proceed to the CNS NC existence check (previously only Initialized CRs were reconciled).
- If CNS reports UnknownContainerID for a Succeeded CR, reprogram the NC from CR status; otherwise surface transient CNS errors.
- Add unit tests covering succeeded+missing NC reprogramming, succeeded+existing NC skip, and transient error propagation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| cns/multitenantcontroller/multitenantoperator/multitenantcrdreconciler.go | Adjusts reconcile gating to include Succeeded CRs and adds a warning log when reprogramming a missing NC. |
| cns/multitenantcontroller/multitenantoperator/multitenantcrdreconciler_test.go | Adds targeted tests for the new Succeeded-state reconciliation behavior and error handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Agent-Logs-Url: https://github.com/Azure/azure-container-networking/sessions/808be914-b861-4e6d-bfbb-db52b24297d2 Co-authored-by: rbtr <[email protected]>
Agent-Logs-Url: https://github.com/Azure/azure-container-networking/sessions/808be914-b861-4e6d-bfbb-db52b24297d2 Co-authored-by: rbtr <[email protected]>
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…locally to update the base images.
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
QxBytes
left a comment
There was a problem hiding this comment.
signing off for dockerfile changes
When CNS restarts or loses persisted state, NetworkContainers may be lost from the in-memory ContainerIDByOrchestratorContext map while the corresponding MultiTenantNetworkContainer CRs remain in Succeeded state.
Previously, the reconciler skipped all CRs not in Initialized state, meaning Succeeded CRs with missing NCs were never reprogrammed. This caused permanent CNI ADD failures (Code 18: UnknownContainerID) with no self-healing path.
Now, the reconciler allows Succeeded CRs through to the NC existence check. If the NC exists in CNS, reconciliation is skipped as before (no behavior change for the happy path). If the NC is missing, it is reprogrammed from the CR's
status fields.
Transient CNS errors are not masked — only UnknownContainerID triggers reprogramming, matching the existing behavior for Initialized CRs.
Reason for Change: Fix permanent CNI ADD failures after CNS restart when NCs are lost from memory but CRs remain in Succeeded state.
Issue Fixed:
Requirements:
Notes: