-
Notifications
You must be signed in to change notification settings - Fork 373
bug: AIC broken until restart after failover test #2734
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Current Behavior
We did a K8s failover test where some nodes of the cluster were shut down to simulate a partial outage and some (but not all) apisix routes did not function afterwards.
Further investigation turned up error messages in the ingress-controller logs and after restarting the pod, the routes started working again.
Expected Behavior
After the system has settled, the ingress-controllelr should be able to resume normal operation.
Error Logs
These are the logs filtered for "error", newest lines on top (the failover test started around 15:05 UTC):
# the next 3 lines repeated each minute until restart
2026-03-19T15:15:57.300Z DEBUG provider apisix/provider.go:306 handled ADC execution errors {"status_record": null, "status_update": {}}
2026-03-19T15:15:57.300Z INFO provider.client client/client.go:182 no GatewayProxy configs provided
2026-03-19T15:15:57.300Z INFO provider.client client/client.go:177 syncing all resources
# repeated in bursts until restart
2026-03-19T15:12:57.575Z ERROR controllers.GatewayProxy controller/utils.go:1278 failed to list resource {"error": "Index with name field:serviceRefs does not exist"}
# next two lines repeated until restart
2026-03-19T15:12:57.517Z ERROR controller-runtime controller/controller.go:347 Reconciler error {"controller": "gatewayproxy", "controllerGroup": "apisix.apache.org", "controllerKind": "GatewayProxy", "GatewayProxy": {"name":"apisix-config","namespace":"apisix"}, "namespace": "apisix", "name": "apisix-config", "reconcileID": "d3b4ea4f-c502-42b2-8b75-da07b9a7ab62", "error": "Index with name field:ingressClassParametersRef does not exist"}
2026-03-19T15:12:57.517Z ERROR controllers.GatewayProxy controller/gatewayproxy_controller.go:172 failed to list IngressClassList {"error": "Index with name field:ingressClassParametersRef does not exist"}
# repeated 10s of times per millisecond until 2026-03-19T15:12:57.518Z
2026-03-19T15:12:57.509Z ERROR controllers.GatewayProxy controller/utils.go:1278 failed to list resource {"error": "Index with name field:secretRefs does not exist"}
# repeated 10s of times per millisecond, note the different field name (sevrice vs secret)
2026-03-19T15:12:57.410Z ERROR controllers.GatewayProxy controller/utils.go:1278 failed to list resource {"error": "Index with name field:serviceRefs does not exist"}
# this line is from the apisix pod instead of inngress-controller, repeated until 15:12:53
2026/03/19 15:12:33 [error] 51#51: *257113185 recv() failed (111: Connection refused), context: ngx.timer, client: 10.62.14.1, server: 0.0.0.0:9080
# repeated 7x until 15:12:20.830650
E0319 15:11:52.704232 1 leaderelection.go:436] error retrieving resource lock apisix/apisix-ingress-controller-leader: Get "https://10.63.0.1:443/apis/coordination.k8s.io/v1/namespaces/apisix/leases/apisix-ingress-controller-leader?timeout=10s": dial tcp 10.63.0.1:443: connect: connection refused
2026-03-19T15:10:41.225Z INFO setup manager/run.go:283 failed to get Kubernetes server version {"error": "Get \"https://10.63.0.1:443/version?timeout=32s\": dial tcp 10.63.0.1:443: connect: no route to host"}
# message repeated many times for different kinds
2026-03-19T15:09:42.798Z INFO controller-runtime.api-detection utils/k8s.go:65 group/version not available in cluster {"kind": "Ingress", "group": "networking.k8s.io", "version": "v1", "groupVersion": "networking.k8s.io/v1", "error": "Get \"https://10.63.0.1:443/apis/networking.k8s.io/v1?timeout=32s\": dial tcp 10.63.0.1:443: connect: no route to host"}
2026-03-19T15:09:39.725Z INFO setup manager/run.go:283 failed to get Kubernetes server version {"error": "Get \"https://10.63.0.1:443/version?timeout=32s\": dial tcp 10.63.0.1:443: connect: no route to host"}
2026-03-19T15:09:28.772Z INFO controller-runtime.api-detection utils/k8s.go:65 group/version not available in cluster {"kind": "ApisixRoute", "group": "apisix.apache.org", "version": "v2", "groupVersion": "apisix.apache.org/v2", "error": "Get \"https://10.63.0.1:443/apis/apisix.apache.org/v2?timeout=32s\": context deadline exceeded - error from a previous attempt: http2: client connection lost"}
2026-03-19T15:08:56.770Z INFO controller-runtime.api-detection utils/k8s.go:65 group/version not available in cluster {"kind": "Ingress", "group": "networking.k8s.io", "version": "v1", "groupVersion": "networking.k8s.io/v1", "error": "Get \"https://10.63.0.1:443/apis/networking.k8s.io/v1?timeout=32s\": context deadline exceeded"}
Error: leader election lost
Error: leader election lost
E0319 15:08:24.254648 1 leaderelection.go:436] error retrieving resource lock apisix/apisix-ingress-controller-leader: Get "https://10.63.0.1:443/apis/coordination.k8s.io/v1/namespaces/apisix/leases/apisix-ingress-controller-leader?timeout=10s": context deadline exceeded
2026-03-19T15:07:32.132Z DEBUG provider apisix/provider.go:306 handled ADC execution errors
2026-03-19T15:06:33.149Z DEBUG provider apisix/provider.go:306 handled ADC execution errors {"status_record": {}, "status_update": {}}
2026-03-19T15:06:32.131Z ERROR provider apisix/provider.go:282 failed to sync {"error": "failed to sync 1 configs: GatewayProxy/apisix/apisix-config"}
2026-03-19T15:06:32.131Z DEBUG provider apisix/provider.go:306 handled ADC execution errors {"status_record": {"GatewayProxy/apisix/apisix-config":{"Errors":[{"Name":"GatewayProxy/apisix/apisix-config","FailedErrors":[{"Err":"socket hang up","ServerAddr":"http://10.62.17.169:9180","FailedStatuses":[{"event":{"resourceType":"","type":"","resourceId":"","resourceName":""},"failed_at":"2026-03-19T15:06:32.13Z","synced_at":"0001-01-01T00:00:00Z","reason":"socket hang up","response":{"status":0,"headers":null}}]}]}]}}, "status_update": {"ApisixGlobalRule/apisix/opentelemetry":["ServerAddr: http://10.62.17.169:9180, Err: socket hang up"], ... <redacted many more>}}
2026-03-19T15:06:32.131Z ERROR provider.client client/client.go:210 failed to sync resources {"name": "GatewayProxy/apisix/apisix-config", "error": "ADC execution errors: [ADC execution error for GatewayProxy/apisix/apisix-config: [ServerAddr: http://10.62.17.169:9180, Err: socket hang up]]"}
2026-03-19T15:06:32.131Z ERROR provider.client client/client.go:269 failed to execute adc command {"config": {"name":"GatewayProxy/apisix/apisix-config","serverAddrs":["http://10.62.17.169:9180"],"tlsVerify":false}, "error": "ADC execution error for GatewayProxy/apisix/apisix-config: [ServerAddr: http://10.62.17.169:9180, Err: socket hang up]"}
2026-03-19T15:06:32.131Z ERROR provider.executor client/executor.go:142 failed to run http sync for server {"server": "http://10.62.17.169:9180", "error": "ServerAddr: http://10.62.17.169:9180, Err: socket hang up"}
2026-03-19T15:06:32.131Z ERROR provider.executor client/executor.go:328 ADC Server sync failed {"result": {"status":"all_failed","total_resources":1,"success_count":0,"failed_count":1,"success":[],"failed":[{"event":{"resourceType":"","type":"","resourceId":"","resourceName":""},"failed_at":"2026-03-19T15:06:32.13Z","synced_at":"0001-01-01T00:00:00Z","reason":"socket hang up","response":{"status":0,"headers":null}}]}, "error": "ADC Server sync failed: socket hang up"}
2026-03-19T15:05:32.132Z DEBUG provider apisix/provider.go:306 handled ADC execution errors {"status_record": {}, "status_update": {}}
Steps to Reproduce
We don't know yet what exactly triggered this. This is our best guess so far:
- install apisix helm chart on k8s
- shut down some cluster nodes (including some control plane nodes?)
- ???
Environment
- APISIX Ingress controller version (run
apisix-ingress-controller version --long): apache/apisix-ingress-controller:2.0.1 (docker image) - Kubernetes cluster version (run
kubectl version): v1.29.15
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working