Skip to content

bug: AIC broken until restart after failover test #2734

@martin-schulze-e2m

Description

@martin-schulze-e2m

Current Behavior

We did a K8s failover test where some nodes of the cluster were shut down to simulate a partial outage and some (but not all) apisix routes did not function afterwards.
Further investigation turned up error messages in the ingress-controller logs and after restarting the pod, the routes started working again.

Expected Behavior

After the system has settled, the ingress-controllelr should be able to resume normal operation.

Error Logs

These are the logs filtered for "error", newest lines on top (the failover test started around 15:05 UTC):

# the next 3 lines repeated each minute until restart
2026-03-19T15:15:57.300Z	DEBUG	provider	apisix/provider.go:306	handled ADC execution errors	{"status_record": null, "status_update": {}}
2026-03-19T15:15:57.300Z	INFO	provider.client	client/client.go:182	no GatewayProxy configs provided
2026-03-19T15:15:57.300Z	INFO	provider.client	client/client.go:177	syncing all resources
# repeated in bursts until restart
2026-03-19T15:12:57.575Z	ERROR	controllers.GatewayProxy	controller/utils.go:1278	failed to list resource	{"error": "Index with name field:serviceRefs does not exist"}
# next two lines repeated until restart
2026-03-19T15:12:57.517Z	ERROR	controller-runtime	controller/controller.go:347	Reconciler error	{"controller": "gatewayproxy", "controllerGroup": "apisix.apache.org", "controllerKind": "GatewayProxy", "GatewayProxy": {"name":"apisix-config","namespace":"apisix"}, "namespace": "apisix", "name": "apisix-config", "reconcileID": "d3b4ea4f-c502-42b2-8b75-da07b9a7ab62", "error": "Index with name field:ingressClassParametersRef does not exist"}
2026-03-19T15:12:57.517Z	ERROR	controllers.GatewayProxy	controller/gatewayproxy_controller.go:172	failed to list IngressClassList	{"error": "Index with name field:ingressClassParametersRef does not exist"}
# repeated 10s of times per millisecond until 2026-03-19T15:12:57.518Z
2026-03-19T15:12:57.509Z	ERROR	controllers.GatewayProxy	controller/utils.go:1278	failed to list resource	{"error": "Index with name field:secretRefs does not exist"}
# repeated 10s of times per millisecond, note the different field name (sevrice vs secret)
2026-03-19T15:12:57.410Z	ERROR	controllers.GatewayProxy	controller/utils.go:1278	failed to list resource	{"error": "Index with name field:serviceRefs does not exist"}
# this line is from the apisix pod instead of inngress-controller, repeated until 15:12:53
2026/03/19 15:12:33 [error] 51#51: *257113185 recv() failed (111: Connection refused), context: ngx.timer, client: 10.62.14.1, server: 0.0.0.0:9080
# repeated 7x until 15:12:20.830650
E0319 15:11:52.704232       1 leaderelection.go:436] error retrieving resource lock apisix/apisix-ingress-controller-leader: Get "https://10.63.0.1:443/apis/coordination.k8s.io/v1/namespaces/apisix/leases/apisix-ingress-controller-leader?timeout=10s": dial tcp 10.63.0.1:443: connect: connection refused
2026-03-19T15:10:41.225Z	INFO	setup	manager/run.go:283	failed to get Kubernetes server version	{"error": "Get \"https://10.63.0.1:443/version?timeout=32s\": dial tcp 10.63.0.1:443: connect: no route to host"}
# message repeated many times for different kinds
2026-03-19T15:09:42.798Z	INFO	controller-runtime.api-detection	utils/k8s.go:65	group/version not available in cluster	{"kind": "Ingress", "group": "networking.k8s.io", "version": "v1", "groupVersion": "networking.k8s.io/v1", "error": "Get \"https://10.63.0.1:443/apis/networking.k8s.io/v1?timeout=32s\": dial tcp 10.63.0.1:443: connect: no route to host"}
2026-03-19T15:09:39.725Z	INFO	setup	manager/run.go:283	failed to get Kubernetes server version	{"error": "Get \"https://10.63.0.1:443/version?timeout=32s\": dial tcp 10.63.0.1:443: connect: no route to host"}
2026-03-19T15:09:28.772Z	INFO	controller-runtime.api-detection	utils/k8s.go:65	group/version not available in cluster	{"kind": "ApisixRoute", "group": "apisix.apache.org", "version": "v2", "groupVersion": "apisix.apache.org/v2", "error": "Get \"https://10.63.0.1:443/apis/apisix.apache.org/v2?timeout=32s\": context deadline exceeded - error from a previous attempt: http2: client connection lost"}
2026-03-19T15:08:56.770Z	INFO	controller-runtime.api-detection	utils/k8s.go:65	group/version not available in cluster	{"kind": "Ingress", "group": "networking.k8s.io", "version": "v1", "groupVersion": "networking.k8s.io/v1", "error": "Get \"https://10.63.0.1:443/apis/networking.k8s.io/v1?timeout=32s\": context deadline exceeded"}
Error: leader election lost
Error: leader election lost
E0319 15:08:24.254648       1 leaderelection.go:436] error retrieving resource lock apisix/apisix-ingress-controller-leader: Get "https://10.63.0.1:443/apis/coordination.k8s.io/v1/namespaces/apisix/leases/apisix-ingress-controller-leader?timeout=10s": context deadline exceeded
2026-03-19T15:07:32.132Z	DEBUG	provider	apisix/provider.go:306	handled ADC execution errors
2026-03-19T15:06:33.149Z	DEBUG	provider	apisix/provider.go:306	handled ADC execution errors	{"status_record": {}, "status_update": {}}
2026-03-19T15:06:32.131Z	ERROR	provider	apisix/provider.go:282	failed to sync	{"error": "failed to sync 1 configs: GatewayProxy/apisix/apisix-config"}
2026-03-19T15:06:32.131Z	DEBUG	provider	apisix/provider.go:306	handled ADC execution errors	{"status_record": {"GatewayProxy/apisix/apisix-config":{"Errors":[{"Name":"GatewayProxy/apisix/apisix-config","FailedErrors":[{"Err":"socket hang up","ServerAddr":"http://10.62.17.169:9180","FailedStatuses":[{"event":{"resourceType":"","type":"","resourceId":"","resourceName":""},"failed_at":"2026-03-19T15:06:32.13Z","synced_at":"0001-01-01T00:00:00Z","reason":"socket hang up","response":{"status":0,"headers":null}}]}]}]}}, "status_update": {"ApisixGlobalRule/apisix/opentelemetry":["ServerAddr: http://10.62.17.169:9180, Err: socket hang up"], ... <redacted many more>}}
2026-03-19T15:06:32.131Z	ERROR	provider.client	client/client.go:210	failed to sync resources	{"name": "GatewayProxy/apisix/apisix-config", "error": "ADC execution errors: [ADC execution error for GatewayProxy/apisix/apisix-config: [ServerAddr: http://10.62.17.169:9180, Err: socket hang up]]"}
2026-03-19T15:06:32.131Z	ERROR	provider.client	client/client.go:269	failed to execute adc command	{"config": {"name":"GatewayProxy/apisix/apisix-config","serverAddrs":["http://10.62.17.169:9180"],"tlsVerify":false}, "error": "ADC execution error for GatewayProxy/apisix/apisix-config: [ServerAddr: http://10.62.17.169:9180, Err: socket hang up]"}
2026-03-19T15:06:32.131Z	ERROR	provider.executor	client/executor.go:142	failed to run http sync for server	{"server": "http://10.62.17.169:9180", "error": "ServerAddr: http://10.62.17.169:9180, Err: socket hang up"}
2026-03-19T15:06:32.131Z	ERROR	provider.executor	client/executor.go:328	ADC Server sync failed	{"result": {"status":"all_failed","total_resources":1,"success_count":0,"failed_count":1,"success":[],"failed":[{"event":{"resourceType":"","type":"","resourceId":"","resourceName":""},"failed_at":"2026-03-19T15:06:32.13Z","synced_at":"0001-01-01T00:00:00Z","reason":"socket hang up","response":{"status":0,"headers":null}}]}, "error": "ADC Server sync failed: socket hang up"}
2026-03-19T15:05:32.132Z	DEBUG	provider	apisix/provider.go:306	handled ADC execution errors	{"status_record": {}, "status_update": {}}

Steps to Reproduce

We don't know yet what exactly triggered this. This is our best guess so far:

  1. install apisix helm chart on k8s
  2. shut down some cluster nodes (including some control plane nodes?)
  3. ???

Environment

  • APISIX Ingress controller version (run apisix-ingress-controller version --long): apache/apisix-ingress-controller:2.0.1 (docker image)
  • Kubernetes cluster version (run kubectl version): v1.29.15

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions