Skip to content

fix(shield): add startupProbe to cluster-shield deployment#2589

Closed
juan-rodriguez-gracia wants to merge 1 commit intosysdiglabs:mainfrom
juan-rodriguez-gracia:fix/shield-cluster-startupProbe
Closed

fix(shield): add startupProbe to cluster-shield deployment#2589
juan-rodriguez-gracia wants to merge 1 commit intosysdiglabs:mainfrom
juan-rodriguez-gracia:fix/shield-cluster-startupProbe

Conversation

@juan-rodriguez-gracia
Copy link
Copy Markdown

@juan-rodriguez-gracia juan-rodriguez-gracia commented Apr 20, 2026

Description

Cluster-shield currently has only a liveness probe with a ~50s budget (initialDelaySeconds=5, periodSeconds=5, failureThreshold=9). On slow or noisy clusters the /healthz endpoint can take longer than 50s to become responsive during CVM warmup, causing the kubelet to kill the container with exit code 1067 and restart it.

This PR adds a startupProbe with a ~5 minute budget so the liveness probe is suspended during warmup and only takes over once the container has reported healthy at least once. Steady-state liveness/readiness behaviour is unchanged.

Motivation

Observed on QA build Kubelet events from the failing pod:

Warning Unhealthy  Liveness probe failed: Get ".../healthz": context deadline exceeded
Warning Unhealthy  Readiness probe failed: Get ".../healthz": context deadline exceeded
Warning Unhealthy  Liveness probe failed: HTTP probe failed with statuscode: 500
Normal  Killing    Container cluster-shield failed liveness probe, will be restarted

Pod lastState.terminated.exitCode: 1067, reason Error. The pod recovers on restart — this is startup latency, not a steady-state bug.

What changed

  • charts/shield/values.yaml: new cluster.probes.startup section with default budget of ~5 min (initialDelay=15, period=10, failureThreshold=30, timeout=3).
  • charts/shield/templates/cluster/deployment.yaml: render startupProbe before livenessProbe when cluster.probes.startup is set. The if guard lets users disable the startup probe by setting cluster.probes.startup: null.
  • charts/shield/tests/cluster/deployment_test.yaml: three new cases — default startup probe, custom overrides, and explicit disable.
  • charts/shield/Chart.yaml: bump 1.34.5 → 1.34.6.

Test plan

  • helm unittest charts/shield → 79/79 pass locally
  • helm lint charts/shield -f charts/shield/ci/test-values.yaml clean
  • helm template ... -f ci/test-values.yaml renders the expected startupProbe block
  • After merge: rerun qa/QA-shield/cluster/gke-windows-2019-x86_64 to confirm no cluster-shield restarts

Notes / Trade-offs

  • Chose a long startup budget (5 min) because the GKE Windows reproducer takes ~80s between container start and /healthz stability; a shorter budget would still risk flakes on noisy clusters.
  • Opt-out via cluster.probes.startup: null preserves the current behaviour for users who want to override it.
  • Alternative fix would be to bump failureThreshold/initialDelaySeconds on the liveness probe itself, but that weakens steady-state liveness semantics. A dedicated startupProbe is the minimal-blast-radius option.

Related

  • QA-reported failure: test_cluster_shield_pods_no_restarts
  • Product investigation Jira: will link after ticket is filed

Cluster-shield currently has only a liveness probe with a ~50s budget
(initialDelaySeconds=5, periodSeconds=5, failureThreshold=9). On slow
or noisy clusters (observed on GKE with Windows node pools) the
/healthz endpoint can take longer than 50s to become responsive during
CVM warmup, causing the kubelet to kill the container with exit 1067
and restart it. The test that asserts zero restarts then fails.

Add a startupProbe with a ~5 minute budget (failureThreshold=30,
periodSeconds=10) so the liveness probe is suspended during warmup and
only takes over once the container has reported healthy at least once.
Steady-state liveness/readiness behaviour is unchanged.
@github-actions
Copy link
Copy Markdown
Contributor

Hi @juan-rodriguez-gracia. Thanks for your PR.

After inspecting your changes someone with write access to this repo needs
to approve and run the workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants