fix(shield): add startupProbe to cluster-shield deployment by juan-rodriguez-gracia · Pull Request #2589 · sysdiglabs/charts

juan-rodriguez-gracia · 2026-04-20T18:43:01Z

Description

Cluster-shield currently has only a liveness probe with a ~50s budget (initialDelaySeconds=5, periodSeconds=5, failureThreshold=9). On slow or noisy clusters the /healthz endpoint can take longer than 50s to become responsive during CVM warmup, causing the kubelet to kill the container with exit code 1067 and restart it.

This PR adds a startupProbe with a ~5 minute budget so the liveness probe is suspended during warmup and only takes over once the container has reported healthy at least once. Steady-state liveness/readiness behaviour is unchanged.

Motivation

Observed on QA build Kubelet events from the failing pod:

Warning Unhealthy  Liveness probe failed: Get ".../healthz": context deadline exceeded
Warning Unhealthy  Readiness probe failed: Get ".../healthz": context deadline exceeded
Warning Unhealthy  Liveness probe failed: HTTP probe failed with statuscode: 500
Normal  Killing    Container cluster-shield failed liveness probe, will be restarted

Pod lastState.terminated.exitCode: 1067, reason Error. The pod recovers on restart — this is startup latency, not a steady-state bug.

What changed

charts/shield/values.yaml: new cluster.probes.startup section with default budget of ~5 min (initialDelay=15, period=10, failureThreshold=30, timeout=3).
charts/shield/templates/cluster/deployment.yaml: render startupProbe before livenessProbe when cluster.probes.startup is set. The if guard lets users disable the startup probe by setting cluster.probes.startup: null.
charts/shield/tests/cluster/deployment_test.yaml: three new cases — default startup probe, custom overrides, and explicit disable.
charts/shield/Chart.yaml: bump 1.34.5 → 1.34.6.

Test plan

helm unittest charts/shield → 79/79 pass locally
helm lint charts/shield -f charts/shield/ci/test-values.yaml clean
helm template ... -f ci/test-values.yaml renders the expected startupProbe block
After merge: rerun qa/QA-shield/cluster/gke-windows-2019-x86_64 to confirm no cluster-shield restarts

Notes / Trade-offs

Chose a long startup budget (5 min) because the GKE Windows reproducer takes ~80s between container start and /healthz stability; a shorter budget would still risk flakes on noisy clusters.
Opt-out via cluster.probes.startup: null preserves the current behaviour for users who want to override it.
Alternative fix would be to bump failureThreshold/initialDelaySeconds on the liveness probe itself, but that weakens steady-state liveness semantics. A dedicated startupProbe is the minimal-blast-radius option.

Cluster-shield currently has only a liveness probe with a ~50s budget (initialDelaySeconds=5, periodSeconds=5, failureThreshold=9). On slow or noisy clusters (observed on GKE with Windows node pools) the /healthz endpoint can take longer than 50s to become responsive during CVM warmup, causing the kubelet to kill the container with exit 1067 and restart it. The test that asserts zero restarts then fails. Add a startupProbe with a ~5 minute budget (failureThreshold=30, periodSeconds=10) so the liveness probe is suspended during warmup and only takes over once the container has reported healthy at least once. Steady-state liveness/readiness behaviour is unchanged.

github-actions · 2026-04-20T18:43:14Z

Hi @juan-rodriguez-gracia. Thanks for your PR.

After inspecting your changes someone with write access to this repo needs
to approve and run the workflow.

francesco-furlan closed this Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(shield): add startupProbe to cluster-shield deployment#2589

fix(shield): add startupProbe to cluster-shield deployment#2589
juan-rodriguez-gracia wants to merge 1 commit intosysdiglabs:mainfrom
juan-rodriguez-gracia:fix/shield-cluster-startupProbe

juan-rodriguez-gracia commented Apr 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

juan-rodriguez-gracia commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation

What changed

Test plan

Notes / Trade-offs

Related

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

juan-rodriguez-gracia commented Apr 20, 2026 •

edited

Loading