fix(controller): widen shutdown grace window for in-flight submissions by a7i · Pull Request #2934 · kubeflow/spark-operator

a7i · 2026-05-08T01:37:09Z

Purpose of this PR

Shrink the window in which a controller restart can lose the status write that follows a successful spark-submit. Same failure mode addressed by adoption in #2932; this PR addresses it from the other direction by giving in-flight reconciles enough time to finish, and by making spark-submit cancellable.

When SIGTERM fires mid-submission, two timeouts decide whether the post-submit status write reaches the apiserver:

controller-runtime Manager.GracefulShutdownTimeout (default 30s): once it elapses, Manager.Start returns and the process exits, killing any goroutine still in Reconcile.
Pod terminationGracePeriodSeconds (kubelet default 30s): once it elapses, kubelet sends SIGKILL.

Both defaults are tight, and runSparkSubmit uses exec.Command rather than exec.CommandContext, so context cancellation does not propagate to the child.

Proposed changes:

Add --graceful-shutdown-timeout flag (default 60s) on the controller binary and pass it to ctrl.NewManager as GracefulShutdownTimeout.
Helm chart: add controller.gracefulShutdownTimeout (default 60s) and controller.terminationGracePeriodSeconds (default 90).
runSparkSubmit: use exec.CommandContext(ctx, ...) and thread the reconcile context through Submit.
Helm unit tests for the new chart values.

How the defaults were chosen

A typical spark-submit invocation in kubernetes mode with submission-wait-app-completion=false takes ~5-10s end to end (JVM cold start, build pod manifest, one POST /pods through admission, return). Worst case on a busy cluster: ~15-20s. The post-submit status PATCH is <1s.
The controller defaults to --controller-threads=10, so up to 10 concurrent submissions can be in flight when SIGTERM fires. They serialize on the apiserver and on JVM startup cost. Realistic worst case to drain everything: ~30-60s.
--graceful-shutdown-timeout=60s covers that worst case with margin and is well below the practical ceiling.
terminationGracePeriodSeconds must be strictly greater than gracefulShutdownTimeout to leave room for process teardown after the inner timeout fires (leader lock release, log flush, server shutdowns). About 30s of slack is comfortable; 60s+1s would be unsafe, since the worst-case branch is exactly when the inner timeout is fully consumed. 90s keeps that slack without slowing operator rolling updates excessively.
Both values are configurable; clusters with simpler workloads can dial them back to roughly 30s / 60s.

Change Category

Bugfix (non-breaking change which fixes an issue)
Feature (non-breaking change which adds functionality)
Breaking change (fix or feature that could affect existing functionality)
Documentation update

Rationale

Refs #2788. Complementary to #2932: adoption is the safety net for races no amount of grace can avoid (hard kills, panics, OOMKill); this PR makes those races rarer by widening the normal-shutdown window and propagating cancellation into spark-submit.

Checklist

I have conducted a self-review of my own code.
I have updated documentation accordingly.
I have added tests that prove my changes are effective or that my feature works.
Existing unit tests pass locally with my changes.

Additional Notes

Locally validated:

go test ./internal/controller/sparkapplication/... -timeout 120s
helm unittest charts/spark-operator-chart -f 'tests/controller/deployment_test.yaml'
helm template charts/spark-operator-chart shows --graceful-shutdown-timeout=60s and terminationGracePeriodSeconds: 90 rendered by default.

google-oss-prow · 2026-05-08T01:37:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chenyi015 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-05-08T01:37:18Z

🎉 Welcome to the Kubeflow Spark Operator! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon!

Join the community:

Slack: Join our #kubeflow-spark-on-kubernetes Slack channel.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

When the controller leader receives SIGTERM mid-submission, the in-flight reconcile may not get to write status before either controller-runtime's GracefulShutdownTimeout (30s default) returns from Manager.Start or the kubelet's terminationGracePeriodSeconds (30s default) sends SIGKILL. Both defaults are tight, and spark-submit ignores the reconcile context, so a brand-new submission started seconds before SIGTERM can create a driver pod whose status update is then lost. Make the shutdown window wider and explicitly bounded: - Add --graceful-shutdown-timeout flag (default 90s) and pass it as Manager.GracefulShutdownTimeout, so in-flight reconciles have time to finish writing status after SIGTERM. - Set the controller pod's terminationGracePeriodSeconds to 120s in the Helm chart (configurable), so the kubelet does not SIGKILL the manager before the inner timeout elapses. - Thread the reconcile context into runSparkSubmit and use exec.CommandContext so a cancelled reconcile actually terminates the spark-submit child instead of orphaning it. Adoption (PR kubeflow#2932) still covers the residual race; these changes shrink the window so adoption fires far less often in practice. Refs kubeflow#2788 Signed-off-by: Amir Alavi <amiralavi7@gmail.com>

Copilot

Pull request overview

This PR reduces the likelihood of losing the post-spark-submit status update during controller shutdown/restarts by (1) increasing the controller-runtime graceful shutdown window and (2) propagating reconcile cancellation into the spark-submit subprocess.

Changes:

Add a --graceful-shutdown-timeout CLI flag (default 60s) and wire it to controller-runtime GracefulShutdownTimeout.
Make spark-submit execution cancellable by using exec.CommandContext(ctx, ...) and threading the reconcile context into submission.
Extend the Helm chart to configure controller.gracefulShutdownTimeout (default 60s) and controller.terminationGracePeriodSeconds (default 90), with unit tests and README updates.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
internal/controller/sparkapplication/submission.go	Pass reconcile context into `spark-submit` via `exec.CommandContext` to support cancellation.
cmd/operator/controller/start.go	Add `--graceful-shutdown-timeout` flag and configure controller-runtime manager shutdown timeout.
charts/spark-operator-chart/values.yaml	Introduce default values for graceful shutdown timeout and pod termination grace period.
charts/spark-operator-chart/templates/controller/deployment.yaml	Render the new controller arg and set `terminationGracePeriodSeconds` on the controller Pod spec.
charts/spark-operator-chart/tests/controller/deployment_test.yaml	Add Helm unit tests for the new arg and pod termination grace period settings.
charts/spark-operator-chart/README.md	Document the new Helm values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+	// cancellation) terminates the child process instead of letting it run to completion
+	// and create a driver pod whose status update will never be persisted.


a7i · 2026-05-08T12:31:02Z

 	logger.Info("Running spark-submit", "arguments", args)
-	if err := runSparkSubmit(args); err != nil {
+	if err := runSparkSubmit(ctx, args); err != nil {
 		return fmt.Errorf("failed to run spark-submit: %v", err)


ignoring, this wasn't part of my changes

google-oss-prow Bot added the do-not-merge/work-in-progress label May 8, 2026

google-oss-prow Bot requested review from ImpSy and nabuskey May 8, 2026 01:37

google-oss-prow Bot added the size/M label May 8, 2026

a7i force-pushed the fix/controller-shutdown-grace-window branch from ae91caf to 71f74fc Compare May 8, 2026 01:45

a7i marked this pull request as ready for review May 8, 2026 01:50

Copilot AI review requested due to automatic review settings May 8, 2026 01:50

google-oss-prow Bot removed the do-not-merge/work-in-progress label May 8, 2026

Copilot started reviewing on behalf of a7i May 8, 2026 01:50 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(controller): widen shutdown grace window for in-flight submissions#2934

fix(controller): widen shutdown grace window for in-flight submissions#2934
a7i wants to merge 1 commit intokubeflow:masterfrom
a7i:fix/controller-shutdown-grace-window

a7i commented May 8, 2026 •

edited

Loading

Uh oh!

google-oss-prow Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

a7i May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// cancellation) terminates the child process instead of letting it run to completion
		// and create a driver pod whose status update will never be persisted.

Conversation

a7i commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of this PR

How the defaults were chosen

Change Category

Rationale

Checklist

Additional Notes

Uh oh!

google-oss-prow Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

a7i May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

a7i commented May 8, 2026 •

edited

Loading