Skip to content

fix(controller): widen shutdown grace window for in-flight submissions#2934

Open
a7i wants to merge 1 commit intokubeflow:masterfrom
a7i:fix/controller-shutdown-grace-window
Open

fix(controller): widen shutdown grace window for in-flight submissions#2934
a7i wants to merge 1 commit intokubeflow:masterfrom
a7i:fix/controller-shutdown-grace-window

Conversation

@a7i
Copy link
Copy Markdown

@a7i a7i commented May 8, 2026

Purpose of this PR

Shrink the window in which a controller restart can lose the status write that follows a successful spark-submit. Same failure mode addressed by adoption in #2932; this PR addresses it from the other direction by giving in-flight reconciles enough time to finish, and by making spark-submit cancellable.

When SIGTERM fires mid-submission, two timeouts decide whether the post-submit status write reaches the apiserver:

  • controller-runtime Manager.GracefulShutdownTimeout (default 30s): once it elapses, Manager.Start returns and the process exits, killing any goroutine still in Reconcile.
  • Pod terminationGracePeriodSeconds (kubelet default 30s): once it elapses, kubelet sends SIGKILL.

Both defaults are tight, and runSparkSubmit uses exec.Command rather than exec.CommandContext, so context cancellation does not propagate to the child.

Proposed changes:

  • Add --graceful-shutdown-timeout flag (default 60s) on the controller binary and pass it to ctrl.NewManager as GracefulShutdownTimeout.
  • Helm chart: add controller.gracefulShutdownTimeout (default 60s) and controller.terminationGracePeriodSeconds (default 90).
  • runSparkSubmit: use exec.CommandContext(ctx, ...) and thread the reconcile context through Submit.
  • Helm unit tests for the new chart values.

How the defaults were chosen

  • A typical spark-submit invocation in kubernetes mode with submission-wait-app-completion=false takes ~5-10s end to end (JVM cold start, build pod manifest, one POST /pods through admission, return). Worst case on a busy cluster: ~15-20s. The post-submit status PATCH is <1s.
  • The controller defaults to --controller-threads=10, so up to 10 concurrent submissions can be in flight when SIGTERM fires. They serialize on the apiserver and on JVM startup cost. Realistic worst case to drain everything: ~30-60s.
  • --graceful-shutdown-timeout=60s covers that worst case with margin and is well below the practical ceiling.
  • terminationGracePeriodSeconds must be strictly greater than gracefulShutdownTimeout to leave room for process teardown after the inner timeout fires (leader lock release, log flush, server shutdowns). About 30s of slack is comfortable; 60s+1s would be unsafe, since the worst-case branch is exactly when the inner timeout is fully consumed. 90s keeps that slack without slowing operator rolling updates excessively.
  • Both values are configurable; clusters with simpler workloads can dial them back to roughly 30s / 60s.

Change Category

  • Bugfix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that could affect existing functionality)
  • Documentation update

Rationale

Refs #2788. Complementary to #2932: adoption is the safety net for races no amount of grace can avoid (hard kills, panics, OOMKill); this PR makes those races rarer by widening the normal-shutdown window and propagating cancellation into spark-submit.

Checklist

  • I have conducted a self-review of my own code.
  • I have updated documentation accordingly.
  • I have added tests that prove my changes are effective or that my feature works.
  • Existing unit tests pass locally with my changes.

Additional Notes

Locally validated:

  • go test ./internal/controller/sparkapplication/... -timeout 120s
  • helm unittest charts/spark-operator-chart -f 'tests/controller/deployment_test.yaml'
  • helm template charts/spark-operator-chart shows --graceful-shutdown-timeout=60s and terminationGracePeriodSeconds: 90 rendered by default.

@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chenyi015 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

🎉 Welcome to the Kubeflow Spark Operator! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

When the controller leader receives SIGTERM mid-submission, the in-flight
reconcile may not get to write status before either controller-runtime's
GracefulShutdownTimeout (30s default) returns from Manager.Start or the
kubelet's terminationGracePeriodSeconds (30s default) sends SIGKILL. Both
defaults are tight, and spark-submit ignores the reconcile context, so a
brand-new submission started seconds before SIGTERM can create a driver
pod whose status update is then lost.

Make the shutdown window wider and explicitly bounded:

- Add --graceful-shutdown-timeout flag (default 90s) and pass it as
  Manager.GracefulShutdownTimeout, so in-flight reconciles have time to
  finish writing status after SIGTERM.
- Set the controller pod's terminationGracePeriodSeconds to 120s in the
  Helm chart (configurable), so the kubelet does not SIGKILL the manager
  before the inner timeout elapses.
- Thread the reconcile context into runSparkSubmit and use
  exec.CommandContext so a cancelled reconcile actually terminates the
  spark-submit child instead of orphaning it.

Adoption (PR kubeflow#2932) still covers the residual race; these changes shrink
the window so adoption fires far less often in practice.

Refs kubeflow#2788

Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
@a7i a7i force-pushed the fix/controller-shutdown-grace-window branch from ae91caf to 71f74fc Compare May 8, 2026 01:45
@a7i a7i marked this pull request as ready for review May 8, 2026 01:50
Copilot AI review requested due to automatic review settings May 8, 2026 01:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces the likelihood of losing the post-spark-submit status update during controller shutdown/restarts by (1) increasing the controller-runtime graceful shutdown window and (2) propagating reconcile cancellation into the spark-submit subprocess.

Changes:

  • Add a --graceful-shutdown-timeout CLI flag (default 60s) and wire it to controller-runtime GracefulShutdownTimeout.
  • Make spark-submit execution cancellable by using exec.CommandContext(ctx, ...) and threading the reconcile context into submission.
  • Extend the Helm chart to configure controller.gracefulShutdownTimeout (default 60s) and controller.terminationGracePeriodSeconds (default 90), with unit tests and README updates.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
internal/controller/sparkapplication/submission.go Pass reconcile context into spark-submit via exec.CommandContext to support cancellation.
cmd/operator/controller/start.go Add --graceful-shutdown-timeout flag and configure controller-runtime manager shutdown timeout.
charts/spark-operator-chart/values.yaml Introduce default values for graceful shutdown timeout and pod termination grace period.
charts/spark-operator-chart/templates/controller/deployment.yaml Render the new controller arg and set terminationGracePeriodSeconds on the controller Pod spec.
charts/spark-operator-chart/tests/controller/deployment_test.yaml Add Helm unit tests for the new arg and pod termination grace period settings.
charts/spark-operator-chart/README.md Document the new Helm values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +64 to +65
// cancellation) terminates the child process instead of letting it run to completion
// and create a driver pod whose status update will never be persisted.
logger.Info("Running spark-submit", "arguments", args)
if err := runSparkSubmit(args); err != nil {
if err := runSparkSubmit(ctx, args); err != nil {
return fmt.Errorf("failed to run spark-submit: %v", err)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoring, this wasn't part of my changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants