fix(controller): widen shutdown grace window for in-flight submissions#2934
fix(controller): widen shutdown grace window for in-flight submissions#2934a7i wants to merge 1 commit intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Spark Operator! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
When the controller leader receives SIGTERM mid-submission, the in-flight reconcile may not get to write status before either controller-runtime's GracefulShutdownTimeout (30s default) returns from Manager.Start or the kubelet's terminationGracePeriodSeconds (30s default) sends SIGKILL. Both defaults are tight, and spark-submit ignores the reconcile context, so a brand-new submission started seconds before SIGTERM can create a driver pod whose status update is then lost. Make the shutdown window wider and explicitly bounded: - Add --graceful-shutdown-timeout flag (default 90s) and pass it as Manager.GracefulShutdownTimeout, so in-flight reconciles have time to finish writing status after SIGTERM. - Set the controller pod's terminationGracePeriodSeconds to 120s in the Helm chart (configurable), so the kubelet does not SIGKILL the manager before the inner timeout elapses. - Thread the reconcile context into runSparkSubmit and use exec.CommandContext so a cancelled reconcile actually terminates the spark-submit child instead of orphaning it. Adoption (PR kubeflow#2932) still covers the residual race; these changes shrink the window so adoption fires far less often in practice. Refs kubeflow#2788 Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
ae91caf to
71f74fc
Compare
There was a problem hiding this comment.
Pull request overview
This PR reduces the likelihood of losing the post-spark-submit status update during controller shutdown/restarts by (1) increasing the controller-runtime graceful shutdown window and (2) propagating reconcile cancellation into the spark-submit subprocess.
Changes:
- Add a
--graceful-shutdown-timeoutCLI flag (default60s) and wire it to controller-runtimeGracefulShutdownTimeout. - Make
spark-submitexecution cancellable by usingexec.CommandContext(ctx, ...)and threading the reconcile context into submission. - Extend the Helm chart to configure
controller.gracefulShutdownTimeout(default60s) andcontroller.terminationGracePeriodSeconds(default90), with unit tests and README updates.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/controller/sparkapplication/submission.go | Pass reconcile context into spark-submit via exec.CommandContext to support cancellation. |
| cmd/operator/controller/start.go | Add --graceful-shutdown-timeout flag and configure controller-runtime manager shutdown timeout. |
| charts/spark-operator-chart/values.yaml | Introduce default values for graceful shutdown timeout and pod termination grace period. |
| charts/spark-operator-chart/templates/controller/deployment.yaml | Render the new controller arg and set terminationGracePeriodSeconds on the controller Pod spec. |
| charts/spark-operator-chart/tests/controller/deployment_test.yaml | Add Helm unit tests for the new arg and pod termination grace period settings. |
| charts/spark-operator-chart/README.md | Document the new Helm values. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // cancellation) terminates the child process instead of letting it run to completion | ||
| // and create a driver pod whose status update will never be persisted. |
| logger.Info("Running spark-submit", "arguments", args) | ||
| if err := runSparkSubmit(args); err != nil { | ||
| if err := runSparkSubmit(ctx, args); err != nil { | ||
| return fmt.Errorf("failed to run spark-submit: %v", err) |
There was a problem hiding this comment.
ignoring, this wasn't part of my changes
Purpose of this PR
Shrink the window in which a controller restart can lose the status write that follows a successful
spark-submit. Same failure mode addressed by adoption in #2932; this PR addresses it from the other direction by giving in-flight reconciles enough time to finish, and by makingspark-submitcancellable.When SIGTERM fires mid-submission, two timeouts decide whether the post-submit status write reaches the apiserver:
Manager.GracefulShutdownTimeout(default 30s): once it elapses,Manager.Startreturns and the process exits, killing any goroutine still inReconcile.terminationGracePeriodSeconds(kubelet default 30s): once it elapses, kubelet sends SIGKILL.Both defaults are tight, and
runSparkSubmitusesexec.Commandrather thanexec.CommandContext, so context cancellation does not propagate to the child.Proposed changes:
--graceful-shutdown-timeoutflag (default60s) on the controller binary and pass it toctrl.NewManagerasGracefulShutdownTimeout.controller.gracefulShutdownTimeout(default60s) andcontroller.terminationGracePeriodSeconds(default90).runSparkSubmit: useexec.CommandContext(ctx, ...)and thread the reconcile context throughSubmit.How the defaults were chosen
spark-submitinvocation inkubernetesmode withsubmission-wait-app-completion=falsetakes ~5-10s end to end (JVM cold start, build pod manifest, onePOST /podsthrough admission, return). Worst case on a busy cluster: ~15-20s. The post-submit statusPATCHis <1s.--controller-threads=10, so up to 10 concurrent submissions can be in flight when SIGTERM fires. They serialize on the apiserver and on JVM startup cost. Realistic worst case to drain everything: ~30-60s.--graceful-shutdown-timeout=60scovers that worst case with margin and is well below the practical ceiling.terminationGracePeriodSecondsmust be strictly greater thangracefulShutdownTimeoutto leave room for process teardown after the inner timeout fires (leader lock release, log flush, server shutdowns). About 30s of slack is comfortable; 60s+1s would be unsafe, since the worst-case branch is exactly when the inner timeout is fully consumed.90skeeps that slack without slowing operator rolling updates excessively.30s / 60s.Change Category
Rationale
Refs #2788. Complementary to #2932: adoption is the safety net for races no amount of grace can avoid (hard kills, panics, OOMKill); this PR makes those races rarer by widening the normal-shutdown window and propagating cancellation into spark-submit.
Checklist
Additional Notes
Locally validated:
go test ./internal/controller/sparkapplication/... -timeout 120shelm unittest charts/spark-operator-chart -f 'tests/controller/deployment_test.yaml'helm template charts/spark-operator-chartshows--graceful-shutdown-timeout=60sandterminationGracePeriodSeconds: 90rendered by default.