fix(controller): adopt existing driver pod when status update is lost after submit#2932
Draft
a7i wants to merge 1 commit into
Draft
fix(controller): adopt existing driver pod when status update is lost after submit#2932a7i wants to merge 1 commit into
a7i wants to merge 1 commit into
Conversation
Contributor
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Spark Operator! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
1 task
When the controller restarts after spark-submit created the driver pod but before status was persisted, reconcile must not call spark-submit again. Adopt the pod when it is owned by this SparkApplication, restoring submission id from pod labels and aligning attempts with a successful submit path. Refs kubeflow#2788 Signed-off-by: Amir Alavi <[email protected]>
62bff2e to
76f61ce
Compare
8 tasks
a7i
added a commit
to a7i/spark-operator
that referenced
this pull request
May 8, 2026
When the controller leader receives SIGTERM mid-submission, the in-flight reconcile may not get to write status before either controller-runtime's GracefulShutdownTimeout (30s default) returns from Manager.Start or the kubelet's terminationGracePeriodSeconds (30s default) sends SIGKILL. Both defaults are tight, and spark-submit ignores the reconcile context, so a brand-new submission started seconds before SIGTERM can create a driver pod whose status update is then lost. Make the shutdown window wider and explicitly bounded: - Add --graceful-shutdown-timeout flag (default 90s) and pass it as Manager.GracefulShutdownTimeout, so in-flight reconciles have time to finish writing status after SIGTERM. - Set the controller pod's terminationGracePeriodSeconds to 120s in the Helm chart (configurable), so the kubelet does not SIGKILL the manager before the inner timeout elapses. - Thread the reconcile context into runSparkSubmit and use exec.CommandContext so a cancelled reconcile actually terminates the spark-submit child instead of orphaning it. Adoption (PR kubeflow#2932) still covers the residual race; these changes shrink the window so adoption fires far less often in practice. Refs kubeflow#2788 Signed-off-by: Amir Alavi <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose of this PR
Make
SparkApplicationsubmission idempotent when the controller loses the post-submit status write (for example after a leader restart). If the driver pod for this application already exists and is owned by thisSparkApplication(matched byOwnerReferenceUID), reconciliation skipsspark-submitand recovers enough status to continue the normal lifecycle.This addresses duplicate submission and the failure mode where
spark-submitreturnsdriver pod already existwhile.statusnever leftNew. Requiring an owner reference in addition to operator driver labels avoids adopting an unrelated or recycled same-named pod.Proposed changes:
tryAdoptExistingDriverPodand call it fromreconcileNewSparkApplicationbeforesubmitSparkApplication.driverInfo.podName,submissionIDfrom the driver pod label when present (keeps pod event enqueue behavior inevent_handleraligned),lastSubmissionAttemptTime, submission and execution attempts when they were zero, andappStatetoSubmitted.Change Category
Rationale
Fixes #2788.
Downstream reconcile (
reconcileSubmittedSparkApplication, driver and executor state updates) fills insparkApplicationID, terminal state, and related fields from the driver pod, consistent with the path where status was never lost.Checklist
Additional Notes
Tested with
go test ./internal/controller/sparkapplication/... -timeout 120s.