Skip to content

fix(controller): adopt existing driver pod when status update is lost after submit#2932

Draft
a7i wants to merge 1 commit into
kubeflow:masterfrom
a7i:fix/idempotent-spark-application-submission
Draft

fix(controller): adopt existing driver pod when status update is lost after submit#2932
a7i wants to merge 1 commit into
kubeflow:masterfrom
a7i:fix/idempotent-spark-application-submission

Conversation

@a7i
Copy link
Copy Markdown

@a7i a7i commented May 6, 2026

Purpose of this PR

Make SparkApplication submission idempotent when the controller loses the post-submit status write (for example after a leader restart). If the driver pod for this application already exists and is owned by this SparkApplication (matched by OwnerReference UID), reconciliation skips spark-submit and recovers enough status to continue the normal lifecycle.

This addresses duplicate submission and the failure mode where spark-submit returns driver pod already exist while .status never left New. Requiring an owner reference in addition to operator driver labels avoids adopting an unrelated or recycled same-named pod.

Proposed changes:

  • Add tryAdoptExistingDriverPod and call it from reconcileNewSparkApplication before submitSparkApplication.
  • Adoption sets driverInfo.podName, submissionID from the driver pod label when present (keeps pod event enqueue behavior in event_handler aligned), lastSubmissionAttemptTime, submission and execution attempts when they were zero, and appState to Submitted.
  • Add envtest coverage for adoption when the pod is owned by the app, and for non-adoption when the pod is unrelated.

Change Category

  • Bugfix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that could affect existing functionality)
  • Documentation update

Rationale

Fixes #2788.

Downstream reconcile (reconcileSubmittedSparkApplication, driver and executor state updates) fills in sparkApplicationID, terminal state, and related fields from the driver pod, consistent with the path where status was never lost.

Checklist

  • I have conducted a self-review of my own code.
  • I have updated documentation accordingly.
  • I have added tests that prove my changes are effective or that my feature works.
  • Existing unit tests pass locally with my changes.

Additional Notes

Tested with go test ./internal/controller/sparkapplication/... -timeout 120s.

@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chenyi015 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot requested review from ImpSy and nabuskey May 6, 2026 14:43
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

🎉 Welcome to the Kubeflow Spark Operator! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

When the controller restarts after spark-submit created the driver pod but before
status was persisted, reconcile must not call spark-submit again. Adopt the pod
when it is owned by this SparkApplication, restoring submission id from pod
labels and aligning attempts with a successful submit path.

Refs kubeflow#2788

Signed-off-by: Amir Alavi <[email protected]>
@a7i a7i force-pushed the fix/idempotent-spark-application-submission branch from 62bff2e to 76f61ce Compare May 7, 2026 20:30
@a7i a7i changed the title fix(controller): adopt existing driver pod on submission to prevent duplicate runs fix(controller): adopt existing driver pod when status update is lost after submit May 7, 2026
a7i added a commit to a7i/spark-operator that referenced this pull request May 8, 2026
When the controller leader receives SIGTERM mid-submission, the in-flight
reconcile may not get to write status before either controller-runtime's
GracefulShutdownTimeout (30s default) returns from Manager.Start or the
kubelet's terminationGracePeriodSeconds (30s default) sends SIGKILL. Both
defaults are tight, and spark-submit ignores the reconcile context, so a
brand-new submission started seconds before SIGTERM can create a driver
pod whose status update is then lost.

Make the shutdown window wider and explicitly bounded:

- Add --graceful-shutdown-timeout flag (default 90s) and pass it as
  Manager.GracefulShutdownTimeout, so in-flight reconciles have time to
  finish writing status after SIGTERM.
- Set the controller pod's terminationGracePeriodSeconds to 120s in the
  Helm chart (configurable), so the kubelet does not SIGKILL the manager
  before the inner timeout elapses.
- Thread the reconcile context into runSparkSubmit and use
  exec.CommandContext so a cancelled reconcile actually terminates the
  spark-submit child instead of orphaning it.

Adoption (PR kubeflow#2932) still covers the residual race; these changes shrink
the window so adoption fires far less often in practice.

Refs kubeflow#2788

Signed-off-by: Amir Alavi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Restart of Spark Operator may Result in Duplicated Submission

1 participant