fix(controller): adopt existing driver pod when status update is lost after submit by a7i · Pull Request #2932 · kubeflow/spark-operator

a7i · 2026-05-06T14:43:01Z

Purpose of this PR

Make SparkApplication submission idempotent when the controller loses the post-submit status write (for example after a leader restart). If the driver pod for this application already exists and is owned by this SparkApplication (matched by OwnerReference UID), reconciliation skips spark-submit and recovers enough status to continue the normal lifecycle.

This addresses duplicate submission and the failure mode where spark-submit returns driver pod already exist while .status never left New. Requiring an owner reference in addition to operator driver labels avoids adopting an unrelated or recycled same-named pod.

Proposed changes:

Add tryAdoptExistingDriverPod and call it from reconcileNewSparkApplication before submitSparkApplication.
Adoption sets driverInfo.podName, submissionID from the driver pod label when present (keeps pod event enqueue behavior in event_handler aligned), lastSubmissionAttemptTime, submission and execution attempts when they were zero, and appState to Submitted.
Add envtest coverage for adoption when the pod is owned by the app, and for non-adoption when the pod is unrelated.

Change Category

Bugfix (non-breaking change which fixes an issue)
Feature (non-breaking change which adds functionality)
Breaking change (fix or feature that could affect existing functionality)
Documentation update

Rationale

Fixes #2788.

Downstream reconcile (reconcileSubmittedSparkApplication, driver and executor state updates) fills in sparkApplicationID, terminal state, and related fields from the driver pod, consistent with the path where status was never lost.

Checklist

I have conducted a self-review of my own code.
I have updated documentation accordingly.
I have added tests that prove my changes are effective or that my feature works.
Existing unit tests pass locally with my changes.

Additional Notes

Tested with go test ./internal/controller/sparkapplication/... -timeout 120s.

google-oss-prow · 2026-05-06T14:43:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chenyi015 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-05-06T14:43:13Z

🎉 Welcome to the Kubeflow Spark Operator! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon!

Join the community:

Slack: Join our #kubeflow-spark-on-kubernetes Slack channel.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

When the controller restarts after spark-submit created the driver pod but before status was persisted, reconcile must not call spark-submit again. Adopt the pod when it is owned by this SparkApplication, restoring submission id from pod labels and aligning attempts with a successful submit path. Refs kubeflow#2788 Signed-off-by: Amir Alavi <[email protected]>

When the controller leader receives SIGTERM mid-submission, the in-flight reconcile may not get to write status before either controller-runtime's GracefulShutdownTimeout (30s default) returns from Manager.Start or the kubelet's terminationGracePeriodSeconds (30s default) sends SIGKILL. Both defaults are tight, and spark-submit ignores the reconcile context, so a brand-new submission started seconds before SIGTERM can create a driver pod whose status update is then lost. Make the shutdown window wider and explicitly bounded: - Add --graceful-shutdown-timeout flag (default 90s) and pass it as Manager.GracefulShutdownTimeout, so in-flight reconciles have time to finish writing status after SIGTERM. - Set the controller pod's terminationGracePeriodSeconds to 120s in the Helm chart (configurable), so the kubelet does not SIGKILL the manager before the inner timeout elapses. - Thread the reconcile context into runSparkSubmit and use exec.CommandContext so a cancelled reconcile actually terminates the spark-submit child instead of orphaning it. Adoption (PR kubeflow#2932) still covers the residual race; these changes shrink the window so adoption fires far less often in practice. Refs kubeflow#2788 Signed-off-by: Amir Alavi <[email protected]>

google-oss-prow Bot added the do-not-merge/work-in-progress label May 6, 2026

google-oss-prow Bot requested review from ImpSy and nabuskey May 6, 2026 14:43

google-oss-prow Bot added the size/L label May 6, 2026

a7i mentioned this pull request May 7, 2026

Restart of Spark Operator may Result in Duplicated Submission #2788

Open

1 task

a7i force-pushed the fix/idempotent-spark-application-submission branch from 62bff2e to 76f61ce Compare May 7, 2026 20:30

a7i changed the title ~~fix(controller): adopt existing driver pod on submission to prevent duplicate runs~~ fix(controller): adopt existing driver pod when status update is lost after submit May 7, 2026

a7i mentioned this pull request May 8, 2026

fix(controller): widen shutdown grace window for in-flight submissions #2934

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(controller): adopt existing driver pod when status update is lost after submit#2932

fix(controller): adopt existing driver pod when status update is lost after submit#2932
a7i wants to merge 1 commit into
kubeflow:masterfrom
a7i:fix/idempotent-spark-application-submission

a7i commented May 6, 2026 •

edited

Loading

Uh oh!

google-oss-prow Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

a7i commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of this PR

Change Category

Rationale

Checklist

Additional Notes

Uh oh!

google-oss-prow Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

a7i commented May 6, 2026 •

edited

Loading