fix: requeue pending rerun while resources exist by skaushik15 · Pull Request #2906 · kubeflow/spark-operator

skaushik15 · 2026-04-08T10:11:09Z

Please refer #2905 for more details

Summary

avoid prematurely finalizing PENDING_RERUN when the previous application resources still exist
keep reconciliation in PENDING_RERUN by returning a requeue result until old resources are cleaned up
add controller tests to verify the requeue behavior and guard against regressions

Test plan

unit tests for pending rerun path added/updated
validate controller behavior in PENDING_RERUN when resources exist

google-oss-prow · 2026-04-08T10:11:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR fixes a bug in the SparkApplication controller where it would prematurely finalize PENDING_RERUN applications when resources from the previous run still exist. The fix ensures that the controller keeps reconciling in the PENDING_RERUN state by returning a requeue result until old resources are cleaned up.

Changes:

Added a constant for the requeue delay (5 seconds) when waiting for resources to be cleaned up
Modified the reconcilePendingRerunSparkApplication function to retry deletion of remaining resources and requeue for a fixed delay instead of immediately finalizing
Added comprehensive unit tests to verify the requeue behavior and prevent regressions

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
internal/controller/sparkapplication/controller.go	Added requeue delay constant and modified pending rerun reconciliation to retry resource deletion and requeue instead of immediately finalizing
internal/controller/sparkapplication/controller_test.go	Added test case verifying that pending rerun applications requeue while resources are still being deleted

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When transitioning to PENDING_RERUN, the reconciler returned ctrl.Result{}, nil if the driver pod or UI service was still terminating, with no requeue scheduled. If the pod Delete event was missed or dropped, the SparkApplication would remain stuck in PENDING_RERUN indefinitely. Fix by retrying deleteSparkResources (to handle partial deletion failures) and scheduling a requeue after 5s so the reconciler re-checks until all resources are fully gone before submitting the next run. Signed-off-by: sushant kaushik <[email protected]> Made-with: Cursor AI-Session-Id: 58c13a82-a063-4650-a1c9-0a6ec5821b35 AI-Tool: claude-code AI-Model: unknown

google-oss-prow · 2026-04-08T10:35:49Z

@vizsmurali: changing LGTM is restricted to collaborators

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Prevent unbounded 5s requeues in PENDING_RERUN by limiting cleanup wait to 30s (aligned with default pod termination grace). If resources are still present after the window, transition the application to FAILED with an explicit error. Signed-off-by: sushant kaushik <[email protected]> Made-with: Cursor AI-Session-Id: ca1834a5-26b0-4af3-8439-ad7a091e985b AI-Tool: claude-code AI-Model: unknown

google-oss-prow · 2026-04-10T13:46:31Z

@dineshkumar181094: changing LGTM is restricted to collaborators

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nabuskey

The change for this problem should be very similar to what we do in reconcileFailedSubmissionSparkApplication.

If GC hasn't completed yet, we should update the result object with TimeUntilNextRetryDue, then re-queue.

Signed-off-by: sushant kaushik <[email protected]> Made-with: Cursor AI-Session-Id: ca1834a5-26b0-4af3-8439-ad7a091e985b AI-Tool: claude-code AI-Model: unknown

skaushik15 · 2026-04-11T05:26:56Z

The change for this problem should be very similar to what we do in reconcileFailedSubmissionSparkApplication.

If GC hasn't completed yet, we should update the result object with TimeUntilNextRetryDue, then re-queue.

Made the change in line with reconcileFailedSubmissionSparkApplication

tariq-hasan · 2026-04-13T07:45:10Z

The change for this problem should be very similar to what we do in reconcileFailedSubmissionSparkApplication.
If GC hasn't completed yet, we should update the result object with TimeUntilNextRetryDue, then re-queue.

Made the change in line with reconcileFailedSubmissionSparkApplication

I am wondering if we might be conflating the semantics of PENDING_RERUN somewhat with these changes. As per this description, if PENDING_RERUN is just a placeholder for garbage collection and if its parent states - FAILING and SUCCEEDING - already encapsulate the semantics for restart policy, linear backoff and retry logic then would we want to reuse that logic again in PENDING_RERUN?

If PENDING_RERUN is just a polling step to validate garbage collection I am wondering if we want to have a clean reusable abstraction for just that logic - which is basically just this portion. This would still leave open the question of the polling interval but that would be a different consideration than the retry semantics that was already run through in the prior state.

Also, the SUSPENDED state suffers from the same issue as PENDING_RERUN: https://github.com/kubeflow/spark-operator/blob/master/internal/controller/sparkapplication/controller.go#L815. I am wondering if we want to make the same changes there as well.

skaushik15 · 2026-04-15T04:19:08Z

The change for this problem should be very similar to what we do in reconcileFailedSubmissionSparkApplication.
If GC hasn't completed yet, we should update the result object with TimeUntilNextRetryDue, then re-queue.

Made the change in line with reconcileFailedSubmissionSparkApplication

I am wondering if we might be conflating the semantics of PENDING_RERUN somewhat with these changes. As per this description, if PENDING_RERUN is just a placeholder for garbage collection and if its parent states - FAILING and SUCCEEDING - already encapsulate the semantics for restart policy, linear backoff and retry logic then would we want to reuse that logic again in PENDING_RERUN?

If PENDING_RERUN is just a polling step to validate garbage collection I am wondering if we want to have a clean reusable abstraction for just that logic - which is basically just this portion. This would still leave open the question of the polling interval but that would be a different consideration than the retry semantics that was already run through in the prior state.

Also, the SUSPENDED state suffers from the same issue as PENDING_RERUN: https://github.com/kubeflow/spark-operator/blob/master/internal/controller/sparkapplication/controller.go#L815. I am wondering if we want to make the same changes there as well.

Agree @tariq-hasan other states have similar issues and we should fix them.
Spark application shouldn't be stuck in a non terminal state forever, it should graceful fail so that failure strategies can be applied

tariq-hasan · 2026-04-15T04:42:34Z

+				if app.Status.TerminationTime.IsZero() {
+					app.Status.TerminationTime = metav1.Now()
+				}
+				if util.PendingRerunCleanupRetryBudgetExceeded(app) {
+					app.Status.TerminationTime = metav1.Now()
+					app.Status.AppState.State = v1beta2.ApplicationStateFailed
+					app.Status.AppState.ErrorMessage = "failed to delete resources before rerun after exhausting cleanup retries"
+					r.recordSparkApplicationEvent(app)
+				} else {
+					timeUntilNextRetryDue, err := util.TimeUntilNextRetryDue(app)
+					if err != nil {
+						app.Status.TerminationTime = metav1.Now()
+						app.Status.AppState.State = v1beta2.ApplicationStateFailed
+						app.Status.AppState.ErrorMessage = fmt.Sprintf("failed to determine pending rerun cleanup retry: %v", err)
+						r.recordSparkApplicationEvent(app)
+					} else if timeUntilNextRetryDue <= 0 {
+						// Retry cleanup in case deletion is only partially complete, for example if the driver pod is gone
+						// but the UI Service/executors are still terminating.
+						logger.Info("Resources associated with SparkApplication still exist, retrying deletion")
+						if err := r.deleteSparkResources(ctx, app); err != nil {
+							logger.Error(err, "failed to delete spark resources")
+						}
+						result.Requeue = true
+					} else {
+						// Keep polling until Kubernetes GC has finished removing all resources associated with the app.
+						logger.Info("Resources associated with SparkApplication still exist, waiting for GC to complete")
+						result.RequeueAfter = timeUntilNextRetryDue
+					}
+				}


Suggested change

if app.Status.TerminationTime.IsZero() {

app.Status.TerminationTime = metav1.Now()

}

if util.PendingRerunCleanupRetryBudgetExceeded(app) {

app.Status.TerminationTime = metav1.Now()

app.Status.AppState.State = v1beta2.ApplicationStateFailed

app.Status.AppState.ErrorMessage = "failed to delete resources before rerun after exhausting cleanup retries"

r.recordSparkApplicationEvent(app)

} else {

timeUntilNextRetryDue, err := util.TimeUntilNextRetryDue(app)

if err != nil {

app.Status.TerminationTime = metav1.Now()

app.Status.AppState.State = v1beta2.ApplicationStateFailed

app.Status.AppState.ErrorMessage = fmt.Sprintf("failed to determine pending rerun cleanup retry: %v", err)

r.recordSparkApplicationEvent(app)

} else if timeUntilNextRetryDue <= 0 {

// Retry cleanup in case deletion is only partially complete, for example if the driver pod is gone

// but the UI Service/executors are still terminating.

logger.Info("Resources associated with SparkApplication still exist, retrying deletion")

if err := r.deleteSparkResources(ctx, app); err != nil {

logger.Error(err, "failed to delete spark resources")

}

result.Requeue = true

} else {

// Keep polling until Kubernetes GC has finished removing all resources associated with the app.

logger.Info("Resources associated with SparkApplication still exist, waiting for GC to complete")

result.RequeueAfter = timeUntilNextRetryDue

}

}

logger.Info("Resources associated with SparkApplication still exist, retrying deletion")

if err := r.deleteSparkResources(ctx, app); err != nil {

logger.Error(err, "Failed to delete resources associated with SparkApplication")

}

result.RequeueAfter = r.options.ResourceCleanupPollInterval

Add to that I believe we ought to have something like this for the resource cleanup where we set up the poll interval as a configurable option since the retry was already taken care of in the prior step.

This is in line with what I had in mind. Make use of sec.RetryInterval with a default value and backoff just to avoid adding yet another option to the spec / flags. Currently the field is not used at all. Perhaps it makes sense to rename or update description of the field to avoid confusion. What do you think?

That makes sense to me. Probably ideal to use this field than to add a new one in the spec or manifests.

@skaushik15 Does this make sense?

We should be able to set the default value here: api/v1beta2/defaults.go. 5 seconds default is good since it's the default value for other similar intervals.
Feel free to ping on Slack too.

Hey ,
it does make sense, but is there a plan to move it to terminal state instead of just retrying with a backoff interval? would moving it to a failed state after certain retries make more sense?

Made the changes as per suggestion

My understanding is that we can rephrase the question in this way: "Does K8s guarantee garbage collection (in this case for the Spark driver pod, Spark Web UI Ingress and Spark Web UI Service)? If so should we rely on that guarantee and avoid any custom logic for app-level failures? On the other hand if there is no guarantee then should we do app-level failure to compensate for that lack of guarantee?"

The original issue for the stuck state happened not because we did not fail the app. Rather it happened because we did not requeue in the first place. So the two aspects are fundamentally different and we are already taking care of requeueing.

In so far as garbage collection is concerned K8s does not guarantee deletion in a bounded time since finalizers, webhooks, offline nodes, etc. can block it, but this sounds more like a rare edge case than what this current issue is dealing with.

nabuskey

@tariq-hasan' s suggestion is great. The core of the issue here is that we return an empty result object instead of setting the requeue field. By not setting the field, reconciliation only happens when global reconcile timer is hit.

For this to happen, we need to set retry interval and backoff logic similar to how it's handled in other states.

nabuskey · 2026-04-15T16:19:15Z

+				if app.Status.TerminationTime.IsZero() {
+					app.Status.TerminationTime = metav1.Now()
+				}
+				if util.PendingRerunCleanupRetryBudgetExceeded(app) {
+					app.Status.TerminationTime = metav1.Now()
+					app.Status.AppState.State = v1beta2.ApplicationStateFailed
+					app.Status.AppState.ErrorMessage = "failed to delete resources before rerun after exhausting cleanup retries"
+					r.recordSparkApplicationEvent(app)
+				} else {
+					timeUntilNextRetryDue, err := util.TimeUntilNextRetryDue(app)
+					if err != nil {
+						app.Status.TerminationTime = metav1.Now()
+						app.Status.AppState.State = v1beta2.ApplicationStateFailed
+						app.Status.AppState.ErrorMessage = fmt.Sprintf("failed to determine pending rerun cleanup retry: %v", err)
+						r.recordSparkApplicationEvent(app)
+					} else if timeUntilNextRetryDue <= 0 {
+						// Retry cleanup in case deletion is only partially complete, for example if the driver pod is gone
+						// but the UI Service/executors are still terminating.
+						logger.Info("Resources associated with SparkApplication still exist, retrying deletion")
+						if err := r.deleteSparkResources(ctx, app); err != nil {
+							logger.Error(err, "failed to delete spark resources")
+						}
+						result.Requeue = true
+					} else {
+						// Keep polling until Kubernetes GC has finished removing all resources associated with the app.
+						logger.Info("Resources associated with SparkApplication still exist, waiting for GC to complete")
+						result.RequeueAfter = timeUntilNextRetryDue
+					}
+				}


This is in line with what I had in mind. Make use of sec.RetryInterval with a default value and backoff just to avoid adding yet another option to the spec / flags. Currently the field is not used at all. Perhaps it makes sense to rename or update description of the field to avoid confusion. What do you think?

Reuse spec.retryInterval (default 5s) for PendingRerun resource cleanup polling with linear backoff timing, and remove custom cleanup retry budget logic. Update defaults, API field description, and tests to reflect the new polling behavior. Signed-off-by: sushant kaushik <[email protected]> Made-with: Cursor

tariq-hasan · 2026-04-21T05:50:43Z

-	// RetryInterval is the unit of intervals in seconds between submission retries.
+	// RetryInterval is the unit of intervals in seconds between retry/poll attempts
+	// in controller-managed loops (for example submission retries and pending rerun cleanup polling).
 	// +optional


I think the submission retries phrasing should be removed since this is not a legitimate use case for RetryInterval.

tariq-hasan · 2026-04-21T06:02:22Z

+				if app.Status.TerminationTime.IsZero() {
+					app.Status.TerminationTime = metav1.Now()
+				}
+				timeUntilNextRetryDue, err := util.TimeUntilNextRetryDue(app)


The termination time is used for app failure tracking and therefore should not to be used to calculate backoff for cleanup polling. Otherwise this will conflate the use case for this field.

As a practical example, if the app transitions from PENDING_RERUN to FAILED, the TTL calculation in reconcileTerminatedSparkApplication would use the cleanup start timestamp rather than the actual failure timestamp.

tariq-hasan · 2026-04-21T06:05:08Z

+				attemptsDone = int32(elapsed/baseInterval) + 1
+			}
+		}
 	}


Looking at the original code for TimeUntilNextRetryDue there is an implicit assumption that this is a helper function that enables restarts to occur after app failures.

This change therefore intermingles two different concerns - app failure retries and cleanup polling - within the same block of code. It would be better to move the logic for cleanup polling to a different helper function to avoid confusion and maintain a clean separation of concerns.

Copilot AI review requested due to automatic review settings April 8, 2026 10:11

google-oss-prow Bot requested review from ImpSy and nabuskey April 8, 2026 10:11

google-oss-prow Bot added the size/L label Apr 8, 2026

Copilot started reviewing on behalf of skaushik15 April 8, 2026 10:11 View session

skaushik15 mentioned this pull request Apr 8, 2026

SparkApplication can get stuck in PENDING_RERUN because retry reconciliation does not requeue when cleanup is incomplete #2905

Open

Copilot AI reviewed Apr 8, 2026

View reviewed changes

skaushik15 force-pushed the skaushik_fixPendingRerunState branch from 18956cc to 46654e6 Compare April 8, 2026 10:17

skaushik15 changed the title ~~fix(sparkapplication): requeue pending rerun while resources exist~~ fix: requeue pending rerun while resources exist Apr 8, 2026

vizsmurali approved these changes Apr 8, 2026

View reviewed changes

dineshkumar181094 reviewed Apr 10, 2026

View reviewed changes

Comment thread internal/controller/sparkapplication/controller.go Outdated

skaushik15 requested a review from dineshkumar181094 April 10, 2026 09:33

dineshkumar181094 suggested changes Apr 10, 2026

View reviewed changes

Comment thread internal/controller/sparkapplication/controller.go

Comment thread internal/controller/sparkapplication/controller.go Outdated

nabuskey suggested changes Apr 10, 2026

View reviewed changes

Comment thread internal/controller/sparkapplication/controller.go Outdated

google-oss-prow Bot assigned nabuskey Apr 10, 2026

fixed review comemnts

4010ecc

Signed-off-by: sushant kaushik <[email protected]> Made-with: Cursor AI-Session-Id: ca1834a5-26b0-4af3-8439-ad7a091e985b AI-Tool: claude-code AI-Model: unknown

skaushik15 requested review from dineshkumar181094 and nabuskey April 11, 2026 05:26

tariq-hasan reviewed Apr 15, 2026

View reviewed changes

nabuskey reviewed Apr 15, 2026

View reviewed changes

tariq-hasan reviewed Apr 21, 2026

View reviewed changes

SIDDHESH1564 mentioned this pull request Apr 22, 2026

fix: requeue pending rerun while resources still exist #2921

Open

8 tasks

Conversation

skaushik15 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

google-oss-prow Bot commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

google-oss-prow Bot commented Apr 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

google-oss-prow Bot commented Apr 10, 2026

Uh oh!

nabuskey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

skaushik15 commented Apr 11, 2026

Uh oh!

tariq-hasan commented Apr 13, 2026

Uh oh!

skaushik15 commented Apr 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nabuskey Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tariq-hasan Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nabuskey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

skaushik15 commented Apr 8, 2026 •

edited

Loading

nabuskey Apr 17, 2026 •

edited

Loading

tariq-hasan Apr 21, 2026 •

edited

Loading