Skip to content

fix: requeue pending rerun while resources exist#2906

Open
skaushik15 wants to merge 4 commits into
kubeflow:masterfrom
skaushik15:skaushik_fixPendingRerunState
Open

fix: requeue pending rerun while resources exist#2906
skaushik15 wants to merge 4 commits into
kubeflow:masterfrom
skaushik15:skaushik_fixPendingRerunState

Conversation

@skaushik15
Copy link
Copy Markdown

@skaushik15 skaushik15 commented Apr 8, 2026

Please refer #2905 for more details

Summary

  • avoid prematurely finalizing PENDING_RERUN when the previous application resources still exist
  • keep reconciliation in PENDING_RERUN by returning a requeue result until old resources are cleaned up
  • add controller tests to verify the requeue behavior and guard against regressions

Test plan

  • unit tests for pending rerun path added/updated
  • validate controller behavior in PENDING_RERUN when resources exist

Copilot AI review requested due to automatic review settings April 8, 2026 10:11
@google-oss-prow google-oss-prow Bot requested review from ImpSy and nabuskey April 8, 2026 10:11
@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug in the SparkApplication controller where it would prematurely finalize PENDING_RERUN applications when resources from the previous run still exist. The fix ensures that the controller keeps reconciling in the PENDING_RERUN state by returning a requeue result until old resources are cleaned up.

Changes:

  • Added a constant for the requeue delay (5 seconds) when waiting for resources to be cleaned up
  • Modified the reconcilePendingRerunSparkApplication function to retry deletion of remaining resources and requeue for a fixed delay instead of immediately finalizing
  • Added comprehensive unit tests to verify the requeue behavior and prevent regressions

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
internal/controller/sparkapplication/controller.go Added requeue delay constant and modified pending rerun reconciliation to retry resource deletion and requeue instead of immediately finalizing
internal/controller/sparkapplication/controller_test.go Added test case verifying that pending rerun applications requeue while resources are still being deleted

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When transitioning to PENDING_RERUN, the reconciler returned ctrl.Result{},
nil if the driver pod or UI service was still terminating, with no requeue
scheduled. If the pod Delete event was missed or dropped, the SparkApplication
would remain stuck in PENDING_RERUN indefinitely.

Fix by retrying deleteSparkResources (to handle partial deletion failures) and
scheduling a requeue after 5s so the reconciler re-checks until all resources
are fully gone before submitting the next run.

Signed-off-by: sushant kaushik <[email protected]>
Made-with: Cursor

AI-Session-Id: 58c13a82-a063-4650-a1c9-0a6ec5821b35
AI-Tool: claude-code
AI-Model: unknown
@skaushik15 skaushik15 force-pushed the skaushik_fixPendingRerunState branch from 18956cc to 46654e6 Compare April 8, 2026 10:17
@skaushik15 skaushik15 changed the title fix(sparkapplication): requeue pending rerun while resources exist fix: requeue pending rerun while resources exist Apr 8, 2026
@google-oss-prow
Copy link
Copy Markdown
Contributor

@vizsmurali: changing LGTM is restricted to collaborators

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Comment thread internal/controller/sparkapplication/controller.go Outdated
Prevent unbounded 5s requeues in PENDING_RERUN by limiting cleanup wait to 30s (aligned with default pod termination grace). If resources are still present after the window, transition the application to FAILED with an explicit error.

Signed-off-by: sushant kaushik <[email protected]>
Made-with: Cursor

AI-Session-Id: ca1834a5-26b0-4af3-8439-ad7a091e985b
AI-Tool: claude-code
AI-Model: unknown
Comment thread internal/controller/sparkapplication/controller.go
Comment thread internal/controller/sparkapplication/controller.go Outdated
@google-oss-prow
Copy link
Copy Markdown
Contributor

@dineshkumar181094: changing LGTM is restricted to collaborators

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Copy Markdown
Contributor

@nabuskey nabuskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change for this problem should be very similar to what we do in reconcileFailedSubmissionSparkApplication.

If GC hasn't completed yet, we should update the result object with TimeUntilNextRetryDue, then re-queue.

Comment thread internal/controller/sparkapplication/controller.go Outdated
Signed-off-by: sushant kaushik <[email protected]>
Made-with: Cursor

AI-Session-Id: ca1834a5-26b0-4af3-8439-ad7a091e985b
AI-Tool: claude-code
AI-Model: unknown
@skaushik15
Copy link
Copy Markdown
Author

The change for this problem should be very similar to what we do in reconcileFailedSubmissionSparkApplication.

If GC hasn't completed yet, we should update the result object with TimeUntilNextRetryDue, then re-queue.

Made the change in line with reconcileFailedSubmissionSparkApplication

@tariq-hasan
Copy link
Copy Markdown
Member

The change for this problem should be very similar to what we do in reconcileFailedSubmissionSparkApplication.
If GC hasn't completed yet, we should update the result object with TimeUntilNextRetryDue, then re-queue.

Made the change in line with reconcileFailedSubmissionSparkApplication

I am wondering if we might be conflating the semantics of PENDING_RERUN somewhat with these changes. As per this description, if PENDING_RERUN is just a placeholder for garbage collection and if its parent states - FAILING and SUCCEEDING - already encapsulate the semantics for restart policy, linear backoff and retry logic then would we want to reuse that logic again in PENDING_RERUN?

If PENDING_RERUN is just a polling step to validate garbage collection I am wondering if we want to have a clean reusable abstraction for just that logic - which is basically just this portion. This would still leave open the question of the polling interval but that would be a different consideration than the retry semantics that was already run through in the prior state.

Also, the SUSPENDED state suffers from the same issue as PENDING_RERUN: https://github.com/kubeflow/spark-operator/blob/master/internal/controller/sparkapplication/controller.go#L815. I am wondering if we want to make the same changes there as well.

@skaushik15
Copy link
Copy Markdown
Author

The change for this problem should be very similar to what we do in reconcileFailedSubmissionSparkApplication.
If GC hasn't completed yet, we should update the result object with TimeUntilNextRetryDue, then re-queue.

Made the change in line with reconcileFailedSubmissionSparkApplication

I am wondering if we might be conflating the semantics of PENDING_RERUN somewhat with these changes. As per this description, if PENDING_RERUN is just a placeholder for garbage collection and if its parent states - FAILING and SUCCEEDING - already encapsulate the semantics for restart policy, linear backoff and retry logic then would we want to reuse that logic again in PENDING_RERUN?

If PENDING_RERUN is just a polling step to validate garbage collection I am wondering if we want to have a clean reusable abstraction for just that logic - which is basically just this portion. This would still leave open the question of the polling interval but that would be a different consideration than the retry semantics that was already run through in the prior state.

Also, the SUSPENDED state suffers from the same issue as PENDING_RERUN: https://github.com/kubeflow/spark-operator/blob/master/internal/controller/sparkapplication/controller.go#L815. I am wondering if we want to make the same changes there as well.

Agree @tariq-hasan other states have similar issues and we should fix them.
Spark application shouldn't be stuck in a non terminal state forever, it should graceful fail so that failure strategies can be applied

Comment on lines +523 to +551
if app.Status.TerminationTime.IsZero() {
app.Status.TerminationTime = metav1.Now()
}
if util.PendingRerunCleanupRetryBudgetExceeded(app) {
app.Status.TerminationTime = metav1.Now()
app.Status.AppState.State = v1beta2.ApplicationStateFailed
app.Status.AppState.ErrorMessage = "failed to delete resources before rerun after exhausting cleanup retries"
r.recordSparkApplicationEvent(app)
} else {
timeUntilNextRetryDue, err := util.TimeUntilNextRetryDue(app)
if err != nil {
app.Status.TerminationTime = metav1.Now()
app.Status.AppState.State = v1beta2.ApplicationStateFailed
app.Status.AppState.ErrorMessage = fmt.Sprintf("failed to determine pending rerun cleanup retry: %v", err)
r.recordSparkApplicationEvent(app)
} else if timeUntilNextRetryDue <= 0 {
// Retry cleanup in case deletion is only partially complete, for example if the driver pod is gone
// but the UI Service/executors are still terminating.
logger.Info("Resources associated with SparkApplication still exist, retrying deletion")
if err := r.deleteSparkResources(ctx, app); err != nil {
logger.Error(err, "failed to delete spark resources")
}
result.Requeue = true
} else {
// Keep polling until Kubernetes GC has finished removing all resources associated with the app.
logger.Info("Resources associated with SparkApplication still exist, waiting for GC to complete")
result.RequeueAfter = timeUntilNextRetryDue
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if app.Status.TerminationTime.IsZero() {
app.Status.TerminationTime = metav1.Now()
}
if util.PendingRerunCleanupRetryBudgetExceeded(app) {
app.Status.TerminationTime = metav1.Now()
app.Status.AppState.State = v1beta2.ApplicationStateFailed
app.Status.AppState.ErrorMessage = "failed to delete resources before rerun after exhausting cleanup retries"
r.recordSparkApplicationEvent(app)
} else {
timeUntilNextRetryDue, err := util.TimeUntilNextRetryDue(app)
if err != nil {
app.Status.TerminationTime = metav1.Now()
app.Status.AppState.State = v1beta2.ApplicationStateFailed
app.Status.AppState.ErrorMessage = fmt.Sprintf("failed to determine pending rerun cleanup retry: %v", err)
r.recordSparkApplicationEvent(app)
} else if timeUntilNextRetryDue <= 0 {
// Retry cleanup in case deletion is only partially complete, for example if the driver pod is gone
// but the UI Service/executors are still terminating.
logger.Info("Resources associated with SparkApplication still exist, retrying deletion")
if err := r.deleteSparkResources(ctx, app); err != nil {
logger.Error(err, "failed to delete spark resources")
}
result.Requeue = true
} else {
// Keep polling until Kubernetes GC has finished removing all resources associated with the app.
logger.Info("Resources associated with SparkApplication still exist, waiting for GC to complete")
result.RequeueAfter = timeUntilNextRetryDue
}
}
logger.Info("Resources associated with SparkApplication still exist, retrying deletion")
if err := r.deleteSparkResources(ctx, app); err != nil {
logger.Error(err, "Failed to delete resources associated with SparkApplication")
}
result.RequeueAfter = r.options.ResourceCleanupPollInterval

Add to that I believe we ought to have something like this for the resource cleanup where we set up the poll interval as a configurable option since the retry was already taken care of in the prior step.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in line with what I had in mind. Make use of sec.RetryInterval with a default value and backoff just to avoid adding yet another option to the spec / flags. Currently the field is not used at all. Perhaps it makes sense to rename or update description of the field to avoid confusion. What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. Probably ideal to use this field than to add a new one in the spec or manifests.

Copy link
Copy Markdown
Contributor

@nabuskey nabuskey Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skaushik15 Does this make sense?

We should be able to set the default value here: api/v1beta2/defaults.go. 5 seconds default is good since it's the default value for other similar intervals.
Feel free to ping on Slack too.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey ,
it does make sense, but is there a plan to move it to terminal state instead of just retrying with a backoff interval? would moving it to a failed state after certain retries make more sense?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the changes as per suggestion

Copy link
Copy Markdown
Member

@tariq-hasan tariq-hasan Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we can rephrase the question in this way: "Does K8s guarantee garbage collection (in this case for the Spark driver pod, Spark Web UI Ingress and Spark Web UI Service)? If so should we rely on that guarantee and avoid any custom logic for app-level failures? On the other hand if there is no guarantee then should we do app-level failure to compensate for that lack of guarantee?"

The original issue for the stuck state happened not because we did not fail the app. Rather it happened because we did not requeue in the first place. So the two aspects are fundamentally different and we are already taking care of requeueing.

In so far as garbage collection is concerned K8s does not guarantee deletion in a bounded time since finalizers, webhooks, offline nodes, etc. can block it, but this sounds more like a rare edge case than what this current issue is dealing with.

Copy link
Copy Markdown
Contributor

@nabuskey nabuskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tariq-hasan' s suggestion is great. The core of the issue here is that we return an empty result object instead of setting the requeue field. By not setting the field, reconciliation only happens when global reconcile timer is hit.

For this to happen, we need to set retry interval and backoff logic similar to how it's handled in other states.

Comment on lines +523 to +551
if app.Status.TerminationTime.IsZero() {
app.Status.TerminationTime = metav1.Now()
}
if util.PendingRerunCleanupRetryBudgetExceeded(app) {
app.Status.TerminationTime = metav1.Now()
app.Status.AppState.State = v1beta2.ApplicationStateFailed
app.Status.AppState.ErrorMessage = "failed to delete resources before rerun after exhausting cleanup retries"
r.recordSparkApplicationEvent(app)
} else {
timeUntilNextRetryDue, err := util.TimeUntilNextRetryDue(app)
if err != nil {
app.Status.TerminationTime = metav1.Now()
app.Status.AppState.State = v1beta2.ApplicationStateFailed
app.Status.AppState.ErrorMessage = fmt.Sprintf("failed to determine pending rerun cleanup retry: %v", err)
r.recordSparkApplicationEvent(app)
} else if timeUntilNextRetryDue <= 0 {
// Retry cleanup in case deletion is only partially complete, for example if the driver pod is gone
// but the UI Service/executors are still terminating.
logger.Info("Resources associated with SparkApplication still exist, retrying deletion")
if err := r.deleteSparkResources(ctx, app); err != nil {
logger.Error(err, "failed to delete spark resources")
}
result.Requeue = true
} else {
// Keep polling until Kubernetes GC has finished removing all resources associated with the app.
logger.Info("Resources associated with SparkApplication still exist, waiting for GC to complete")
result.RequeueAfter = timeUntilNextRetryDue
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in line with what I had in mind. Make use of sec.RetryInterval with a default value and backoff just to avoid adding yet another option to the spec / flags. Currently the field is not used at all. Perhaps it makes sense to rename or update description of the field to avoid confusion. What do you think?

Reuse spec.retryInterval (default 5s) for PendingRerun resource cleanup polling with linear backoff timing, and remove custom cleanup retry budget logic. Update defaults, API field description, and tests to reflect the new polling behavior.

Signed-off-by: sushant kaushik <[email protected]>
Made-with: Cursor
// RetryInterval is the unit of intervals in seconds between submission retries.
// RetryInterval is the unit of intervals in seconds between retry/poll attempts
// in controller-managed loops (for example submission retries and pending rerun cleanup polling).
// +optional
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the submission retries phrasing should be removed since this is not a legitimate use case for RetryInterval.

if app.Status.TerminationTime.IsZero() {
app.Status.TerminationTime = metav1.Now()
}
timeUntilNextRetryDue, err := util.TimeUntilNextRetryDue(app)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The termination time is used for app failure tracking and therefore should not to be used to calculate backoff for cleanup polling. Otherwise this will conflate the use case for this field.

As a practical example, if the app transitions from PENDING_RERUN to FAILED, the TTL calculation in reconcileTerminatedSparkApplication would use the cleanup start timestamp rather than the actual failure timestamp.

attemptsDone = int32(elapsed/baseInterval) + 1
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the original code for TimeUntilNextRetryDue there is an implicit assumption that this is a helper function that enables restarts to occur after app failures.

This change therefore intermingles two different concerns - app failure retries and cleanup polling - within the same block of code. It would be better to move the logic for cleanup polling to a different helper function to avoid confusion and maintain a clean separation of concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants