fix: requeue pending rerun while resources exist#2906
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug in the SparkApplication controller where it would prematurely finalize PENDING_RERUN applications when resources from the previous run still exist. The fix ensures that the controller keeps reconciling in the PENDING_RERUN state by returning a requeue result until old resources are cleaned up.
Changes:
- Added a constant for the requeue delay (5 seconds) when waiting for resources to be cleaned up
- Modified the
reconcilePendingRerunSparkApplicationfunction to retry deletion of remaining resources and requeue for a fixed delay instead of immediately finalizing - Added comprehensive unit tests to verify the requeue behavior and prevent regressions
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| internal/controller/sparkapplication/controller.go | Added requeue delay constant and modified pending rerun reconciliation to retry resource deletion and requeue instead of immediately finalizing |
| internal/controller/sparkapplication/controller_test.go | Added test case verifying that pending rerun applications requeue while resources are still being deleted |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
When transitioning to PENDING_RERUN, the reconciler returned ctrl.Result{},
nil if the driver pod or UI service was still terminating, with no requeue
scheduled. If the pod Delete event was missed or dropped, the SparkApplication
would remain stuck in PENDING_RERUN indefinitely.
Fix by retrying deleteSparkResources (to handle partial deletion failures) and
scheduling a requeue after 5s so the reconciler re-checks until all resources
are fully gone before submitting the next run.
Signed-off-by: sushant kaushik <[email protected]>
Made-with: Cursor
AI-Session-Id: 58c13a82-a063-4650-a1c9-0a6ec5821b35
AI-Tool: claude-code
AI-Model: unknown
18956cc to
46654e6
Compare
|
@vizsmurali: changing LGTM is restricted to collaborators DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Prevent unbounded 5s requeues in PENDING_RERUN by limiting cleanup wait to 30s (aligned with default pod termination grace). If resources are still present after the window, transition the application to FAILED with an explicit error. Signed-off-by: sushant kaushik <[email protected]> Made-with: Cursor AI-Session-Id: ca1834a5-26b0-4af3-8439-ad7a091e985b AI-Tool: claude-code AI-Model: unknown
|
@dineshkumar181094: changing LGTM is restricted to collaborators DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
nabuskey
left a comment
There was a problem hiding this comment.
The change for this problem should be very similar to what we do in reconcileFailedSubmissionSparkApplication.
If GC hasn't completed yet, we should update the result object with TimeUntilNextRetryDue, then re-queue.
Signed-off-by: sushant kaushik <[email protected]> Made-with: Cursor AI-Session-Id: ca1834a5-26b0-4af3-8439-ad7a091e985b AI-Tool: claude-code AI-Model: unknown
Made the change in line with reconcileFailedSubmissionSparkApplication |
I am wondering if we might be conflating the semantics of If Also, the |
Agree @tariq-hasan other states have similar issues and we should fix them. |
| if app.Status.TerminationTime.IsZero() { | ||
| app.Status.TerminationTime = metav1.Now() | ||
| } | ||
| if util.PendingRerunCleanupRetryBudgetExceeded(app) { | ||
| app.Status.TerminationTime = metav1.Now() | ||
| app.Status.AppState.State = v1beta2.ApplicationStateFailed | ||
| app.Status.AppState.ErrorMessage = "failed to delete resources before rerun after exhausting cleanup retries" | ||
| r.recordSparkApplicationEvent(app) | ||
| } else { | ||
| timeUntilNextRetryDue, err := util.TimeUntilNextRetryDue(app) | ||
| if err != nil { | ||
| app.Status.TerminationTime = metav1.Now() | ||
| app.Status.AppState.State = v1beta2.ApplicationStateFailed | ||
| app.Status.AppState.ErrorMessage = fmt.Sprintf("failed to determine pending rerun cleanup retry: %v", err) | ||
| r.recordSparkApplicationEvent(app) | ||
| } else if timeUntilNextRetryDue <= 0 { | ||
| // Retry cleanup in case deletion is only partially complete, for example if the driver pod is gone | ||
| // but the UI Service/executors are still terminating. | ||
| logger.Info("Resources associated with SparkApplication still exist, retrying deletion") | ||
| if err := r.deleteSparkResources(ctx, app); err != nil { | ||
| logger.Error(err, "failed to delete spark resources") | ||
| } | ||
| result.Requeue = true | ||
| } else { | ||
| // Keep polling until Kubernetes GC has finished removing all resources associated with the app. | ||
| logger.Info("Resources associated with SparkApplication still exist, waiting for GC to complete") | ||
| result.RequeueAfter = timeUntilNextRetryDue | ||
| } | ||
| } |
There was a problem hiding this comment.
| if app.Status.TerminationTime.IsZero() { | |
| app.Status.TerminationTime = metav1.Now() | |
| } | |
| if util.PendingRerunCleanupRetryBudgetExceeded(app) { | |
| app.Status.TerminationTime = metav1.Now() | |
| app.Status.AppState.State = v1beta2.ApplicationStateFailed | |
| app.Status.AppState.ErrorMessage = "failed to delete resources before rerun after exhausting cleanup retries" | |
| r.recordSparkApplicationEvent(app) | |
| } else { | |
| timeUntilNextRetryDue, err := util.TimeUntilNextRetryDue(app) | |
| if err != nil { | |
| app.Status.TerminationTime = metav1.Now() | |
| app.Status.AppState.State = v1beta2.ApplicationStateFailed | |
| app.Status.AppState.ErrorMessage = fmt.Sprintf("failed to determine pending rerun cleanup retry: %v", err) | |
| r.recordSparkApplicationEvent(app) | |
| } else if timeUntilNextRetryDue <= 0 { | |
| // Retry cleanup in case deletion is only partially complete, for example if the driver pod is gone | |
| // but the UI Service/executors are still terminating. | |
| logger.Info("Resources associated with SparkApplication still exist, retrying deletion") | |
| if err := r.deleteSparkResources(ctx, app); err != nil { | |
| logger.Error(err, "failed to delete spark resources") | |
| } | |
| result.Requeue = true | |
| } else { | |
| // Keep polling until Kubernetes GC has finished removing all resources associated with the app. | |
| logger.Info("Resources associated with SparkApplication still exist, waiting for GC to complete") | |
| result.RequeueAfter = timeUntilNextRetryDue | |
| } | |
| } | |
| logger.Info("Resources associated with SparkApplication still exist, retrying deletion") | |
| if err := r.deleteSparkResources(ctx, app); err != nil { | |
| logger.Error(err, "Failed to delete resources associated with SparkApplication") | |
| } | |
| result.RequeueAfter = r.options.ResourceCleanupPollInterval |
Add to that I believe we ought to have something like this for the resource cleanup where we set up the poll interval as a configurable option since the retry was already taken care of in the prior step.
There was a problem hiding this comment.
This is in line with what I had in mind. Make use of sec.RetryInterval with a default value and backoff just to avoid adding yet another option to the spec / flags. Currently the field is not used at all. Perhaps it makes sense to rename or update description of the field to avoid confusion. What do you think?
There was a problem hiding this comment.
That makes sense to me. Probably ideal to use this field than to add a new one in the spec or manifests.
There was a problem hiding this comment.
@skaushik15 Does this make sense?
We should be able to set the default value here: api/v1beta2/defaults.go. 5 seconds default is good since it's the default value for other similar intervals.
Feel free to ping on Slack too.
There was a problem hiding this comment.
Hey ,
it does make sense, but is there a plan to move it to terminal state instead of just retrying with a backoff interval? would moving it to a failed state after certain retries make more sense?
There was a problem hiding this comment.
Made the changes as per suggestion
There was a problem hiding this comment.
My understanding is that we can rephrase the question in this way: "Does K8s guarantee garbage collection (in this case for the Spark driver pod, Spark Web UI Ingress and Spark Web UI Service)? If so should we rely on that guarantee and avoid any custom logic for app-level failures? On the other hand if there is no guarantee then should we do app-level failure to compensate for that lack of guarantee?"
The original issue for the stuck state happened not because we did not fail the app. Rather it happened because we did not requeue in the first place. So the two aspects are fundamentally different and we are already taking care of requeueing.
In so far as garbage collection is concerned K8s does not guarantee deletion in a bounded time since finalizers, webhooks, offline nodes, etc. can block it, but this sounds more like a rare edge case than what this current issue is dealing with.
nabuskey
left a comment
There was a problem hiding this comment.
@tariq-hasan' s suggestion is great. The core of the issue here is that we return an empty result object instead of setting the requeue field. By not setting the field, reconciliation only happens when global reconcile timer is hit.
For this to happen, we need to set retry interval and backoff logic similar to how it's handled in other states.
| if app.Status.TerminationTime.IsZero() { | ||
| app.Status.TerminationTime = metav1.Now() | ||
| } | ||
| if util.PendingRerunCleanupRetryBudgetExceeded(app) { | ||
| app.Status.TerminationTime = metav1.Now() | ||
| app.Status.AppState.State = v1beta2.ApplicationStateFailed | ||
| app.Status.AppState.ErrorMessage = "failed to delete resources before rerun after exhausting cleanup retries" | ||
| r.recordSparkApplicationEvent(app) | ||
| } else { | ||
| timeUntilNextRetryDue, err := util.TimeUntilNextRetryDue(app) | ||
| if err != nil { | ||
| app.Status.TerminationTime = metav1.Now() | ||
| app.Status.AppState.State = v1beta2.ApplicationStateFailed | ||
| app.Status.AppState.ErrorMessage = fmt.Sprintf("failed to determine pending rerun cleanup retry: %v", err) | ||
| r.recordSparkApplicationEvent(app) | ||
| } else if timeUntilNextRetryDue <= 0 { | ||
| // Retry cleanup in case deletion is only partially complete, for example if the driver pod is gone | ||
| // but the UI Service/executors are still terminating. | ||
| logger.Info("Resources associated with SparkApplication still exist, retrying deletion") | ||
| if err := r.deleteSparkResources(ctx, app); err != nil { | ||
| logger.Error(err, "failed to delete spark resources") | ||
| } | ||
| result.Requeue = true | ||
| } else { | ||
| // Keep polling until Kubernetes GC has finished removing all resources associated with the app. | ||
| logger.Info("Resources associated with SparkApplication still exist, waiting for GC to complete") | ||
| result.RequeueAfter = timeUntilNextRetryDue | ||
| } | ||
| } |
There was a problem hiding this comment.
This is in line with what I had in mind. Make use of sec.RetryInterval with a default value and backoff just to avoid adding yet another option to the spec / flags. Currently the field is not used at all. Perhaps it makes sense to rename or update description of the field to avoid confusion. What do you think?
Reuse spec.retryInterval (default 5s) for PendingRerun resource cleanup polling with linear backoff timing, and remove custom cleanup retry budget logic. Update defaults, API field description, and tests to reflect the new polling behavior. Signed-off-by: sushant kaushik <[email protected]> Made-with: Cursor
| // RetryInterval is the unit of intervals in seconds between submission retries. | ||
| // RetryInterval is the unit of intervals in seconds between retry/poll attempts | ||
| // in controller-managed loops (for example submission retries and pending rerun cleanup polling). | ||
| // +optional |
There was a problem hiding this comment.
I think the submission retries phrasing should be removed since this is not a legitimate use case for RetryInterval.
| if app.Status.TerminationTime.IsZero() { | ||
| app.Status.TerminationTime = metav1.Now() | ||
| } | ||
| timeUntilNextRetryDue, err := util.TimeUntilNextRetryDue(app) |
There was a problem hiding this comment.
The termination time is used for app failure tracking and therefore should not to be used to calculate backoff for cleanup polling. Otherwise this will conflate the use case for this field.
As a practical example, if the app transitions from PENDING_RERUN to FAILED, the TTL calculation in reconcileTerminatedSparkApplication would use the cleanup start timestamp rather than the actual failure timestamp.
| attemptsDone = int32(elapsed/baseInterval) + 1 | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Looking at the original code for TimeUntilNextRetryDue there is an implicit assumption that this is a helper function that enables restarts to occur after app failures.
This change therefore intermingles two different concerns - app failure retries and cleanup polling - within the same block of code. It would be better to move the logic for cleanup polling to a different helper function to avoid confusion and maintain a clean separation of concerns.
Please refer #2905 for more details
Summary
PENDING_RERUNwhen the previous application resources still existPENDING_RERUNby returning a requeue result until old resources are cleaned upTest plan
PENDING_RERUNwhen resources exist