Add retry policy with engine, CRUD API, scheduler integration, and OnPodError matcher#4804
Add retry policy with engine, CRUD API, scheduler integration, and OnPodError matcher#4804dejanzele wants to merge 5 commits into
Conversation
Greptile SummaryThis PR introduces RetryPolicy as a first-class Armada concept, allowing operators to define per-queue rules that decide whether a failed job should be retried or permanently failed. The feature is fully gated behind
Confidence Score: 5/5Safe to merge; the feature is fully gated behind a disabled-by-default flag and all previously flagged issues have been resolved. All previously flagged issues (global-cap zero-value bug, empty-pattern validation, action validation, ConditionAppError constant, migration sequence, write-time policy validation, FailureCategory preservation, preemption count) have been addressed in this revision. The one remaining finding is a niche edge case requiring an unusual LeaseReturned policy configuration that only affects the reason string on intermediate API events, with no impact on scheduling correctness. internal/server/event/conversion/conversions.go — PodLeaseReturned error type has no explicit case and produces a JobFailedEvent with empty reason for policies matching the LeaseReturned condition. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Job Run Fails] --> B{retryPolicy.Enabled?}
B -- No --> C[Legacy maxAttemptedRuns path]
B -- Yes --> D{failFast?}
D -- Yes --> E[Terminal failure]
D -- No --> F{Queue has policy?}
F -- No --> C
F -- Yes --> G{Policy in cache?}
G -- No --> C
G -- Yes --> H[Engine.Evaluate policy runError failureCount]
H --> I{globalMaxRetries exceeded?}
I -- Yes --> E
I -- No --> J{Rule matched?}
J -- No --> K{DefaultAction = Fail?}
K -- Yes --> E
K -- No --> L{RetryLimit exceeded?}
J -- Yes --> M{Rule.Action = Fail?}
M -- Yes --> E
M -- No --> L
L -- Yes --> E
L -- No --> N[Emit non-terminal JobErrors retryable=true]
N --> O[Requeue job, bump QueuedVersion]
O --> P[WithNewRun index=len runsById]
E --> Q[Emit terminal MaxRunsExceeded with FailureCategory preserved]
C --> R{lastRun.Returned and attempts < max?}
R -- Yes --> O
R -- No --> E
Reviews (30): Last reviewed commit: "Add retry policy testsuite case" | Re-trigger Greptile |
bca4896 to
1ed96c9
Compare
1ed96c9 to
ec73ad8
Compare
ec73ad8 to
a3d9e04
Compare
e39ec2f to
33b0802
Compare
995056d to
cd5ea06
Compare
cd5ea06 to
b8aecc4
Compare
6f9b7e5 to
61b202c
Compare
304fd13 to
af9a78f
Compare
86c96fc to
0da81ae
Compare
d293452 to
71251a2
Compare
71251a2 to
d614d83
Compare
ede8e69 to
126cd4c
Compare
126cd4c to
ee74555
Compare
8e264e6 to
11ed062
Compare
d70e8d9 to
1dad838
Compare
Introduces a policy-based retry engine that evaluates Error protos against configurable rules to decide whether a failed job run should be retried. This is dead code - nothing calls the engine yet. - types.go: Policy, Rule, Result, Action types with regex pre-compilation - extract.go: extract condition, exit code, termination message, and failure_category/failure_subcategory from armadaevents.Error. Exit code and termination message come from the first non-empty ContainerError in the PodError; categories come from the top-level Error fields. - matcher.go: AND-logic rule matching, first-match-wins rule list evaluation - engine.go: Evaluate() with global cap, policy retry limit, and rule matching - configuration: RetryPolicyConfig type, wired into SchedulingConfig - config.yaml: default retryPolicy (disabled, globalMaxRetries=20) Signed-off-by: Dejan Zele Pejchev <[email protected]>
Introduce RetryPolicy as a first-class API resource with full CRUD operations. This is pure infrastructure with no scheduling behavior changes. Proto: Add RetryPolicy, RetryRule, RetryExitCodeMatcher messages and RetryPolicyService gRPC service with REST gateway bindings on the Submit service. Add retry_policy field to Queue message. Server: Add retrypolicy package with PostgresRetryPolicyRepository (stores serialized proto in retry_policy table) and Server handler with authorization checks. Wire into server startup and register the gRPC service. Add CreateRetryPolicy/DeleteRetryPolicy permissions. Client: Add pkg/client/retrypolicy with Create/Update/Delete/Get/GetAll functions matching the queue client pattern. CLI: Add armadactl commands for create/update/delete/get retry-policy and get retry-policies, all using file-based input for create/update. Add --retry-policy flag to queue create and update commands. Add RetryPolicy as a valid ResourceKind for file-based creation. Signed-off-by: Dejan Zele Pejchev <[email protected]> Signed-off-by: Dejan Zele Pejchev <[email protected]>
Signed-off-by: Dejan Zele Pejchev <[email protected]>
Signed-off-by: Dejan Zele Pejchev <[email protected]>
Signed-off-by: Dejan Zele Pejchev <[email protected]>
1dad838 to
b02a0ce
Compare
What this PR does
Adds RetryPolicy as a first-class concept. Operators define policies via armadactl, attach them to queues, and the scheduler evaluates them when a job run fails to decide whether to retry or terminally fail.
Everything is behind
application.retryPolicy.enabled(default off). With the flag off there is no behaviour change: the engine is never invoked and the existing pod-name format is preserved byte-for-byte.A policy is a list of rules. Each rule has matchers (conditions, exit codes, termination message, category/subcategory) and an action (Retry or Fail). First match wins. There's a per-policy
retryLimit(counted in retries, not failures:retryLimit: 3means up to 3 retries after the initial failure, 4 total attempts) and a scheduler-wideglobalMaxRetriessafety cap with the same semantics. Categories and subcategories come from the executor classifier introduced in #4891, so policies can match on the same buckets the UI already surfaces.Pod naming changes for retried runs: when the flag is on, each retry attempt gets its own pod (
armada-<jobId>-0-<runIndex>) so the executor doesn't have to mutate or delete the previous pod. The oldarmada-<jobId>-0format is kept exactly when the flag is off.Commits
Policy,Rule,Action, andResulttypes, regex pre-compilation at policy-load time, AND-within-a-rule plus first-match-wins across rules, the global cap and per-policy retry-limit checks. Pure Go with no callers; the engine is dead code at this point.create/update/delete/get retry-policy), a Go client library, and a--retry-policyflag onarmadactl create/update queue. Custom enum unmarshalling so YAML can use friendly names (Fail/Retry,In/NotIn) instead of the canonicalRETRY_ACTION_*strings.run_indexcolumn to therunstable, anoptional uint32 job_run_indexfield onJobRunLease, and a branch in the executor's pod-naming so the suffix becomesarmada-<id>-0-<runIndex>when the field is set. Field absence reproduces today's exact pod name, which keeps the feature flag's zero-impact promise true even at the wire level.JobErrorson retries so the api event stream surfacesJobFailedEvent.retryable = truefor the intermediate failures. Gangs are explicitly skipped here.retryable: trueFailed events. The gang-retry analysis lives innotes/retry-policy/gang-retry.md.Each commit builds, tests, and lints clean on its own. Most of the diff volume sits in regenerated proto/swagger files; the hand-written changes are in
*.proto, the non-pb.goGo files, the SQL migration, and the sqlc queries.How to validate
The testsuite case in commit 5 covers the full flow in CI. To exercise it against the goreman localdev stack:
_local/scheduler/config.yamlalready has the retry-policy flag enabled, so no config edits are needed:retryLimitis the number of retries allowed (3 here means up to 3 retries, 4 total attempts).defaultActionis what happens when no rule matches. Matchers within a rule are ANDed; rules are first-match-wins. Available matchers:onConditions,onExitCodes,onTerminationMessagePattern,onCategory(with optionalonSubcategory).armadactl create -f policy.yamlarmadactl create queue my-queue --retry-policy my-policy(or update an existing queue witharmadactl update queue --retry-policy ...).You should see one pod per attempt (
armada-<id>-0-0,armada-<id>-0-1, ...) until either a Retry rule stops matching, the policy'sretryLimitis hit, or the scheduler'sglobalMaxRetriescap kicks in. The scheduler logs aretry decisionline per failure with the matched rule and the resulting action.The api event stream now carries
JobFailedEvent.retryable = truefor intermediate retries so clients and the testsuite can distinguish them from terminal failures.To run the testsuite case directly against the same stack:
go run cmd/testsuite/main.go test \ --tests testsuite/testcases/retries/single_pod_retry.yaml \ --config e2e/config/armadactl_config.yamlWhat's deferred
maxAttemptedRunswhen the executor returns the lease. Design notes innotes/retry-policy/gang-retry.md.Issue
Part of #4683.