Skip to content

Add retry policy with engine, CRUD API, scheduler integration, and OnPodError matcher#4804

Open
dejanzele wants to merge 5 commits into
armadaproject:masterfrom
dejanzele:retry-engine-config
Open

Add retry policy with engine, CRUD API, scheduler integration, and OnPodError matcher#4804
dejanzele wants to merge 5 commits into
armadaproject:masterfrom
dejanzele:retry-engine-config

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Mar 30, 2026

What this PR does

Adds RetryPolicy as a first-class concept. Operators define policies via armadactl, attach them to queues, and the scheduler evaluates them when a job run fails to decide whether to retry or terminally fail.

Everything is behind application.retryPolicy.enabled (default off). With the flag off there is no behaviour change: the engine is never invoked and the existing pod-name format is preserved byte-for-byte.

A policy is a list of rules. Each rule has matchers (conditions, exit codes, termination message, category/subcategory) and an action (Retry or Fail). First match wins. There's a per-policy retryLimit (counted in retries, not failures: retryLimit: 3 means up to 3 retries after the initial failure, 4 total attempts) and a scheduler-wide globalMaxRetries safety cap with the same semantics. Categories and subcategories come from the executor classifier introduced in #4891, so policies can match on the same buckets the UI already surfaces.

Pod naming changes for retried runs: when the flag is on, each retry attempt gets its own pod (armada-<jobId>-0-<runIndex>) so the executor doesn't have to mutate or delete the previous pod. The old armada-<jobId>-0 format is kept exactly when the flag is off.

Commits

  1. Retry engine and matching primitives. Introduces the decision logic: Policy, Rule, Action, and Result types, regex pre-compilation at policy-load time, AND-within-a-rule plus first-match-wins across rules, the global cap and per-policy retry-limit checks. Pure Go with no callers; the engine is dead code at this point.
  2. CRUD API for retry policies. Makes RetryPolicy a first-class API resource: proto messages, gRPC service with REST gateway, postgres-backed repository, authorisation hooks, an armadactl surface (create/update/delete/get retry-policy), a Go client library, and a --retry-policy flag on armadactl create/update queue. Custom enum unmarshalling so YAML can use friendly names (Fail/Retry, In/NotIn) instead of the canonical RETRY_ACTION_* strings.
  3. Run-index plumbing. Each retry attempt should be a fresh pod, so the scheduler now tracks an attempt counter per run. Adds a run_index column to the runs table, an optional uint32 job_run_index field on JobRunLease, and a branch in the executor's pod-naming so the suffix becomes armada-<id>-0-<runIndex> when the field is set. Field absence reproduces today's exact pod name, which keeps the feature flag's zero-impact promise true even at the wire level.
  4. Wire the engine into the scheduler. The actual behaviour change. Adds a policy cache that periodically refreshes from the api, a proto-to-engine converter, and the call site that looks up the queue's policy on each run failure and asks the engine for a decision. Emits a non-terminal JobErrors on retries so the api event stream surfaces JobFailedEvent.retryable = true for the intermediate failures. Gangs are explicitly skipped here.
  5. End-to-end test plus design docs. A testsuite case that submits a pod which fails the first attempt and succeeds the second, reading the run-index from inside the container via the downward API so the only way the second attempt succeeds is if the engine actually retried and the executor stamped the new index on the new lease. Testsuite validator and watcher updates to assert and tolerate retryable: true Failed events. The gang-retry analysis lives in notes/retry-policy/gang-retry.md.

Each commit builds, tests, and lints clean on its own. Most of the diff volume sits in regenerated proto/swagger files; the hand-written changes are in *.proto, the non-pb.go Go files, the SQL migration, and the sqlc queries.

How to validate

The testsuite case in commit 5 covers the full flow in CI. To exercise it against the goreman localdev stack:

  1. Bring up the dependencies and components. _local/scheduler/config.yaml already has the retry-policy flag enabled, so no config edits are needed:
    docker compose -f _local/docker-compose-deps.yaml up -d
    scripts/localdev-init.sh
    goreman -f _local/procfiles/no-auth.Procfile start
  2. Write a policy YAML:
    apiVersion: armadaproject.io/v1beta1
    kind: RetryPolicy
    name: my-policy
    retryLimit: 3
    defaultAction: Fail
    rules:
      - action: Retry
        onConditions: ["AppError"]
      - action: Retry
        onExitCodes:
          operator: In
          values: [42, 137]
    retryLimit is the number of retries allowed (3 here means up to 3 retries, 4 total attempts). defaultAction is what happens when no rule matches. Matchers within a rule are ANDed; rules are first-match-wins. Available matchers: onConditions, onExitCodes, onTerminationMessagePattern, onCategory (with optional onSubcategory).
  3. armadactl create -f policy.yaml
  4. armadactl create queue my-queue --retry-policy my-policy (or update an existing queue with armadactl update queue --retry-policy ...).
  5. Submit a job that fails on first attempt.

You should see one pod per attempt (armada-<id>-0-0, armada-<id>-0-1, ...) until either a Retry rule stops matching, the policy's retryLimit is hit, or the scheduler's globalMaxRetries cap kicks in. The scheduler logs a retry decision line per failure with the matched rule and the resulting action.

The api event stream now carries JobFailedEvent.retryable = true for intermediate retries so clients and the testsuite can distinguish them from terminal failures.

To run the testsuite case directly against the same stack:

go run cmd/testsuite/main.go test \
  --tests testsuite/testcases/retries/single_pod_retry.yaml \
  --config e2e/config/armadactl_config.yaml

What's deferred

  • Gangs are out of scope. Coordinating retries across gang members (atomic failure detection, run-index sync, killing healthy pods, retry state at gang level) is enough complexity to warrant its own PR. Gang jobs fall through to the existing behaviour: terminal failure for pod errors, lease-return retry up to maxAttemptedRuns when the executor returns the lease. Design notes in notes/retry-policy/gang-retry.md.
  • Backoff between attempts. Retries currently fire on the next scheduler cycle.
  • Removing the legacy lease-return retry code path. The new engine coexists with it for now; cleanup after this lands.
  • Lookout UI for run-index. Database has the data; UI work is a follow-up.

Issue

Part of #4683.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 30, 2026

Greptile Summary

This PR introduces RetryPolicy as a first-class Armada concept, allowing operators to define per-queue rules that decide whether a failed job should be retried or permanently failed. The feature is fully gated behind application.retryPolicy.enabled (default off), preserving byte-identical behaviour for existing deployments.

  • Retry engine and CRUD API: Adds Policy, Rule, and Action types with AND-within-rule / first-match-wins semantics, validated at load time via CompileRules/ValidatePolicy. A gRPC/REST service backed by a Postgres repository exposes Create/Update/Delete/Get for retry policies, with full validation at write time so operators get immediate errors rather than silent cache-drop failures.
  • Run-index plumbing: Each retry attempt receives a zero-based run_index suffix on its pod name (armada-<id>-0-<runIndex>), propagated through a new job_run_index DB column, a JobRunLeased.run_index proto field, and a conditional branch in the executor's pod-naming path. When the flag is off the suffix is absent and the legacy format is preserved exactly.
  • Scheduler integration: A policy cache (periodic gRPC fetch, atomic.Pointer swap) feeds the engine at failure time; gang jobs skip the engine entirely. Policy-driven retries emit a non-terminal JobErrors so API consumers see JobFailedEvent.retryable=true for intermediate failures; terminal policy failures preserve the original FailureCategory/FailureSubcategory.

Confidence Score: 5/5

Safe to merge; the feature is fully gated behind a disabled-by-default flag and all previously flagged issues have been resolved.

All previously flagged issues (global-cap zero-value bug, empty-pattern validation, action validation, ConditionAppError constant, migration sequence, write-time policy validation, FailureCategory preservation, preemption count) have been addressed in this revision. The one remaining finding is a niche edge case requiring an unusual LeaseReturned policy configuration that only affects the reason string on intermediate API events, with no impact on scheduling correctness.

internal/server/event/conversion/conversions.go — PodLeaseReturned error type has no explicit case and produces a JobFailedEvent with empty reason for policies matching the LeaseReturned condition.

Important Files Changed

Filename Overview
internal/scheduler/retry/engine.go Core retry decision logic; global-cap zero-value guard and per-policy limit checks are correctly implemented.
internal/scheduler/retry/matcher.go Rule matching with isContainerFailure guard correctly prevents OnTerminationMessage from matching non-pod errors.
internal/scheduler/retry/types.go CompileRules and validateRule cover action validation, non-empty matcher requirements, exit-code operator/values checks, and empty-pattern rejection.
internal/scheduler/retry/cache.go atomic.Pointer swap for the policy map is safe; fail-open design (nil map on first fetch failure) is correct.
internal/scheduler/scheduler.go Retry engine integration correctly overrides legacy path; FailureCategory is preserved on terminal failures; preempted runs no longer count against global cap.
internal/server/retrypolicy/service.go Calls ConvertPolicy at write time for both Create and Update, giving operators immediate InvalidArgument errors for malformed policies.
internal/server/event/conversion/conversions.go Surfaces non-terminal JobErrors with retryable=true; PodLeaseReturned reason has no explicit case and falls to the default, producing a JobFailedEvent with empty Reason for policies matching LeaseReturned.
internal/scheduler/retry/convert.go Proto-to-engine conversion validates actions, operators, and empty values; deep-copies slices so cached policies don't retain API-response references.
internal/executor/util/kubernetes_object.go Pod name suffix added correctly behind JobRunIndex presence check; legacy format preserved when field is nil.
internal/scheduler/jobdb/job.go WithNewRun derives run index from len(runsById), ensuring monotonically increasing unique pod suffixes across preempted and retried runs.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Job Run Fails] --> B{retryPolicy.Enabled?}
    B -- No --> C[Legacy maxAttemptedRuns path]
    B -- Yes --> D{failFast?}
    D -- Yes --> E[Terminal failure]
    D -- No --> F{Queue has policy?}
    F -- No --> C
    F -- Yes --> G{Policy in cache?}
    G -- No --> C
    G -- Yes --> H[Engine.Evaluate policy runError failureCount]
    H --> I{globalMaxRetries exceeded?}
    I -- Yes --> E
    I -- No --> J{Rule matched?}
    J -- No --> K{DefaultAction = Fail?}
    K -- Yes --> E
    K -- No --> L{RetryLimit exceeded?}
    J -- Yes --> M{Rule.Action = Fail?}
    M -- Yes --> E
    M -- No --> L
    L -- Yes --> E
    L -- No --> N[Emit non-terminal JobErrors retryable=true]
    N --> O[Requeue job, bump QueuedVersion]
    O --> P[WithNewRun index=len runsById]
    E --> Q[Emit terminal MaxRunsExceeded with FailureCategory preserved]
    C --> R{lastRun.Returned and attempts < max?}
    R -- Yes --> O
    R -- No --> E
Loading

Reviews (30): Last reviewed commit: "Add retry policy testsuite case" | Re-trigger Greptile

Comment thread internal/scheduler/retry/engine.go Outdated
Comment thread internal/scheduler/retry/types.go
Comment thread internal/scheduler/retry/types.go
@dejanzele dejanzele force-pushed the retry-engine-config branch from bca4896 to 1ed96c9 Compare March 30, 2026 13:27
Comment thread internal/scheduler/retry/matcher.go Outdated
@dejanzele dejanzele force-pushed the retry-engine-config branch from 1ed96c9 to ec73ad8 Compare March 30, 2026 14:21
Comment thread internal/scheduler/retry/engine.go Outdated
@dejanzele dejanzele force-pushed the retry-engine-config branch from ec73ad8 to a3d9e04 Compare March 30, 2026 14:48
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the retry-engine-config branch 2 times, most recently from e39ec2f to 33b0802 Compare March 30, 2026 16:14
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the retry-engine-config branch 2 times, most recently from 995056d to cd5ea06 Compare March 31, 2026 10:41
@dejanzele dejanzele force-pushed the retry-engine-config branch from cd5ea06 to b8aecc4 Compare April 7, 2026 13:11
Comment thread internal/scheduler/retry/extract.go
@dejanzele dejanzele force-pushed the retry-engine-config branch 3 times, most recently from 6f9b7e5 to 61b202c Compare April 24, 2026 13:50
@dejanzele dejanzele force-pushed the retry-engine-config branch 2 times, most recently from 304fd13 to af9a78f Compare May 7, 2026 12:30
@dejanzele dejanzele changed the title Add retry policy engine and matching primitives Add retry policy: engine, config, and CRUD API May 8, 2026
Comment thread internal/lookout/schema/migrations/034_create_retry_policy.sql
Comment thread internal/server/retrypolicy/service.go
Comment thread internal/server/retrypolicy/service.go
Comment thread internal/server/retrypolicy/service.go
@dejanzele dejanzele changed the title Add retry policy: engine, config, and CRUD API Add retry policy with engine, CRUD API, and scheduler integration May 8, 2026
@dejanzele dejanzele force-pushed the retry-engine-config branch from 86c96fc to 0da81ae Compare May 8, 2026 14:09
@dejanzele dejanzele changed the title Add retry policy with engine, CRUD API, and scheduler integration Add retry policy with engine, CRUD API, scheduler integration, and OnPodError matcher May 8, 2026
@dejanzele dejanzele force-pushed the retry-engine-config branch 3 times, most recently from d293452 to 71251a2 Compare May 11, 2026 10:50
@dejanzele dejanzele force-pushed the retry-engine-config branch from 71251a2 to d614d83 Compare May 11, 2026 11:19
Comment thread internal/scheduler/scheduler.go Outdated
@dejanzele dejanzele force-pushed the retry-engine-config branch 2 times, most recently from ede8e69 to 126cd4c Compare May 11, 2026 12:25
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the retry-engine-config branch from 126cd4c to ee74555 Compare May 11, 2026 12:40
Comment thread internal/scheduler/retry/matcher.go Outdated
@dejanzele dejanzele force-pushed the retry-engine-config branch 4 times, most recently from 8e264e6 to 11ed062 Compare May 11, 2026 14:31
Comment thread internal/scheduler/scheduler.go
@dejanzele dejanzele force-pushed the retry-engine-config branch 3 times, most recently from d70e8d9 to 1dad838 Compare May 11, 2026 15:13
dejanzele added 5 commits May 11, 2026 17:32
Introduces a policy-based retry engine that evaluates Error protos against
configurable rules to decide whether a failed job run should be retried.
This is dead code - nothing calls the engine yet.

- types.go: Policy, Rule, Result, Action types with regex pre-compilation
- extract.go: extract condition, exit code, termination message, and
  failure_category/failure_subcategory from armadaevents.Error. Exit code
  and termination message come from the first non-empty ContainerError in
  the PodError; categories come from the top-level Error fields.
- matcher.go: AND-logic rule matching, first-match-wins rule list evaluation
- engine.go: Evaluate() with global cap, policy retry limit, and rule matching
- configuration: RetryPolicyConfig type, wired into SchedulingConfig
- config.yaml: default retryPolicy (disabled, globalMaxRetries=20)

Signed-off-by: Dejan Zele Pejchev <[email protected]>
Introduce RetryPolicy as a first-class API resource with full CRUD
operations. This is pure infrastructure with no scheduling behavior
changes.

Proto: Add RetryPolicy, RetryRule, RetryExitCodeMatcher messages and
RetryPolicyService gRPC service with REST gateway bindings on the
Submit service. Add retry_policy field to Queue message.

Server: Add retrypolicy package with PostgresRetryPolicyRepository
(stores serialized proto in retry_policy table) and Server handler
with authorization checks. Wire into server startup and register
the gRPC service. Add CreateRetryPolicy/DeleteRetryPolicy permissions.

Client: Add pkg/client/retrypolicy with Create/Update/Delete/Get/GetAll
functions matching the queue client pattern.

CLI: Add armadactl commands for create/update/delete/get retry-policy
and get retry-policies, all using file-based input for create/update.
Add --retry-policy flag to queue create and update commands.
Add RetryPolicy as a valid ResourceKind for file-based creation.

Signed-off-by: Dejan Zele Pejchev <[email protected]>
Signed-off-by: Dejan Zele Pejchev <[email protected]>
Signed-off-by: Dejan Zele Pejchev <[email protected]>
@dejanzele dejanzele force-pushed the retry-engine-config branch from 1dad838 to b02a0ce Compare May 11, 2026 15:32
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant