Skip to content

Add onPodError matcher to categorize pre-startup pod failures#4891

Merged
dejanzele merged 3 commits intoarmadaproject:masterfrom
dejanzele:categorizer-on-pod-error
May 8, 2026
Merged

Add onPodError matcher to categorize pre-startup pod failures#4891
dejanzele merged 3 commits intoarmadaproject:masterfrom
dejanzele:categorizer-on-pod-error

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Apr 30, 2026

Summary

This PR extends the failure categorizer with two operator-facing additions:

  1. onPodError matcher - matches pod-level error text, covering pre-startup kubelet/runtime errors (image pull, missing volume, missing ConfigMap/Secret) and Armada-detected pod-level failures (stuck terminating, active deadline exceeded, externally deleted) that produce no useful container terminationMessage. These end up with empty failure_category / failure_subcategory in lookoutdb today.
  2. hint field on rules - operator-supplied user-facing copy describing the failure mode. When set, it is appended to the failure message that lands in lookoutdb.job_run.error, so end users see actionable guidance alongside the raw runtime error.

The two commits are independently useful; together they let operators both classify and explain previously-opaque pod-level failures.

Approach

onPodError

  • New rule field on CategoryRule. Matches a regex against the issue's pod-level error message. ContainerName scoping is ignored (pod-level text has no container attribution).
  • onTerminationMessage is unchanged - still matches container Terminated.Message, still honors ContainerName. Non-overlapping data source from OnPodError by design.
  • Classify is split into ClassifyContainerError(pod) and ClassifyPodError(pod, podErrorMessage). The pod-error variant is needed because kubelet rotates Waiting.Reason from ErrImagePull to ImagePullBackOff within seconds, replacing Waiting.Message with a generic backoff string, so by the time Armada classifies the pod the runtime error is no longer in pod.Status and must be passed in by the caller.
  • Config validation: a rule must specify exactly one of the four matchers; regex matchers compile-check at startup so invalid patterns fail fast.

hint

  • Optional Hint string on CategoryRule. Empty by default - no behavior change for existing configs.
  • The classifier returns the matched rule's hint alongside category/subcategory in ClassifyResult.
  • pod_issue_handler and reporter/event.go append the hint to the user-facing message with two newlines: "<original message>\n\n<hint>". Appended (not prepended) so the raw runtime error stays the lede.
  • Hints flow through the existing PodError -> Pulsar -> lookout-ingester pipeline into lookoutdb.job_run.error.

Validation

To reproduce on local dev:

1. Add a rule with both onPodError and hint (_local/executor/config.yaml under application:):

application:
  errorCategories:
    defaultCategory: "uncategorized"
    defaultSubcategory: "unknown"
    categories:
      - name: infrastructure
        rules:
          - onPodError:
              pattern: "no match for platform in manifest"
            subcategory: "platform_mismatch"
            hint: "Build the image for the cluster's CPU architecture (typically x64/arm64 mismatch)."

Categorization is opt-in: Armada ships no default rules.

2. Submit a wrong-arch job (example/platform-mismatch.yaml):

queue: test
jobSetId: platform-mismatch-repro
jobs:
  - namespace: default
    priority: 0
    podSpec:
      terminationGracePeriodSeconds: 0
      restartPolicy: Never
      containers:
        - name: wrong-arch
          image: amd64/busybox:latest
          command:
            - sh
            - -c
            - echo should-never-run
          resources:
            requests:
              memory: 64Mi
              cpu: "0.1"
            limits:
              memory: 64Mi
              cpu: "0.1"
armadactl create queue test
armadactl submit example/platform-mismatch.yaml

3. Wait for the kubelet event-based fail check to fire (typically 1-5 minutes).

4. Verify the categorization and hint landed:

docker exec postgres psql -U postgres -d lookout -c \
  "SELECT job_id, run_id, finished, failure_category, failure_subcategory,
          convert_from(decompress(error), 'UTF8') AS error_text
     FROM job_run
    ORDER BY finished DESC NULLS LAST
    LIMIT 1;"

Expected:

 failure_category | failure_subcategory
------------------+---------------------
 infrastructure   | platform_mismatch

 error_text
------------
 Pod of the job failed
 ...
 Failed to pull image "amd64/busybox:latest": no match for platform in manifest
 ...

 Build the image for the cluster's CPU architecture (typically x64/arm64 mismatch).

The hint appears on its own paragraph after the raw kubelet error.

Live-validated end-to-end on macOS arm64 (M3) against a k3d cluster.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR extends the executor's failure categorizer with two new operator-facing capabilities: an onPodError matcher for pod-level error messages (covering image pull, missing volume, and Armada-detected pre-terminal failures that have no useful container terminationMessage), and an optional hint field on rules that appends operator-supplied guidance to the user-facing failure message in lookoutdb.job_run.error.

  • onPodError matcher: Matched in ClassifyPodError (renamed from Classify) against a pod error message string passed by the caller; the pre-existing container-level matchers (onConditions, onExitCodes, onTerminationMessage) are unaffected and continue to work as before. ContainerName scoping is intentionally ignored for onPodError because the pod-level error has no container attribution.
  • hint field: Stored on the matched ClassifyResult and appended via the new AppendHint helper at both call sites (pod_issue_handler for non-retryable issues, event.go for terminated-pod state reporting), preventing duplicate logic divergence.
  • Caller split: ClassifyContainerError (pod state reporter) and ClassifyPodError (pod issue handler) delegate to a private classify(pod, podErrorMessage) method; passing an empty string from ClassifyContainerError ensures onPodError rules never fire on the terminated-pod path.

Confidence Score: 5/5

Safe to merge; the changes are additive and backward-compatible — operators who ship no errorCategories config see no behavior change.

The onPodError matcher and hint field are opt-in additions with no mutation of existing behavior. Both call sites that emit failure events were updated consistently, and the private classify helper ensures the two public methods stay in sync. The guard podErrorMessage != "" in ruleMatches prevents onPodError rules from accidentally matching on the ClassifyContainerError path. Test coverage reaches from unit-level regex compilation through integration-level event emission, including ordering assertions that prevent hint-prepend regressions.

No files require special attention.

Important Files Changed

Filename Overview
internal/executor/categorizer/classifier.go Core classifier logic: splits Classify into ClassifyContainerError/ClassifyPodError, adds onPodError regex field, centralizes hint appending in AppendHint. Logic is clean with proper nil/empty guards.
internal/executor/categorizer/types.go Adds OnPodError and Hint fields to CategoryRule with clear YAML tags and documentation on the ContainerName scoping semantics.
internal/executor/service/pod_issue_handler.go Switches handleNonRetryableJobIssue from Classify to ClassifyPodError(pod, podIssue.Message) and appends hint before constructing the failed event. Non-retryable path is the only path that produces a failure event, so classification is correctly scoped.
internal/executor/reporter/event.go Calls AppendHint on the extracted pod failure reason before building the JobRunErrors event for the PodFailed state-change path. The two hint-appending call sites use the same helper, preventing format drift.
internal/executor/service/job_state_reporter.go Mechanical rename from Classify to ClassifyContainerError; no behavior change in the state-reporter path.
internal/executor/categorizer/classifier_test.go Adds thorough test cases: onPodError match/no-match, ContainerName ignored, onTerminationMessage isolation, AppendHint unit tests, and validation-error tests for onPodError invalid regex.
internal/executor/service/pod_issue_handler_test.go Integration tests cover the full onPodError + hint pipeline from HandlePodIssues() through to the emitted event message, including ordering assertion (hint comes after raw error).
internal/executor/reporter/event_test.go Adds a hint-appending regression test for CreateEventForCurrentState on the terminated-pod path, asserting raw error appears before the hint in the final event message.
internal/executor/categorizer/doc.go Documentation updated: removes incorrectly listed AppError from OnConditions, adds OnPodError and Hint examples, and updates the API usage snippet to the renamed methods.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Pod Event / State Change] --> B{Pod Phase?}
    B -->|PodFailed| C[JobStateReporter.reportCurrentStatus]
    B -->|Pending / Unknown| D[PodIssueHandler.detectPodIssues]
    B -->|Running - deadline exceeded| D
    B -->|Stuck Terminating| D
    C --> E[ClassifyContainerError pod]
    E --> F[classify pod, podErrorMessage='']
    F --> G{Rule type?}
    G -->|onConditions / onExitCodes / onTerminationMessage| H[Match against pod state]
    G -->|onPodError| I[podErrorMessage == '' - no match]
    H --> J[ClassifyResult + Hint]
    I --> J
    J --> K[AppendHint ExtractPodFailedReason pod]
    K --> L[CreateEventForCurrentState - Pulsar]
    D --> M{Retryable?}
    M -->|Yes| N[handleRetryableJobIssue - ReturnLease]
    M -->|No| O[handleNonRetryableJobIssue]
    O --> P[ClassifyPodError pod, podIssue.Message]
    P --> Q[classify pod, podErrorMessage=podIssue.Message]
    Q --> R{Rule type?}
    R -->|onConditions / onExitCodes / onTerminationMessage| S[Match against pod state]
    R -->|onPodError| T[Match regex against podErrorMessage]
    S --> U[ClassifyResult + Hint]
    T --> U
    U --> V[AppendHint podIssue.Message]
    V --> W[CreateJobFailedEvent - Pulsar]
Loading

Reviews (16): Last reviewed commit: "Merge branch 'master' into categorizer-o..." | Re-trigger Greptile

@dejanzele dejanzele force-pushed the categorizer-on-pod-error branch 6 times, most recently from e36ca03 to d1690ff Compare April 30, 2026 13:10
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

mauriceyap
mauriceyap previously approved these changes Apr 30, 2026
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the categorizer-on-pod-error branch 4 times, most recently from 9202ede to 7c651a1 Compare May 6, 2026 12:07
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the categorizer-on-pod-error branch from 7c651a1 to 2c63c74 Compare May 8, 2026 10:03
@dejanzele dejanzele enabled auto-merge (squash) May 8, 2026 13:23
@dejanzele dejanzele merged commit 1244021 into armadaproject:master May 8, 2026
27 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants