Add onPodError matcher to categorize pre-startup pod failures#4891
Add onPodError matcher to categorize pre-startup pod failures#4891dejanzele merged 3 commits intoarmadaproject:masterfrom
Conversation
Greptile SummaryThis PR extends the executor's failure categorizer with two new operator-facing capabilities: an
Confidence Score: 5/5Safe to merge; the changes are additive and backward-compatible — operators who ship no errorCategories config see no behavior change. The onPodError matcher and hint field are opt-in additions with no mutation of existing behavior. Both call sites that emit failure events were updated consistently, and the private classify helper ensures the two public methods stay in sync. The guard podErrorMessage != "" in ruleMatches prevents onPodError rules from accidentally matching on the ClassifyContainerError path. Test coverage reaches from unit-level regex compilation through integration-level event emission, including ordering assertions that prevent hint-prepend regressions. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Pod Event / State Change] --> B{Pod Phase?}
B -->|PodFailed| C[JobStateReporter.reportCurrentStatus]
B -->|Pending / Unknown| D[PodIssueHandler.detectPodIssues]
B -->|Running - deadline exceeded| D
B -->|Stuck Terminating| D
C --> E[ClassifyContainerError pod]
E --> F[classify pod, podErrorMessage='']
F --> G{Rule type?}
G -->|onConditions / onExitCodes / onTerminationMessage| H[Match against pod state]
G -->|onPodError| I[podErrorMessage == '' - no match]
H --> J[ClassifyResult + Hint]
I --> J
J --> K[AppendHint ExtractPodFailedReason pod]
K --> L[CreateEventForCurrentState - Pulsar]
D --> M{Retryable?}
M -->|Yes| N[handleRetryableJobIssue - ReturnLease]
M -->|No| O[handleNonRetryableJobIssue]
O --> P[ClassifyPodError pod, podIssue.Message]
P --> Q[classify pod, podErrorMessage=podIssue.Message]
Q --> R{Rule type?}
R -->|onConditions / onExitCodes / onTerminationMessage| S[Match against pod state]
R -->|onPodError| T[Match regex against podErrorMessage]
S --> U[ClassifyResult + Hint]
T --> U
U --> V[AppendHint podIssue.Message]
V --> W[CreateJobFailedEvent - Pulsar]
Reviews (16): Last reviewed commit: "Merge branch 'master' into categorizer-o..." | Re-trigger Greptile |
e36ca03 to
d1690ff
Compare
d1690ff to
66d8523
Compare
9202ede to
7c651a1
Compare
Signed-off-by: Dejan Zele Pejchev <[email protected]>
Signed-off-by: Dejan Zele Pejchev <[email protected]>
7c651a1 to
2c63c74
Compare
Summary
This PR extends the failure categorizer with two operator-facing additions:
onPodErrormatcher - matches pod-level error text, covering pre-startup kubelet/runtime errors (image pull, missing volume, missing ConfigMap/Secret) and Armada-detected pod-level failures (stuck terminating, active deadline exceeded, externally deleted) that produce no useful containerterminationMessage. These end up with emptyfailure_category/failure_subcategoryin lookoutdb today.hintfield on rules - operator-supplied user-facing copy describing the failure mode. When set, it is appended to the failure message that lands inlookoutdb.job_run.error, so end users see actionable guidance alongside the raw runtime error.The two commits are independently useful; together they let operators both classify and explain previously-opaque pod-level failures.
Approach
onPodErrorCategoryRule. Matches a regex against the issue's pod-level error message.ContainerNamescoping is ignored (pod-level text has no container attribution).onTerminationMessageis unchanged - still matches containerTerminated.Message, still honorsContainerName. Non-overlapping data source fromOnPodErrorby design.Classifyis split intoClassifyContainerError(pod)andClassifyPodError(pod, podErrorMessage). The pod-error variant is needed because kubelet rotatesWaiting.ReasonfromErrImagePulltoImagePullBackOffwithin seconds, replacingWaiting.Messagewith a generic backoff string, so by the time Armada classifies the pod the runtime error is no longer inpod.Statusand must be passed in by the caller.hintHint stringonCategoryRule. Empty by default - no behavior change for existing configs.ClassifyResult.pod_issue_handlerandreporter/event.goappend the hint to the user-facing message with two newlines:"<original message>\n\n<hint>". Appended (not prepended) so the raw runtime error stays the lede.PodError-> Pulsar -> lookout-ingester pipeline intolookoutdb.job_run.error.Validation
To reproduce on local dev:
1. Add a rule with both
onPodErrorandhint(_local/executor/config.yamlunderapplication:):Categorization is opt-in: Armada ships no default rules.
2. Submit a wrong-arch job (
example/platform-mismatch.yaml):armadactl create queue test armadactl submit example/platform-mismatch.yaml3. Wait for the kubelet event-based fail check to fire (typically 1-5 minutes).
4. Verify the categorization and hint landed:
Expected:
The hint appears on its own paragraph after the raw kubelet error.
Live-validated end-to-end on macOS arm64 (M3) against a k3d cluster.