Summary
As a platform engineer
I want to be able to alert only when the last job didn't succeed
So that I reduce alert fatigue in my team
Context
What I'm after is essentially alerting only when the last job for a schedule did not pass.
I'd also like to have the alert per-schedule, so that I can amend things such as what is its name, namespace, etc.
I was thinking of a k8up_schedule_last_job_succeded gauge, with a value of 1 for when we are ok and 0 when it failed.
Out of Scope
No response
Further links
No response
Acceptance Criteria
- A metric exists with enough labels to allows a user to alert on:
- last job failure - meaning if a backup succeded an hour ago and failed 23h ago, I get no alert
- specific namespace
- specific schedule name
Implementation Ideas
It'd be another metric - let me know if that sounds good and fits the project well and I'd also happy to get it contributed.
Summary
As a platform engineer
I want to be able to alert only when the last job didn't succeed
So that I reduce alert fatigue in my team
Context
What I'm after is essentially alerting only when the last job for a schedule did not pass.
I'd also like to have the alert per-schedule, so that I can amend things such as what is its name, namespace, etc.
I was thinking of a
k8up_schedule_last_job_succededgauge, with a value of 1 for when we are ok and 0 when it failed.Out of Scope
No response
Further links
No response
Acceptance Criteria
Implementation Ideas
It'd be another metric - let me know if that sounds good and fits the project well and I'd also happy to get it contributed.