[hma] Split /status into /livez and /readyz#1974
[hma] Split /status into /livez and /readyz#1974aokolish wants to merge 3 commits intofacebook:mainfrom
Conversation
Kubernetes HTTP probes only inspect the response status code, not the body. The combined /status endpoint returned 503 for both INDEX-NOT-LOADED (cold start) and INDEX-STALE, so pointing both livenessProbe and readinessProbe at it was effectively the same probe - and the cold-start 503 caused CrashLoopBackOff before the index could finish loading. Add /livez (always 200 if Flask can serve) and /readyz (mirrors the existing /status checks) as separate endpoints. /status is left untouched for backwards compatibility and marked deprecated in its OpenAPI description. Closes facebook#1905.
Drop k8s-specific framing in the OpenAPI descriptions and test comment. The endpoints work the same way under any health-check client that acts on HTTP status codes (k8s, ALB/NLB, GCP LB, Consul, Envoy, etc.); k8s is just one prominent example. Status-code-only is the lowest common denominator the split is designed for, but the body strings remain useful for body-aware clients (HAProxy http-check expect, Docker HEALTHCHECK shell scripts, Azure Application Gateway match conditions).
Dcallies
left a comment
There was a problem hiding this comment.
Since the ready endpoint is the same as status, let's keep using that instead of creating a new endpoint, then you don't need to go through the steps of deprecated it. Other comments inline.
| return "I-AM-ALIVE", 200 | ||
|
|
||
| @app.get( | ||
| "/livez", |
There was a problem hiding this comment.
Can we nest this under status?
/status/live
| return "OK", 200 | ||
|
|
||
| @app.get( | ||
| "/readyz", |
There was a problem hiding this comment.
Can we nest this under status?
/status/ready
| "Liveness check. Returns 200 if the process can serve HTTP. Does" | ||
| " not check index state - this is intentional, so that" | ||
| " health-check clients acting on status codes alone do not" | ||
| " restart the instance during cold start before the index has" | ||
| " loaded. Failing this should imply the process is wedged and" | ||
| " needs to be restarted." |
There was a problem hiding this comment.
Since this appears to be motivated by kubernetes needs, I would have expected to see that mentioned here.
| }, | ||
| summary="Health check", | ||
| description="Liveness/readiness check", | ||
| summary="Health check (deprecated)", |
There was a problem hiding this comment.
| summary="Health check (deprecated)", | |
| summary="Health check", |
| " the database)." | ||
| ), | ||
| ) | ||
| def readyz(): |
There was a problem hiding this comment.
blocking: Wait, this seems to be the same implementation as the original status. Why don't we leave the old one so you don't need to deprecate it?
| " clients that act on HTTP status codes alone (most load balancers" | ||
| " and orchestrators) cannot distinguish the failure modes this" | ||
| " endpoint reports, so using it as a liveness check causes restart" | ||
| " loops during cold start while the index is still loading." |
There was a problem hiding this comment.
| " loops during cold start while the index is still loading." | |
| "Returns 200 if the server is healthy and should serve traffic. If you need a separate check that the server is alive, check for a non-empty return from this endpoint or use a 200 return from /status/live" |
Per Dcallies' review on facebook#1974: - Drop /readyz: it duplicates /status. Keep /status as the readiness check (un-deprecated) with the reviewer's suggested description that points body-aware clients at /status and status-code-only liveness clients at /status/live. - Rename /livez to /status/live to nest under /status. - Mention Kubernetes (and other orchestrators / load balancers that act on status codes alone) in the /status/live description, since that's the motivating case. - Return "I-AM-ALIVE" from /status/live so its body matches /status, letting body-aware clients use either endpoint identically.
Summary
Splits the combined
/statushealth check into separate/livez(liveness) and/readyz(readiness) endpoints. Closes #1905.Why the previous fix didn't work
#1912 disambiguated the cold-start case (
INDEX-NOT-LOADED) from staleness (INDEX-STALE) in the response body, but most health-check clients only inspect the response status code. From their perspective both still return503, so pointing both a liveness and a readiness check at/statusis effectively the same check, and during index reloading those503responses can cause a restart loop before the index can finish loading.What this PR does
/livez— minimal liveness. Failing this should mean the instance needs to be restarted./readyz— readiness. Mirrors the existing/statuschecks/status— left untouched for backwards compatibility.Migration
Example (Kubernetes):
Test plan