Skip to content

[hma] Split /status into /livez and /readyz#1974

Open
aokolish wants to merge 3 commits intofacebook:mainfrom
aokolish:split-health-probes
Open

[hma] Split /status into /livez and /readyz#1974
aokolish wants to merge 3 commits intofacebook:mainfrom
aokolish:split-health-probes

Conversation

@aokolish
Copy link
Copy Markdown
Collaborator

@aokolish aokolish commented Apr 30, 2026

Summary

Splits the combined /status health check into separate /livez (liveness) and /readyz (readiness) endpoints. Closes #1905.

Why the previous fix didn't work

#1912 disambiguated the cold-start case (INDEX-NOT-LOADED) from staleness (INDEX-STALE) in the response body, but most health-check clients only inspect the response status code. From their perspective both still return 503, so pointing both a liveness and a readiness check at /status is effectively the same check, and during index reloading those 503 responses can cause a restart loop before the index can finish loading.

What this PR does

  • /livez — minimal liveness. Failing this should mean the instance needs to be restarted.
  • /readyz — readiness. Mirrors the existing /status checks
  • /status — left untouched for backwards compatibility.

Migration

Example (Kubernetes):

livenessProbe:
  httpGet:
    path: /livez
    port: 5100
readinessProbe:
  httpGet:
    path: /readyz
    port: 5100

Test plan

  • added unit tests
  • built dev container and ran curl against all 3 status endpoints

Kubernetes HTTP probes only inspect the response status code, not the
body. The combined /status endpoint returned 503 for both
INDEX-NOT-LOADED (cold start) and INDEX-STALE, so pointing both
livenessProbe and readinessProbe at it was effectively the same probe -
and the cold-start 503 caused CrashLoopBackOff before the index could
finish loading.

Add /livez (always 200 if Flask can serve) and /readyz (mirrors the
existing /status checks) as separate endpoints. /status is left
untouched for backwards compatibility and marked deprecated in its
OpenAPI description.

Closes facebook#1905.
@meta-cla meta-cla Bot added the CLA Signed label Apr 30, 2026
@github-actions github-actions Bot added the hma Items related to the hasher-matcher-actioner system label Apr 30, 2026
Drop k8s-specific framing in the OpenAPI descriptions and test comment.
The endpoints work the same way under any health-check client that acts
on HTTP status codes (k8s, ALB/NLB, GCP LB, Consul, Envoy, etc.); k8s
is just one prominent example. Status-code-only is the lowest common
denominator the split is designed for, but the body strings remain
useful for body-aware clients (HAProxy http-check expect, Docker
HEALTHCHECK shell scripts, Azure Application Gateway match conditions).
@aokolish aokolish marked this pull request as ready for review April 30, 2026 04:55
Copy link
Copy Markdown
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the ready endpoint is the same as status, let's keep using that instead of creating a new endpoint, then you don't need to go through the steps of deprecated it. Other comments inline.

return "I-AM-ALIVE", 200

@app.get(
"/livez",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we nest this under status?

/status/live

return "OK", 200

@app.get(
"/readyz",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we nest this under status?

/status/ready

Comment on lines +295 to +300
"Liveness check. Returns 200 if the process can serve HTTP. Does"
" not check index state - this is intentional, so that"
" health-check clients acting on status codes alone do not"
" restart the instance during cold start before the index has"
" loaded. Failing this should imply the process is wedged and"
" needs to be restarted."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this appears to be motivated by kubernetes needs, I would have expected to see that mentioned here.

},
summary="Health check",
description="Liveness/readiness check",
summary="Health check (deprecated)",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
summary="Health check (deprecated)",
summary="Health check",

" the database)."
),
)
def readyz():
Copy link
Copy Markdown
Contributor

@Dcallies Dcallies May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Wait, this seems to be the same implementation as the original status. Why don't we leave the old one so you don't need to deprecate it?

" clients that act on HTTP status codes alone (most load balancers"
" and orchestrators) cannot distinguish the failure modes this"
" endpoint reports, so using it as a liveness check causes restart"
" loops during cold start while the index is still loading."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
" loops during cold start while the index is still loading."
"Returns 200 if the server is healthy and should serve traffic. If you need a separate check that the server is alive, check for a non-empty return from this endpoint or use a 200 return from /status/live"

Per Dcallies' review on facebook#1974:

- Drop /readyz: it duplicates /status. Keep /status as the readiness
  check (un-deprecated) with the reviewer's suggested description that
  points body-aware clients at /status and status-code-only liveness
  clients at /status/live.
- Rename /livez to /status/live to nest under /status.
- Mention Kubernetes (and other orchestrators / load balancers that act
  on status codes alone) in the /status/live description, since that's
  the motivating case.
- Return "I-AM-ALIVE" from /status/live so its body matches /status,
  letting body-aware clients use either endpoint identically.
Copy link
Copy Markdown
Contributor

@ThisIsMissEm ThisIsMissEm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed hma Items related to the hasher-matcher-actioner system

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[hma] Split health checkpoint into liveness and readiness

4 participants