Skip to content

[codex] Document node renderer health probes#3241

Open
justin808 wants to merge 78 commits intomainfrom
codex/node-renderer-probe-docs
Open

[codex] Document node renderer health probes#3241
justin808 wants to merge 78 commits intomainfrom
codex/node-renderer-probe-docs

Conversation

@justin808
Copy link
Copy Markdown
Member

@justin808 justin808 commented May 3, 2026

Summary

Adds node renderer readiness and liveness probe guidance to the OSS node renderer JavaScript configuration docs.

The new section covers:

  • using /health or /info for shallow HTTP probes
  • the difference between readiness and liveness checks
  • avoiding external dependency checks in the renderer liveness probe
  • Control Plane deployment shapes for localhost, same-workload container, and separate workload setups

Validation

  • pnpm --dir /Users/justin/codex/react_on_rails exec prettier --check /Users/justin/codex/react_on_rails-pr-probes/docs/oss/building-features/node-renderer/js-configuration.md
  • pre-commit hook: trailing newline check, Markdown link check, Prettier
  • pre-push hook: branch lint, online Markdown link check

Note

Low Risk
Low risk documentation-only change; no runtime code paths are modified. Main risk is users misconfiguring probes/host binding based on the new guidance.

Overview
Adds detailed guidance for configuring Node Renderer startup/readiness/liveness probes in container environments, including recommended defaults (h2c-aware exec readiness via curl --http2-prior-knowledge, tcpSocket liveness) and timing/timeout advice (e.g., --max-time buffers, timeoutSeconds: 1 for TCP probes).

Documents the unauthenticated built-in GET /info endpoint (version exposure) vs a custom /health route, and expands deployment-shape guidance for Control Plane (same Rails container, sidecar container, separate renderer workload) with corresponding renderer_url, host binding, and probe targeting recommendations. Updates Compose/Kubernetes examples accordingly.

Reviewed by Cursor Bugbot for commit d03f3df. Bugbot is set up for automated code reviews on this repo. Configure here.

Summary by CodeRabbit

  • Documentation
    • Updated Kubernetes probe guidance for the Node Renderer: recommended startup/readiness/liveness semantics and clarified omitting initialDelaySeconds for readiness (and liveness).
    • Tightened example probe commands to bound probes (curl --max-time 2/3) and use h2c-capable checks.
    • Clarified exec readiness probes require curl with HTTP/2 support; httpGet (HTTP/1.1) won’t work against h2c.
    • Documented built-in, unauthenticated /info endpoint and advised richer /health readiness logic.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Refines Node Renderer health-check docs: documents an unauthenticated GET /info endpoint, clarifies Fastify health-check guidance, and updates Docker Compose and Kubernetes probe examples to use h2c-aware curl probes with explicit --max-time timeouts and probe-timing notes.

Changes

Node Renderer Health Checks & Kubernetes Probe Documentation

Layer / File(s) Summary
Built-in Endpoint
docs/oss/building-features/node-renderer/js-configuration.md
Adds a "Built-in Endpoints" section documenting an unauthenticated GET /info returning node_version and renderer_version, and notes it remains accessible when password is configured.
Health-check guidance
docs/oss/building-features/node-renderer/js-configuration.md
Clarifies the sample /health handler is intentionally shallow and recommends implementing readiness/warm-up gating logic inside the handler; documents exec-probe semantics (non-zero curl exit = failure).
Probe configuration guidance
docs/oss/building-features/node-renderer/js-configuration.md
Adds "Configuring Startup, Readiness, and Liveness Probes" describing when to use tcpSocket vs exec vs HTTP/1.1 probes, h2c implications for Kubernetes/Control Plane, recommended probe timings, warning against external checks in liveness, and deployment-shape guidance for probe targets and renderer binding.
ERR_STREAM_PREMATURE_CLOSE remediation
docs/oss/building-features/node-renderer/container-deployment.md
Updates remediation examples: readiness-probe exec uses curl -sf --max-time 4 --http2-prior-knowledge http://localhost:3800/info (or /health), updates JS config link reference, changes curl HTTP/2 verification to `curl --version
Container examples
docs/oss/building-features/node-renderer/container-deployment.md
Adds --max-time 2 to Docker Compose renderer service healthcheck curl; updates Kubernetes sidecar manifest readiness-probe exec to include --max-time 4 for the h2c readiness check.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit nibbling docs by night,
Notes /info and curl set right.
H2 probes with timeouts snug and neat,
Containers healthy, heartbeats meet. 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title '[codex] Document node renderer health probes' clearly and specifically describes the main change: adding documentation about health probes for the node renderer.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/node-renderer-probe-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 3, 2026

Code Review: Documentation-only PR

Overall: Good, practical addition. The readiness/liveness probe guidance is accurate and the Control Plane deployment table is genuinely useful. A few minor issues below.

Issues

/info built-in endpoint is undocumented
Line 148 says "The built-in /info route can also serve as a shallow process check." This is the first mention of /info anywhere in the docs. If it is truly built-in to the node renderer, it needs to be documented (what does it return? is it always present?). If it's not guaranteed to be there, calling it "built-in" and recommending it as a probe path is misleading and could cause broken probes in the field.

GVC acronym not explained
Line 179 uses <GVC> in the example URL without any explanation. Users unfamiliar with Control Plane won't know this stands for "Global Virtual Cloud." Should be expanded or linked.

Formatting inconsistency: $PORT not in backticks
Line 166: 3800 is in backticks but $PORT is not. Should be consistent.

Failure threshold for liveness is vague
Line 170: "higher for cautious liveness" doesn't give operators a useful starting point. Even a soft suggestion like 510 would be more actionable than "higher."

Minor Suggestions

  • The phrase "cheap endpoint" (line 148) is informal but fine for technical docs — no change needed.
  • "Control Plane Flow's default rails template" (line 181): first occurrence of "Control Plane Flow" as a distinct proper noun — worth a brief note or link for readers who only know "Control Plane."
  • The health check code example above this section returns { status: 'ok' }. Mentioning the expected response shape (e.g., "returns 200 OK") in the probe table would help operators verify their setup.

| ----------------- | ----------------------------------------------- |
| Scheme | `HTTP` |
| Path | `/health` or `/info` |
| Port | The renderer port, usually `3800` or `$PORT` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting inconsistency: 3800 is wrapped in backticks but $PORT is not.

Suggested change
| Port | The renderer port, usually `3800` or `$PORT` |
| Port | The renderer port, usually `3800` or `$PORT` |

Should be:

Suggested change
| Port | The renderer port, usually `3800` or `$PORT` |
| Port | The renderer port, usually `3800` or `$PORT` |

i.e. `$PORT` instead of plain $PORT.


### Configuring Readiness and Liveness Probes

Use a cheap endpoint such as the `/health` route above for readiness and liveness probes. The built-in `/info`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The built-in /info route is referenced here but never documented elsewhere in this file or the broader docs. Before recommending it as a probe endpoint operators should know:

  • Is it always present, or only when a custom Fastify setup is used?
  • What does its response body look like?
  • Is it affected by auth (password config)?

If it's truly always built-in, please add a note (or a link to wherever it's defined) so operators can verify it before pointing a liveness probe at it.

| Initial delay | `5`-`10` seconds, longer if startup is slow |
| Period | `10` seconds |
| Timeout | `1`-`5` seconds |
| Failure threshold | `3` for readiness, higher for cautious liveness |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"higher for cautious liveness" is vague. Operators setting this for the first time benefit from a concrete starting point. Consider something like:

Suggested change
| Failure threshold | `3` for readiness, higher for cautious liveness |
| Failure threshold | `3` for readiness; `5``10` for liveness (restart only as a last resort) |

| --------------------------------------- | ----------------------------------------------------------------- | --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Same Rails container/process supervisor | `http://localhost:3800` | Default `localhost` is fine | Probe the `rails` container's Rails health endpoint, such as `/up` on port `3000`. Add a Rails readiness check for the renderer if SSR is required. |
| Separate container in the same workload | `http://node-renderer:3800`, using the renderer container name | Usually `0.0.0.0` | Add HTTP readiness and liveness probes to the `node-renderer` container on port `3800`. The renderer port does not need to be the public workload port. |
| Separate node-renderer workload | `http://node-renderer.<GVC>.cpln.local:3800` or your internal URL | `0.0.0.0` | Add HTTP readiness and liveness probes to the node-renderer workload container on port `3800`. Expose the port internally, not publicly, unless required. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<GVC> is unexplained here. Control Plane users will recognise it, but anyone reading these docs for the first time won't. Consider expanding it inline:

Suggested change
| Separate node-renderer workload | `http://node-renderer.<GVC>.cpln.local:3800` or your internal URL | `0.0.0.0` | Add HTTP readiness and liveness probes to the node-renderer workload container on port `3800`. Expose the port internally, not publicly, unless required. |
| Separate node-renderer workload | `http://node-renderer.<GVC>.cpln.local:3800` or your internal URL (`<GVC>` is your Control Plane Global Virtual Cloud name) | `0.0.0.0` | Add HTTP readiness and liveness probes to the node-renderer workload container on port `3800`. Expose the port internally, not publicly, unless required. |

Or add a short footnote/sentence after the table.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 4, 2026

Review: Node Renderer Health Probes Documentation

This is a clean, well-structured documentation addition. The readiness vs liveness distinction is correct and the operational advice (don't couple liveness to external dependencies) is sound. A few issues worth addressing before merging:

Issues

$PORT vs $RENDERER_PORT in the probe table
The probe table's Port row says "usually 3800 or $PORT". In the renderer config docs directly above this section, the port defaults to process.env.RENDERER_PORT || 3800, and $PORT is described as the Heroku/Control Plane public port — not the renderer-specific env var. In a Control Plane sidecar or separate workload, $PORT is often the externally-exposed workload port and may differ from the port the renderer is actually listening on. The cell should distinguish $RENDERER_PORT (renderer-specific) from $PORT (platform public port) to avoid misconfiguration of probes.

Kubernetes users are left without guidance
The "Adding a Health Check Endpoint" section two levels above this one explicitly mentions "Docker or Kubernetes" as target environments, but the new probes section only covers Control Plane deployment shapes. A reader on Kubernetes will get the conceptual guidance (readiness vs liveness, recommended thresholds) but no example of the Kubernetes livenessProbe/readinessProbe spec. Even a short YAML snippet or a pointer to the Kubernetes docs would close this gap.

"Combined health endpoint" advice is actionable but under-specified
The final paragraph advises "use one combined health endpoint if you need to check both processes" when Rails and the renderer share a container, but gives no guidance on what that endpoint should do — e.g., Rails makes an HTTP call to the renderer's /health and returns a 503 if it fails. Without at least a brief description of the pattern (or a link to a Rails up controller example), users won't know how to implement it.

/info unauthenticated access in production
The section states /info "does not require the renderer password." Worth clarifying whether this applies even in production (where password is required to start). If /info is always open regardless of environment, that's a security-relevant detail worth noting — it exposes node_version and renderer_version publicly to anyone who can reach the renderer port.

Minor

  • The Control Plane table's internal URL (http://node-renderer.<GVC>.cpln.local:3800) would benefit from a link to the Control Plane internal DNS docs so users know where to find their GVC name and verify the format.
  • Failure threshold values are mixed into a shared table. Consider a note clarifying that the readiness/liveness split on that row is intentional so it doesn't look like a typo.


| Field | Value |
| ----------------- | -------------------------------------------- |
| Scheme | `HTTP` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The env var for the renderer port is RENDERER_PORT (documented above as process.env.RENDERER_PORT || 3800), while $PORT is described in the config docs as the Heroku / Control Plane public port — these are not the same thing. In a sidecar or separate-workload setup, $PORT is the externally-exposed workload port and may differ from the renderer's internal listening port.

Suggest clarifying:

Suggested change
| Scheme | `HTTP` |
| Port | The renderer port, usually `3800` (`$RENDERER_PORT`). On Heroku or Control Plane, use `$PORT` only if the renderer is configured to listen on that variable. |

Use a cheap endpoint such as the `/health` route above for readiness and liveness probes. The health check route should
return `200 OK` when the process can accept probe traffic. The built-in `/info` route can also serve as a shallow process
check if you do not need a custom route; it is always registered by the renderer, does not require the renderer password,
and returns `node_version` and `renderer_version`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth clarifying whether /info remains unauthenticated in all environments, including production (where password is required to start the renderer). If /info is always open regardless of the password config, that's a security-relevant detail: it exposes node_version and renderer_version to anyone who can reach the renderer port. Either note that explicitly, or add a security callout similar to the host: 0.0.0.0 warning already in this file.


Control Plane configures probes per container. When Rails and the renderer share one container, use one combined health
endpoint if you need to check both processes. When the renderer has its own container or workload, put the renderer probes
on that container.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advice to "use one combined health endpoint" is correct but under-specified. What should the combined endpoint do? A common pattern is to have the Rails /up (or a custom controller) make a short HTTP call to the renderer's /health or /info and return 503 if it fails. Without at least a brief description of this pattern, users sharing a container won't know how to implement it. Consider adding a sentence like: "For example, add a Rails health check controller that calls http://localhost:3800/health and returns 503 if the renderer is unreachable."

| Separate container in the same workload | `http://node-renderer:3800`, using the renderer container name | Usually `0.0.0.0` | Add HTTP readiness and liveness probes to the `node-renderer` container on port `3800`. The renderer port does not need to be the public workload port. |
| Separate node-renderer workload | `http://node-renderer.<GVC>.cpln.local:3800` or your internal URL (`<GVC>` is your Control Plane Global Virtual Cloud name) | `0.0.0.0` | Add HTTP readiness and liveness probes to the node-renderer workload container on port `3800`. Expose the port internally, not publicly, unless required. |

[Control Plane Flow](https://github.com/shakacode/control-plane-flow)'s default `rails` template models Rails as a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The <GVC>.cpln.local internal DNS format is Control Plane-specific and non-obvious. Consider linking to the relevant Control Plane documentation (e.g., the internal DNS or service discovery reference) so users can verify the format and find their GVC name without leaving the renderer docs to hunt for it.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 4, 2026

Code Review

This is a documentation-only PR adding a "Readiness and Liveness Probes" section to docs/oss/building-features/node-renderer/js-configuration.md. The conceptual guidance (separating readiness vs. liveness semantics, avoiding external dependency checks in the liveness probe) is well-written and accurate. However, there are a few issues worth addressing before merging.

Critical: HTTP/2 incompatibility not mentioned

The renderer uses cleartext HTTP/2 (h2c), which means standard Kubernetes httpGet probes (HTTP/1.1) will not work — they'll get a protocol error. container-deployment.md already documents this explicitly (line 391):

"Because the renderer uses cleartext HTTP/2, Kubernetes httpGet probes (HTTP/1.1) are incompatible with this listener. Use a TCP probe, an exec probe (for example with curl --http2-prior-knowledge…)"

The new probe table lists Scheme: HTTP without any caveat, which will lead users to configure broken httpGet probes. The table should instead recommend tcpSocket probes as the safe cross-platform option and reference exec + curl --http2-prior-knowledge for application-level checks, consistent with what container-deployment.md already says.

Duplication with container-deployment.md

container-deployment.md lines 391–428 already cover startup, readiness, and liveness probes with full YAML examples and the HTTP/2 caveat. The new section duplicates this at a lower accuracy level. Consider whether this section should:

  • Simply cross-reference container-deployment.md#startup-errors-err_stream_premature_close instead of duplicating.
  • Or at minimum note that full YAML examples and the HTTP/2 exec-probe pattern are documented there.

Networking inconsistency for the sidecar case (Control Plane)

The new Control Plane table says containers in the same workload use renderer_url: http://node-renderer:3800 and host: 0.0.0.0. But container-deployment.md documents that sidecar containers in the same pod/workload share a network namespace, so http://localhost:3800 and the default localhost binding work — the same as a single container. Using 0.0.0.0 + container-name DNS is the pattern for separate workloads, not same-workload sidecars. See container-deployment.md line 87 ("Rails still connects via localhost").

If Control Plane workloads behave differently from Kubernetes pods in this regard, that's worth a brief callout to avoid confusion.

Missing startup probe

The recommended values table covers readiness and liveness thresholds but omits the startup probe, which is particularly important for the node renderer (workers take time to initialize). container-deployment.md includes a startup probe template with initialDelaySeconds: 10. Worth adding at least a row or note.

Minor

  • The Initial delay: 5-10 seconds row in the table applies only to readiness probes; liveness probes typically omit initial delay when a startup probe is in use.
  • The table title says "Recommended HTTP probe values" — "HTTP" in the title reinforces the incorrect assumption that httpGet is the right probe type.


| Field | Value |
| ----------------- | -------------------------------------------- |
| Scheme | `HTTP` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking guidance for Kubernetes users: The renderer's Fastify server listens on HTTP/2 (h2c — cleartext HTTP/2). Standard Kubernetes httpGet probes speak HTTP/1.1 and will receive a protocol error, causing the probe to fail permanently.

container-deployment.md (line 391) already documents this and recommends either an exec probe with curl --http2-prior-knowledge or a tcpSocket probe as the fallback. This row should be corrected — or at minimum carry a note — to avoid sending users down a broken path:

Suggested change
| Scheme | `HTTP` |
| Probe type | `tcpSocket` (safe default) or `exec` with `curl --http2-prior-knowledge` (application-level) — **not** `httpGet`; the renderer speaks HTTP/2 (h2c) which is incompatible with HTTP/1.1 httpGet probes |

| Deployment shape | Rails `renderer_url` | Renderer `host` | Probe target |
| --------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Same Rails container/process supervisor | `http://localhost:3800` | Default `localhost` is fine | Probe the `rails` container's Rails health endpoint, such as `/up` on port `3000`. Add a Rails readiness check for the renderer if SSR is required. |
| Separate container in the same workload | `http://node-renderer:3800`, using the renderer container name | Usually `0.0.0.0` | Add HTTP readiness and liveness probes to the `node-renderer` container on port `3800`. The renderer port does not need to be the public workload port. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The networking shown here conflicts with container-deployment.md.

Containers in the same Control Plane workload share a network namespace (same as Kubernetes pod sidecars), so Rails connects via http://localhost:3800 and the renderer can keep its default localhost binding — just like the single-container case. container-deployment.md line 87 confirms: "Rails still connects via localhost."

Using http://node-renderer:3800 (container-name DNS) and binding 0.0.0.0 is the pattern for separate workloads communicating across a network boundary, not for same-workload sidecars.

If Control Plane's same-workload networking genuinely differs from Kubernetes pods here, a brief clarification callout would prevent confusion — but the current values look like a copy-paste from the separate-workload row.

Comment on lines +165 to +176
The renderer listens with cleartext HTTP/2 (h2c). Kubernetes `httpGet` probes and other HTTP/1.1-only probes are not
compatible with that listener. Use one of these probe styles instead:

| Probe style | When to use it |
| ------------ | ---------------------------------------------------------------------------------------------------------------------------- |
| `tcpSocket` | Safe default for startup and liveness probes when you only need to know that the renderer port is accepting traffic. |
| `exec` probe | Application-level readiness check with an h2c-aware client, for example `curl --http2-prior-knowledge`. |
| HTTP/1.1 | Only if you probe Rails, a separate HTTP/1.1 health sidecar/port, or another endpoint that is not the renderer h2c listener. |

Recommended starting values:

| Probe | Starting point |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: h2c (HTTP/2 cleartext) incompatibility not mentioned

The renderer serves all its routes — including /health and /info — over cleartext HTTP/2 (h2c). Standard httpGet probes used by Kubernetes (and Control Plane, which uses the same format) speak HTTP/1.1 and are incompatible with h2c listeners. This is explicitly documented in container-deployment.md:

"Because the renderer uses cleartext HTTP/2, Kubernetes httpGet probes (HTTP/1.1) are incompatible with this listener. Use a TCP probe, an exec probe (for example with curl --http2-prior-knowledge) …"

The Scheme: HTTP row in this table will mislead users into configuring httpGet probes that silently fail at the protocol level. The table needs either:

  1. A note that httpGet probes don't work — use exec with curl --http2-prior-knowledge for application-level checks, or tcpSocket as a lightweight fallback.
  2. A direct cross-reference to container-deployment.md#startup-errors-err_stream_premature_close where the correct YAML examples already live.
Suggested change
The renderer listens with cleartext HTTP/2 (h2c). Kubernetes `httpGet` probes and other HTTP/1.1-only probes are not
compatible with that listener. Use one of these probe styles instead:
| Probe style | When to use it |
| ------------ | ---------------------------------------------------------------------------------------------------------------------------- |
| `tcpSocket` | Safe default for startup and liveness probes when you only need to know that the renderer port is accepting traffic. |
| `exec` probe | Application-level readiness check with an h2c-aware client, for example `curl --http2-prior-knowledge`. |
| HTTP/1.1 | Only if you probe Rails, a separate HTTP/1.1 health sidecar/port, or another endpoint that is not the renderer h2c listener. |
Recommended starting values:
| Probe | Starting point |
Recommended probe approach:
> **Note:** The renderer serves all routes over cleartext HTTP/2 (h2c). Standard `httpGet` probes speak HTTP/1.1 and will be rejected by the renderer. Use an `exec` probe with `curl --http2-prior-knowledge` for application-level checks, or `tcpSocket` as a minimal fallback. See [Container Deployment: Probe Configuration](./container-deployment.md#startup-errors-err_stream_premature_close) for YAML examples.
| Field | Value |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Probe type | `exec` (`curl -sf --http2-prior-knowledge http://localhost:3800/info`) for readiness; `tcpSocket` port `3800` for liveness |
| Path | `/health` or `/info` (for `exec` probes only) |
| Port | The renderer port, usually `3800` (`$RENDERER_PORT`). On Heroku or Control Plane, use `$PORT` only if the renderer is configured to listen on that variable. |
| Initial delay | `5`-`10` seconds, longer if startup is slow |
| Period | `5` seconds for readiness; `10` seconds for liveness |
| Timeout | `1`-`5` seconds |
| Failure threshold | `3` for readiness; `5`-`10` for liveness |
| Success threshold | `1` |

- **Liveness** answers whether the renderer is stuck badly enough that restarting the container is safer.

Do not put Rails, database, Redis, or other external dependency checks in the node-renderer's liveness probe. A
temporary dependency outage should not restart every renderer replica. If SSR must be available before Rails receives
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing: startup probe recommendation

container-deployment.md explicitly recommends a startup probe alongside readiness and liveness probes, noting it's the primary way to handle ERR_STREAM_PREMATURE_CLOSE errors during slow container startup. This section only mentions two probe types — worth adding a brief note:

Also configure a startup probe (separate from the liveness probe) with a generous initialDelaySeconds to handle slow renderer startup without triggering premature liveness restarts. See Container Deployment for a YAML example.

Comment on lines +180 to +184
| Liveness | `tcpSocket` on the renderer port, `periodSeconds: 10`, and `failureThreshold` set to `5`-`10`. |

See [Node Renderer: Container Deployment](./container-deployment.md#startup-errors-err_stream_premature_close) for full
Kubernetes YAML examples, including startup, readiness, and liveness probes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Probe target" column recommends HTTP probes on port 3800 for the separate-container and separate-workload rows, but doesn't mention the h2c constraint. A reader following this table will likely configure httpGet probes that fail silently.

Suggest adding a footnote or inline note like: "Use exec + curl --http2-prior-knowledge or tcpSocket — standard httpGet probes are HTTP/1.1 and are rejected by the h2c listener."

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 4, 2026

Code Review

Overview: Documentation-only PR adding a "Configuring Readiness and Liveness Probes" section to docs/oss/building-features/node-renderer/js-configuration.md. The advice on probe semantics (readiness vs. liveness separation, no external-dependency checks in liveness) is sound. However, there is one critical correctness issue and a few gaps relative to existing docs.


🔴 Critical — h2c (HTTP/2 cleartext) incompatibility

The renderer serves all its routes — including the custom /health and the built-in /info — over cleartext HTTP/2 (h2c). Standard httpGet probes in Kubernetes and Control Plane speak HTTP/1.1 and are rejected at the protocol level. This is already documented clearly in container-deployment.md:

"Because the renderer uses cleartext HTTP/2, Kubernetes httpGet probes (HTTP/1.1) are incompatible with this listener. Use a TCP probe, an exec probe (for example with curl --http2-prior-knowledge) …"

The new probe table's Scheme: HTTP row directly contradicts this. A user following it will configure httpGet probes that fail silently, which is worse than having no documentation. The fix is either:

  • Replace the table row with the correct probe types (exec + curl --http2-prior-knowledge for readiness, tcpSocket for liveness).
  • Or collapse the table entirely and point to the YAML examples already in container-deployment.md#startup-errors-err_stream_premature_close.

The Control Plane deployment table has the same gap — the "Probe target" cells recommend HTTP probes without mentioning h2c.


🟡 Significant — startup probe not mentioned

container-deployment.md explicitly recommends a startup probe as the primary mitigation for ERR_STREAM_PREMATURE_CLOSE errors during slow renderer startup. This PR documents readiness and liveness but omits startup probes. For renderers with large bundles or many workers, the startup probe is the most important one — without it, a generous initialDelaySeconds on the liveness probe becomes the only safety net.


🟡 Significant — duplication without cross-reference

container-deployment.md already has detailed probe guidance with correct YAML, h2c warnings, and startup/readiness/liveness examples. The new section covers the same ground but less completely and with a protocol error. A lighter touch — a paragraph + cross-reference — would reduce the surface area for divergence as the renderer evolves. If the intent is to make js-configuration.md self-contained, that's fine, but then it needs to be kept in sync with container-deployment.md.


🟢 Minor

  • PR description leaks a local path in the validation commands (/Users/justin/codex/react_on_rails-pr-probes/…). Not a doc issue but worth cleaning up before the PR is un-drafted.
  • Period value inconsistency: The table recommends 10s for all probes; container-deployment.md uses 5s for readiness and 10s for liveness. Using the same period for both is fine, but it's worth aligning intentionally.

What's good

  • Clear readiness vs. liveness distinction and the external-dependency warning are exactly right and valuable.
  • The /info unauthenticated-access security callout is important and well-placed.
  • The Control Plane deployment-shape table is a useful addition not present in container-deployment.md.
  • The prose is concise and the placement (immediately after the health check example) is logical.

The Control Plane table and the probe-semantics prose are worth keeping — they just need the h2c caveat added and a cross-reference to the YAML examples in container-deployment.md.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 4, 2026

Code Review

This is a well-structured documentation addition. The guidance is technically accurate overall, the table layout is clear, and the placement within ## Custom Fastify Configuration is logical. A couple of issues are worth fixing before merge.

Issues

Liveness probe failureThreshold conflicts with container-deployment.md

container-deployment.md documents failureThreshold: 3 for the liveness probe (in both the prose YAML snippet at line ~417 and the sidecar manifest at line ~527). The new section recommends 510. These values are now contradictory across the same docs set. Pick one value or range and align both files.

Readiness probe is missing the curl HTTP/2 image prerequisite

container-deployment.md explicitly warns: "The exec probe requires curl with HTTP/2 support in your image. Verify with curl --version | grep HTTP2. If curl is unavailable, use tcpSocket as a fallback." The new summary table omits this entirely. A user reading only js-configuration.md will copy the curl --http2-prior-knowledge command without knowing their base image may not ship a curl build with HTTP/2 support, and the probe will silently fail.

Minor observations

  • The paragraph immediately before the Control Plane table (For separate Control Plane workloads, the internal URL format is \http://WORKLOAD_NAME.GVC_NAME.cpln.local:PORT\`.`) repeats what the table's third row already shows. It can be removed or folded into the table caption.
  • The Control Plane table's "Separate node-renderer workload" cell under renderer_url is quite wide. Consider moving the GVC parenthetical to a footnote or prose below the table to keep it scannable.

Recommended starting values:

| Probe | Starting point |
| --------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failureThreshold range here (510) conflicts with container-deployment.md, which documents failureThreshold: 3 for the liveness probe in both the prose YAML snippet (line ~417) and the sidecar manifest (line ~527). Having two different values in the same docs set will confuse readers.

Suggestion: align to the single value used in container-deployment.md (currently 3), or update that file at the same time and explain in prose why the liveness threshold should be more generous than the readiness threshold.

Suggested change
| --------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Liveness | `tcpSocket` on the renderer port, `periodSeconds: 10`, and `failureThreshold: 3`. This matches the example in [Container Deployment](./container-deployment.md#startup-errors-err_stream_premature_close). Increase only if your environment has slow storage or frequent transient pauses. |


Recommended starting values:

| Probe | Starting point |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The curl --http2-prior-knowledge command requires a curl build compiled with HTTP/2 support, which is not present in all base images. container-deployment.md surfaces this with: "The exec probe requires curl with HTTP/2 support in your image. Verify with curl --version | grep HTTP2. If curl is unavailable, use tcpSocket as a fallback."

Without that note here, a reader who copies this table value verbatim may get a silently broken readiness probe.

Suggested change
| Probe | Starting point |
| Readiness | `exec` with `curl -sf --http2-prior-knowledge http://localhost:3800/info`, `timeoutSeconds: 5`, `periodSeconds: 5`, and `failureThreshold: 3`. Requires curl with HTTP/2 support (`curl --version \| grep HTTP2`); use `tcpSocket` as a fallback if unavailable. |

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 4, 2026

Code Review: Documentation-only PR

Summary: Adds a "Configuring Readiness and Liveness Probes" section to the node-renderer JS configuration docs. The content is accurate, well-structured, and fills a real gap. A few issues need to be addressed before merging.


Bugs / Inconsistencies

Liveness probe failureThreshold conflicts with the canonical example

The new table recommends failureThreshold: 510 for liveness, but the authoritative YAML in container-deployment.md (Kubernetes Sidecar Manifest section) uses failureThreshold: 3. A reader who reads both docs will see two different values for the same setting with no explanation. One of them needs to change (or the prose needs to explain the intentional tradeoff), and the two files should agree.

Missing curl HTTP/2 support caveat

The readiness probe table recommends curl -sf --http2-prior-knowledge http://localhost:3800/info, but container-deployment.md already carries a note that this requires curl built with HTTP/2 support and instructs users to verify with curl --version | grep HTTP2. That caveat is missing here, so a reader who lands on this page first will not know their probe may silently fail if their image ships a curl without HTTP/2.


Design / Structure

Section lives under Custom Fastify Configuration but probes are a deployment concern

The new H3 is nested inside ## Custom Fastify Configuration, which covers Fastify plugin/hook setup. Probe configuration is an orchestration concern, not a Fastify-specific one; users who need no custom Fastify setup will still need probe guidance. Consider promoting this to a standalone H2 (e.g., ## Container Health Probes) placed after the Custom Fastify section, or at minimum add an introductory sentence clarifying why it appears here (e.g., "Because the probe endpoint is registered via configureFastify(), probe setup is documented here.").

Risk of docs drift from duplicated values

The probe recommendations here (startup failureThreshold: 6, readiness timeoutSeconds: 5, etc.) already appear as YAML examples in container-deployment.md. Having identical numbers in two places means both files must be updated in lockstep. Consider replacing the recommended-values table with a forward reference to the canonical YAML examples there. The prose about h2c incompatibility, probe semantics, and Control Plane deployment shapes adds unique value and should stay.


Minor

  • The Control Plane internal URL is given two ways: http://node-renderer.<GVC>.cpln.local:3800 in the table and http://WORKLOAD_NAME.GVC_NAME.cpln.local:PORT in the closing paragraph. Using the same placeholder format in both would reduce confusion.
  • The liveness probe row omits initialDelaySeconds. When a startup probe is also configured this is fine (Kubernetes gates liveness on startup success), but a one-line note would help readers who configure liveness without a startup probe.

| --------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Startup | `tcpSocket` on the renderer port, usually `3800` or `$RENDERER_PORT`; use `$PORT` only if the renderer listens there. Use `initialDelaySeconds: 10`, `periodSeconds: 5`, and `failureThreshold: 6` as a starting point. |
| Readiness | `exec` with `curl -sf --http2-prior-knowledge http://localhost:3800/info`, `timeoutSeconds: 5`, `periodSeconds: 5`, and `failureThreshold: 3`. |
| Liveness | `tcpSocket` on the renderer port, `periodSeconds: 10`, and `failureThreshold` set to `5`-`10`. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failureThreshold: 510 range here conflicts with container-deployment.md, where the Kubernetes Sidecar Manifest YAML example uses failureThreshold: 3 for liveness. A reader who consults both docs will see two different values with no explanation. Please align the two files — either update the YAML in container-deployment.md or change this recommendation to 3, and add a note if the wider range is intentional (e.g., "increase from 3 if the renderer recovers from transient hangs").

> timeoutSeconds: 1
> periodSeconds: 10
> failureThreshold: 3
> ```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The liveness fallback block is placed after the manifest's closing fence, making it easy to scroll past after copying the YAML. Users who don't read the callout and fallback will get the exec probe (which fails silently if curl lacks HTTP/2 support). Consider whether the fallback block would be better positioned immediately above the manifest — alongside the [!WARNING] — so it's seen before copying rather than after.


| Probe style | When to use it |
| ------------ | ------------------------------------------------------------------------------------------------------------------------------ |
| `tcpSocket` | Startup checks and fallback readiness or liveness checks when curl with HTTP/2 support is unavailable. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prose throughout the section recommends exec first and tcpSocket as a fallback, but the table puts tcpSocket first. Reordering to match the recommendation order (exec, then tcpSocket, then HTTP/1.1) would be less confusing for readers who skim tables before prose.

Suggested change
| `tcpSocket` | Startup checks and fallback readiness or liveness checks when curl with HTTP/2 support is unavailable. |
| Probe style | When to use it |
| ------------ | ------------------------------------------------------------------------------------------------------------------------------ |
| `exec` probe | Application-level readiness and stricter liveness checks with an h2c-aware client, for example `curl --http2-prior-knowledge`. |
| `tcpSocket` | Startup checks and fallback readiness or liveness checks when curl with HTTP/2 support is unavailable. |
| HTTP/1.1 | Only if you probe Rails, a separate HTTP/1.1 health sidecar/port, or another endpoint that is not the renderer h2c listener. |

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 5, 2026

Code Review

Overview

This PR adds comprehensive documentation for configuring Node Renderer health probes (startup, readiness, liveness) in Docker Compose, Kubernetes, and Control Plane environments. It also documents the built-in unauthenticated /info endpoint and cleans up the Fastify health check handler example. Documentation-only change.


What Works Well

  • Technically accurate: The h2c ↔ httpGet incompatibility, tcpSocket host binding requirement (0.0.0.0 for Kubernetes probes vs. localhost for exec probes), and the startup-probe deferral of readiness/liveness (Kubernetes 1.20+) are all correct.
  • --max-time discipline: Consistently threading --max-time shorter than timeoutSeconds through all examples (--max-time 2 / timeout: 3s in Compose, --max-time 3 / timeoutSeconds: 5 in Kubernetes) is good hygiene and the rationale is explained.
  • Fastify handler simplification (return { status: 'ok' } vs reply.send(...)) is idiomatic Fastify and correct.
  • Security callout for the unauthenticated /info endpoint is appropriately placed.
  • Upgrade warning before switching liveness from tcpSocket to exec is prominently placed before the copy-paste manifest — the right place for it.

Issues to Address

1. Missing link between -f flag and HTTP-error probe failures (medium)

js-configuration.md around the "not-ready" guidance:

To signal not-ready... call reply.code(503).send(...). Kubernetes exec probes treat any non-zero curl exit code as a failure.

This is true, but the mechanism is the -f flag in curl -sf. Without -f, curl exits 0 even on a 503 response. A reader who copies the probe command but drops the -f (e.g. while debugging) would have a silently broken probe. Worth one sentence: "The -f flag in -sf causes curl to exit non-zero for HTTP 4xx/5xx responses, so a 503 from this handler correctly fails the probe."

2. HealthController TCP probe checks localhost, not the configured host (minor)

container-deployment.mdHealthController#show:

renderer_port = URI.parse(ReactOnRailsPro.configuration.renderer_url).port
Socket.tcp("localhost", renderer_port, connect_timeout: 1) {}

This hardcodes "localhost" for the TCP target regardless of the host in renderer_url. For the single-container case (renderer_url = "http://localhost:3800") this is correct. But if someone uses a hostnamed URL (e.g. they copied the Docker Compose URL http://renderer:3800 into the Rails container env), the port is extracted correctly but the TCP check still hits localhost, not the actual renderer host.

Since this section is explicitly scoped to "Same Rails Container", the hardcoded localhost is probably intentional — but a one-line comment clarifying that assumption (# renderer and Rails share a network namespace; localhost is always correct here) would prevent confusion.

3. Startup probe math note is slightly misleading (minor)

js-configuration.md:

the sixth and final failure fires at 10 + ((6 - 1) * 5) = 35 s after container start

The formula gives the timestamp of the 6th attempt, not when Kubernetes actually acts on the failure (there's an additional processing cycle). This is a nit and the formula is a reasonable approximation, but calling it "fires at" is clearer than "container restarts at 35 s" — the current wording is fine but worth confirming the intent.

4. PR description/CodeRabbit walkthrough says --max-time 4 but actual code uses 3 (very minor)

The auto-generated CodeRabbit walkthrough says:

updates Kubernetes sidecar manifest readiness-probe exec to include --max-time 4

The actual diffs use '3' throughout. Not a code bug, but worth a quick correction to the PR description to avoid reader confusion.


Suggestions (Non-blocking)

  • The > [!NOTE] callout placed between **Mitigation:** and the numbered list causes the list to visually float away from its heading in some renderers. Consider moving the callout after the numbered list, or making it a collapsed <details> block.
  • The curl HTTP/2 verification command (curl --version | grep -i http2) appears ~6 times across the two files; the harmonization to -i (case-insensitive) is good. The centralized "Probe command notes" callout in container-deployment.md is the right anchor — the inline repetitions are mostly for scannability, which is acceptable for docs.
  • The configureFastify handler comment # Return a Promise or use async/await if warm-up checks involve async operations. could note that async functions work too (not just explicit Promise returns), but this is very minor.

Verdict

The content is technically sound and fills a real documentation gap. The issues above are minor (1 medium, 3 minor/nits) and none are blockers. Recommend addressing item 1 (the -f flag note) before merging since it's the kind of thing operators copy-paste and misapply.

also passes `request` and `reply` to handlers if you need to inspect headers, set status codes, or customize the
response. Add warm-up or readiness-gate logic inside this handler if readiness should wait for renderer-specific
initialization. To signal not-ready, add `reply` to the handler parameters and call
`reply.code(503).send({ status: 'warming_up' })`. Kubernetes exec probes treat any non-zero curl exit code as a failure;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement that "exec probes treat any non-zero curl exit code as a failure" is correct, but the mechanism that makes a 503 response actually produce a non-zero exit code is the -f flag in curl -sf. Without -f, curl exits 0 regardless of the HTTP status code.

Consider adding one sentence to close the loop for readers who might not know this:

Suggested change
`reply.code(503).send({ status: 'warming_up' })`. Kubernetes exec probes treat any non-zero curl exit code as a failure;
`reply.code(503).send({ status: 'warming_up' })`. The `-f` flag in `curl -sf` causes curl to exit non-zero for HTTP 4xx/5xx responses (exit 22), so a `503` from this handler correctly fails the probe. Kubernetes exec probes treat any non-zero curl exit code as a failure;

# cluster workers are ready. Pair with the startup probe to shield liveness.
# connect_timeout is supported by the Ruby versions in this guide's prerequisites.
renderer_port = URI.parse(ReactOnRailsPro.configuration.renderer_url).port
Socket.tcp("localhost", renderer_port, connect_timeout: 1) {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TCP target is hardcoded to "localhost" rather than extracted from renderer_url. For the "Same Rails Container" scenario this is always correct (both processes share a network namespace), but a reader who copy-pastes this controller into a different topology (e.g. after having set renderer_url to a Docker Compose service name like http://renderer:3800) would silently probe the wrong host — they'd get :ok when the local port happens to be listening for something else.

A short inline comment would prevent this confusion:

Suggested change
Socket.tcp("localhost", renderer_port, connect_timeout: 1) {}
Socket.tcp("localhost", renderer_port, connect_timeout: 1) {} # Same container: localhost is always correct here.

Comment on lines +478 to +489
> [!NOTE]
> **Probe command notes:** `exec` probes require curl with HTTP/2 support in your image. Verify with
> `curl --version | grep -i http2`; if unavailable, use `tcpSocket` as a fallback. Set curl `--max-time` shorter than the
> orchestrator timeout so curl returns a clean non-zero exit code before Kubernetes terminates the probe process. These
> examples use `--max-time 3` with `timeoutSeconds: 5`, leaving a 2-second buffer. Readiness and liveness omit
> `initialDelaySeconds` because Kubernetes 1.20+ (startup probe GA) defers
> them until the startup probe succeeds. If you skip the startup probe or run an older cluster without startup probe
> support, add an appropriate `initialDelaySeconds`.

> **Security:** `/info` is unauthenticated even when `password` is configured. Keep the renderer on `localhost` or
> private networking if exposing node and renderer version details is a concern; see
> [Built-in Endpoints](./js-configuration.md#built-in-endpoints).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placing the > [!NOTE] callout and the security blockquote between **Mitigation:** and the numbered list visually disconnects the list from its heading in some Markdown renderers (and in raw-text reading). The numbered items look like they follow the Note's content rather than the "Mitigation" section.

Consider moving these two callouts after item 4 (or at the end of the section) so the numbered list flows directly under "Mitigation:". This is non-blocking but would improve scannability for operators copy-pasting the probe YAML.

# In this same-container topology, Rails and the renderer share a network namespace.
# Probe localhost even if other deployment shapes use a service host.
# connect_timeout is supported by the Ruby versions in this guide's prerequisites.
renderer_port = URI.parse(ReactOnRailsPro.configuration.renderer_url).port
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent wrong-port bug when renderer_url lacks an explicit port.

URI.parse("http://localhost").port returns 80 (the URI default for HTTP), not nil. If a user configures renderer_url as http://localhost (no :3800), the TCP check silently probes port 80 instead of the renderer — returning 200 if anything listens on port 80, 503 if nothing does. Neither URI::Error nor TypeError is raised, so the rescue doesn't catch this case.

Consider either validating the port or falling back explicitly:

Suggested change
renderer_port = URI.parse(ReactOnRailsPro.configuration.renderer_url).port
renderer_port = URI.parse(ReactOnRailsPro.configuration.renderer_url).port
raise URI::InvalidURIError, "renderer_url must include an explicit port (got: #{ReactOnRailsPro.configuration.renderer_url})" if renderer_port == URI::DEFAULT_PARSER.send(:default_port, "http")

Or more pragmatically:

renderer_uri  = URI.parse(ReactOnRailsPro.configuration.renderer_url)
renderer_port = renderer_uri.port if renderer_uri.to_s.match?(/:\d+/)
renderer_port ||= 3800  # fall back to renderer default

At minimum, a prose note alongside this snippet warning that renderer_url must include the port would prevent silent misconfiguration.

also passes `request` and `reply` to handlers if you need to inspect headers, set status codes, or customize the
response. Add warm-up or readiness-gate logic inside this handler if readiness should wait for renderer-specific
initialization. To signal not-ready, add `reply` to the handler parameters and call
`reply.code(503).send({ status: 'warming_up' })`. The `-f` flag in `curl -sf` causes curl to exit non-zero for HTTP
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixing bare-return and reply.send() patterns in the same handler needs a clarification.

The example handler uses return { status: 'ok' } (Fastify auto-serializes), but the prose immediately after says "add reply to the handler parameters and call reply.code(503).send(...) ". If a reader adds that call while keeping return { status: 'ok' }, Fastify will warn about a double-send (the returned object is still serialized after reply.send() returns).

Suggest clarifying that switching to reply requires dropping the bare return:

Suggested change
`reply.code(503).send({ status: 'warming_up' })`. The `-f` flag in `curl -sf` causes curl to exit non-zero for HTTP
`reply.code(503).send({ status: 'warming_up' })` and removing the bare `return` (or switching to `return reply.code(503).send(...)`). The `-f` flag in `curl -sf` causes curl to exit non-zero for HTTP

Or add a complete conditional example:

app.get('/health', async (_request, reply) => {
  if (!isWarmedUp()) {
    return reply.code(503).send({ status: 'warming_up' });
  }
  return { status: 'ok' };
});

`config/routes.rb`:

```ruby
# Override Rails 7.1+'s built-in /up route to add the renderer TCP check.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/up route override may silently break an existing custom health endpoint.

If a Rails app already defines a custom GET /up action (common for Rails < 7.1 or apps that added their own), this get "up", to: "health#show" in routes.rb will override it without warning. Consider noting that the user should check for an existing /up route before adding this, or suggest using a distinct path (e.g., /health or /healthz) to avoid the collision:

Suggested change
# Override Rails 7.1+'s built-in /up route to add the renderer TCP check.
# Override Rails 7.1+'s built-in /up route to add the renderer TCP check.
# If you already have a custom /up or are on Rails < 7.1, use a distinct
# path such as /healthz to avoid silently replacing existing health logic.
get "up", to: "health#show"

> The 10-second initial delay is a conservative starting point for images that take a moment to boot before opening the
> port; reduce it, or omit it, if your renderer consistently starts in under 5 seconds.

Readiness and liveness omit `initialDelaySeconds` here because Kubernetes 1.20+ (startup probe GA) defers them until
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialDelaySeconds: 10 note is slightly misleading for the startup probe.

The note says "reduce it, or omit it, if your renderer consistently starts in under 5 seconds" — but this is the startup probe's own initialDelaySeconds. Setting it to 0 (or omitting it, which defaults to 0 in Kubernetes) means checks begin immediately and the 30-second grace window still exists via failureThreshold: 6 × periodSeconds: 5. The note could be clearer that removing initialDelaySeconds doesn't reduce the total window, just shifts when the first check fires:

The 10-second initial delay means the first check fires at 10 s. Omitting it (Kubernetes defaults to 0) starts checks immediately — the total allowed startup window stays the same (failureThreshold × periodSeconds = 30 s). Reduce initialDelaySeconds if your renderer reliably opens the port within 1–2 seconds; keep it to avoid noisy early-failure log entries.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 5, 2026

Code Review — PR #3241: Document node renderer health probes

Overall: This is a well-structured, thorough documentation addition. The technical content is accurate, the probe-timing rationale (--max-time vs timeoutSeconds buffers) is clearly explained, and the upgrade warnings around the tcpSocketexec liveness probe change are appropriately prominent. The new Control Plane Deployment Shapes section is a useful addition for the target audience.

Four issues worth addressing before merge:


1. Silent wrong-port in HealthController (correctness bug in code sample)

URI.parse("http://localhost").port returns 80 — the HTTP default — not nil. If renderer_url is configured without an explicit port (e.g. http://localhost instead of http://localhost:3800), the TCP check silently probes port 80. Neither URI::Error nor TypeError is raised, so the rescue doesn't catch this. The health endpoint returns 200 if anything else listens on 80, 503 otherwise — both wrong.

At minimum, add a prose note that renderer_url must include the explicit port, or add a guard in the example before calling Socket.tcp. Inline comment added on line 181 of container-deployment.md.


2. Mixed Fastify handler patterns (docs + reply.send() confusion)

The /health handler example uses return { status: 'ok' } (Fastify auto-serializes). The immediately following paragraph says "add reply to the handler parameters and call reply.code(503).send(…)" — but mixing that into the same handler as a bare return triggers a Fastify double-send warning. A complete conditional example would be cleaner than the current mixed description. Inline comment on js-configuration.md line 172.


3. /up route override may conflict with existing health logic

get "up", to: "health#show" silently overrides any existing /up route (including customized Rails 7.1 health endpoints or pre-7.1 custom routes). A brief note recommending users check for conflicts, or suggesting an alternative path like /healthz, would prevent silent breakage. Inline comment on container-deployment.md line 160.


4. initialDelaySeconds note for startup probe is slightly misleading

The note suggests reducing or omitting initialDelaySeconds "if startup is under 5 seconds", but omitting it doesn't reduce the total allowed startup window — failureThreshold × periodSeconds still gives 30 s. The only effect is when the first check fires, not how long the window is. Inline comment on js-configuration.md line 295.


Minor observations (no action required)

  • Consistency: curl --version | grep -i http2 (case-insensitive) is now uniform across the PR — good improvement over the previous mixed grep HTTP2.
  • --max-time rationale: The explanation that Docker Compose uses --max-time 2 / timeout: 3s for "fast local feedback" while Kubernetes uses --max-time 3 / timeoutSeconds: 5 for "scheduler jitter" is clear and well-motivated.
  • Anchor validity: [Kubernetes Sidecar Manifest](#kubernetes-sidecar-manifest) referenced from the Control Plane section resolves correctly to line 601.
  • Startup probe timeoutSeconds: 1: Adding this explicitly (rather than relying on the Kubernetes default) is good practice for copy-paste safety.
  • Content duplication between files: The probe command notes appear in both container-deployment.md and js-configuration.md. This is intentional for standalone readability but creates a future drift risk; the cross-file links mitigate this reasonably.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 5, 2026

Code Review: PR #3241 — Document node renderer health probes

Documentation-only PR that adds solid, actionable guidance. The content is technically accurate and fills a real gap. A few issues worth addressing before merge.


What this PR does well

  • Correctly explains the h2c incompatibility with httpGet probes — this is a non-obvious constraint that trips people up.
  • --max-time / timeoutSeconds relationship is documented explicitly with the math spelled out (e.g. "leaves a 2-second buffer"). This is exactly the kind of operational detail that saves incident postmortems.
  • tcpSocket fallback is provided consistently wherever the exec probe is introduced.
  • Control Plane topology section is well-structured; the three shapes (same container, sidecar, separate workload) cover the real decision space.
  • /info security note (unauthenticated even with password configured) is important and correctly placed.
  • Fastify handler simplification (return { status: 'ok' } instead of reply.send(...)) is idiomatic and correct.

Issues to address

1. HealthControllerNoMethodError escapes the rescue clause (see inline comment on line 189)

If ReactOnRailsPro.configuration.renderer_url returns nil, renderer_url.match?(...) raises NoMethodError, which is a subclass of NameError — not caught by SocketError, SystemCallError, URI::Error, or TypeError. The health endpoint would return a 500 instead of the intended 503.

2. HealthController — port extraction regex adds unnecessary complexity (see inline comment on line 186)

The regex /:\d+(?:[\/?#]|$)/ re-parses the URL string to detect whether an explicit port was given. URI already provides this: renderer_uri.port returns 80 when the URL is http://localhost (default HTTP port). A direct check against the default port is simpler and less error-prone.

3. Minor — repeated curl HTTP/2 verification note

curl --version | grep -i http2 appears in at least five separate callout boxes. While making each section self-contained is a good instinct, the repetition adds noise. Consider a single canonical note (already written well in the [!NOTE] block at the top of the mitigation steps) and link to it from later callouts rather than re-stating the full text.

4. Minor — [!NOTE] block appears before the numbered mitigation list but is not part of it

The [!NOTE] block for "Probe command notes" sits between the "Mitigation:" heading and the 1. list item. Readers scanning the numbered steps may miss it. Consider integrating the note into step 1, or placing it after the full numbered list.


No issues with

  • The overall document structure and cross-references between the two files.
  • The probe timing recommendations (values are reasonable starting points and the caveats are correct).
  • The Fastify handler signature change (() vs (request, reply)).
  • The Docker Compose --max-time 2 vs Kubernetes --max-time 3 difference (documented and intentional).

renderer_port = renderer_url.match?(/:\d+(?:[\/?#]|$)/) ? renderer_uri.port : 3800
Socket.tcp("localhost", renderer_port, connect_timeout: 1) {}
head :ok
rescue SocketError, SystemCallError, URI::Error, TypeError
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rescue clause misses NoMethodError. If ReactOnRailsPro.configuration.renderer_url returns nil, the call to renderer_url.match?(...) on the line above raises NoMethodError — a subclass of NameError, not of SocketError, SystemCallError, URI::Error, or TypeError. The endpoint would respond with 500 instead of the intended 503.

Add a nil guard before the URI parse, or widen the rescue:

Suggested change
rescue SocketError, SystemCallError, URI::Error, TypeError
rescue SocketError, SystemCallError, URI::Error, TypeError, ArgumentError

Or, more defensively, guard the nil case explicitly:

renderer_url = ReactOnRailsPro.configuration.renderer_url
raise ArgumentError, "renderer_url not configured" if renderer_url.nil?

Either approach prevents an unhandled exception from masking a misconfiguration as a 500.

# connect_timeout is supported by the Ruby versions in this guide's prerequisites.
renderer_url = ReactOnRailsPro.configuration.renderer_url
renderer_uri = URI.parse(renderer_url)
renderer_port = renderer_url.match?(/:\d+(?:[\/?#]|$)/) ? renderer_uri.port : 3800
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex re-parses the URL string to detect whether an explicit port was set, because URI.parse("http://localhost").port returns 80 (HTTP default) rather than nil. A cleaner way to express the same intent:

Suggested change
renderer_port = renderer_url.match?(/:\d+(?:[\/?#]|$)/) ? renderer_uri.port : 3800
renderer_port = renderer_uri.port if renderer_url.to_s.match?(/:\d+/)
renderer_port ||= 3800

Or even more explicitly, relying on the fact that the Node Renderer default (3800) differs from every scheme default:

renderer_port = [80, 443].include?(renderer_uri.port) && !renderer_url.to_s.include?(":#{renderer_uri.port}") ? 3800 : renderer_uri.port

The current regex (/:\d+(?:[\/?#]|$)/) is correct in practice, but its intent (distinguishing "no port in URL" from "port 80 in URL") is not immediately obvious. A brief inline comment on that line explaining why a regex is needed (rather than just calling .port) would help future readers.

raise ArgumentError, "renderer_url not configured" if renderer_url.nil? || renderer_url.empty?

renderer_uri = URI.parse(renderer_url)
renderer_port = renderer_url.match?(/:\d+(?:[\/?#]|$)/) ? renderer_uri.port : 3800
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The port-detection regex is fragile. URI.parse("http://localhost").port returns 80 (HTTP default), not nil, so you can compare against URI::HTTP.default_port instead of pattern-matching the raw string:

Suggested change
renderer_port = renderer_url.match?(/:\d+(?:[\/?#]|$)/) ? renderer_uri.port : 3800
renderer_port = (renderer_uri.port == renderer_uri.class.default_port) ? 3800 : renderer_uri.port

This is clearer and correctly handles edge cases like http://renderer:80 (which the regex would misclassify as "has an explicit port == 3800").

There's also a subtle issue: if someone passes https://…, renderer_uri.class.default_port is 443, so the fallback would kick in unintentionally. Since the renderer only speaks h2c (plain HTTP), you could add an explicit scheme guard or just use 3800 as the hardcoded fallback and document the assumption.

renderer_port = renderer_url.match?(/:\d+(?:[\/?#]|$)/) ? renderer_uri.port : 3800
Socket.tcp("localhost", renderer_port, connect_timeout: 1) {}
head :ok
rescue ArgumentError, SocketError, SystemCallError, URI::Error, TypeError
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rescuing ArgumentError alongside connection errors means a misconfigured renderer_url (the raise on line 185) silently returns 503 in production — indistinguishable from a genuinely unreachable renderer. The configuration error will keep returning 503 on every probe and causing cascading availability alerts without surfacing the real cause.

Consider re-raising (or logging + re-raising) ArgumentError so it surfaces as a 500 rather than a 503:

  rescue ArgumentError
    raise  # surface misconfiguration as 500, not a health-check failure
  rescue SocketError, SystemCallError, URI::Error, TypeError
    head :service_unavailable
  end

This keeps "renderer is down" vs "deployment is misconfigured" distinguishable in logs and alerting.


renderer_uri = URI.parse(renderer_url)
renderer_port = renderer_url.match?(/:\d+(?:[\/?#]|$)/) ? renderer_uri.port : 3800
Socket.tcp("localhost", renderer_port, connect_timeout: 1) {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded "localhost" here is intentional for the same-container topology, but this is subtle and easy to miss for readers adapting the snippet to other shapes. The renderer_url is used only for the port, while the host is silently overridden.

Worth adding a brief inline callout (or a blockquote below the snippet) like:

Topology-specific: This code always probes localhost regardless of renderer_url's host — it only borrows the port. This is correct for same-container deployments where Rails and the renderer share a network namespace. Do not reuse this snippet as-is for sidecar or separate-workload topologies where renderer_url points to a different host.

Comment on lines +280 to +285
- **Liveness**: Prefer `exec` with
`curl -sf --max-time 3 --http2-prior-knowledge http://localhost:3800/info` when curl with HTTP/2 support is available.
If you are upgrading from a `tcpSocket` liveness probe, verify curl has HTTP/2 support in the image before switching.
Start with `timeoutSeconds: 5`, `periodSeconds: 10`, and `failureThreshold: 3`, matching the Container Deployment
examples. Use `tcpSocket` only if curl is unavailable. Raise `failureThreshold`, and optionally `periodSeconds`, if
heavy CPU bursts or frequent transient pauses trigger false-positive restarts.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The liveness probe recommendation (exec curl /info) is more aggressive than is often appropriate. A liveness probe failure restarts the container, while a readiness failure only removes it from the load-balancer rotation. Under sustained CPU/GC pressure, an h2c round-trip can time out even when the process is functional — triggering a restart that makes things worse (the "liveness probe death spiral" pattern).

The recommendation and the table already mention tcpSocket as a fallback, but the prose buries it with "only if curl is unavailable." It's worth flipping the default recommendation: tcpSocket for liveness (cheap, restarts only on hard failures), exec curl /info for readiness (application-level gatekeeping). The current guidance was changed from tcpSocket liveness → exec liveness, which reverses the safer default.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 5, 2026

Code Review

This is a documentation-only PR that adds health probe guidance for the Node Renderer. The content is thorough and the security callouts (unauthenticated /info, 0.0.0.0 binding risks) are well-placed. A few things worth addressing before merge:

Code correctness issues in the HealthController example

The Ruby snippet in the "Same Rails Container" section has three issues (see inline comments):

  1. Port regex is fragile. renderer_url.match?(/:\d+(?:[\/?#]|$)/) reads the raw string to detect an explicit port, but Ruby's URI#port already returns the scheme default (80 for http) when none is given. Comparing against renderer_uri.class.default_port is cleaner and avoids mis-classifying http://renderer:80.

  2. ArgumentError is silently swallowed. The rescue clause catches both the deliberate raise ArgumentError, "renderer_url not configured" and runtime connection failures. A misconfigured deployment will silently return 503 forever with no log signal distinguishing it from a real outage. Splitting the rescue block or re-raising ArgumentError makes the two cases distinguishable.

  3. Hardcoded "localhost" is not obvious. The code parses renderer_url but only uses its port, always connecting to localhost. Readers adapting this for a sidecar or separate-workload topology may silently probe the wrong host. A short callout (blockquote below the snippet) would prevent that copy-paste hazard.

Probe strategy: liveness default

The PR changes the Kubernetes sidecar manifest's liveness probe from tcpSocketexec curl /info. While h2c-aware exec probes are preferable for readiness, they carry more risk for liveness: under GC or CPU pressure, a curl round-trip can time out even when the process is running fine, triggering a restart that makes things worse (the "liveness probe death spiral"). The tcpSocketexec change for liveness reverses the safer default; the old tcpSocket liveness was actually appropriate. The inline comment has more detail on this.

Minor documentation notes

  • The new "Configuring Startup, Readiness, and Liveness Probes" section in js-configuration.md (~93 lines) substantially overlaps with the new sections in container-deployment.md. This will create a maintenance burden when recommendations change. Consider whether one file should be authoritative and the other should link rather than duplicate.
  • The curl HTTP/2 verification command (curl --version | grep -i http2) is repeated in at least four separate places. A single shared callout or a "Prerequisites" section at the top would reduce that repetition.
  • The Fastify handler simplification (reply.sendreturn) is correct and cleaner — good change.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 8, 2026

Code Review - PR 3241 (Node renderer health probes docs)

OVERVIEW: Documentation-only PR adding comprehensive health probe guidance for the Node Renderer. Technical content is accurate: three-tier probe strategy (startup/readiness/liveness), h2c incompatibility with httpGet probes, the --max-time / timeoutSeconds relationship, and Control Plane topology shapes are all correct.

WHAT WORKS WELL:

  • Accurate explanation of why httpGet probes cannot target the h2c listener
  • Correct advice to prefer tcpSocket for liveness to avoid restarting on transient GC pauses
  • The --max-time 2 / timeout 3s (Compose) vs --max-time 3 / timeoutSeconds 5 (Kubernetes) split is well-motivated and documented
  • timeoutSeconds: 1 additions to startup and liveness tcpSocket probes are the right defaults
  • configureFastify handler change (returning the object directly instead of reply.send) is idiomatic Fastify

ISSUE 1 - Unrescued URI::InvalidURIError in HealthController example:
URI.parse(renderer_url) raises URI::InvalidURIError when the URL is malformed (e.g. 'not a url'). The rescue clause only catches SocketError and SystemCallError, so a malformed but non-empty renderer_url produces an unhandled exception -> 500. The callout box says config mistakes are intentionally allowed to surface as 500s, but that rationale is documented for nil/empty -- not for a URI parse error. Recommend either also rescuing URI::Error and re-raising explicitly, or adding a code comment that URI::InvalidURIError is intentionally left unrescued so misconfiguration is visible in logs.

ISSUE 2 - Port-detection regex is fragile:
renderer_url.match?() with the hand-rolled regex works correctly in practice but is non-obvious and hard to maintain. Since URI.parse already ran, comparing renderer_uri.port against the scheme default directly would be cleaner and removes the regex entirely.

ISSUE 3 - Content duplication:
Probe values and curl flags now appear in three places: the mitigation steps in container-deployment.md, the Kubernetes sidecar manifest, and the new js-configuration.md section. The cross-references help, but any future change to timing values will require multiple edits to stay in sync.

ISSUE 4 - Backwards initialDelaySeconds comment phrasing:
The inline YAML comment 'Omit initialDelaySeconds only if the startupProbe above is configured' reads as a restriction (when to omit) rather than a trigger (when to add). Suggested rewording: 'Add initialDelaySeconds here if no startupProbe is configured; Kubernetes 1.20+ defers readiness/liveness until the startup probe succeeds.'

VERDICT: Content is technically sound and a genuine improvement. The main fix worth making before merge is clarifying the URI::InvalidURIError behaviour in the HealthController example. Regex and phrasing issues are nice-to-haves.

renderer_port = renderer_url.match?(/:\d+(?:[\/?#]|$)/) ? renderer_uri.port : 3800
Socket.tcp("localhost", renderer_port, connect_timeout: 1) {}
head :ok
rescue SocketError, SystemCallError
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rescue clause catches SocketError and SystemCallError but not URI::InvalidURIError, which URI.parse raises for a malformed URL. A non-empty but syntactically invalid renderer_url (e.g. "renderer:not-a-port") would therefore raise an unhandled exception and produce a 500, which differs from the "reachability failures become 503, config errors become 500" contract described in the callout below.

If the intent is for all config errors to surface as 500s, add a comment here making that explicit:

Suggested change
rescue SocketError, SystemCallError
rescue SocketError, SystemCallError
# URI::InvalidURIError (malformed renderer_url) is intentionally not rescued here;
# it surfaces as a 500 so misconfiguration is visible in logs and alerting.
head :service_unavailable

Comment on lines +187 to +188
renderer_uri = URI.parse(renderer_url)
renderer_port = renderer_url.match?(/:\d+(?:[\/?#]|$)/) ? renderer_uri.port : 3800
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex approach is correct for typical renderer URLs but is hard to reason about at a glance. Since URI.parse already ran, you can compare renderer_uri.port against the HTTP/HTTPS scheme defaults directly, which is more self-documenting and handles an unrecognised scheme visibly:

Suggested change
renderer_uri = URI.parse(renderer_url)
renderer_port = renderer_url.match?(/:\d+(?:[\/?#]|$)/) ? renderer_uri.port : 3800
renderer_uri = URI.parse(renderer_url)
default_scheme_port = { 'http' => 80, 'https' => 443 }.fetch(renderer_uri.scheme, nil)
renderer_port = renderer_uri.port != default_scheme_port ? renderer_uri.port : 3800

failureThreshold: 6
timeoutSeconds: 1
readinessProbe:
# Omit initialDelaySeconds only if the startupProbe above is configured.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrasing "Omit … only if" reads as a restriction (when it is safe to leave it out) rather than a trigger (when it is required). Readers who skip the startup probe may not notice they need to add initialDelaySeconds. Suggested rewording:

Suggested change
# Omit initialDelaySeconds only if the startupProbe above is configured.
# Add initialDelaySeconds here if no startupProbe is configured.
# Kubernetes 1.20+ defers readiness/liveness until the startup probe succeeds.

periodSeconds: 5
failureThreshold: 3
livenessProbe:
# Omit initialDelaySeconds only if the startupProbe above is configured.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same phrasing issue as the readiness probe comment above — "Omit … only if" reads as a restriction rather than a trigger. Suggest the same rewording:

Suggested change
# Omit initialDelaySeconds only if the startupProbe above is configured.
# Add initialDelaySeconds here if no startupProbe is configured.
# Kubernetes 1.20+ defers readiness/liveness until the startup probe succeeds.

app.get('/health', (request, reply) => {
reply.send({ status: 'ok' });
app.get('/health', () => {
// Return a Promise or use async/await if warm-up checks involve async operations.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: the comment says "Return a Promise or use async/await" but these are the same thing — an async function always returns a Promise. The useful distinction is between returning a plain object (synchronous, as shown here) vs. returning a Promise / using async/await for async warm-up checks:

Suggested change
// Return a Promise or use async/await if warm-up checks involve async operations.
// For async warm-up checks, use an async handler: async () => { await ...; return { status: 'ok' }; }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant