Performance characterisation

Two runs on file: a single-worker baseline at 5e0b6ae and a multi-worker re-baseline at 18e1908 after #68/#71 shipped (PC2NUTS_WORKERS=2, Redis-backed shared rate-limit storage via a sidecar container).

	Single-worker baseline	Multi-worker (current)
Date	2026-04-30	2026-05-01
Commit	`5e0b6ae`	`18e1908`
uvicorn workers	1	2
Rate-limit backend	per-process in-memory	Redis sidecar (`redis://localhost:6379/0`), shared across workers
Test client	BE residential → DE PoP, single source IP, authenticated via labeled trusted token (revoked after each run)	(same)
Tools	`bombardier` v1.2.6, `vegeta` v12.12.0	(same)
Reproduction	`scripts/perf_test.sh`	(same)

Headline

Multi-worker plateau under realistic random-corpus load (Scenario B): ~35-38 RPS before queue saturation. Hot-key with persistent connections (Scenario A) sustains ~50 RPS. Single-worker baseline plateaued at ~30 RPS in both.

Recommended operating point: unchanged at 27 RPS. The 3-minute sustained run holds 100% success, p99 162 ms — well inside the SLO. Multi-worker raises headroom above the operating point from ~10% to ~30-40%, not the operating point itself.

Rate-limit shared-storage verified. 130 anonymous requests sequentially from a single source IP yielded exactly 120×200 + 10×429. The Redis sidecar is reachable from both workers and the cap is enforced globally, not per-worker.

The aggregate ceiling roughly scales 1.5× with two workers (not 2×). Likely contributors: GIL contention on Pydantic serialisation, fresh-TLS overhead per request in vegeta's connection pattern, and shared platform-edge serialisation in front of the pod. Scenario A's higher ceiling (50 RPS with persistent connections) implies the per-request TLS handshake is part of the cap, not just per-request CPU.

Latency curve (Scenario B — random valid lookups across 5 countries)

This is the realistic-input scenario and the basis for the headline numbers.

Offered RPS	Achieved RPS	Success	p50	p90	p95	p99	Max
10	10.0	100%	57 ms	68 ms	71 ms	102 ms	120 ms
20	20.0	100%	60 ms	70 ms	83 ms	111 ms	211 ms
25	25.1	100%	56 ms	71 ms	79 ms	112 ms	181 ms
30	30.0	100%	62 ms	75 ms	95 ms	122 ms	210 ms
35	34.8	100%	63 ms	97 ms	110 ms	150 ms	170 ms
40	38.3	100%	1.71 s	3.14 s	3.61 s	4.24 s	4.60 s
50	36.3	89.3%	3.85 s	6.63 s	8.75 s	9.86 s	10.7 s
60	52.2	100%	1.60 s	2.44 s	2.65 s	2.98 s	3.14 s

The new knee sits between 35 and 40 RPS. From 35 → 40 the throughput barely moves (38 vs 35) but tail latencies jump 30×. Beyond, behaviour is bimodal: 50 RPS hit transient platform back-pressure (107 × 503), while 60 RPS pushed through cleanly at higher achieved throughput than 50 — the platform-edge layer's overload mode is non-monotonic.

Compared to the single-worker baseline:

Offered RPS	p99 (single-worker)	p99 (multi-worker)	Δ
10	74 ms	102 ms	+28 ms (within noise)
20	96 ms	111 ms	+15 ms
25	151 ms	112 ms	−39 ms
30	193 ms	122 ms	−71 ms
35	4.5 s	150 ms	−4.3 s — single-worker collapsed here

At and below the operating point the curves are similar; the win shows up at and beyond the old knee, where the new system absorbs ~20% more sustained throughput before breaking down.

Saturation discovery (Scenario A — hot single key, BE 3080)

Throughput plateaus regardless of client concurrency, confirming the bottleneck is per-request work, not concurrency exhaustion on the client. Plateau roughly 1.6× the single-worker baseline.

Connections	Reqs/sec (single)	Reqs/sec (multi)	p99 (single)	p99 (multi)
5	29.6	46.9	267 ms	186 ms
10	31.0	50.8	479 ms	338 ms
20	31.8	47.6	1.00 s	746 ms
40	30.9	51.4	2.31 s	1.20 s
80	30.4	47.9	6.92 s	2.78 s

Throughput is bounded around 50 RPS; concurrency just queues. Tail latency at saturation is ~2.5× lower under multi-worker — at c=80 the single-worker setup pushed p99 to 6.9 s, multi-worker holds it under 2.8 s.

At c≥100 the platform pushes back (unchanged from single-worker baseline). Stay below c=80 in any scripted test against this deployment.

Fallback-path cost (Scenario C — 50/50 hit/miss at 25 RPS)

Compared to Scenario B at the same rate: p50 62 ms vs 56 ms; p99 115 ms vs 112 ms. The Tier 3 prefix-approximation path imposes no measurable latency cost at this load (matches the single-worker conclusion).

FastAPI/uvicorn floor (Scenario D — `/health` at 25 RPS)

Endpoint	p50	p95	p99	Max
`/health` (multi-worker)	18 ms	37 ms	63 ms	91 ms
`/health` (single-worker baseline)	15 ms	19 ms	27 ms	62 ms
`/lookup` (Scenario B at 25 RPS, multi-worker)	56 ms	79 ms	112 ms	181 ms

/health p99 is ~2× higher under multi-worker (63 ms vs 27 ms) — small absolute number, but the only place the second worker is visibly worse. Probable cause: process scheduling jitter when the OS load-balances incoming connections across two workers vs one. Worth re-measuring if /health ever becomes a hot path; not material for the /lookup ceiling.

Stability (Scenario E — sustained 27 RPS for 3 minutes)

Metric	Single-worker	Multi-worker
Total requests	4,860	4,860
Achieved rate	27.0/s	27.0/s
Success	100.0%	100.0%
p50 / p95 / p99 / max	46 / 89 / 132 / 324 ms	63 / 111 / 162 / 391 ms
<50 ms	73.0%	13.8%
<100 ms	97.4%	93.2%
<200 ms	99.8%	99.6%
5xx	0	0
429	0	0

No drift over the 3-minute window. p99 stayed under 200 ms throughout. Tail-latency distribution is tighter at the median under single-worker (much more <50 ms) but the >100 ms tail is slightly fatter under multi-worker — net p99 is ~30 ms higher. Within the SLO either way.

Resource utilisation

Pulled from the platform's per-app overview API at the end of the run (one sample, averaged over the recent activity window):

Metric	Value
Avg pod RAM	~810 MB
Avg pod CPU	~13%
Avg pod RAM as % of limit	2.5%

Both numbers are well clear of pressure. CPU at 13% during sustained 27 RPS is a useful corroboration that the bottleneck isn't pure compute — the workers are idle most of the time, waiting on something else (TLS, network, or scheduler). Memory headroom is comfortable for adding a third worker or an in-process cache if either becomes warranted.

Rate-limit shared-storage verification

A separate probe with no Authorization header was used to exercise the per-IP cap (the trusted-token bypass turns the cap off, so the perf scenarios can't observe it). 130 sequential requests from a single source IP, against the published cap of 120/minute:

Outcome	Count
`200`	120
`429`	10

Result is exact, not approximate. If both workers had used per-process in-memory storage (the failure mode the startup validator at app/config.py:42-50 exists to prevent), the effective cap would have been 240 — and 130 requests from one IP would have produced 130 × 200, zero 429s. The 120 + 10 split is conclusive evidence that:

PC2NUTS_RATE_LIMIT_STORAGE_URI=redis://localhost:6379/0 is being read.
The Redis sidecar (library/redis@sha256:84b07a33…5cf5b27) is reachable from both workers via the shared pod network namespace.
slowapi's shared-counter increments are synchronised across workers.
The 120/minute cap is honoured globally, not per-worker.

Methodology notes

Tools, methodology, and corpus are unchanged from the single-worker baseline — same bombardier/vegeta versions, same scenarios, same target file format. Numbers are directly comparable.
Cooldown between runs. Same 10 s pause between scenarios as before; needed to keep residual queueing from one run from polluting the next.
Single source IP test client. Aggregate ceiling reflects a single TCP/TLS termination path; distributed traffic from many IPs would push the ceiling up to where the actual per-pod work is the bottleneck — but the recommended operating point is set by single-client latency, so this is the realistic measurement.
Multi-worker container topology. The deployment is now a single pod with two co-located containers (api running uvicorn with two workers; redis:7-alpine started with --save "" --appendonly no for in-memory rate-limit counters). Both share the pod network namespace, so redis://localhost:6379/0 is the api-to-redis URI. Rate-limit counters reset every minute, so no persistence is needed.
Why the 1.6× and not 2×. Two workers don't double throughput. Likely contributors, in rough order: shared edge-layer TLS termination in front of the single pod; Pydantic serialisation contending under the GIL when both workers are CPU-bound on JSON; vegeta's per-request fresh-connection pattern at higher rates putting more weight on TLS than on the lookup itself. Scenario A (persistent connections) sustains 50 RPS, Scenario B (fresh per-request connections) plateaus at 35-38 — the difference is the TLS handshake cost, which a third worker won't help with.

Recommendations

Recommended operating point unchanged at 27 RPS. Scenario E meets the p99 ≤ 200 ms SLO with 100% success at this rate. The multi-worker headroom buys a wider safety margin to that operating point (~30-40% vs ~10%) — useful for absorbing bursts without rewriting the recommendation.
Per-IP cap unchanged at 120/minute (2 RPS per IP). With aggregate ceiling now ~50 RPS, this is roughly 1/25 of the ceiling — comfortable margin, supports up to ~25 simultaneous full-rate anonymous clients before the aggregate degrades. Bumping the cap is reasonable if dashboards show consistent under-utilisation, but not required.
Don't push above PC2NUTS_WORKERS=2 yet. The remaining gap between Scenario A (50 RPS) and Scenario B (35 RPS) suggests the bottleneck has shifted from pure compute to TLS+connection setup. Adding a third worker would help only if the platform's TLS termination scales with it — empirical question, but the cheapest first investigation is reusing connections client-side, not adding more workers.
Re-baseline if the topology changes. Specifically:
- Adding a second pod replica (raising autoScaling.max above 1) — would multiply both ceilings, and the rate-limit storage already supports it (Redis is shared per pod today; would need to move to a cross-pod shared service if scaling out).
- #7 (UK NSPL, +1.79M postcodes) — should not change per-request latency materially (still a dict lookup) but doubles in-memory state per worker. Re-run to confirm.
Don't run unattended high-concurrency tests. Bombardier at c≥100 from a single source still triggers platform-level connection refusal. The B 50/s result here (107 × 503) is a milder version of the same edge back-pressure. Keep scripted load below c=80 and below 50 RPS in B-style sweeps.

Reproducing

# 1. Issue a labeled trusted token (operator credentials required):
export PC2NUTS_TOKEN_DB_URL='libsql://...'
export PC2NUTS_TOKEN_DB_AUTH_TOKEN='...'
python -m scripts.tokens add --label "perf-test-$(date -I)"
# Token will become active in the running service after one refresh interval (default 60 s).

# 2. Run the suite:
export PC2NUTS_TARGET='https://example.invalid'
export PC2NUTS_TOKEN='<the value printed above>'
scripts/perf_test.sh

# 3. Revoke when done:
python -m scripts.tokens revoke <id>

Raw outputs are written to /tmp/perf/. The harness automatically downloads a fresh corpus from public GISCO TERCET ZIPs on first run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance characterisation

Headline

Latency curve (Scenario B — random valid lookups across 5 countries)

Saturation discovery (Scenario A — hot single key, BE 3080)

Fallback-path cost (Scenario C — 50/50 hit/miss at 25 RPS)

FastAPI/uvicorn floor (Scenario D — `/health` at 25 RPS)

Stability (Scenario E — sustained 27 RPS for 3 minutes)

Resource utilisation

Rate-limit shared-storage verification

Methodology notes

Recommendations

Reproducing

FilesExpand file tree

performance.md

Latest commit

History

performance.md

File metadata and controls

Performance characterisation

Headline

Latency curve (Scenario B — random valid lookups across 5 countries)

Saturation discovery (Scenario A — hot single key, BE 3080)

Fallback-path cost (Scenario C — 50/50 hit/miss at 25 RPS)

FastAPI/uvicorn floor (Scenario D — /health at 25 RPS)

Stability (Scenario E — sustained 27 RPS for 3 minutes)

Resource utilisation

Rate-limit shared-storage verification

Methodology notes

Recommendations

Reproducing

FastAPI/uvicorn floor (Scenario D — `/health` at 25 RPS)