Skip to content

perf(proxy): bind the load balancer before loading routes#105

Merged
dviejokfs merged 10 commits into
mainfrom
perf/fast-lb-bind
May 30, 2026
Merged

perf(proxy): bind the load balancer before loading routes#105
dviejokfs merged 10 commits into
mainfrom
perf/fast-lb-bind

Conversation

@dviejokfs
Copy link
Copy Markdown
Contributor

@dviejokfs dviejokfs commented May 29, 2026

Problem

In prod the Pingora proxy (the load balancer on 80/443) took 2.5–10s to start accepting connections. Startup ran a serial chain of blocking block_on calls before server.run() bound the listeners, dominated by:

  • Two full route-table loads (each an N+1 query loop over every domain/env/deployment/container/project/node)
  • An inline DNS reconcile_all() awaited at the end of every load
  • A TimescaleDB CALL refresh_continuous_aggregate backfill
  • A Docker connect+ping

Change: bind first, load asynchronously

The load balancer now binds 80/443 immediately and loads routes in the background.

  • RouteTableListener::start_listening subscribes to PG NOTIFY first (closing the missed-NOTIFY window), then spawns the initial load instead of awaiting it — the proxy no longer waits for routes to bind.
  • Removed the duplicate load_routes() in setup_proxy_server. It only existed because the on-demand sleeping-domain callback was registered too late; it's now registered in serve before the first load, so the first load populates sleeping domains / on-demand configs.
  • load_routes bumps the generation + notifies waiters before the DNS reconcile, and the reconcile is now fire-and-forget (never gates a load or the readiness wait).
  • run_post_migration_backfill is detached from establish_connection and spawned on the long-lived runtime (idempotent; the refresh policy catches up).
  • resolve_peer readiness guard: when the route table has never loaded, a request briefly waits for the first load (wait_until_loaded, 5s) and re-looks-up before falling back to the console — covers the cold-start window the async load introduces. After the first load, an unmatched host falls through immediately, exactly as before.

The admin gate stays pre-bind and fail-closed (a ~1ms single-row read that only governs the management console, not app routing).

Verification

Local run, timestamped:

  • Proxy binds at T+168ms (from DB-init) while the route table finishes loading 124ms later — i.e. the listener was accepting connections before routes were loaded.
  • HTTP 200 served through the proxy; backfill (events_hourly … complete) and preview-gateway reconcile (✅ reconciled) both complete off the bind path.
  • cargo check --bin temps clean; 346 tests pass including 4 new readiness tests for has_loaded / wait_until_loaded.

Not in scope (follow-ups)

  • The N+1 inside load_routes is unchanged — the background load is still slow at prod scale, it just no longer blocks the bind. Batching those queries is the highest-value follow-up.
  • Docker ping still runs pre-bind; a wedged Docker socket could still delay the bind. Deferring it is a candidate follow-up.
  • Cold-window readiness wait validated by unit tests, not yet exercised live at scale (local load was too fast to hit the branch).

Note for reviewers

This branch carries the in-process RouteReloadSubscriber (ForceRouteReload over the shared queue) it was built on top of — that work was already in the working tree, and its relocation inside serve/mod.rs is interleaved with this restructure and not cleanly separable. Unrelated frontend CLI (.ts) changes were deliberately left out.


Also in this PR: feat(cli) TEMPS_CONTEXT override

Bundled per request (unrelated to the proxy change). Adds a TEMPS_CONTEXT env var that pins the active CLI context for a shell/CI session without mutating .contexts.json, surfaces the override in context ls / context use / whoami, and fixes loginWithApiKey to validate against the server passed via positional/--url (previously it could validate against the wrong server and wipe credentials). Includes contexts.test.ts unit tests for the resolver.

dviejokfs added 10 commits May 29, 2026 19:52
The Pingora proxy took 2.5-10s to start accepting connections in prod
because startup ran a serial chain of blocking block_on calls before
server.run() bound 80/443 — dominated by two full route-table loads
(each an N+1 query loop), an inline DNS reconcile, a TimescaleDB
aggregate backfill, and a Docker ping.

Bind first, load asynchronously:

- RouteTableListener::start_listening subscribes to PG NOTIFY first
  (closing the missed-NOTIFY window) and spawns the initial load instead
  of awaiting it, so the proxy binds without waiting for routes.
- Remove the duplicate load_routes() in setup_proxy_server by registering
  the on-demand sleeping-domain callback in serve before the first load
  runs, so the first load populates sleeping domains / on-demand configs.
- load_routes bumps generation + notifies waiters BEFORE the DNS reconcile,
  and the reconcile is now fire-and-forget (never gates a load).
- run_post_migration_backfill is detached from establish_connection and
  spawned on the long-lived runtime (idempotent; refresh policy catches up).
- resolve_peer: when the table has never loaded, briefly wait for the first
  load (wait_until_loaded, 5s) and re-lookup before falling back to the
  console — covers the cold-start window the async load introduces.

Verified locally: proxy binds at T+168ms while the route table finishes
loading 124ms later; HTTP 200 served through the proxy; backfill and
preview-gateway reconcile complete off the bind path. 346 tests pass
incl. 4 new readiness tests for has_loaded/wait_until_loaded.

NOTE: carries the in-process RouteReloadSubscriber (ForceRouteReload over
the shared queue) it was built on top of — its relocation in serve/mod.rs
is interleaved with this restructure and not separable.
Adds a TEMPS_CONTEXT environment variable that pins the active CLI
context for the current shell / CI session without mutating the shared
.contexts.json (mirroring how TEMPS_API_URL overrides the resolved URL).

- envContextName() / pickActiveContext(): TEMPS_CONTEXT, when set, selects
  the active context; a missing name returns null (unauthenticated) with a
  one-time stderr warning rather than silently falling back to another
  context's credentials — picking the wrong server silently is how you push
  to prod by accident.
- `context ls` and `context use` reflect the env override: the active
  marker / JSON isActive follow TEMPS_CONTEXT, and `use` warns when an env
  pin overrides the switch so it doesn't appear to silently no-op.
- `whoami` labels the active context with "(env: TEMPS_CONTEXT)" when the
  env var drove the selection.
- loginWithApiKey: resolve the target server (positional/--url) BEFORE
  validating the key, so an api-key login against a named server doesn't
  validate against the active context / localhost default, fail, and wipe
  credentials.
- contexts.test.ts: unit tests for the resolver + missing-context warning.

Unrelated to the proxy bind change in this branch; bundled per request.
Adds [Unreleased] entries for the async-route-loading proxy bind
(Changed), the TEMPS_CONTEXT env override (Added), and the api-key
login server-resolution fix (Fixed).
Add a hard-coded AI-agent taxonomy (OpenAI, Anthropic, Perplexity,
Google, Apple, Meta, Amazon, ByteDance, Common Crawl, Cohere, Diffbot,
You.com, DuckDuckGo, Brave, Mistral, xAI, and more) that classifies
crawler user agents into (provider, agent) pairs at ingest time, stored
as the canonical agent name in proxy_logs.bot_name.

Backend (temps-proxy):
- ai_agent_detector module with RegexSet-based detection + 7 unit tests
- ai_agent / ai_provider / is_ai_agent filters on GET /proxy-logs
- GET /proxy-logs/stats/ai-agents      per-agent/provider breakdown
- GET /proxy-logs/stats/ai-pages       top pages crawled by AI, with a
  distinct-agent count and optional exact-path filter
- GET /proxy-logs/ai-agents/known      taxonomy for UI dropdowns

Migration: composite (project_id, <dim>, timestamp DESC) indexes on
proxy_logs for the project-scoped Request Logs filters (status, method,
environment, is_bot, deployment) -- idempotent, mirrors the hypertable
compression layout.

Frontend (web):
- AiAgentLogo component + 21 provider logos (white chip so monochrome
  brand marks stay legible in dark mode)
- AI Agents overview card (top 5 + View all) and full AiAgentsDetail
  page: ranked agents (by provider/agent) + pages-crawled-by-AI table
- Page detail shows an "AI Agents" stat (distinct agents + requests)
  linking into the bot-filtered request log
- Request Logs redesign: advanced filters collapsed behind a toggle;
  AI/provider/path drill-down context shown as removable chips and
  actually applied to the query (previously dropped). Removed the
  unindexed user-agent LIKE search.
The "Pages crawled by AI" table only showed a distinct-agent count per
path. Make each page row expandable to reveal which agents hit that page
and how many times each.

- get_ai_agent_breakdown gains an optional exact-path filter so a single
  page returns its per-agent counts (bot_name GROUP BY scoped to path).
  The path param already existed on the shared OpenAPI query, so no SDK
  change is needed.
- AiAgentsDetail: page rows toggle an inline PageAgentBreakdown that
  lazily fetches the path-scoped breakdown (only on expand) and lists
  each agent with logo, share %, and request count. Clicking an agent
  opens the request log filtered to that page + agent; a "View all"
  link opens the page's full AI traffic.
Web: rework the AI Agents detail page from a cramped two-column grid into
two full-width tabs (Agents | Pages crawled) since the two datasets are
unrelated. Agents render as a proper ranked table (agent · share · unique
IPs · requests); pages stay expandable to the per-agent breakdown. The
three summary metrics move into compact header badges so the content sits
at the top instead of being pushed down by a tall KPI block.

CLI: add `temps analytics ai-agents`, `ai-pages`, and `ai-page <path>`
mirroring the web view, backed by the AI breakdown endpoints. Regenerate
the CLI SDK/openapi for the new routes and make the stdout output helpers
no-op in `--json`/quiet mode so machine-readable output isn't corrupted.
Document the new commands in the temps-cli SKILL.
Container logs viewer: infer a severity per line (ERROR/WARN/INFO/DEBUG/
TRACE) from common log shapes, add level filtering, and a pause/resume
control that freezes the visible rope while buffering the tail (capped at
maxLogs) so a chatty service can't grow unbounded. Switch match-scroll to
the virtualizer's scrollToIndex, parse and surface the optional server
timestamp as a structured field, and add a scroll-to-bottom action.

Masking: add value-based secret detection alongside the existing key-name
heuristic — connection-string userinfo, Authorization/Bearer tokens, and
JWTs are partially redacted in place (scheme://user:•••@host) so the
structure stays debuggable. The container env-var table now masks
sensitive values with a reveal/hide toggle, catching cases like
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer … and SENTRY_DSN where the
key name alone wouldn't flag them.
AI crawlers (ClaudeBot, OAI-SearchBot, PerplexityBot, ...) never appeared
on the AI Agents analytics page because the live ingest path —
ProxyLogBatchWriter — only ran CrawlerDetector, which returns a loose UA
substring (e.g. `ClaudeBot/1.0` -> `"Bot/"`). The ai_agent_detector was
wired into ProxyLogService instead, which is not the path that writes
proxy logs in production.

The analytics query filters `bot_name = ANY(known_agents)`, so substrings
never match the taxonomy and the page (and provider breakdown) stay empty.

enrich_entry now runs ai_agent_detector::detect first (canonical name like
`ClaudeBot`), falling back to CrawlerDetector for non-AI bots — identical
to ProxyLogService. Going-forward only; existing rows keep their old
bot_name and need a separate backfill.

Adds test_enrich_entry_detects_ai_agent_canonical_name covering ClaudeBot,
OAI-SearchBot, PerplexityBot, CCBot, and Meta-ExternalAgent.
…t names

The AI Agents analytics page was empty because historical proxy_logs rows
carry loose CrawlerDetector substrings (e.g. ClaudeBot/1.0 -> "Bot/")
instead of canonical taxonomy names, which the page filters on. The code
fix (3929bbf) handles new rows; this backfills existing ones so the page
has history.

Scoped to the last 7 days for two reasons tied to the TimescaleDB
hypertable: chunks >7 days are compressed (immutable) and >30 days are
dropped by retention. Crucially this uses a plain single-table UPDATE with
the AI-agent regex in the WHERE clause, NOT a self-join over proxy_logs --
the self-join form forces decompression and aborts with "tuple
decompression limit exceeded" on a compressed hypertable (verified locally
against 96 compressed chunks). The single-table form lets the planner
exclude compressed chunks and only writes matching rows.

CASE mirrors ai_agent_detector::AGENT_PATTERNS exactly (32 agents / 21
providers) in the same specificity order (Applebot-Extended before
Applebot, OAI-SearchBot before openai/, etc.). Idempotent via
`IS DISTINCT FROM`; down() is a no-op (substrings aren't recoverable).

Verified end-to-end against a seeded local hypertable: correct canonical
names, non-AI bots and humans untouched, re-run affects 0 rows.
The backfill must NOT run as a Sea-ORM migration: Migrator::up executes
inside establish_connection, which runs BEFORE the Pingora proxy binds its
listeners. A full-table UPDATE there would block proxy startup -- the exact
latency the fast-LB-bind work removes.

Remove the migration (02cdf12) and ship the verified, decompression-safe
UPDATE as scripts/backfill-ai-agent-bot-names.sql instead, run manually
against the DB via psql. Includes an optional read-only preview and a
post-run verification query.

The going-forward code fix (3929bbf, ProxyLogBatchWriter runs
ai_agent_detector) is unaffected and remains the long-term fix; this script
only reclassifies pre-existing rows when an operator chooses to run it.
@dviejokfs dviejokfs merged commit 579a624 into main May 30, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant