perf(proxy): bind the load balancer before loading routes by dviejokfs · Pull Request #105 · gotempsh/temps

dviejokfs · 2026-05-29T17:53:09Z

Problem

In prod the Pingora proxy (the load balancer on 80/443) took 2.5–10s to start accepting connections. Startup ran a serial chain of blocking block_on calls before server.run() bound the listeners, dominated by:

Two full route-table loads (each an N+1 query loop over every domain/env/deployment/container/project/node)
An inline DNS reconcile_all() awaited at the end of every load
A TimescaleDB CALL refresh_continuous_aggregate backfill
A Docker connect+ping

Change: bind first, load asynchronously

The load balancer now binds 80/443 immediately and loads routes in the background.

RouteTableListener::start_listening subscribes to PG NOTIFY first (closing the missed-NOTIFY window), then spawns the initial load instead of awaiting it — the proxy no longer waits for routes to bind.
Removed the duplicate load_routes() in setup_proxy_server. It only existed because the on-demand sleeping-domain callback was registered too late; it's now registered in serve before the first load, so the first load populates sleeping domains / on-demand configs.
load_routes bumps the generation + notifies waiters before the DNS reconcile, and the reconcile is now fire-and-forget (never gates a load or the readiness wait).
run_post_migration_backfill is detached from establish_connection and spawned on the long-lived runtime (idempotent; the refresh policy catches up).
resolve_peer readiness guard: when the route table has never loaded, a request briefly waits for the first load (wait_until_loaded, 5s) and re-looks-up before falling back to the console — covers the cold-start window the async load introduces. After the first load, an unmatched host falls through immediately, exactly as before.

The admin gate stays pre-bind and fail-closed (a ~1ms single-row read that only governs the management console, not app routing).

Verification

Local run, timestamped:

Proxy binds at T+168ms (from DB-init) while the route table finishes loading 124ms later — i.e. the listener was accepting connections before routes were loaded.
HTTP 200 served through the proxy; backfill (events_hourly … complete) and preview-gateway reconcile (✅ reconciled) both complete off the bind path.
cargo check --bin temps clean; 346 tests pass including 4 new readiness tests for has_loaded / wait_until_loaded.

Not in scope (follow-ups)

The N+1 inside load_routes is unchanged — the background load is still slow at prod scale, it just no longer blocks the bind. Batching those queries is the highest-value follow-up.
Docker ping still runs pre-bind; a wedged Docker socket could still delay the bind. Deferring it is a candidate follow-up.
Cold-window readiness wait validated by unit tests, not yet exercised live at scale (local load was too fast to hit the branch).

Note for reviewers

This branch carries the in-process RouteReloadSubscriber (ForceRouteReload over the shared queue) it was built on top of — that work was already in the working tree, and its relocation inside serve/mod.rs is interleaved with this restructure and not cleanly separable. Unrelated frontend CLI (.ts) changes were deliberately left out.

Also in this PR: `feat(cli)` TEMPS_CONTEXT override

Bundled per request (unrelated to the proxy change). Adds a TEMPS_CONTEXT env var that pins the active CLI context for a shell/CI session without mutating .contexts.json, surfaces the override in context ls / context use / whoami, and fixes loginWithApiKey to validate against the server passed via positional/--url (previously it could validate against the wrong server and wipe credentials). Includes contexts.test.ts unit tests for the resolver.

The Pingora proxy took 2.5-10s to start accepting connections in prod because startup ran a serial chain of blocking block_on calls before server.run() bound 80/443 — dominated by two full route-table loads (each an N+1 query loop), an inline DNS reconcile, a TimescaleDB aggregate backfill, and a Docker ping. Bind first, load asynchronously: - RouteTableListener::start_listening subscribes to PG NOTIFY first (closing the missed-NOTIFY window) and spawns the initial load instead of awaiting it, so the proxy binds without waiting for routes. - Remove the duplicate load_routes() in setup_proxy_server by registering the on-demand sleeping-domain callback in serve before the first load runs, so the first load populates sleeping domains / on-demand configs. - load_routes bumps generation + notifies waiters BEFORE the DNS reconcile, and the reconcile is now fire-and-forget (never gates a load). - run_post_migration_backfill is detached from establish_connection and spawned on the long-lived runtime (idempotent; refresh policy catches up). - resolve_peer: when the table has never loaded, briefly wait for the first load (wait_until_loaded, 5s) and re-lookup before falling back to the console — covers the cold-start window the async load introduces. Verified locally: proxy binds at T+168ms while the route table finishes loading 124ms later; HTTP 200 served through the proxy; backfill and preview-gateway reconcile complete off the bind path. 346 tests pass incl. 4 new readiness tests for has_loaded/wait_until_loaded. NOTE: carries the in-process RouteReloadSubscriber (ForceRouteReload over the shared queue) it was built on top of — its relocation in serve/mod.rs is interleaved with this restructure and not separable.

Adds a TEMPS_CONTEXT environment variable that pins the active CLI context for the current shell / CI session without mutating the shared .contexts.json (mirroring how TEMPS_API_URL overrides the resolved URL). - envContextName() / pickActiveContext(): TEMPS_CONTEXT, when set, selects the active context; a missing name returns null (unauthenticated) with a one-time stderr warning rather than silently falling back to another context's credentials — picking the wrong server silently is how you push to prod by accident. - `context ls` and `context use` reflect the env override: the active marker / JSON isActive follow TEMPS_CONTEXT, and `use` warns when an env pin overrides the switch so it doesn't appear to silently no-op. - `whoami` labels the active context with "(env: TEMPS_CONTEXT)" when the env var drove the selection. - loginWithApiKey: resolve the target server (positional/--url) BEFORE validating the key, so an api-key login against a named server doesn't validate against the active context / localhost default, fail, and wipe credentials. - contexts.test.ts: unit tests for the resolver + missing-context warning. Unrelated to the proxy bind change in this branch; bundled per request.

Adds [Unreleased] entries for the async-route-loading proxy bind (Changed), the TEMPS_CONTEXT env override (Added), and the api-key login server-resolution fix (Fixed).

Add a hard-coded AI-agent taxonomy (OpenAI, Anthropic, Perplexity, Google, Apple, Meta, Amazon, ByteDance, Common Crawl, Cohere, Diffbot, You.com, DuckDuckGo, Brave, Mistral, xAI, and more) that classifies crawler user agents into (provider, agent) pairs at ingest time, stored as the canonical agent name in proxy_logs.bot_name. Backend (temps-proxy): - ai_agent_detector module with RegexSet-based detection + 7 unit tests - ai_agent / ai_provider / is_ai_agent filters on GET /proxy-logs - GET /proxy-logs/stats/ai-agents per-agent/provider breakdown - GET /proxy-logs/stats/ai-pages top pages crawled by AI, with a distinct-agent count and optional exact-path filter - GET /proxy-logs/ai-agents/known taxonomy for UI dropdowns Migration: composite (project_id, <dim>, timestamp DESC) indexes on proxy_logs for the project-scoped Request Logs filters (status, method, environment, is_bot, deployment) -- idempotent, mirrors the hypertable compression layout. Frontend (web): - AiAgentLogo component + 21 provider logos (white chip so monochrome brand marks stay legible in dark mode) - AI Agents overview card (top 5 + View all) and full AiAgentsDetail page: ranked agents (by provider/agent) + pages-crawled-by-AI table - Page detail shows an "AI Agents" stat (distinct agents + requests) linking into the bot-filtered request log - Request Logs redesign: advanced filters collapsed behind a toggle; AI/provider/path drill-down context shown as removable chips and actually applied to the query (previously dropped). Removed the unindexed user-agent LIKE search.

The "Pages crawled by AI" table only showed a distinct-agent count per path. Make each page row expandable to reveal which agents hit that page and how many times each. - get_ai_agent_breakdown gains an optional exact-path filter so a single page returns its per-agent counts (bot_name GROUP BY scoped to path). The path param already existed on the shared OpenAPI query, so no SDK change is needed. - AiAgentsDetail: page rows toggle an inline PageAgentBreakdown that lazily fetches the path-scoped breakdown (only on expand) and lists each agent with logo, share %, and request count. Clicking an agent opens the request log filtered to that page + agent; a "View all" link opens the page's full AI traffic.

Web: rework the AI Agents detail page from a cramped two-column grid into two full-width tabs (Agents | Pages crawled) since the two datasets are unrelated. Agents render as a proper ranked table (agent · share · unique IPs · requests); pages stay expandable to the per-agent breakdown. The three summary metrics move into compact header badges so the content sits at the top instead of being pushed down by a tall KPI block. CLI: add `temps analytics ai-agents`, `ai-pages`, and `ai-page <path>` mirroring the web view, backed by the AI breakdown endpoints. Regenerate the CLI SDK/openapi for the new routes and make the stdout output helpers no-op in `--json`/quiet mode so machine-readable output isn't corrupted. Document the new commands in the temps-cli SKILL.

@host

Container logs viewer: infer a severity per line (ERROR/WARN/INFO/DEBUG/ TRACE) from common log shapes, add level filtering, and a pause/resume control that freezes the visible rope while buffering the tail (capped at maxLogs) so a chatty service can't grow unbounded. Switch match-scroll to the virtualizer's scrollToIndex, parse and surface the optional server timestamp as a structured field, and add a scroll-to-bottom action. Masking: add value-based secret detection alongside the existing key-name heuristic — connection-string userinfo, Authorization/Bearer tokens, and JWTs are partially redacted in place (scheme://user:•••@host) so the structure stays debuggable. The container env-var table now masks sensitive values with a reveal/hide toggle, catching cases like OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer … and SENTRY_DSN where the key name alone wouldn't flag them.

AI crawlers (ClaudeBot, OAI-SearchBot, PerplexityBot, ...) never appeared on the AI Agents analytics page because the live ingest path — ProxyLogBatchWriter — only ran CrawlerDetector, which returns a loose UA substring (e.g. `ClaudeBot/1.0` -> `"Bot/"`). The ai_agent_detector was wired into ProxyLogService instead, which is not the path that writes proxy logs in production. The analytics query filters `bot_name = ANY(known_agents)`, so substrings never match the taxonomy and the page (and provider breakdown) stay empty. enrich_entry now runs ai_agent_detector::detect first (canonical name like `ClaudeBot`), falling back to CrawlerDetector for non-AI bots — identical to ProxyLogService. Going-forward only; existing rows keep their old bot_name and need a separate backfill. Adds test_enrich_entry_detects_ai_agent_canonical_name covering ClaudeBot, OAI-SearchBot, PerplexityBot, CCBot, and Meta-ExternalAgent.

…t names The AI Agents analytics page was empty because historical proxy_logs rows carry loose CrawlerDetector substrings (e.g. ClaudeBot/1.0 -> "Bot/") instead of canonical taxonomy names, which the page filters on. The code fix (3929bbf) handles new rows; this backfills existing ones so the page has history. Scoped to the last 7 days for two reasons tied to the TimescaleDB hypertable: chunks >7 days are compressed (immutable) and >30 days are dropped by retention. Crucially this uses a plain single-table UPDATE with the AI-agent regex in the WHERE clause, NOT a self-join over proxy_logs -- the self-join form forces decompression and aborts with "tuple decompression limit exceeded" on a compressed hypertable (verified locally against 96 compressed chunks). The single-table form lets the planner exclude compressed chunks and only writes matching rows. CASE mirrors ai_agent_detector::AGENT_PATTERNS exactly (32 agents / 21 providers) in the same specificity order (Applebot-Extended before Applebot, OAI-SearchBot before openai/, etc.). Idempotent via `IS DISTINCT FROM`; down() is a no-op (substrings aren't recoverable). Verified end-to-end against a seeded local hypertable: correct canonical names, non-AI bots and humans untouched, re-run affects 0 rows.

The backfill must NOT run as a Sea-ORM migration: Migrator::up executes inside establish_connection, which runs BEFORE the Pingora proxy binds its listeners. A full-table UPDATE there would block proxy startup -- the exact latency the fast-LB-bind work removes. Remove the migration (02cdf12) and ship the verified, decompression-safe UPDATE as scripts/backfill-ai-agent-bot-names.sql instead, run manually against the DB via psql. Includes an optional read-only preview and a post-run verification query. The going-forward code fix (3929bbf, ProxyLogBatchWriter runs ai_agent_detector) is unaffected and remains the long-term fix; this script only reclassifies pre-existing rows when an operator chooses to run it.

dviejokfs added 10 commits May 29, 2026 19:52

docs(changelog): record fast LB bind + TEMPS_CONTEXT CLI changes

50beaa5

Adds [Unreleased] entries for the async-route-loading proxy bind (Changed), the TEMPS_CONTEXT env override (Added), and the api-key login server-resolution fix (Fixed).

dviejokfs merged commit 579a624 into main May 30, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(proxy): bind the load balancer before loading routes#105

perf(proxy): bind the load balancer before loading routes#105
dviejokfs merged 10 commits into
mainfrom
perf/fast-lb-bind

dviejokfs commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dviejokfs commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Change: bind first, load asynchronously

Verification

Not in scope (follow-ups)

Note for reviewers

Also in this PR: feat(cli) TEMPS_CONTEXT override

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dviejokfs commented May 29, 2026 •

edited

Loading

Also in this PR: `feat(cli)` TEMPS_CONTEXT override