perf(proxy): bind the load balancer before loading routes#105
Merged
Conversation
The Pingora proxy took 2.5-10s to start accepting connections in prod because startup ran a serial chain of blocking block_on calls before server.run() bound 80/443 — dominated by two full route-table loads (each an N+1 query loop), an inline DNS reconcile, a TimescaleDB aggregate backfill, and a Docker ping. Bind first, load asynchronously: - RouteTableListener::start_listening subscribes to PG NOTIFY first (closing the missed-NOTIFY window) and spawns the initial load instead of awaiting it, so the proxy binds without waiting for routes. - Remove the duplicate load_routes() in setup_proxy_server by registering the on-demand sleeping-domain callback in serve before the first load runs, so the first load populates sleeping domains / on-demand configs. - load_routes bumps generation + notifies waiters BEFORE the DNS reconcile, and the reconcile is now fire-and-forget (never gates a load). - run_post_migration_backfill is detached from establish_connection and spawned on the long-lived runtime (idempotent; refresh policy catches up). - resolve_peer: when the table has never loaded, briefly wait for the first load (wait_until_loaded, 5s) and re-lookup before falling back to the console — covers the cold-start window the async load introduces. Verified locally: proxy binds at T+168ms while the route table finishes loading 124ms later; HTTP 200 served through the proxy; backfill and preview-gateway reconcile complete off the bind path. 346 tests pass incl. 4 new readiness tests for has_loaded/wait_until_loaded. NOTE: carries the in-process RouteReloadSubscriber (ForceRouteReload over the shared queue) it was built on top of — its relocation in serve/mod.rs is interleaved with this restructure and not separable.
Adds a TEMPS_CONTEXT environment variable that pins the active CLI context for the current shell / CI session without mutating the shared .contexts.json (mirroring how TEMPS_API_URL overrides the resolved URL). - envContextName() / pickActiveContext(): TEMPS_CONTEXT, when set, selects the active context; a missing name returns null (unauthenticated) with a one-time stderr warning rather than silently falling back to another context's credentials — picking the wrong server silently is how you push to prod by accident. - `context ls` and `context use` reflect the env override: the active marker / JSON isActive follow TEMPS_CONTEXT, and `use` warns when an env pin overrides the switch so it doesn't appear to silently no-op. - `whoami` labels the active context with "(env: TEMPS_CONTEXT)" when the env var drove the selection. - loginWithApiKey: resolve the target server (positional/--url) BEFORE validating the key, so an api-key login against a named server doesn't validate against the active context / localhost default, fail, and wipe credentials. - contexts.test.ts: unit tests for the resolver + missing-context warning. Unrelated to the proxy bind change in this branch; bundled per request.
Adds [Unreleased] entries for the async-route-loading proxy bind (Changed), the TEMPS_CONTEXT env override (Added), and the api-key login server-resolution fix (Fixed).
Add a hard-coded AI-agent taxonomy (OpenAI, Anthropic, Perplexity, Google, Apple, Meta, Amazon, ByteDance, Common Crawl, Cohere, Diffbot, You.com, DuckDuckGo, Brave, Mistral, xAI, and more) that classifies crawler user agents into (provider, agent) pairs at ingest time, stored as the canonical agent name in proxy_logs.bot_name. Backend (temps-proxy): - ai_agent_detector module with RegexSet-based detection + 7 unit tests - ai_agent / ai_provider / is_ai_agent filters on GET /proxy-logs - GET /proxy-logs/stats/ai-agents per-agent/provider breakdown - GET /proxy-logs/stats/ai-pages top pages crawled by AI, with a distinct-agent count and optional exact-path filter - GET /proxy-logs/ai-agents/known taxonomy for UI dropdowns Migration: composite (project_id, <dim>, timestamp DESC) indexes on proxy_logs for the project-scoped Request Logs filters (status, method, environment, is_bot, deployment) -- idempotent, mirrors the hypertable compression layout. Frontend (web): - AiAgentLogo component + 21 provider logos (white chip so monochrome brand marks stay legible in dark mode) - AI Agents overview card (top 5 + View all) and full AiAgentsDetail page: ranked agents (by provider/agent) + pages-crawled-by-AI table - Page detail shows an "AI Agents" stat (distinct agents + requests) linking into the bot-filtered request log - Request Logs redesign: advanced filters collapsed behind a toggle; AI/provider/path drill-down context shown as removable chips and actually applied to the query (previously dropped). Removed the unindexed user-agent LIKE search.
The "Pages crawled by AI" table only showed a distinct-agent count per path. Make each page row expandable to reveal which agents hit that page and how many times each. - get_ai_agent_breakdown gains an optional exact-path filter so a single page returns its per-agent counts (bot_name GROUP BY scoped to path). The path param already existed on the shared OpenAPI query, so no SDK change is needed. - AiAgentsDetail: page rows toggle an inline PageAgentBreakdown that lazily fetches the path-scoped breakdown (only on expand) and lists each agent with logo, share %, and request count. Clicking an agent opens the request log filtered to that page + agent; a "View all" link opens the page's full AI traffic.
Web: rework the AI Agents detail page from a cramped two-column grid into two full-width tabs (Agents | Pages crawled) since the two datasets are unrelated. Agents render as a proper ranked table (agent · share · unique IPs · requests); pages stay expandable to the per-agent breakdown. The three summary metrics move into compact header badges so the content sits at the top instead of being pushed down by a tall KPI block. CLI: add `temps analytics ai-agents`, `ai-pages`, and `ai-page <path>` mirroring the web view, backed by the AI breakdown endpoints. Regenerate the CLI SDK/openapi for the new routes and make the stdout output helpers no-op in `--json`/quiet mode so machine-readable output isn't corrupted. Document the new commands in the temps-cli SKILL.
Container logs viewer: infer a severity per line (ERROR/WARN/INFO/DEBUG/ TRACE) from common log shapes, add level filtering, and a pause/resume control that freezes the visible rope while buffering the tail (capped at maxLogs) so a chatty service can't grow unbounded. Switch match-scroll to the virtualizer's scrollToIndex, parse and surface the optional server timestamp as a structured field, and add a scroll-to-bottom action. Masking: add value-based secret detection alongside the existing key-name heuristic — connection-string userinfo, Authorization/Bearer tokens, and JWTs are partially redacted in place (scheme://user:•••@host) so the structure stays debuggable. The container env-var table now masks sensitive values with a reveal/hide toggle, catching cases like OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer … and SENTRY_DSN where the key name alone wouldn't flag them.
AI crawlers (ClaudeBot, OAI-SearchBot, PerplexityBot, ...) never appeared on the AI Agents analytics page because the live ingest path — ProxyLogBatchWriter — only ran CrawlerDetector, which returns a loose UA substring (e.g. `ClaudeBot/1.0` -> `"Bot/"`). The ai_agent_detector was wired into ProxyLogService instead, which is not the path that writes proxy logs in production. The analytics query filters `bot_name = ANY(known_agents)`, so substrings never match the taxonomy and the page (and provider breakdown) stay empty. enrich_entry now runs ai_agent_detector::detect first (canonical name like `ClaudeBot`), falling back to CrawlerDetector for non-AI bots — identical to ProxyLogService. Going-forward only; existing rows keep their old bot_name and need a separate backfill. Adds test_enrich_entry_detects_ai_agent_canonical_name covering ClaudeBot, OAI-SearchBot, PerplexityBot, CCBot, and Meta-ExternalAgent.
…t names The AI Agents analytics page was empty because historical proxy_logs rows carry loose CrawlerDetector substrings (e.g. ClaudeBot/1.0 -> "Bot/") instead of canonical taxonomy names, which the page filters on. The code fix (3929bbf) handles new rows; this backfills existing ones so the page has history. Scoped to the last 7 days for two reasons tied to the TimescaleDB hypertable: chunks >7 days are compressed (immutable) and >30 days are dropped by retention. Crucially this uses a plain single-table UPDATE with the AI-agent regex in the WHERE clause, NOT a self-join over proxy_logs -- the self-join form forces decompression and aborts with "tuple decompression limit exceeded" on a compressed hypertable (verified locally against 96 compressed chunks). The single-table form lets the planner exclude compressed chunks and only writes matching rows. CASE mirrors ai_agent_detector::AGENT_PATTERNS exactly (32 agents / 21 providers) in the same specificity order (Applebot-Extended before Applebot, OAI-SearchBot before openai/, etc.). Idempotent via `IS DISTINCT FROM`; down() is a no-op (substrings aren't recoverable). Verified end-to-end against a seeded local hypertable: correct canonical names, non-AI bots and humans untouched, re-run affects 0 rows.
The backfill must NOT run as a Sea-ORM migration: Migrator::up executes inside establish_connection, which runs BEFORE the Pingora proxy binds its listeners. A full-table UPDATE there would block proxy startup -- the exact latency the fast-LB-bind work removes. Remove the migration (02cdf12) and ship the verified, decompression-safe UPDATE as scripts/backfill-ai-agent-bot-names.sql instead, run manually against the DB via psql. Includes an optional read-only preview and a post-run verification query. The going-forward code fix (3929bbf, ProxyLogBatchWriter runs ai_agent_detector) is unaffected and remains the long-term fix; this script only reclassifies pre-existing rows when an operator chooses to run it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In prod the Pingora proxy (the load balancer on 80/443) took 2.5–10s to start accepting connections. Startup ran a serial chain of blocking
block_oncalls beforeserver.run()bound the listeners, dominated by:reconcile_all()awaited at the end of every loadCALL refresh_continuous_aggregatebackfillChange: bind first, load asynchronously
The load balancer now binds 80/443 immediately and loads routes in the background.
RouteTableListener::start_listeningsubscribes to PGNOTIFYfirst (closing the missed-NOTIFY window), then spawns the initial load instead of awaiting it — the proxy no longer waits for routes to bind.load_routes()insetup_proxy_server. It only existed because the on-demand sleeping-domain callback was registered too late; it's now registered inservebefore the first load, so the first load populates sleeping domains / on-demand configs.load_routesbumps the generation + notifies waiters before the DNS reconcile, and the reconcile is now fire-and-forget (never gates a load or the readiness wait).run_post_migration_backfillis detached fromestablish_connectionand spawned on the long-lived runtime (idempotent; the refresh policy catches up).resolve_peerreadiness guard: when the route table has never loaded, a request briefly waits for the first load (wait_until_loaded, 5s) and re-looks-up before falling back to the console — covers the cold-start window the async load introduces. After the first load, an unmatched host falls through immediately, exactly as before.The admin gate stays pre-bind and fail-closed (a ~1ms single-row read that only governs the management console, not app routing).
Verification
Local run, timestamped:
HTTP 200served through the proxy; backfill (events_hourly … complete) and preview-gateway reconcile (✅ reconciled) both complete off the bind path.cargo check --bin tempsclean; 346 tests pass including 4 new readiness tests forhas_loaded/wait_until_loaded.Not in scope (follow-ups)
load_routesis unchanged — the background load is still slow at prod scale, it just no longer blocks the bind. Batching those queries is the highest-value follow-up.Note for reviewers
This branch carries the in-process
RouteReloadSubscriber(ForceRouteReloadover the shared queue) it was built on top of — that work was already in the working tree, and its relocation insideserve/mod.rsis interleaved with this restructure and not cleanly separable. Unrelated frontend CLI (.ts) changes were deliberately left out.Also in this PR:
feat(cli)TEMPS_CONTEXT overrideBundled per request (unrelated to the proxy change). Adds a
TEMPS_CONTEXTenv var that pins the active CLI context for a shell/CI session without mutating.contexts.json, surfaces the override incontext ls/context use/whoami, and fixesloginWithApiKeyto validate against the server passed via positional/--url(previously it could validate against the wrong server and wipe credentials). Includescontexts.test.tsunit tests for the resolver.