fix(embedding): scope long batch timeout to embed-jobs worker (#445) by cjimti · Pull Request #446 · txn2/mcp-data-platform

cjimti · 2026-05-19T23:02:59Z

Summary

Closes #445.

v1.64.0 switched EmbedBatch to Ollama's batched /api/embed endpoint (#435) without raising the HTTP timeout that governs it. On CPU-only Ollama a 30-text batched POST routinely exceeds the 30s DefaultTimeout, so the api-gateway embed-jobs worker timeout-storms on every spec write, retries five times, fails terminally, gets re-enqueued by the reconciler, and loops. From the operator's perspective the spec badge alternates between indexing and failed and looks "stuck on queued".

Approach: scoped, not global

A first pass raised embedding.DefaultTimeout from 30s to 5m. The adversarial review caught that this also affects synchronous request-path callers (memory_recall, memory_manage, capture_insight, apigateway queryVectorFor), which would now hang an MCP tools/call for 5 minutes on a wedged Ollama instead of failing in 30s. That is a real regression for interactive tool calls.

The shipped fix scopes the longer budget to the async worker only:

embedding.DefaultTimeout stays at 30s. The constant's comment explicitly documents that the singular /api/embeddings path is what it's tuned for, and points at the worker's dedicated provider for the batched path.
New apigateway.embed_jobs.embed_timeout config (duration, default 5m). The worker constructs its own Ollama provider with this timeout via a new Platform.workerEmbedder() helper. The shared p.embeddingProv that request-path callers use is unchanged.
Provider kinds other than ollama (noop today, future kinds) reuse the shared instance. Today only Ollama has the long-tail batch latency profile that needs this.

New config

apigateway:
  embed_jobs:
    workers: 1
    embed_timeout: 5m

Field	Default	Affects
`embed_jobs.embed_timeout`	`5m`	The api-gateway embed-jobs worker only. Request-path embedding calls (`memory_recall`, `memory_manage`, `capture_insight`, apigateway query-vector) continue to use `memory.embedding.ollama.timeout` (default 30s via `embedding.DefaultTimeout`).

Tests

Unit (`pkg/platform/worker_embedder_test.go`)

TestWorkerEmbedder_OllamaUsesEmbedTimeout proves the worker gets a fresh embedding.Provider with the configured timeout AND the shared instance keeps its 30s default. This is the regression-prevention assertion: a future change that accidentally re-globalizes the long timeout fails this test loudly.
TestWorkerEmbedder_OllamaDefaultsTo5Min covers the zero-value fallback.
TestWorkerEmbedder_NoopReturnsShared covers the non-Ollama path.

The tests reach the unexported http.Client.Timeout via reflection. A future rename of that field produces a loud failed assertion (durations are compared against specific values), not a silent skip.

Integration against a real Ollama (`pkg/platform/integration_embedjobs_realollama_test.go`, build tag `integration`)

The reason #445 shipped undetected is that every prior integration test used synthetic delays in stub Computers. None exercised actual Ollama batch latency.

This PR adds the missing coverage: spins up real Ollama in testcontainers, pulls nomic-embed-text, seeds a 30-operation spec, enqueues a job, and asserts the embedding pass completes inside the worker's timeout budget. Pre-fix behavior (30s global default) would fail this test on the very batch shape that caused the incident; post-fix it passes with margin.

The shared internal/testollama helper uses sync.Once to start the Ollama container once per test process. Provider(timeout) returns a fresh provider with the supplied timeout so each call site names the production code path it's exercising (30s for request-path, 5m for worker). The pattern is reusable by every upcoming embedding-related consumer (DataHub semantic search, prompt discovery, knowledge insights recall, portal asset search).

Verification

make verify clean: tools-check, gofmt, race tests, total + patch coverage above gate, lint (patch-scoped against origin/main), gosec, govulncheck, semgrep, codeql, doc-check, dead-code, mutation testing, goreleaser dry-run.
make test-integration clean against pgvector + real Ollama. The real-Ollama test takes ~3.5min (model pull dominates; the 30-op batched embed itself completes in ~2 seconds once warm).
Adversarial sub-agent review: 2 rounds. Round 1 surfaced (a) the global timeout regression for request-path callers and (b) test poll deadline equaling the HTTP timeout with no margin. Both fixed. Round 2 returned CLEAN.

Operator workaround on v1.64.0 (no upgrade required)

Add timeout: 5m under memory.embedding.ollama in the platform configmap and roll the deployment. Once v1.64.1 is rolled, the override under memory.embedding.ollama.timeout can be removed because the worker no longer shares that path; operators who want to keep the request-path timeout at 60s and the worker timeout at 5m can now set them independently.

Test plan

make verify passes locally.
make test-integration passes locally against pgvector + real Ollama.
Adversarial review verdict CLEAN.
CI green.
Manual: deploy to a cluster running CPU-only Ollama with a multi-op spec (30+ operations), confirm the spec indexes without timeout errors and the badge ticks indexing N/M upward to completion.
Manual: with memory.embedding.ollama.timeout unset, confirm a wedged Ollama still fails memory_recall in ~30s (the request-path budget) while the embed-jobs worker continues with its 5m budget.

v1.64.0 shipped the Ollama batch endpoint (#435) without raising the HTTP timeout that governs the path. On CPU-only Ollama a 30-text batch exceeds the 30s default `DefaultTimeout`, so the api-gateway embed-jobs worker timeout-storms on every spec write, retries five times, fails terminally, gets re-enqueued by the reconciler, and loops. From the operator's perspective the spec badge alternates between "indexing" and looks "stuck on queued". ## Fix The fix is scoped to the async worker. The shared `embedding.Provider` that request-path callers (`memory_recall`, `memory_manage`, `capture_insight`, apigateway `queryVectorFor`) use keeps its 30s default so a wedged Ollama fails an MCP `tools/call` in seconds, not minutes. Only the embed-jobs worker gets the longer budget. - `DefaultTimeout` stays at 30s. The constant's comment now explicitly documents that the singular `/api/embeddings` path is what it's tuned for, and points at the worker's dedicated provider for the batched path. - New `apigateway.embed_jobs.embed_timeout` config (default 5m). The worker constructs its own Ollama provider with this timeout via `Platform.workerEmbedder()`. Provider != ollama (noop, future kinds) reuses the shared instance unchanged. ## Tests Three new unit tests in `worker_embedder_test.go` cover the scoping: - `TestWorkerEmbedder_OllamaUsesEmbedTimeout` proves the worker gets a fresh Provider with the configured timeout AND the shared instance keeps its 30s default (the regression-prevention assertion). - `TestWorkerEmbedder_OllamaDefaultsTo5Min` covers the zero-value fallback. - `TestWorkerEmbedder_NoopReturnsShared` covers the non-Ollama path. The tests reach the unexported `http.Client.Timeout` via reflection; a future rename of that field produces a loud failed assertion (the duration compares against specific values), not a silent skip. ## Integration test against real Ollama New `pkg/platform/integration_embedjobs_realollama_test.go` (build tag `integration`) spins up real Ollama in testcontainers, pulls `nomic-embed-text`, enqueues a 30-operation spec, and asserts the embedding pass completes within the worker's timeout budget. This is the regression test that would have caught #445 before release; prior integration tests used synthetic delays in stub Computers and never exercised actual Ollama batch latency. The shared `internal/testollama` helper uses `sync.Once` to start the Ollama container once per test process. `Provider(timeout)` returns a fresh embedding.Provider with the supplied timeout so call sites name the production path they're exercising (30s for request-path, 5m for worker). testcontainers' ryuk reaper handles teardown at process exit. ## Verification - `make verify` clean (tools-check, fmt, race tests, lint patch-scoped against `origin/main`, gosec, govulncheck, semgrep, codeql, doc-check, dead-code, mutation testing, goreleaser dry-run). - `make test-integration` clean against pgvector + real Ollama. Real- Ollama test runtime: ~3.5min (model pull dominates; the 30-op batched embed itself completes in ~2 seconds once warm). - Adversarial sub-agent review: 2 rounds. Round 1 surfaced (a) the global timeout change would affect synchronous request-path callers too and (b) the test poll deadline equaled the HTTP timeout with no margin. Both fixed. Round 2 CLEAN. ## Operator workaround on v1.64.0 (no upgrade required) Add `timeout: 5m` under `memory.embedding.ollama` in the platform configmap and roll the deployment.

codecov · 2026-05-19T23:06:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.19%. Comparing base (ffed470) to head (2fdc237).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #446   +/-   ##
=======================================
  Coverage   86.18%   86.19%           
=======================================
  Files         235      235           
  Lines       32241    32252   +11     
=======================================
+ Hits        27787    27798   +11     
  Misses       3239     3239           
  Partials     1215     1215

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

cjimti merged commit c91e711 into main May 19, 2026
9 checks passed

cjimti deleted the fix/embedding-default-timeout branch May 19, 2026 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(embedding): scope long batch timeout to embed-jobs worker (#445)#446

fix(embedding): scope long batch timeout to embed-jobs worker (#445)#446
cjimti merged 1 commit into
mainfrom
fix/embedding-default-timeout

cjimti commented May 19, 2026

Uh oh!

codecov Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cjimti commented May 19, 2026

Summary

Approach: scoped, not global

New config

Tests

Unit (pkg/platform/worker_embedder_test.go)

Integration against a real Ollama (pkg/platform/integration_embedjobs_realollama_test.go, build tag integration)

Verification

Operator workaround on v1.64.0 (no upgrade required)

Test plan

Uh oh!

codecov Bot commented May 19, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Unit (`pkg/platform/worker_embedder_test.go`)

Integration against a real Ollama (`pkg/platform/integration_embedjobs_realollama_test.go`, build tag `integration`)