Skip to content

feat(apigateway): incremental embed progress and configurable worker concurrency (#430)#437

Merged
cjimti merged 1 commit into
mainfrom
feat/apigateway-embed-progress-and-concurrency
May 19, 2026
Merged

feat(apigateway): incremental embed progress and configurable worker concurrency (#430)#437
cjimti merged 1 commit into
mainfrom
feat/apigateway-embed-progress-and-concurrency

Conversation

@cjimti
Copy link
Copy Markdown
Member

@cjimti cjimti commented May 19, 2026

Summary

Closes #430.

A 164-operation spec on a CPU-only Ollama deployment took ~10 minutes to index while the UI sat at indexing 0/164 the entire run and snapped to 164/164 at the end. From an operator's perspective the run looked frozen. Two independent gaps surfaced by that workflow, both fixed in this PR. The third half of #430 (Ollama batch endpoint) was closed earlier by #435.

Approach

Progress visibility

Migration 000046 adds embedded_so_far INTEGER NOT NULL DEFAULT 0 to api_catalog_embedding_jobs. The worker publishes the counter at every chunk boundary via a new Store.UpdateProgress call. The catalog status endpoint reads the column. The portal's EmbeddingStatusBadge renders indexing N/M from embedded_so_far while JobStatus == running, distinct from pending (queued, 0) and succeeded (the green N/N indexed state).

The atomic all-or-nothing semantic of the spec's vector upsert is preserved: the existing DELETE+INSERT transaction is unchanged. embedded_so_far is a separate column on the job row, populated by best-effort writes. A DB error on the progress write is logged at debug level but does not abort the embed pass: the final Complete is the authoritative success signal.

ComputeOperationEmbeddings gains an optional progress func(int) callback that fires once with the reused-row count up front (so a fully-cached refresh ticks straight to operation_count without waiting for a no-op embed pass) and again after every embedInBatches chunk. The embed-jobs Computer adapter wires the callback to Store.UpdateProgress keyed by (id, worker_id, status='running') so a stale worker whose lease was rotated cannot clobber the new lease-holder's count.

The counter resets to 0 only on Claim. Terminal rows (succeeded / failed) and pending rows recovered from a lease expiry may still carry a prior attempt's value; callers gate display on Status == running so the stale value never reaches the UI. The doc comments and the JSON response struct both call this out explicitly.

Worker concurrency

New apigateway.embed_jobs.workers config (default 1, preserves prior single-goroutine behavior). Worker.Start spawns N goroutines that share one wakeup channel and one stopCh. The existing FOR UPDATE SKIP LOCKED LIMIT 1 predicate in Claim plus the 10-minute lease guarantee keep two goroutines (in the same pod or across pods) from picking the same job.

After each successful Claim the worker also calls Notify once so a sibling goroutine can drain the next pending job in parallel; the buffered wakeup channel coalesces redundant Notifies. The Notify-after-Claim is placed after the error check so a DB outage that storms Claim errors does not also storm sibling-wakeup attempts.

Worker gains a Concurrency() int accessor so the wiring test can prove the value flowed from config through WorkerConfig without exporting the cfg field itself.

Wire-format change

GET /api/v1/admin/api-catalogs/{id}/embedding-status response gains one new field:

{
  "spec_name": "users",
  "operation_count": 47,
  "embedding_count": 0,
  "embedded_so_far": 12,
  "job_status": "running",
  "job_attempts": 1,
  "job_last_error": "",
  "job_updated_at": "2026-05-19T11:42:03Z"
}

embedded_so_far is omitempty so existing consumers that ignore unknown fields see no change in shape when the counter is zero.

Tests

Unit

  • TestUpdateProgress_HappyPath / _LeaseRotatedIsNoop / _DBError against PostgresStore via sqlmock.
  • TestWorker_PublishesChunkProgress proves the worker hooks the chunk callback into Store.UpdateProgress keyed by (id, worker_id).
  • TestWorker_ProgressWriteFailureIsLogged proves a DB error on the progress write does not abort the job.
  • TestWorker_ConcurrencyProcessesJobsInParallel proves 4 jobs at 50ms each run in well under the 200ms serial baseline with Concurrency=4 (typical wall time ~60ms).
  • TestComputeOperationEmbeddings_ProgressCallback asserts the initial reused-publish and the chunk-done publishes happen and that the cumulative count reaches operation_count.
  • TestWireAPIGatewayEmbedJobsFromDB_WiresWorkerWithConfiguredConcurrency asserts apigateway.embed_jobs.workers: 3 produces a Worker reporting Concurrency() == 3.

Integration (build tag integration)

pkg/platform/integration_embedjobs_progress_test.go starts pgvector/pgvector:pg16, enqueues a job against a slow stub Computer that publishes 4/8/12 across three 100ms chunks, polls SpecStatuses while running, asserts embedded_so_far is strictly increasing across observations and terminal status is succeeded. Runtime ~5s.

Verification

  • make verify clean: tools-check, gofmt, race tests, total + patch coverage above gate (patch coverage 100%), golangci-lint (patch-scoped against origin/main), gosec, govulncheck, semgrep, codeql, doc-check, dead-code, mutation testing, goreleaser dry-run.
  • make test-integration clean against pgvector.
  • Adversarial sub-agent review: 3 rounds. Round 1 surfaced a doc-vs-code mismatch on the EmbeddedSoFar lifecycle (only Claim resets it, comments claimed Retry did too) plus em dashes; both fixed. Round 2 surfaced a dead slowComputer.calls field reintroducing a round-1 pattern, plus a wiring test whose name promised more than it asserted; both fixed (drop field, add Worker.Concurrency() + assertion). Round 3 returned CLEAN.

Drive-by fix

test/e2e/helpers/admin.go was failing the integration build because AuditConfig.Enabled and KnowledgeConfig.Enabled had drifted to *bool. The helper now wraps with a local boolPtr.

Test plan

  • make verify passes locally.
  • make test-integration passes locally against pgvector.
  • Adversarial review verdict CLEAN.
  • CI green.
  • Manual: deploy with apigateway.embed_jobs.workers: 2, save two API specs back to back, confirm both worker goroutines pick a different spec (lease + SKIP LOCKED in action), confirm both spec badges tick indexing N/M upward before flipping to green.
  • Manual: with a slow embedder (CPU-only Ollama on a multi-op spec), confirm the catalog editor badge ticks past 0/N before completion instead of staying at 0 until the final commit.

…concurrency (#430)

A 164-operation spec on a CPU-only Ollama deployment took ~10 minutes
to index while the UI sat at 0/164 the entire run and snapped to
164/164 at the end. Two independent gaps surfaced by that workflow,
both fixed here. The third half (Ollama batch endpoint) was closed
by #435.

## Progress visibility

Migration 000046 adds embedded_so_far INT NOT NULL DEFAULT 0 to
api_catalog_embedding_jobs. The worker publishes the counter at
every chunk boundary via a new Store.UpdateProgress call so the
catalog status endpoint can render "running, N/M" while the spec's
DELETE+INSERT upsert is still pending. The atomic all-or-nothing
write of the embedding vectors is preserved; the counter is a
separate column read only while JobStatus == running.

ComputeOperationEmbeddings gains an optional progress callback that
fires once with the reused-row count up front, then again after
every embedInBatches chunk. The embed-jobs Computer adapter wires
the callback to Store.UpdateProgress. UpdateProgress writes are
best-effort: a DB error is logged at debug level but does not abort
the embed pass.

The counter resets to 0 only on Claim. Terminal rows (succeeded /
failed) and pending rows recovered from a lease expiry may still
carry a prior attempt's value; callers gate display on
Status == running so the stale value never reaches the UI.

## Worker concurrency

New apigateway.embed_jobs.workers config (default 1, preserves prior
behavior). Worker.Start spawns N goroutines that share the queue;
the existing FOR UPDATE SKIP LOCKED + lease guarantee in Claim keep
two goroutines (in the same pod or across pods) from picking the
same job. After each successful Claim the worker also calls Notify
once so a sibling can drain the next pending job in parallel; the
buffered wakeup channel coalesces redundant Notifies. Worker gains
a Concurrency() accessor so the wiring test can assert the value
flowed from config through WorkerConfig.

## Tests

- Unit: TestUpdateProgress_{HappyPath,LeaseRotatedIsNoop,DBError}
  (sqlmock against PostgresStore).
- Unit: TestWorker_PublishesChunkProgress proves the worker hooks
  the chunk callback into Store.UpdateProgress keyed by (id, worker).
- Unit: TestWorker_ProgressWriteFailureIsLogged proves a DB error
  on the progress write does not abort the job (final Complete is
  the authoritative success signal).
- Unit: TestWorker_ConcurrencyProcessesJobsInParallel proves 4 jobs
  at 50ms each run in well under the 200ms serial baseline with
  Concurrency=4.
- Unit: TestComputeOperationEmbeddings_ProgressCallback asserts the
  initial reused-publish and the chunk-done publishes happen.
- Wiring: TestWireAPIGatewayEmbedJobsFromDB_WiresWorkerWithConfiguredConcurrency
  asserts apigateway.embed_jobs.workers=3 produces a Worker reporting
  Concurrency()==3.
- Integration (build tag integration): starts pgvector/pgvector:pg16,
  enqueues a job against a slow stub Computer that publishes 4/8/12
  across three 100ms chunks, polls SpecStatuses while running, asserts
  embedded_so_far is strictly increasing across observations and
  terminal status is succeeded.

Pre-existing test/e2e/helpers/admin.go also fixed (AuditConfig.Enabled
and KnowledgeConfig.Enabled drifted to *bool; helper added a local
boolPtr).
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 96.42857% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.18%. Comparing base (2d5bd40) to head (952017d).

Files with missing lines Patch % Lines
...kg/toolkits/apigateway/embedjobs/store_postgres.go 95.65% 0 Missing and 1 partial ⚠️
pkg/toolkits/apigateway/embedjobs/worker.go 91.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #437      +/-   ##
==========================================
+ Coverage   86.11%   86.18%   +0.06%     
==========================================
  Files         235      235              
  Lines       32208    32241      +33     
==========================================
+ Hits        27737    27787      +50     
+ Misses       3260     3239      -21     
- Partials     1211     1215       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@cjimti cjimti merged commit ffed470 into main May 19, 2026
9 checks passed
@cjimti cjimti deleted the feat/apigateway-embed-progress-and-concurrency branch May 19, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

apigateway embed worker: progress invisible during runs; ollama EmbedBatch is sequential, worker is single-goroutine

1 participant