Skip to content

feat(knowledge-base): validate vector store bindings on create, copy, and delete#1372

Merged
lyingbug merged 1 commit into
Tencent:mainfrom
ochanism:enhancement/1206-3
May 18, 2026
Merged

feat(knowledge-base): validate vector store bindings on create, copy, and delete#1372
lyingbug merged 1 commit into
Tencent:mainfrom
ochanism:enhancement/1206-3

Conversation

@ochanism
Copy link
Copy Markdown
Contributor

@ochanism ochanism commented May 18, 2026

Context

Part of #993 — Phase 2 (Per-KB VectorStore Binding) of the multi-store roadmap.

Phase 2 PR sequence (this is PR 3 of 5):

# PR Status
1 #994vector_store_id column + migration MERGED
2 #1310 — KB-scoped retrieve-engine factory + 25-site refactor MERGED
3 this PR — KB create/retrieve API + defensive logic here
4 multi-store fan-out search planned
5 KB create/detail UI planned

Depends on #994 and #1310.

Changes at a glance

  • POST /knowledge-bases accepts an optional vector_store_id and validates it against the caller's tenant scope + the engine registry before persisting.
  • Two new typed error codes distinguish the validation outcomes:
    • 2200 ErrVectorStoreBindingInvalid — UUID is malformed, does not exist, or belongs to another tenant (single message; intentionally no enumeration oracle).
    • 2201 ErrVectorStoreUnavailable — store exists in DB but is not registered in the engine registry; check connection_config.
  • KB responses (POST / GET /:id / PUT / PUT /:id/pin) gain four read-only fields describing the bound store: vector_store_name, vector_store_source, vector_store_engine_type, vector_store_status. ListKnowledgeBases deliberately does not include these to avoid N+1 queries.
  • Cross-tenant shared-KB responses receive a suppressed payload: vector_store_id is stripped and metadata fields are set to source="shared". The owner tenant's store inventory cannot be correlated or enumerated.
  • POST /knowledge-bases/copy rejects mismatched clones synchronously, before enqueueing the async clone task:
    • 400 if source.embedding_model_id != target.embedding_model_id
    • 400 if source.vector_store_id != target.vector_store_id
  • DELETE /vector-stores/:id refuses to remove a store that still has bound knowledge bases. The check runs inside a transaction that row-locks the store on PostgreSQL (SELECT … FOR UPDATE) and serializes via WAL on SQLite. A panic in the in-memory registry unregister is recovered into a structured warning rather than silently leaking a stale engine.
  • vector_store_id is immutable post-creation. The GORM <-:create tag (from feat: add vector_store_id column to knowledge_bases for per-KB vector store binding #994) blocks every ORM update path; the service-layer UpdateKnowledgeBaseRequest DTO omits the field entirely; a reflection-based test catches any future regression in either layer.
  • Empty-string vector_store_id is normalized to nil at the create boundary and inside SharesStoreWith, so rows persisted by callers that did not run Normalize() first (raw-SQL writes, external migrations, ops scripts) still compare correctly.

File-by-file summary

Production

File Change
internal/types/knowledgebase.go HasVectorStore(), Normalize(), SharesStoreWith() helpers (all nil-safe)
internal/types/vectorstore.go StoreDisplay DTO + StoreSource* constants (env / user / shared / unavailable) + display-payload constructors
internal/types/interfaces/knowledgebase.go CountByVectorStoreID(ctx, *gorm.DB, tenantID, storeID) interface — accepts a handle so tx and non-tx callers share one implementation
internal/types/interfaces/vectorstore.go ResolveStoreView + BatchResolveStoreView on VectorStoreService
internal/application/repository/knowledgebase.go CountByVectorStoreID implementation; uses GORM's auto soft-delete scope (no explicit deleted_at IS NULL predicate)
internal/application/service/retriever/factory.go VerifyBinding(ctx, registry, ownership, tenantID, storeID) — exposes the existing ownership + registry sentinel hierarchy so the KB create path can reuse the same source of truth
internal/application/service/knowledgebase.go validateVectorStoreBinding (UUID parse → VerifyBinding → typed AppError translation); embedding-model + store defenses inside CopyKnowledgeBase; new-target clone now inherits VectorStoreID; Update godoc documents the immutability contract
internal/application/service/vectorstore.go DI extended with kbRepo + *gorm.DB + cached envStores; DeleteStore transactional binding guard with PG row-lock; unregisterSafely panic-recovery; ResolveStoreView + BatchResolveStoreView implementations
internal/container/container.go Startup assertion that Dialector.Name() is "postgres" or "sqlite" — guards against silent SQLite fallback under a future driver swap
internal/errors/errors.go New error codes 2200 / 2201 + matching constructors
internal/logger/logger.go WarnWithFields(ctx, fields, msg) helper + Fields alias
internal/handler/knowledgebase.go buildKBResponse map-merge helper (avoids KnowledgeBase.MarshalJSON swallowing the new fields); resolveKBStoreView (env / shared / DB / unavailable branches); response builder applied to Create / Get / Update / TogglePin; Copy pre-flight (same checks as the service-layer worker entry); typed-AppError unwrap on Create so 2200 / 2201 reach the client

API docs

File Change
docs/api/knowledge-base.md vector_store_id request field; new response fields + meanings table; POST /copy synchronous pre-flight section; new error code rows; explicit note that GET /knowledge-bases (list) does NOT include resolved store metadata
docs/api/vector-store.md DELETE /:id binding-guard section with example 400 body; updated error-code table; env-store section addendum

Tests

File Change
internal/types/knowledgebase_test.go HasVectorStore, Normalize, SharesStoreWith matrices including the empty-string-vs-nil normalization
internal/application/repository/knowledgebase_sqlite_test.go TestCountByVectorStoreID against in-memory sqlite — tenant scope, soft-delete auto-scope, empty-string row exclusion, shared-tx idempotency
internal/application/service/retriever/factory_test.go TestVerifyBinding — ownership infra error, not owned, registry miss, success, cross-tenant
internal/application/service/knowledgebase_pr3_test.go (new) Create validation matrix (UUID parse, normalization, sentinel wrap, error code distinction) + CopyKB defenses including round-trip assertion that the new KB inherits VectorStoreID
internal/application/service/vectorstore_test.go DeleteStore guard against real sqlite tx (success, bound count, soft-deleted ignored, cross-tenant ignored, rollback, panic recovery) + ResolveStoreView / BatchResolveStoreView matrices
internal/handler/knowledgebase_request_test.go (new) Reflection assertion that neither UpdateKnowledgeBaseRequest nor KnowledgeBaseConfig exposes VectorStoreID (structural enforcement of the service-layer immutability tier)
internal/handler/knowledgebase_copy_preflight_test.go (new) Synchronous 400 for mismatched embedding model / vector store; user-facing message must not contain internal roadmap labels
internal/handler/knowledgebase_pr3_response_test.go (new) Typed error code 2200 / 2201 preserved at the HTTP boundary; raw infra errors still fall through to 500; shared-KB response strips vector_store_id while owner response retains it

Backward compatibility

A pre-existing client that does not send vector_store_id sees zero behavior change.

Surface Compatibility analysis
KB create without vector_store_id JSON unmarshal → kb.VectorStoreID = nilNormalize() is a no-op → HasVectorStore() is false → validation skipped → DB row stores NULL → retrieve-engine factory falls back to the tenant's effective engines (env store), same as before this PR
KB create with vector_store_id = "" Normalized to nil and treated identically to "not sent" (matches the factory's nil/empty pre-condition from #1310)
KB update DTO does not accept vector_store_id; a client that round-trips a full KB object (GET → modify → PUT) is silently ignored on this field — round-trip pattern is not broken
KB response shape Four new string fields are added; clients that ignore unknown JSON fields (the REST norm and the WeKnora frontend's pattern) are unaffected. Strict-schema clients should treat these as additive
POST /knowledge-bases/copy Same embedding model + same store → 100% identical behavior. Different embedding model used to be a silent corruption path (mixed vector spaces); now a 400. Different vector store used to land on a shared physical index implicitly; now an explicit 400 with a message pointing at the not-yet-supported migration capability
DELETE /vector-stores/:id Guard is a no-op when there are no bound KBs. At merge time of this PR every KB has vector_store_id IS NULL (no UI yet binds a store), so the guard is effectively dormant until clients start opting in
KBDeletePayload / IndexDeletePayload in Asynq VectorStoreID *string was added with omitempty in #1310; tasks enqueued before that upgrade decode it as nil and transparently fall back to the pre-serialized effective engines — same here
Migrations None. The column and the composite index were added by #994

Behavior change (intentional)

The current CopyKnowledgeBase silently allows cloning between KBs with different embedding models, producing vectors that are incomparable across the resulting KBs. This PR turns that case into a 400 with a clear message. Cross-store clones are 400'd for the same reason — cross-store data migration is a separate piece of work tracked in the parent issue.

Known limitations

  • Create-Delete race — a KB can be created against a store that is simultaneously being deleted. The factory rejects subsequent searches with ErrVectorStoreForbidden / NotFound, and the response view surfaces vector_store_status: "unavailable" so the UI can guide recovery. Closing the window fully would require SELECT … FOR SHARE on the store row during every CreateKB, which is disproportionate cost for a corner case.
  • Multi-replica registry stalenessUnregisterByStoreID only updates the in-memory registry of the deleting process. Sibling replicas continue to serve the engine until process restart. Documented; cluster-wide invalidation is a separate hardening item.
  • ListKnowledgeBases does not enrich store metadata — N+1 avoidance. The detail endpoint covers the upcoming UI; a batch resolver (BatchResolveStoreView) is shipped on the service interface for the follow-up UI PR to wire up.
  • KBDeletePayload / IndexDeletePayload are unsigned — pre-existing risk inherited from refactor(retriever): introduce factory for KB-scoped retrieve engine resolution #1310. The factory's ownership chain (payload.TenantID ↔ store's tenant_id) prevents cross-tenant delete even with a tampered payload. Adding HMAC is a separate hardening PR.

Test plan

  • go build ./...
  • go vet ./...
  • go test -race ./internal/application/service/retriever/... (sentinel matrix + parallel invocation)
  • go test -race ./internal/types/... (nil-safe helpers + JSON shape regressions)
  • go test ./internal/application/repository/... (CountByVectorStoreID + GORM <-:create immutability)
  • go test ./internal/application/service/... (CreateKB validation matrix, CopyKB defenses, DeleteStore guard against sqlite, ResolveStoreView / BatchResolveStoreView)
  • go test ./internal/handler/... (typed error code preservation, shared-KB UUID suppression, copy pre-flight, reflection-based immutability)
  • Manual smoke against PostgreSQL: create / get / delete flow with and without vector_store_id
  • Manual smoke against the SQLite Lite edition

… and delete

Wires KnowledgeBase.VectorStoreID and the ownership-aware retrieve factory
into the user-facing knowledge-base lifecycle:

- POST /knowledge-bases validates the requested vector_store_id against
  the caller's tenant scope and the engine registry. New error codes
  ErrVectorStoreBindingInvalid (2200) and ErrVectorStoreUnavailable (2201)
  distinguish the typed branches without echoing UUIDs to the client.
- GET / POST / PUT / PUT-pin responses embed the bound store's display
  metadata (name, source, engine_type, status) without exposing any
  connection credentials. Cross-tenant shared KBs receive a suppressed
  payload (vector_store_id stripped, source="shared") so operator-chosen
  store names cannot be enumerated across tenants.
- POST /knowledge-bases/copy synchronously rejects clones whose target
  has a different embedding model or vector store, before the async
  clone task is enqueued. The async clone worker re-applies the same
  checks for defense in depth.
- DELETE /vector-stores/:id refuses to remove a store with bound KBs,
  inside a transaction that row-locks the store on PostgreSQL and
  serializes via WAL on SQLite. unregister-from-registry is wrapped in
  defer/recover so a panic surfaces as a structured warning instead of
  silently leaking a stale engine.
- vector_store_id is immutable after creation. The GORM <-:create tag
  blocks every ORM update path; the service-layer DTO omits the field
  entirely; a reflection-based regression test catches any future
  maintainer who adds it back to either layer.
- Empty-string vector_store_id is normalized to nil at both the create
  path and inside SharesStoreWith, so rows persisted by callers that
  did not run Normalize first cannot trip false same-store comparisons.

Part of Tencent#993. Depends on Tencent#994 and Tencent#1310.
@ochanism ochanism force-pushed the enhancement/1206-3 branch from ce62712 to a2474ce Compare May 18, 2026 04:30
@lyingbug lyingbug merged commit 0e8de61 into Tencent:main May 18, 2026
ochanism added a commit to ochanism/WeKnora that referenced this pull request May 18, 2026
Multi-KB hybrid search now groups KBs by their bound VectorStore (partition
key (storeID, owner_tenant_id)), retrieves in parallel via errgroup with a
SetLimit(4) cap and a per-group timeout (MULTI_STORE_RETRIEVE_TIMEOUT_SEC,
default 30s), and merges results. When the collected results span more than
one engine type, an EngineAwareNormalizer rescales vector scores to [0, 1];
keyword (BM25) scores pass through to the existing RRF fusion. Single-group
calls take the fast path with zero fan-out overhead, preserving today's
behavior for deployments where every KB has vector_store_id = NULL.

Embedding-model consistency is now enforced explicitly via
ResolveEmbeddingModelKeys. Multi-KB searches across KBs whose resolved
model identities differ return BadRequest instead of silently producing
incomparable scores. Cross-tenant Organization-shared KBs are preserved by
partitioning on KB.TenantID so the factory's ownership lookup runs against
the source tenant.

Service-layer typed AppErrors (ErrVectorStoreBindingInvalid 2200 /
ErrVectorStoreUnavailable 2201) are mapped from PR2 sentinel hierarchy and
preserved end-to-end: the iterative FAQ path returns them rather than
swallowing, and the HybridSearch handler routes typed AppErrors to the
client unchanged instead of downgrading to 500.

Part of Tencent#993 (Phase 2: Per-KB VectorStore Binding).
Phase 2 roadmap item: PR 4 (Multi-store fan-out search).
Depends on Tencent#994, Tencent#1310, Tencent#1372.
ochanism added a commit to ochanism/WeKnora that referenced this pull request May 18, 2026
Multi-KB hybrid search now groups KBs by their bound VectorStore (partition
key (storeID, owner_tenant_id)), retrieves in parallel via errgroup with a
SetLimit(4) cap and a per-group timeout (MULTI_STORE_RETRIEVE_TIMEOUT_SEC,
default 30s), and merges results. When the collected results span more than
one engine type, an EngineAwareNormalizer rescales vector scores to [0, 1];
keyword (BM25) scores pass through to the existing RRF fusion. Single-group
calls take the fast path with zero fan-out overhead, preserving today's
behavior for deployments where every KB has vector_store_id = NULL.

Embedding-model consistency is now enforced explicitly via
ResolveEmbeddingModelKeys. Multi-KB searches across KBs whose resolved
model identities differ return BadRequest instead of silently producing
incomparable scores. Cross-tenant Organization-shared KBs are preserved by
partitioning on KB.TenantID so the factory's ownership lookup runs against
the source tenant.

Service-layer typed AppErrors (ErrVectorStoreBindingInvalid 2200 /
ErrVectorStoreUnavailable 2201) are mapped from PR2 sentinel hierarchy and
preserved end-to-end: the iterative FAQ path returns them rather than
swallowing, and the HybridSearch handler routes typed AppErrors to the
client unchanged instead of downgrading to 500.

Part of Tencent#993 (Phase 2: Per-KB VectorStore Binding).
Phase 2 roadmap item: PR 4 (Multi-store fan-out search).
Depends on Tencent#994, Tencent#1310, Tencent#1372.
ochanism added a commit to ochanism/WeKnora that referenced this pull request May 18, 2026
Multi-KB hybrid search now groups KBs by their bound VectorStore (partition
key (storeID, owner_tenant_id)), retrieves in parallel via errgroup with a
SetLimit(4) cap and a per-group timeout (MULTI_STORE_RETRIEVE_TIMEOUT_SEC,
default 30s), and merges results. When the collected results span more than
one engine type, an EngineAwareNormalizer rescales vector scores to [0, 1];
keyword (BM25) scores pass through to the existing RRF fusion. Single-group
calls take the fast path with zero fan-out overhead, preserving today's
behavior for deployments where every KB has vector_store_id = NULL.

Embedding-model consistency is now enforced explicitly via
ResolveEmbeddingModelKeys. Multi-KB searches across KBs whose resolved
model identities differ return BadRequest instead of silently producing
incomparable scores. Cross-tenant Organization-shared KBs are preserved by
partitioning on KB.TenantID so the factory's ownership lookup runs against
the source tenant.

Service-layer typed AppErrors (ErrVectorStoreBindingInvalid 2200 /
ErrVectorStoreUnavailable 2201) are mapped from PR2 sentinel hierarchy and
preserved end-to-end: the iterative FAQ path returns them rather than
swallowing, and the HybridSearch handler routes typed AppErrors to the
client unchanged instead of downgrading to 500.

Part of Tencent#993 (Phase 2: Per-KB VectorStore Binding).
Phase 2 roadmap item: PR 4 (Multi-store fan-out search).
Depends on Tencent#994, Tencent#1310, Tencent#1372.
ochanism added a commit to ochanism/WeKnora that referenced this pull request May 18, 2026
Multi-KB hybrid search now groups KBs by their bound VectorStore (partition
key (storeID, owner_tenant_id)), retrieves in parallel via errgroup with a
SetLimit(4) cap and a per-group timeout (MULTI_STORE_RETRIEVE_TIMEOUT_SEC,
default 30s), and merges results. When the collected results span more than
one engine type, an EngineAwareNormalizer rescales vector scores to [0, 1];
keyword (BM25) scores pass through to the existing RRF fusion. Single-group
calls take the fast path with zero fan-out overhead, preserving today's
behavior for deployments where every KB has vector_store_id = NULL.

Embedding-model consistency is now enforced explicitly via
ResolveEmbeddingModelKeys. Multi-KB searches across KBs whose resolved
model identities differ return BadRequest instead of silently producing
incomparable scores. Cross-tenant Organization-shared KBs are preserved by
partitioning on KB.TenantID so the factory's ownership lookup runs against
the source tenant. Foreign-tenant KB UUIDs injected via the request body
are rejected via kbShareService.HasTenantKBPermission (Plan 3 of Tencent#1303,
3-D capped) before any retrieval; rejected scopes surface as 404 to avoid
leaking foreign KB existence.

Service-layer typed AppErrors (ErrVectorStoreBindingInvalid 2200 /
ErrVectorStoreUnavailable 2201) are mapped from PR2 sentinel hierarchy and
preserved end-to-end: the iterative FAQ path returns them rather than
swallowing, and the HybridSearch handler routes typed AppErrors to the
client unchanged instead of downgrading to 500.

Part of Tencent#993 (Phase 2: Per-KB VectorStore Binding).
Phase 2 roadmap item: PR 4 (Multi-store fan-out search).
Depends on Tencent#994, Tencent#1310, Tencent#1372.
ochanism added a commit to ochanism/WeKnora that referenced this pull request May 18, 2026
Multi-KB hybrid search now groups KBs by their bound VectorStore (partition
key (storeID, owner_tenant_id)), retrieves in parallel via errgroup with a
SetLimit(4) cap and a per-group timeout (MULTI_STORE_RETRIEVE_TIMEOUT_SEC,
default 30s), and merges results. When the collected results span more than
one engine type, an EngineAwareNormalizer rescales vector scores to [0, 1];
keyword (BM25) scores pass through to the existing RRF fusion. Single-group
calls take the fast path with zero fan-out overhead, preserving today's
behavior for deployments where every KB has vector_store_id = NULL.

Embedding-model consistency is now enforced explicitly via
ResolveEmbeddingModelKeys. Multi-KB searches across KBs whose resolved
model identities differ return BadRequest instead of silently producing
incomparable scores. Cross-tenant Organization-shared KBs are preserved by
partitioning on KB.TenantID so the factory's ownership lookup runs against
the source tenant. Foreign-tenant KB UUIDs injected via the request body
are rejected via kbShareService.HasTenantKBPermission (Plan 3 of Tencent#1303,
3-D capped) before any retrieval; rejected scopes surface as 404 to avoid
leaking foreign KB existence.

Service-layer typed AppErrors (ErrVectorStoreBindingInvalid 2200 /
ErrVectorStoreUnavailable 2201) are mapped from PR2 sentinel hierarchy and
preserved end-to-end: the iterative FAQ path returns them rather than
swallowing, and the HybridSearch handler routes typed AppErrors to the
client unchanged instead of downgrading to 500.

Part of Tencent#993 (Phase 2: Per-KB VectorStore Binding).
Phase 2 roadmap item: PR 4 (Multi-store fan-out search).
Depends on Tencent#994, Tencent#1310, Tencent#1372.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants