feat(grep): integrate VikingDB bm25 keyword search for grep engine#2144
feat(grep): integrate VikingDB bm25 keyword search for grep engine#2144ByteDanceLiuYang wants to merge 10 commits into
Conversation
PR Reviewer Guide 🔍(Review updated until commit fbf4fea)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
92329f5 to
fbf4fea
Compare
|
Persistent review updated to latest commit fbf4fea |
PR Code Suggestions ✨No code suggestions found for the PR. |
0672236 to
4d21a96
Compare
| node_limit: i32, | ||
| level_limit: i32, | ||
| engine: Option<String>, | ||
| switch_to_remote_threshold: Option<i32>, |
There was a problem hiding this comment.
这些参数由于不常用,考虑放入 ovcli.conf 而不通过 flags 暴露
| offset: int = 0, | ||
| filters: Optional[Dict[str, Any]] = None, | ||
| output_fields: Optional[List[str]] = None, | ||
| mode: Optional[str] = None, |
There was a problem hiding this comment.
好的,先删掉了。这2个参数目前在openviking其实可以不传,就是走的默认值行为
…che, use Literal, Split regex alternation into individual keywords for bm25 (max 10)
…v suffixes in version comparison
…nt, not necessary to fallback to local fs
…arams in keywords search
faaaf83 to
b80738d
Compare
Summary
The existing
grepperforms full filesystem traversal — walking the directory tree, reading every file, and applying regex line by line. This becomes prohibitively slow on large codebases (like tens of thousands of files, hundreds of MB), where a single grep can take minutes.This PR introduces a two-phase grep strategy: use VikingDB
search_by_keywords(bm25 mode) as a coarse-grained recall filter to narrow down candidate files, then perform precise local regex matching only on those candidates. The engine is configurable per-request (auto/fs). Inautomode, the system adaptively switches between pure fs and vikingdb+fs based on collection size and schema compatibility, with automatic fallback on any failure.Type of Change
Feature Usage
API & Client
GrepRequestparams:engine: Literal["auto", "fs"] = "auto",switch_to_remote_threshold: int = 1000,remote_return_limit: int = 100--engine,--switch-to-remote-threshold,--remote-return-limitflags onov grepcommandAPI Parameters Add (in
grepAPI)engineLiteral["auto", "fs"]"auto"switch_to_remote_thresholdint1000remote_return_limitint100These parameters are passed per-request on the grep API endpoint, alongside existing
pattern,uri, etc. Thecount_cache_ttlis hardcoded to 60 seconds (not configurable).Usage Example
1. Configure
ov.conffor VikingDB backendThe
storage.vectordbsection must usevolcengineorvikingdbbackend to enable bm25 recall. Example:{ "storage": { // ... "vectordb": { "backend": "volcengine", "volcengine": { "ak": "YOUR_AK", "sk": "YOUR_SK", "region": "cn-beijing" }, "name": "my_collection_for_ov", "index_name": "my_index_1" } } }2. Basic grep (auto mode, default)
ov --account default --user default grep --uri viking://resources/code 'VikingDB'This uses
engine=autoby default. If the collection has ≥1000 L2 records and supports FullText, it will use vikingdb bm25 recall + fs precise match; otherwise falls back to pure fs.3. Force filesystem grep
ov --account default --user default grep --uri viking://resources/code --engine fs 'VikingDB'4. Always use vikingdb (threshold=0)
ov --account default --user default grep --uri viking://resources/code \ --switch-to-remote-threshold 0 'VikingDB'5. Increase vikingdb recall limit
ov --account default --user default grep --uri viking://resources/code \ --switch-to-remote-threshold 0 --remote-return-limit 500 'VikingDB'This recalls up to 500 candidate files from vikingdb bm25 before doing local regex matching.
Changes Made
1. Grep Engine Modes
auto(default)switch_to_remote_threshold, uses vikingdb recall + fs match; otherwise falls back to pure fs.fsAuto mode decision chain:
2. Collection Interface Layer
search_by_keywordsgainsmode: Optional[str]andfields: Optional[List[str]]params across all 6 collection implementations (vikingdb, volcengine, volcengine_api_key, http, local, mock)3. Schema & Config
contenttext field in context collection schema for FullText indexing[{"Field": "content", "Analyzer": {"Tokenizer": "standard"}}]schema_version: "0.3.18"added to collection embedding metadata for version-aware compatibility checksengine(Literal["auto", "fs"]),switch_to_remote_threshold(int, default=1000, ≥0),remote_return_limit(int, default=100, 1–100000);count_cache_ttlhardcoded to 60s4. Data Pipeline
embedding_msg_converter.pywritesvectorization_text[:65536]tocontentfield5. Business Logic (
viking_fs.py)grep()refactored with engine dispatch:_resolve_grep_engine()→_grep_fs()or_grep_vikingdb_then_fs()_grep_vikingdb_then_fs(): bm25 recall →_grep_in_files()precise regex match; auto-fallback to fs on vikingdb errors_get_cached_count(): per-URI count cache with hardcoded TTL=60s_collection_has_fulltext(): checks content field + FullText config in collection metadata6. Backend & Adapter
CollectionAdapter.search_by_keywords()delegates to collection, normalizes recordsVikingVectorIndexBackend.search_by_keywords()andget_collection_meta()async methods7. User-Agent Header
All VikingDB HTTP requests now include a
User-Agentheader with formatopenviking/{version}(e.g.,openviking/0.3.18). This helps VikingDB server-side identify request sources for troubleshooting and traffic analytics.8. Schema Compatibility
Existing collections created before v0.3.18 will not have the
contentfield or FullText config. On startup,init_context_collection()detects this viaschema_versionin the Description metadata and logs a warning. The grep engine=auto path automatically falls back to fs mode for such collections. To enable vikingdb-based grep, the collection must be recreated.Testing
113/113 storage tests pass
Updated
test_init_context_collection_warns_on_mismatched_nonempty_collection— reflects current behavior (warn + return False instead of raising)Updated
test_context_collection_contains_content_field_for_fulltext— validates content field and FullText config presenceI have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
Benchmark
1. Step-by-step workflow
Benchmark scripts are located at
benchmark/retrieval/grep/vikingdb_bm25/:2. Results
Environment: Debian 10, 12c24m
Total Data: 80,000 files, ~4GB, 4-level directory tree
Key Findings:
zero results, while bm25 returns empty immediately.
|matches faster in fs because it hits more files early.