|
| 1 | +# RuVector Dev Tooling — `skillsmith-doc-retrieval` MCP |
| 2 | + |
| 3 | +> ⚠️ **Runtime status: BLOCKED on [SMI-4426](https://linear.app/smith-horn-group/issue/SMI-4426)** |
| 4 | +> |
| 5 | +> Wave 2 Step 1 ([PR #722](https://github.com/smith-horn/skillsmith/pull/722)) |
| 6 | +> ships this package as a **scaffold**. Unit tests cover chunker, metadata |
| 7 | +> store, and config guards (19/19 green), but the `@ruvector/core@0.1.30` |
| 8 | +> integration surface has **not** been runtime-validated. `skill_docs_search` |
| 9 | +> and `skill_docs_reindex` will throw a clear SMI-4426 error at invocation |
| 10 | +> rather than NPE. `skill_docs_status` works (metadata-only). The rest of |
| 11 | +> this guide documents the *intended* end-state — treat it as a spec until |
| 12 | +> SMI-4426 lands. |
| 13 | +> |
| 14 | +> See SMI-4426 for the API mismatch findings (`VectorDb` vs `VectorDB`, |
| 15 | +> `withDimensions(n)` factory, opaque persistence, distance-like scores, |
| 16 | +> platform native bindings). |
| 17 | +
|
| 18 | +Local, private semantic search over the Skillsmith doc corpus. Wraps |
| 19 | +`@ruvector/core` with Skillsmith's existing `EmbeddingService` so agents can |
| 20 | +hit 3 tools (`skill_docs_search`, `skill_docs_reindex`, `skill_docs_status`) |
| 21 | +instead of `Read`-ing whole guides. |
| 22 | + |
| 23 | +Phase 1 of [SMI-4416](https://linear.app/smith-horn-group/issue/SMI-4416) / |
| 24 | +[SMI-4417](https://linear.app/smith-horn-group/issue/SMI-4417). See ADR-117 |
| 25 | +for the design rationale and alternatives considered. |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## Setup (first run) |
| 30 | + |
| 31 | +```bash |
| 32 | +docker compose --profile dev up -d |
| 33 | +docker exec skillsmith-dev-1 npm install |
| 34 | +docker exec skillsmith-dev-1 npm run build -w packages/doc-retrieval-mcp |
| 35 | + |
| 36 | +# Build the initial .rvf — runs on host because we do not index CI artifacts |
| 37 | +git submodule update --init # required: docs/internal must be present |
| 38 | +node packages/doc-retrieval-mcp/dist/src/cli.js reindex --full |
| 39 | +``` |
| 40 | + |
| 41 | +Output lands at `.ruvector/skillsmith-docs.rvf` + |
| 42 | +`.ruvector/metadata.json` + `.ruvector/.index-state.json`. All three are |
| 43 | +git-ignored. `.git-crypt-ignore` is **not** needed — smudge/clean filters |
| 44 | +never run on untracked files. |
| 45 | + |
| 46 | +Restart Claude Code so it picks up the new `.mcp.json` entry. |
| 47 | + |
| 48 | +--- |
| 49 | + |
| 50 | +## Tools |
| 51 | + |
| 52 | +| Tool | Purpose | Shape | |
| 53 | +|------|---------|-------| |
| 54 | +| `skill_docs_search` | Semantic doc search | `{ query, k?, min_score?, scope_globs? } → { chunks: [{ id, file_path, line_start, line_end, heading_chain, text, score }] }` | |
| 55 | +| `skill_docs_reindex` | Rebuild / refresh | `{ mode: 'full' \| 'incremental' }` | |
| 56 | +| `skill_docs_status` | Index health check | `{} → { chunkCount, fileCount, lastIndexedSha, lastRunAt, rvfPath, corpusVersion }` | |
| 57 | + |
| 58 | +### Score semantics |
| 59 | + |
| 60 | +Cosine similarity, ∈ `[0, 1]`, higher is better. Default `min_score = 0.30`. |
| 61 | + |
| 62 | +| Range | Meaning | |
| 63 | +|-------|---------| |
| 64 | +| `< 0.25` | Noise | |
| 65 | +| `0.25–0.40` | Weakly related | |
| 66 | +| `0.40–0.60` | Loosely relevant | |
| 67 | +| `0.60–0.80` | Strongly relevant | |
| 68 | +| `> 0.80` | Near-duplicate / exact | |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +## Corpus |
| 73 | + |
| 74 | +Defined in |
| 75 | +[`packages/doc-retrieval-mcp/src/corpus.config.json`](../../packages/doc-retrieval-mcp/src/corpus.config.json): |
| 76 | +`CLAUDE.md`, `CONTRIBUTING.md`, `README.md`, `.claude/development/**`, |
| 77 | +`.claude/skills/**/SKILL.md`, `.claude/templates/**`, `docs/internal/**`, |
| 78 | +`packages/*/README.md`. The indexer refuses to start if the |
| 79 | +`docs/internal/` submodule is uninitialized — it would silently omit |
| 80 | +private content otherwise. |
| 81 | + |
| 82 | +## Chunk sizing — design note |
| 83 | + |
| 84 | +`all-MiniLM-L6-v2` has a **256-token hard cap** and `EmbeddingService.embed` |
| 85 | +further truncates input to 1000 chars (~250 tokens). Chunks target |
| 86 | +240 tokens (≈960 chars), overlap 48 tokens. The original plan targeted |
| 87 | +500-token chunks, which was infeasible with this model — the second half |
| 88 | +of every chunk would have been ignored by the encoder. Phase 3 |
| 89 | +([SMI-4419](https://linear.app/smith-horn-group/issue/SMI-4419)) revisits |
| 90 | +this if we adopt a longer-context model. |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +## Privacy boundary |
| 95 | + |
| 96 | +1. `.ruvector/` is **git-ignored** and **CI-refused**. The indexer exits |
| 97 | + non-zero if `CI=true` or `SKILLSMITH_CI=true`. It also refuses to write |
| 98 | + outside `$REPO_ROOT/.ruvector/`. |
| 99 | +2. `.mcp.json` carries an explicit `disabledTools` block listing 37 Ruflo |
| 100 | + tools with remote-persistence surfaces (AgentDB, hive-mind_memory, |
| 101 | + transfer_*, memory_store, etc.). Authoritative list lives in |
| 102 | + [`docs/internal/architecture/ruflo-tool-classification.md`](../../docs/internal/architecture/ruflo-tool-classification.md) |
| 103 | + (SMI-4420). Re-audit when Ruflo bumps a minor version. |
| 104 | +3. The corpus includes `docs/internal/**/*.md` (private submodule). The |
| 105 | + resulting `.rvf` is a searchable index of that content — treat it with |
| 106 | + the same confidentiality as the submodule itself. |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +## Post-commit hook |
| 111 | + |
| 112 | +`.husky/post-commit` runs an incremental re-index in the background when: |
| 113 | + |
| 114 | +- `$REPO_ROOT/.ruvector/skillsmith-docs.rvf` exists (first run is manual). |
| 115 | +- `packages/doc-retrieval-mcp/dist/src/cli.js` exists (package is built). |
| 116 | +- `CI` and `SKILLSMITH_CI` are unset. |
| 117 | + |
| 118 | +The indexer uses `GIT_OPTIONAL_LOCKS=0` and passes |
| 119 | +`--no-optional-locks` to every `git diff` invocation, avoiding the |
| 120 | +SMI-2536 smudge-filter branch-switch hazard. Hook failure is non-fatal |
| 121 | +and non-blocking. |
| 122 | + |
| 123 | +To disable the auto-reindex: delete the `.rvf` (first-run branch skips), |
| 124 | +or unset the cli by removing `packages/doc-retrieval-mcp/dist/`. |
| 125 | + |
| 126 | +--- |
| 127 | + |
| 128 | +## Operations |
| 129 | + |
| 130 | +### Rebuild from scratch |
| 131 | + |
| 132 | +```bash |
| 133 | +rm -rf .ruvector/ |
| 134 | +node packages/doc-retrieval-mcp/dist/src/cli.js reindex --full |
| 135 | +``` |
| 136 | + |
| 137 | +### Verify a query end-to-end |
| 138 | + |
| 139 | +```bash |
| 140 | +node packages/doc-retrieval-mcp/dist/src/cli.js status |
| 141 | +node -e "import('./packages/doc-retrieval-mcp/dist/src/search.js').then(m => m.search({ query: 'git-crypt worktrees', k: 3 })).then(r => console.log(JSON.stringify(r, null, 2)))" |
| 142 | +``` |
| 143 | + |
| 144 | +### Token-delta measurement (Wave 2 Step 6 gate) |
| 145 | + |
| 146 | +```bash |
| 147 | +node scripts/token-delta-harness.mjs run --mode baseline |
| 148 | +node scripts/token-delta-harness.mjs run --mode measured |
| 149 | +node scripts/token-delta-harness.mjs compare |
| 150 | +``` |
| 151 | + |
| 152 | +Pass = ≥40% median input-token reduction across the three tasks in |
| 153 | +[`scripts/ruvector-harness-tasks.json`](../../scripts/ruvector-harness-tasks.json). |
| 154 | +Fail = Phase 2 abandoned, retro filed. |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +## Troubleshooting |
| 159 | + |
| 160 | +| Symptom | Fix | |
| 161 | +|---------|-----| |
| 162 | +| `index not built` error from `skill_docs_search` | `node packages/doc-retrieval-mcp/dist/src/cli.js reindex --full` | |
| 163 | +| `required submodule 'docs/internal' is not initialized` | `git submodule update --init` | |
| 164 | +| `refusing to run in CI` | Expected — indexer never runs in CI. | |
| 165 | +| MCP server doesn't appear in Claude Code | Restart Claude Code after editing `.mcp.json`. Run the package build first: `docker exec skillsmith-dev-1 npm run build -w packages/doc-retrieval-mcp`. | |
| 166 | +| Stale results after many edits | `rm -rf .ruvector && node packages/doc-retrieval-mcp/dist/src/cli.js reindex --full` | |
| 167 | + |
| 168 | +--- |
| 169 | + |
| 170 | +## Deferred |
| 171 | + |
| 172 | +Phase 2 promotes `skill_docs_search` into `@skillsmith/mcp-server` with |
| 173 | +an `installed`/`registry` scope split (registry side uses pgvector on |
| 174 | +Supabase, not RuVector — Deno cannot load the native module). Phase 3 |
| 175 | +evaluates longer-context embedding models and potentially replaces the |
| 176 | +HNSW brute-force fallback in `packages/core/src/embeddings/hnsw-store.ts` |
| 177 | +(SMI-1519 / SMI-4419). |
0 commit comments