Hybrid semantic + keyword search over Claude Code conversation history, exposed as a CLI and an MCP server.
A personal research surface for retrieval-quality work over Claude Code conversation history. The aim is to make hybrid retrieval over a developer's own chat archive measurable and improvable, not to be production infrastructure. See Retrieval Evaluation for the harness and the methodology used to assess it; see Architecture for how the pieces fit together.
- Background
- Install
- Usage
- Architecture
- Retrieval Evaluation
- Chunking
- Configuration
- Development
- Security
- API
- Maintainer
- Contributing
- License
Claude Code already records every session as JSONL under ~/.claude/projects/. That archive grows fast and becomes hard to search with grep alone, especially across projects. Claude KB pipes that archive into a Qdrant collection with hybrid (dense + sparse) retrieval and exposes it back to Claude Code as an MCP server, so the agent can search its own history without leaving the editor.
Scope and non-goals:
- Scope: a measurable retrieval surface over one developer's local Claude Code archive. Optimised for a single laptop and a local Qdrant instance.
- Non-goals: multi-tenant deployment, hosted SaaS, ingesting non-Claude-Code corpora, replacing a general-purpose RAG framework.
- Status: alpha. The author uses it daily; assume rough edges and expect to read source.
Prerequisites: Python 3.13+, uv, Docker (for the local Qdrant instance), Claude Code (for the MCP integration).
# 1. Install the CLI
uv tool install claude-kb
# 2. Start a local Qdrant
docker compose up -d
# 3. Import your Claude Code conversation history
kb import-claude-code-chats
# 4. Register the MCP server with Claude Code
claude mcp add -s user kb -- kb mcpAfter step 4, Claude Code has access to two tools, kb_search and kb_get, against your imported history. The first import re-embeds every message and may take several minutes; subsequent imports are incremental and only embed new messages.
uv tool upgrade claude-kb
kb --versionkb search "recency boost implementation"
kb search "error handling" --project claude-kb --from 2026-01-01 --limit 5
kb get <message-uuid>
kb get-thread <message-uuid> --depth 3
kb status
kb ai # LLM-optimized command schemaFull flag list per command: kb <command> --help.
Once registered with claude mcp add, Claude Code can call:
kb_search(query, ...)- hybrid search; optional filters for project, conversation, role, date range, score threshold; optional grouping by conversation.kb_get(message_id | conversation_id, ...)- retrieve a single message, a thread context, or restore a full conversation transcript.
Streamable HTTP transport is also supported:
kb mcp --transport http --port 3000See docs/mcp-api.md for the full schema reference.
~/.claude/projects/*/<session>.jsonl
-> parse (import_claude.py)
-> classify content_type (prose/tool_use/tool_result/thinking/mixed)
-> embed dense BGE-base 768d
-> Qdrant collection: conversations_hybrid
-> retrieve query_points(dense)
+ server-side filters: project, conversation_id, role, date range,
primary_content_type (default-deny on tool_result + thinking)
+ score_threshold
-> post-process recency boost / compact / grouping
-> CLI (kb ...) | MCP server (kb_search, kb_get)
One Qdrant point per Claude Code message; no sub-message chunking. Dense retrieval uses BAAI/bge-base-en-v1.5 (768d, L2-normalised). Recent messages are boosted post-retrieval with +0.2 * exp(-age / 1 week). Tool-result and thinking blocks are excluded from search results by default - both are dominant noise sources in code-conversation corpora; users opt in via include_tool_results=True / include_thinking=True when needed.
The collection schema also reserves a sparse vector slot, but the production search path is dense-only. The eval (docs/retrieval-experiments-2026-05.md) showed every hybrid configuration tested (BM25 fusion, bge-m3, Qwen3-Embedding-8B) regresses Recall@10 by 0.075-0.22 on this corpus shape; sparse vectors are stored only to keep the door open for future experiments.
Full diagram and per-stage notes: docs/architecture.md.
Measured on the maintainer's corpus (~690k messages, 20 hand-graded queries across five categories, conversation-level grading with cross-phrasing to defeat the selection bias of self-grading). At --min-score 0.0, k=10:
| Mode | Recall@10 | MRR@10 |
|---|---|---|
| dense-only, content-type filter on (default) | 0.368 | 0.397 |
| dense-only, recency boost on | 0.368 | 0.440 |
| dense-only, filter off | 0.361 | 0.389 |
| hybrid (RRF of dense + BM25) | regressed -0.075 vs dense-only on the 28-query expanded test | — |
| sparse-only (BM25) | regressed -0.18 vs dense-only on the 28-query expanded test | — |
Five hybrid- and encoder-replacement experiments were tested across this work (BM25 hyperparameter tuning, RRF prefetch pool size, bge-m3 dense, bge-m3 sparse, Qwen3-Embedding-8B). All but one (RRF prefetch_factor=30, +0.024 MRR) regressed Recall@10 by 0.075-0.22 versus dense-only BGE-base. The corpus shape - short-form English code-conversation messages, ~47 words/doc median - is the constraint, not the encoder. Full table, methodology, and per-experiment failure analysis: docs/retrieval-experiments-2026-05.md.
The most measurable improvement of the work was a server-side filter that excludes tool_result and thinking blocks from search results by default (controlled by primary_content_type payload tag). It lifted dense-only Recall@10 from 0.352 → 0.502 in earlier iteration on a related corpus split. The current 20-query eval is published in docs/evaluation.md with per-query and per-category breakdowns.
Harness: scripts/run_eval.py; query set: tests/eval/queries.jsonl. The harness will not fabricate metrics; if queries are ungraded it prints ungraded, N queries pending and exits 0.
Adjacent measurements: MCP response token reduction (29% mean / 86% peak from compact mode, see CHANGELOG.md), restore-mode unit tests (tests/test_search_service.py), content-type classifier tests (tests/test_content_type.py).
One Claude Code message, one Qdrant point. No sub-message chunking. The choice is load-bearing for the rest of the design (point IDs are message UUIDs, kb_get round-trips with kb_search), and it accepts known tradeoffs (SPLADE input truncated to 8000 chars per message; long-form prose recall is weaker than a sliding-window approach would deliver).
Why this is the right unit, what we lose, alternatives considered, and when to revisit: docs/chunking.md.
Environment variables (or a .env file in the working directory):
| Variable | Default | Purpose |
|---|---|---|
QDRANT_URL |
http://localhost:6333 |
Qdrant endpoint. Override for remote clusters. |
QDRANT_API_KEY |
unset | API key for Qdrant Cloud. |
EMBEDDING_MODEL |
BAAI/bge-base-en-v1.5 |
HuggingFace model name for the dense encoder. |
Apple Silicon (MPS), CUDA, and CPU are auto-detected by sentence-transformers. The dense encoder is the production retrieval signal; the collection schema reserves a sparse vector slot but the production search path does not query it (see Retrieval Evaluation for why).
git clone https://github.com/tenequm/claude-kb.git
cd claude-kb
uv sync --extra dev
just check # ty type-check + ruff lint + format
uv run pytest -q # unit testsPre-commit is configured via .pre-commit-config.yaml (ruff, secrets scan, basic hygiene).
Vulnerability reporting policy: SECURITY.md. The MCP server binds to 127.0.0.1 by default, queries a local Qdrant instance only, and exposes only read operations.
The MCP server exposes two tools. Both are read-only, idempotent, and run entirely against a local Qdrant instance.
| Tool | Purpose | Key parameters |
|---|---|---|
kb_search |
Hybrid semantic + keyword search across all imported messages. | query, limit, project, conversation_id, role, from_date, to_date, min_score, boost_recent, group_by_conversation |
kb_get |
Retrieve a single message, a thread context, or restore a full conversation transcript. | message_id, conversation_id, up_to, context_depth, max_messages |
Output models, filter application order, error modes, and non-obvious filter semantics are documented in docs/mcp-api.md. Pydantic models live in src/claude_kb/models.py.
Misha Kolesnik - @tenequm - misha@kolesnik.io
Issues and PRs are welcome at https://github.com/tenequm/claude-kb. Commit messages follow Conventional Commits (feat:, fix:, docs:, chore:, refactor:, test:). Please run just check and uv run pytest -q before opening a PR.
MIT (c) 2025 Misha Kolesnik