perf(mining): batch per-chunk upserts + optional GPU acceleration#1185
perf(mining): batch per-chunk upserts + optional GPU acceleration#1185
Conversation
The miner upserted one drawer per ChromaDB call, paying tokenizer + ONNX session setup per chunk. The embedding device was CPU-only because no EmbeddingFunction was ever wired through the backend. Two changes, each a speedup in its own right; stacked they give ~10x end-to-end on a medium corpus (20 files, 568 drawers): 1. Batched upsert. `process_file` and `_file_chunks_locked` now collect all chunks of a file into a single `collection.upsert(...)` so the embedding model runs one forward pass per file instead of N. 2. Hardware-accelerated embedding function. New `mempalace/embedding.py` wraps `ONNXMiniLM_L6_V2` with configurable `preferred_providers`. `MEMPALACE_EMBEDDING_DEVICE` (or `embedding_device` in config.json) selects auto / cpu / cuda / coreml / dml. Unavailable accelerators log a warning and fall back to CPU. The factory subclasses `ONNXMiniLM_L6_V2` and spoofs its `name()` to `"default"` so the persisted EF identity matches existing palaces created with ChromaDB's bare `DefaultEmbeddingFunction` -- same model, same 384-dim vectors, no rebuild needed when turning GPU on. `ChromaBackend.get_collection` / `create_collection` now pass the resolved EF on every call so miner writes and searcher reads agree. Benchmarks (i9-12900KF + RTX 3090, medium scenario, 568 drawers): per-chunk + CPU 19.77s · 29 drw/s (baseline) batched + CPU 8.07s · 70 drw/s (2.4x) batched + CUDA 2.15s · 264 drw/s (9.2x) Reproducible via `benchmarks/mine_bench.py`. Install paths: pip install mempalace[gpu] # NVIDIA CUDA pip install mempalace[dml] # DirectML (Windows) pip install mempalace[coreml] # macOS Neural Engine Mine header now prints `Device: cpu|cuda|...` so users can confirm the accelerator engaged.
There was a problem hiding this comment.
Pull request overview
Improves mining throughput and enables optional hardware-accelerated embeddings by batching ChromaDB upserts per file and wiring a configurable ONNX Runtime execution provider through the Chroma backend.
Changes:
- Batch per-file drawer upserts in the project miner and conversation miner to reduce embedding overhead.
- Add a new embedding-function factory with configurable device/provider selection (
auto/cpu/cuda/coreml/dml) and expose it via config/env. - Pass the resolved embedding function into Chroma collection creation/open calls; add optional dependency extras for accelerator runtimes.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Adds locked optional-dependency entries for gpu/dml/coreml extras (onnxruntime variants). |
pyproject.toml |
Defines new optional-dependency extras and documents how to select devices. |
mempalace/miner.py |
Reworks mining to batch per-file upserts and prints the resolved embedding device. |
mempalace/convo_miner.py |
Batches per-file upserts for conversation mining. |
mempalace/embedding.py |
New embedding-function factory with provider resolution, caching, and device description. |
mempalace/config.py |
Adds embedding_device config/env property. |
mempalace/backends/chroma.py |
Wires embedding function into get_collection / create_collection. |
benchmarks/mine_bench.py |
Adds a benchmark script comparing legacy per-chunk upserts vs batched upserts across devices. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Batch all chunks of this file into a single upsert so the embedding | ||
| # model runs one forward pass over the whole file instead of N passes | ||
| # of one chunk each. On CPU this is typically a 10-30x speedup; on | ||
| # GPU the speedup is larger because per-call overhead dominates. | ||
| try: | ||
| source_mtime = os.path.getmtime(source_file) | ||
| except OSError: | ||
| source_mtime = None | ||
|
|
||
| batch_docs: list = [] | ||
| batch_ids: list = [] | ||
| batch_metas: list = [] |
There was a problem hiding this comment.
process_file() now materializes batch_docs/batch_ids/batch_metas for the entire file at once. Given MAX_FILE_SIZE is 500MB and CHUNK_SIZE is 800 chars, a single large text file can produce an enormous number of chunks; duplicating those into three additional lists can blow memory and can also exceed Chroma/SQLite practical batch limits. Consider chunking the upsert into bounded sub-batches (e.g., 256–5000 items) to keep memory and DB variable counts under control while still getting most of the batching speedup.
| # Batch the whole file into one upsert so the embedding model runs | ||
| # a single forward pass for all chunks — dramatically faster than | ||
| # one call per chunk, especially on GPU where per-call overhead | ||
| # dominates over the actual matmul. | ||
| batch_docs: list = [] | ||
| batch_ids: list = [] | ||
| batch_metas: list = [] | ||
| filed_at = datetime.now().isoformat() |
There was a problem hiding this comment.
This function batches all chunks for a file into one collection.upsert(...). For large conversation transcripts this can create very large batch_* lists and a huge single upsert, risking high peak memory and potential Chroma/SQLite batch limits. Consider splitting the lists into bounded sub-batches (keeping filed_at constant) to preserve most performance gains without risking OOMs.
| logger.warning( | ||
| "embedding_device=%r requested but %s is not installed — " | ||
| "falling back to CPU. Install mempalace[gpu] for CUDA.", | ||
| device, | ||
| preferred, | ||
| ) |
There was a problem hiding this comment.
The fallback warning always says "Install mempalace[gpu] for CUDA" even when the requested device is coreml or dml. This will mislead users on macOS/Windows. Suggest tailoring the message based on device (e.g., coreml→mempalace[coreml], dml→mempalace[dml], cuda→mempalace[gpu]) or wording it generically ("install the matching extra for your device").
| def get_embedding_function(device: Optional[str] = None): | ||
| """Return a cached embedding function bound to the requested device. | ||
|
|
||
| ``device=None`` reads from :class:`MempalaceConfig.embedding_device`. | ||
| The returned function is shared across calls with the same resolved | ||
| provider list so we only pay model-load cost once per process. | ||
| """ | ||
| if device is None: | ||
| from .config import MempalaceConfig | ||
|
|
||
| device = MempalaceConfig().embedding_device | ||
|
|
||
| providers, effective = _resolve_providers(device) | ||
| cache_key = tuple(providers) | ||
| cached = _EF_CACHE.get(cache_key) | ||
| if cached is not None: | ||
| return cached | ||
|
|
||
| ef_cls = _build_ef_class() | ||
| ef = ef_cls(preferred_providers=providers) | ||
| _EF_CACHE[cache_key] = ef | ||
| logger.info("Embedding function initialized (device=%s providers=%s)", effective, providers) | ||
| return ef | ||
|
|
||
|
|
||
| def describe_device(device: Optional[str] = None) -> str: | ||
| """Return a short human-readable label for the resolved device. | ||
|
|
||
| Used by the miner CLI header so users can see at a glance whether GPU | ||
| acceleration actually engaged. | ||
| """ | ||
| if device is None: | ||
| from .config import MempalaceConfig | ||
|
|
||
| device = MempalaceConfig().embedding_device | ||
| _, effective = _resolve_providers(device) | ||
| return effective |
There was a problem hiding this comment.
New behavior in get_embedding_function() / describe_device() isn’t covered by tests (provider resolution, unknown device fallback warning, caching semantics). Adding unit tests with monkeypatch (e.g., stubbing onnxruntime.get_available_providers) would help prevent regressions across platforms where provider availability differs.
| @property | ||
| def embedding_device(self): | ||
| """Hardware device for the ONNX embedding model. | ||
|
|
||
| Values: ``"auto"`` (default), ``"cpu"``, ``"cuda"``, ``"coreml"``, | ||
| ``"dml"``. Read from env ``MEMPALACE_EMBEDDING_DEVICE`` first, then | ||
| ``embedding_device`` in ``config.json``, then ``"auto"``. | ||
|
|
||
| ``auto`` resolves to the first available accelerator at runtime via | ||
| :mod:`mempalace.embedding`; requesting an unavailable accelerator | ||
| logs a warning and falls back to CPU. | ||
| """ | ||
| env_val = os.environ.get("MEMPALACE_EMBEDDING_DEVICE") | ||
| if env_val: | ||
| return env_val.strip().lower() | ||
| return str(self._file_config.get("embedding_device", "auto")).strip().lower() |
There was a problem hiding this comment.
embedding_device is a new config surface area but there are no tests validating precedence (env var vs config.json vs default) or normalization (trim/lower). Consider adding coverage in tests/test_config.py similar to the existing palace_path env override tests so future refactors don’t break device selection.
bensig
left a comment
There was a problem hiding this comment.
Reviewed locally on `perf/batched-upsert-gpu`. Quality is high — clean design, honest fallbacks, well-documented EF workaround, real benchmarks. Approving with one ask + a few minor notes worth a follow-up commit if you want.
Full pytest matches v3.3.3: 1234 passed in 35s on this branch.
Strengths
- Two wins kept independent. Batching gives ~2.4× CPU; GPU adds another 3.8× on top. Either alone is useful, the combination is multiplicative.
- EF identity rename to `"default"` is the high-value detail. The persisted-EF check in ChromaDB 1.x would otherwise force a palace rebuild on the cutover. Lets the new GPU-capable EF read palaces created by the bare library default and vice versa. The inline comment in `_build_ef_class` documents the why clearly enough that the next reader won't accidentally undo it.
- Fallback chain is solid. Unknown device → CPU+warn, missing accelerator → CPU+warn, EF construction error → ChromaDB's own default. The "still works on a laptop" goal is honored.
- One-shot warning via `_WARNED` — won't spam logs across a 10k-file mine. Process-global EF cache keyed by provider tuple — one model load per device per process.
- `Device:` line in the mine banner — small UX win, users see whether GPU actually engaged.
- Optional deps split into 3 extras (`gpu` / `dml` / `coreml`) — no over-pulling for users who only have one acceleration path available.
Ask: unit tests for `mempalace/embedding.py`
The module has zero tests today. `_resolve_providers` has 5+ branches all driven by a string and all mockable without a GPU:
```python
@patch("onnxruntime.get_available_providers")
def test_auto_picks_cuda(mock):
mock.return_value = ["CUDAExecutionProvider", "CPUExecutionProvider"]
assert _resolve_providers("auto") == (["CUDAExecutionProvider", "CPUExecutionProvider"], "cuda")
def test_auto_falls_to_cpu(mock): ...
def test_cuda_missing_warns(...)
def test_unknown_device_warns(...)
def test_onnxruntime_import_error(...)
```
This is exactly the kind of code that breaks silently in three months when ChromaDB or onnxruntime renames a provider string and we never notice because nothing exercises the resolver. Cheap to add, high coverage win.
Minor notes (not blockers)
- Atomicity trade-off is unmentioned in the diff. Per-chunk → per-file upsert means one bad chunk takes the whole file's chunks down. Almost certainly the right call, but worth a one-line comment in `process_file` so the next reader doesn't accidentally "fix" it back to per-chunk thinking they're improving error isolation.
- `filed_at` is now per-file in `convo_miner` (was per-chunk). Practically fine, arguably more correct semantically — flagging only because behavior changed and isn't called out.
- `tomli` dep + `add_to_known_entities` helper look unrelated to the PR title — likely develop-merge overlap. Worth a quick check with #1183 to make sure we're not duplicating work.
- Subclassing `chromadb.utils.embedding_functions.ONNXMiniLM_L6_V2` is an internal-API dependency on ChromaDB. Calculated risk for the persistence-compat win, but worth a comment so a future reader knows it's deliberate, not accidental.
- CUDA-installed-but-no-driver edge case: `onnxruntime-gpu` may report `CUDAExecutionProvider` as available on a machine without a CUDA driver — session creation then fails at runtime. The `[CUDAExecutionProvider, CPUExecutionProvider]` cascade may save us, but a brief test or note would confirm.
None of the minor notes block — they're "the next person reading this code in six months will thank you" material. The unit tests are the one I'd push for before merge.
Approving on the strength of the design, test passes, and clean fallback story.
|
@copilot apply changes based on the comments in this thread also read the comment from ben |
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3213a67a-6871-4bb2-9ae0-23fa11001a22 Co-authored-by: igorls <[email protected]>
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3213a67a-6871-4bb2-9ae0-23fa11001a22 Co-authored-by: igorls <[email protected]>
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3213a67a-6871-4bb2-9ae0-23fa11001a22 Co-authored-by: igorls <[email protected]>
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3213a67a-6871-4bb2-9ae0-23fa11001a22 Co-authored-by: igorls <[email protected]>
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3213a67a-6871-4bb2-9ae0-23fa11001a22 Co-authored-by: igorls <[email protected]>
Addressed the embedding unit-test ask and related follow-up items in 0315124. Added provider-resolution/cache tests, config precedence/normalization tests, bounded upsert batches for large files/transcripts, and device-specific fallback warning text.
Applied the requested review-thread changes and Ben’s unit-test ask in 0315124. The branch now has bounded Chroma upsert batches, tailored accelerator fallback warnings, embedding/config coverage, and targeted validation passing. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
…HNSW fixes Bring in 29 commits from upstream/develop since the last merge (2026-04-23): Major absorbed changes: - MemPalace#976 (Felipe Truman): HNSW graph corruption fix, mine_global_lock for fan-out, MAX_PRECOMPACT_BLOCK_ATTEMPTS=2 for /compact deadlocks. Closes MemPalace#974/MemPalace#965/MemPalace#955; likely resolves MemPalace#1172 too. - MemPalace#1179 (Igor): CLI mempalace search routes through _hybrid_rank, legacy-metric warning + _warn_if_legacy_metric, invariant tests on hnsw:space=cosine across all 5 collection-creation paths. - MemPalace#1180/MemPalace#1183/MemPalace#1184: cross-wing topic tunnels, init mine UX, --auto-mine. - MemPalace#1185 (perf/batched-upsert-gpu): batched ChromaDB upserts, GPU device detection via mempalace/embedding.py. - MemPalace#1182: graceful Ctrl-C during mempalace mine. - MemPalace#1168: tunnel permissions security fix. Conflict resolutions (8 files): - searcher.py: kept fork's CLI delegation through search_memories (cleaner single-source-of-truth path); upstream's inline-retrieval branch dropped. Merged upstream's print formatting (cosine= + bm25=) with fork's matched_via reporting from sqlite BM25 fallback. - backends/chroma.py: kept fork's _BLOB_FIX_MARKER + palace_path arg to ChromaCollection (MemPalace#1171 write lock); merged upstream's **ef_kwargs (embedding_function support from MemPalace#1185). Removed duplicate _pin_hnsw_threads (fork already cherry-picked Felipe's earlier). - mcp_server.py: kept fork's palace_path arg + upstream's clearer comment on hnsw:num_threads=1 rationale. - miner.py: took upstream's serial mine() flow (mine_global_lock + Ctrl-C), RESTORED fork's strict detect_room — substring matching from upstream breaks fork-only test_detect_room_priority1_no_substring_match. Added `import re` for word-boundary regex matching. Fork-ahead concurrent mining (workers=, ThreadPoolExecutor from 5cd14bd) is dropped — daemon migration deprioritizes local concurrent mining; can re-port if needed. - CHANGELOG.md: combined fork's segfault-trio narrative with upstream's v3.3.4 release notes. - HOOKS_TUTORIAL.md: took MEMPALACE_PYTHON env var name (fork README was stale; hooks already use this name per fork-ahead MemPalace#19). - test_convo_miner_unit.py: took both contextlib + pytest imports. - test_searcher.py: imported upstream's 3 new TestSearchCLI tests but skipped them with TODOs — they assume upstream's inline-retrieval CLI with simpler mocks. Rewrite needed for fork's delegated search_memories path (sqlite BM25 fallback + scope counting). Test result: 1334 passed, 3 skipped (the upstream tests above), 3 warnings. Implications for open fork PRs: - MemPalace#1171 (cross-process write lock at adapter) becomes structurally redundant given MemPalace#976's mine_global_lock + daemon-strict serialization. Slated for close with thank-you to Felipe. - MemPalace#1173/MemPalace#1177 still relevant (quarantine threshold + blob_seq marker). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
After today's work: - Merged upstream/develop (29 commits), absorbing MemPalace#976 (HNSW + mine_global_lock + PreCompact attempt-cap), MemPalace#1179 (CLI BM25), MemPalace#1180/MemPalace#1182/MemPalace#1183/MemPalace#1185 features. - Filed PR MemPalace#1198 (`_tokenize` None guard) upstream. - Closed PR MemPalace#1171 (adapter-level flock) — superseded by MemPalace#976 + daemon-strict. - palace-daemon migration commits (c09582c + 0e97b19) make the daemon the single canonical writer; in-tree adapter flock no longer carries its weight. README: - Lead paragraph: 2026-04-21 → 2026-04-25 sync date, 165,632 → 151,420 drawer count, daemon-on-disks topology, listed absorbed upstream PRs. - Fork change queue: dropped MemPalace#1171 row, added MemPalace#1198 row. - "Multi-client coordination" section: rewrote — palace-daemon is now the fork's primary answer (was "deferred"), MemPalace#1171 retired, narrative shifted to defense-in-depth around MemPalace#976 + daemon. - Two-layer memory model: storage path is now palace-daemon HTTP, not ~/.mempalace/palace. - Ecosystem palace-daemon row: marks integration as complete (commits cited). - Open upstream PRs table: refreshed to MemPalace org URLs, added MemPalace#1198 row, removed MemPalace#1171, added 2026-04-25 develop sync to merged list, added MemPalace#1171 to closed list with rationale. - Test count 1096 → 1334. CLAUDE.md: - Row 20 (MemPalace#1198): "Upstream PR pending" → "Filed as MemPalace#1198 on 2026-04-24". - Row MemPalace#1171 in PR table: open → closed (with MemPalace#976 supersession). - Added MemPalace#1198 row to PR table. - Top-of-section count: "14 merged, 7 open, 9 closed" → "14 merged, 7 open (including MemPalace#1198), 10 closed (added MemPalace#1171)" + sync date 2026-04-25. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Summary
mempalace minewas slow because every drawer was upserted in its owncollection.upsert(...)call (one tokenizer + ONNX forward pass per chunk), and because noEmbeddingFunctionwas ever wired through the backend (CPU-only, by omission).This PR fixes both, independently:
collection.upsert(...)per file instead of per chunk.ONNXMiniLM_L6_V2with configurablepreferred_providers. Selectable viaMEMPALACE_EMBEDDING_DEVICE(auto/cpu/cuda/coreml/dml) orembedding_devicein~/.mempalace/config.json. Wired through mempalace/backends/chroma.py so miner writes and searcher reads always agree.Same
all-MiniLM-L6-v2model, same 384-dim vectors — existing palaces need no rebuild when turning GPU on. The factory subclassesONNXMiniLM_L6_V2and renames it"default"so ChromaDB's persisted-EF check matches palaces created with the bare library default.Benchmark
Hardware: Intel Core i9-12900KF (24 threads) · NVIDIA RTX 3090 · Linux 6.8 · Python 3.13.12
Software:
chromadb 1.5.7·onnxruntime 1.24.4(CPU) /onnxruntime-gpu 1.25.0(CUDA)Method: seeded synthetic corpora, fresh palace per run, model warm-loaded before the first scenario.
(old baseline)
(this PR, no GPU)
(this PR,
mempalace[gpu])Isolating the two wins (medium scenario, same numbers sliced):
Key observation: GPU alone is not enough. Per-chunk + CUDA (47 drw/s) is only modestly better than per-chunk + CPU (29 drw/s) — per-call launch overhead dominates. Batching is what unlocks the accelerator.
Reproduce via
benchmarks/mine_bench.py:Install paths
mempalace mineheader now printsDevice: cpu|cuda|...so users can confirm the accelerator actually engaged.Known ergonomics gap (follow-up)
pip install mempalace[gpu]alone is not sufficient on Linux today — currentonnxruntime-gpuwheels do not bundle cuDNN/cuBLAS/cudart, and thenvidia-*-cu12runtime wheels requireLD_LIBRARY_PATHto be set before Python starts. The failure mode is a silent fallback to CPU. A follow-up PR should: (a) extend[gpu]to also pullnvidia-cudnn-cu12 nvidia-cuda-runtime-cu12 nvidia-cublas-cu12, (b)ctypes.CDLL-preload their.sofiles fromsite-packages/nvidia/*/libat import time (same trick PyTorch uses), and (c) turn the silent fallback into a loud, actionable warning. Not scoped into this PR — this one is about the core speedup.Principles respected
Test plan
uv run pytest tests/ --ignore=tests/benchmarks— 1234 passed locallyuvx --from 'ruff>=0.4.0,<0.5' ruff check+format --checkcleanpre-commithooks (ruff + ruff-format) passMEMPALACE_EMBEDDING_DEVICE=cudaon a host without CUDA logs a clear fallback warning instead of crashing