Add Apple Silicon acceleration (MPS/explicit CPU) for embedding — up to 256x faster (addresses #1068)#1133
Conversation
b66a8ed to
c16fd4b
Compare
|
Rebased onto Changes from the original PR:
All 37 tests still green locally (15 embedding + 22 config). Happy to iterate on anything — particularly open to making torch an optional extra ( |
|
One more note on generalization — this is an architecture-level fix, not a machine-specific one. The root cause (
Absolute rates scale with GPU core count, but the relative speedup (vs today's default) stays in the 50-250x range across the whole M1→M5 range because ChromaDB's The fix also generalizes beyond Apple Silicon:
So this PR is a net-positive for every platform, with the sharpest wins on Apple Silicon because that's where the default path is actively broken. One open caveat for small-RAM M1 base (8 GB): unified memory pressure from torch + model weights + ChromaDB + OS may push into swap. The |
|
Hi @midweste @sha2fiddy — flagging two people who've been working on adjacent angles: @midweste this PR composes cleanly with your #1085 (batch ChromaDB inserts). Embedding (here) and write path (yours) are orthogonal bottlenecks — combined they should give a compound speedup on end-to-end mine. @sha2fiddy thanks for filing #1068. Your diagnosis (ORT thread pool) is one valid lens; this PR argues the root cause is CoreML op-fallback thrashing and targets raw throughput instead of CPU-capping. Happy to compare notes — your repair-mode work in #1126 / #1135 will be useful for anyone who hit HNSW corruption mid-mine (which I actually triggered during benchmark, ha). Either way, the PR is rebased on v3.3.2 and the CI is just waiting on a maintainer to approve workflows (first-time contributor gate). Let me know if anything looks off. |
Thank you! Funny that you mention HNSW corruption as my other pull request addresses SIGINT and other failures May also help with:
Although a per chunk test imo would be the gold standard. |
Nice — #1113 is the prevention side, #1126/#1135 is the recovery side. Between the three PRs the mine lifecycle should be pretty bulletproof. Will take a look at #1113. |
|
@bensig can you test this on Apple Sillicon? Sound good, but I cannot test it here. |
igorls
left a comment
There was a problem hiding this comment.
Thanks for the careful diagnosis and the reproducible benchmark — the CoreML op-fallback explanation is convincing and matches ChromaDB's own source comment. A few items to address before this can land:
Blockers
1. Backwards-compatibility break on existing palaces. CI is red on Linux 3.11 and macOS with this error pattern across many tests:
An embedding function already exists in the collection configuration, and a new one is provided.
Embedding function conflict: new: onnx_mini_lm_l6_v2 vs persisted: default
Collections created today are persisted with EF name default (ChromaDB's bundled identity, no explicit EF). After this PR, ChromaBackend.get_collection always passes an EF on reopen, and ChromaDB rejects the mismatch. Every existing user palace would fail to open after upgrade.
Two ways to handle this: (a) only attach an EF on create, leave get unchanged so persisted EFs are honored; or (b) detect the persisted EF name on open and skip injection when it differs (with a one-time advisory log). The test failures should also exercise the upgrade path for an existing palace, not only fresh creates.
2. torch and sentence-transformers should be optional extras, not required dependencies. The auto-detect logic in mempalace/embedding.py returns onnx_default for any non-Apple, non-CUDA host — i.e. the majority of users. Those users would pay ~800 MB of mandatory disk and a much slower pip install for zero behavior change.
Suggested pyproject layout:
dependencies = [
"chromadb>=1.5.4,<2",
"pyyaml>=6.0,<7",
]
[project.optional-dependencies]
gpu = [
"sentence-transformers>=2.2.0",
"torch>=2.0.0",
]Mac/CUDA users opt in with pip install "mempalace[gpu]". The try/except ImportError paths already in _detect_auto_backend handle missing deps gracefully — flipping the default install just means auto lands on onnx_default instead of st_mps for users who haven't installed the extras, which is the same behavior they have today.
3. Lint is failing. tests/test_embedding.py:3:8: F401 [*] os imported but unused. Auto-fixable with ruff check --fix.
4. Base branch. This PR targets main, but develop is the integration branch in this repo — main lags. Please retarget to develop and rebase. The mergeable_state: dirty will likely resolve in the same step.
Substantive concerns
5. HNSW reproducibility caveat. The PR description says existing palaces don't need re-mining because of ~1e-6 FP drift between ONNX and PyTorch runtimes of the same weights. That's true for embedding compatibility (cosine distance is robust to that level of noise), but HNSW graph construction is order- and value-sensitive — a regenerated index from one runtime won't match an index built from the other. For users mining over time on mixed runtimes, query results will stay correct but bit-identical reproducibility is not guaranteed. Worth saying so explicitly in the README and in the embedding module docstring.
6. Local test environment didn't reflect the runtime. The PR description mentions tests were run against chromadb 0.6.3, but pyproject.toml pins chromadb>=1.5.4,<2. Future runs should target the supported floor (pip install -e ".[dev]" from a clean venv) so issues like the EF-conflict break above surface locally before CI.
Minor
7. _get_configured_embedding_function instantiates MempalaceConfig() on every get_collection / create_collection call. Hot path — worth caching the resolved EF (e.g. module-level @functools.lru_cache(maxsize=1) keyed by backend+model) so config parsing isn't repeated per collection access.
The technical idea is sound and the speedup is meaningful for users who opt in. Once items 1–4 are resolved, this should be much easier to land.
…-transformers) Rebased PR MemPalace#1133 onto develop. The original PR (sentence-transformers backend with its own config knob) and develop's onnxruntime-providers framework (a4868a3, fbd0904) overlapped in scope, so this revision preserves develop's framework end-to-end and adds a single new device, ``mps``, on top of it. What this adds -------------- * ``MEMPALACE_EMBEDDING_DEVICE=mps`` — routes through PyTorch + ``sentence-transformers`` instead of ONNX Runtime, running ``all-MiniLM-L6-v2`` directly on the Apple Metal GPU. * New optional extra ``mempalace[mps]`` carrying ``torch>=2.0`` and ``sentence-transformers>=2.2``. Base install is unchanged — users who don't install the extra get exactly the pre-PR behavior. * ``auto`` now prefers ``mps`` over ``coreml`` on Apple Silicon when the [mps] extra is present. ``coreml`` is retained as an explicit opt-in for users who want ONNX Runtime's CoreML provider. Why mps and coreml are both needed ---------------------------------- ChromaDB's bundled ``ONNXMiniLM_L6_V2`` enables ``CoreMLExecutionProvider`` by default, which silently falls back op-by-op to CPU for ``all-MiniLM-L6-v2`` because some ops are not yet implemented in CoreML's MLProgram lowering. The resulting ANE↔CPU copies cost more than they save. Measured on Apple M5, 200 real conversation chunks, batch 32, 3 runs + 1 warmup: onnx_default (CoreML on) : 2 chunks/s (baseline) onnx_cpu_only : 45 chunks/s (22x) st_cpu : 197 chunks/s (97x) st_mps : 523 chunks/s (256x) The fix at the ONNX-Runtime / CoreML-provider interface layer is upstream's problem; on Apple Silicon today, MPS via PyTorch is the fastest reliable path. The relative speedup persists across the M1→M5 range because the underlying CoreML bug is software, not silicon. Backwards compatibility ----------------------- The same ``_build_ef_class`` rename trick from develop is used here: ``_MempalaceMPS`` overrides ``name()`` to return ``"default"``, so a palace previously created with ChromaDB's bundled ``DefaultEmbeddingFunction`` reopens cleanly under the MPS EF without tripping the chromadb 1.x ``Embedding function conflict`` guard. End-to-end smoke check on this machine: ``cos(mps_embed, cpu_embed) = 1.000000`` over 384-d vectors, i.e. drift below FP32 cosine resolution. Tests ----- 9 new tests in ``tests/test_embedding.py`` cover: * explicit ``mps`` resolves to the sentinel when torch+ST+MPS line up * missing ``[mps]`` extra → CPU fallback + actionable warning pointing at ``pip install mempalace[mps]`` * ST installed but no Metal (Linux/Intel-Mac) → CPU + different warning * ``auto`` prefers ``mps`` over CoreML when both are available * ``auto`` falls back to CoreML when torch is absent * ``get_embedding_function`` routes the MPS sentinel to the sentence-transformers branch, not the ONNX builder * MPS branch shares the same ``_EF_CACHE`` as the ONNX branch * the ``_MPS_SENTINEL`` string itself is not accidentally treated as a user-supplied device name * ``_torch_mps_available()`` returns False without raising when torch is missing The pre-existing onnx-focused tests now neutralize MPS in their fixture so they keep their original meaning on dev machines that have the [mps] extra installed. Full ``mempalace`` test suite: 1317 passed. Includes ``benchmarks/apple_silicon_bench.py`` — a reproducible side-by-side benchmark of all five paths (onnx-default, onnx-cpu-only, onnx-coreml-explicit, st_cpu, st_mps) for users to verify on their own hardware. Addresses PR MemPalace#1133 review feedback from @igorls ----------------------------------------------- 1. Backwards-compatibility — solved by reusing develop's ``name()=="default"`` rename trick (the same fix already on develop) for both the ONNX and MPS EFs. 2. Optional dependency — torch + sentence-transformers are now the ``[mps]`` extra, not required deps. Base install is unchanged. 3. Lint — ``ruff check .`` clean, ``ruff format --check`` clean on all touched files. 4. Branch — this commit is on ``develop``, not ``main``. 5. HNSW reproducibility caveat — documented in both the embedding module docstring and the README: query results stay correct, exact ``recall@k`` can shift between runtimes, pin the device for strict reproducibility. 6. Test environment — re-ran the full suite against the supported chromadb floor (1.5.8) in a clean py3.13 venv; the original PR's 0.6.3 venv was the reason the EF-conflict regression went un-noticed locally. 7. Caching — develop's existing ``_EF_CACHE`` already does this, so no change needed; the MPS branch participates in the same cache. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
c16fd4b to
233c349
Compare
|
Thanks for the careful review @igorls — addressing all four blockers + the substantive items + the minor one. The branch is now rebased onto Heads-up on a strategic pivot before the per-item walkthrough: between the original PR and now, #1068's solution landed on The diagnosis still holds (and the M5 numbers Per-item1. Backwards-compatibility break. ✅ The 2. Optional dependencies. ✅ [project.optional-dependencies]
gpu = ["onnxruntime-gpu>=1.16"] # NVIDIA via ORT
dml = ["onnxruntime-directml>=1.16"] # Windows AMD/Intel/NVIDIA via ORT
coreml = ["onnxruntime>=1.16"] # Apple ANE via ORT
mps = ["sentence-transformers>=2.2.0", "torch>=2.0.0"] # Apple Metal via PyTorch3. Lint. ✅ 4. Base branch. ✅ Retargeted to 5. HNSW reproducibility caveat. ✅ Documented in both the
6. Test environment. ✅ Re-ran the full suite from a fresh py3.13 venv with 7. New tests9 new tests in
Benchmark
Happy to keep iterating. CC @bensig — if you wanted to verify on Apple Silicon you mentioned, the new path is |
Add Apple Silicon acceleration (MPS/explicit CPU) for embedding
Addresses #1068 — with a different root-cause diagnosis and a solution that actually speeds up ingest rather than just capping CPU.
TL;DR
ONNXMiniLM_L6_V2enables all ONNX providers includingCoreMLExecutionProvider, which silently falls back op-by-op to CPU forall-MiniLM-L6-v2. The ANE↔CPU data copies cost more than they save.ChromaBackend-levelembedding_backendoption with six choices:auto,onnx_default,onnx_cpu,st_cpu,st_mps,st_cuda.auto— picksst_mpson Apple Silicon,st_cudaon NVIDIA GPUs, orst_cpuelsewhere.sentence-transformers+torchare added as required dependencies so every user gets the fast path without extra setup.MEMPAL_EMBED_BACKEND=onnx_default. Existing palaces do not need to be re-mined (~1e-6 FP drift between ONNX and Torch runtimes of the same model weights).Motivation
Ingesting 4,759 Claude Code conversation transcripts (670 MB of
.jsonl) on an Apple M5 took 65 minutes withmempalace mine, with the process pegged at ~450% CPU (4.5 cores saturated). Matches the symptom in #1068.The reporter in #1068 diagnosed this as ORT's
intra_oppool defaulting to physical-core count and proposed capping viaSessionOptions. Capping CPU (their reported 62.9% after the change) makesmempalace minemore polite, but it does not make ingest faster. This PR argues the root cause is different.Look at the source hint inside ChromaDB's own embedding function
(chroma/chromadb/utils/embedding_functions/onnx_mini_lm_l6_v2.py):
ChromaDB already knows CoreML doesn't fit
all-MiniLM-L6-v2, but still passesproviders = self.ort.get_available_providers()— which on Apple Silicon is['CoreMLExecutionProvider', 'AzureExecutionProvider', 'CPUExecutionProvider'].onnxruntimethen tries CoreML op-by-op and falls back to CPU for unsupported ops — each fallback costs a tensor copy between memory pools.Explicitly selecting CPU-only (no CoreML attempt) is ~22x faster on the same hardware. Running the same model on Metal GPU via
sentence-transformersis ~256x faster.Benchmark — isolated embedding (5-backend comparison)
Apple M5 (10-core: 4P + 6E), macOS 25.0.0, Python 3.9.6, 200 real chunks from
~/.claude/projects/, batch 32, 3 runs + 1 warmup:onnx_default(ChromaDB bundled)onnx_cpu(explicit CPU only)onnx_coreml(force CoreML)st_cpu(sentence-transformers CPU)st_mps(sentence-transformers Metal GPU)[1]
onnx_defaulthas high variance because CoreML-then-CPU op fallback is sensitive to competing load. Idle first run hits ~24s; one run of three under load stretched to 4+ minutes. The variance itself is a symptom.Benchmark — end-to-end real mine
Same 100 real Claude Code JSONL files (~4.8 MB, 1,430 drawers ingested into a fresh palace twice), toggling only
MEMPAL_EMBED_BACKEND:onnx_defaultst_mpsEnd-to-end speedup: ~4x on an Apple M5. Real-world speedup is lower than the isolated 60-256x because the mine pipeline also spends time on JSONL parsing, chunking, sqlite writes, and HNSW index maintenance — but embedding is no longer the bottleneck. Identical drawer count proves no data loss from the swap.
Reproduce: same 100 files, same palace structure, same pytest-baselined code — only the env var changes. A minimal reproduction script is attached as a PR comment.
What this PR adds
mempalace/embedding.py(new)Central module that returns a ChromaDB-compatible embedding function based on config.
Valid backends:
auto,onnx_default,onnx_cpu,st_cpu,st_mps,st_cuda.autochooses:st_mpsif macOS arm64 +torch.backends.mps.is_available()st_cudaiftorch.cuda.is_available()onnx_defaultotherwise (zero extra deps required)All backends gracefully fall back if their dependencies are missing (e.g.
st_mpswithout torch →onnx_default, not a crash).mempalace/config.py(modified)Two new config properties with the usual env-var > file > default priority:
mempalace/backends/chroma.py(modified) — single injection pointChromaBackend.get_collectionandChromaBackend.create_collectionnow readcfg.embedding_backend/cfg.embedding_model, build an embedding function viamempalace.embedding.get_embedding_function, and pass it toclient.get_or_create_collection(..., embedding_function=ef).If the embedding module fails to load (missing deps, bad config), the helper returns
Noneand ChromaDB falls back to its bundled default — strict backward compatibility. A single_get_configured_embedding_function()helper keeps read and write paths consistent.This aligns with the RFC 001 pluggable backend architecture —
miner.py,convo_miner.py,searcher.py,layers.py, and the MCP server all inherit the speedup because they obtain their collection throughChromaBackend. No other module-level changes required.pyproject.tomldependencies = [ "chromadb>=1.5.4,<2", "pyyaml>=6.0,<7", + "sentence-transformers>=2.2.0", + "torch>=2.0.0", ]Footprint on macOS arm64 grows by ~800 MB (torch + transformers + tokenizers). Justified by the 60-256x ingest speedup — users consistently trade hundreds of minutes of wall time against 800 MB of disk.
tests/test_embedding.py(new)15 unit tests covering:
benchmarks/apple_silicon_bench.py(new)Reproducible 5-backend benchmark runner. Supports real conversation data (
--source <dir>) or synthetic chunks. Outputs a markdown table directly suitable for PR descriptions.--skip <backend>flag to sidestep backends that hang (e.g. forced CoreML).README.mdAdds a concise
## Embedding backendsection documenting the new options and linking to the benchmark script.Backward compatibility
embedding_modelis changed from the default, palaces must be re-mined (embeddings are not comparable across models)._get_configured_embedding_functionreturnsNoneon any import/config failure → ChromaDB's default takes over → existing tests continue to pass unchanged.Testing
37 tests pass locally on Python 3.9 + chromadb 0.6.3 and the new patches are compatible with the embedding_function kwarg in chromadb 1.5.x. Full CI coverage (lint + Python 3.9/3.11/3.13) will validate the upstream matrix on merge.
Related follow-ups (out of scope for this PR)
convo_miner.file_already_minedis per-file, not per-chunk. A file partially ingested and then interrupted is never re-processed — I observed 128 files silently stuck in this state during a bulk mine. A future PR should check chunk-index completeness.all-MiniLM-L6-v2has weak semantic grounding for Chinese / CJK / domain-specific acronyms. A follow-up could makeautodetect non-ASCII-dominant content atmempalace inittime and suggest a multilingual model (paraphrase-multilingual-MiniLM-L12-v2orBAAI/bge-m3), with the caveat that switching model requires re-mining.onnx_cpubeing 5.5x-22x faster thanonnx_defaulton the same hardware suggests ChromaDB's default could be improved upstream (dropCoreMLExecutionProviderfrom the provider list on Darwin arm64). Worth filing separately withchroma-core/chroma— this PR works around it at the MemPalace layer in the meantime.Happy to split this into smaller PRs if preferred (e.g. "just drop CoreML from onnx providers" first, then "add st_mps/st_cuda backends" second). Let me know what structure works best.