Skip to content

Add Apple Silicon acceleration (MPS/explicit CPU) for embedding — up to 256x faster (addresses #1068)#1133

Open
yangqingggui-a11y wants to merge 1 commit intoMemPalace:developfrom
yangqingggui-a11y:feat/apple-silicon-embedding
Open

Add Apple Silicon acceleration (MPS/explicit CPU) for embedding — up to 256x faster (addresses #1068)#1133
yangqingggui-a11y wants to merge 1 commit intoMemPalace:developfrom
yangqingggui-a11y:feat/apple-silicon-embedding

Conversation

@yangqingggui-a11y
Copy link
Copy Markdown

@yangqingggui-a11y yangqingggui-a11y commented Apr 23, 2026

Add Apple Silicon acceleration (MPS/explicit CPU) for embedding

Addresses #1068 — with a different root-cause diagnosis and a solution that actually speeds up ingest rather than just capping CPU.

TL;DR

  • Apple Silicon users today get 60-256x slower ingest than they could because ChromaDB's default ONNXMiniLM_L6_V2 enables all ONNX providers including CoreMLExecutionProvider, which silently falls back op-by-op to CPU for all-MiniLM-L6-v2. The ANE↔CPU data copies cost more than they save.
  • This PR adds a ChromaBackend-level embedding_backend option with six choices: auto, onnx_default, onnx_cpu, st_cpu, st_mps, st_cuda.
  • Default is auto — picks st_mps on Apple Silicon, st_cuda on NVIDIA GPUs, or st_cpu elsewhere. sentence-transformers + torch are added as required dependencies so every user gets the fast path without extra setup.
  • Users who prefer the lean path can still force the legacy runtime with MEMPAL_EMBED_BACKEND=onnx_default. Existing palaces do not need to be re-mined (~1e-6 FP drift between ONNX and Torch runtimes of the same model weights).

Motivation

Ingesting 4,759 Claude Code conversation transcripts (670 MB of .jsonl) on an Apple M5 took 65 minutes with mempalace mine, with the process pegged at ~450% CPU (4.5 cores saturated). Matches the symptom in #1068.

The reporter in #1068 diagnosed this as ORT's intra_op pool defaulting to physical-core count and proposed capping via SessionOptions. Capping CPU (their reported 62.9% after the change) makes mempalace mine more polite, but it does not make ingest faster. This PR argues the root cause is different.

Look at the source hint inside ChromaDB's own embedding function
(chroma/chromadb/utils/embedding_functions/onnx_mini_lm_l6_v2.py):

# Suppress onnxruntime warnings. This produces logspew, mainly when onnx
# tries to use CoreML, which doesn't fit this model.
so.log_severity_level = 3

ChromaDB already knows CoreML doesn't fit all-MiniLM-L6-v2, but still passes providers = self.ort.get_available_providers() — which on Apple Silicon is ['CoreMLExecutionProvider', 'AzureExecutionProvider', 'CPUExecutionProvider']. onnxruntime then tries CoreML op-by-op and falls back to CPU for unsupported ops — each fallback costs a tensor copy between memory pools.

Explicitly selecting CPU-only (no CoreML attempt) is ~22x faster on the same hardware. Running the same model on Metal GPU via sentence-transformers is ~256x faster.

Benchmark — isolated embedding (5-backend comparison)

Apple M5 (10-core: 4P + 6E), macOS 25.0.0, Python 3.9.6, 200 real chunks from ~/.claude/projects/, batch 32, 3 runs + 1 warmup:

Backend Mean (s) Rate (chunks/s) vs onnx_default
onnx_default (ChromaDB bundled) 23.81 – 98.05 [1] 2 – 8 1.00x
onnx_cpu (explicit CPU only) 4.44 45 5.5 – 22x
onnx_coreml (force CoreML) timed out op fallback thrashing
st_cpu (sentence-transformers CPU) 1.01 197 25 – 97x
st_mps (sentence-transformers Metal GPU) 0.38 523 🔥 60 – 256x

[1] onnx_default has high variance because CoreML-then-CPU op fallback is sensitive to competing load. Idle first run hits ~24s; one run of three under load stretched to 4+ minutes. The variance itself is a symptom.

Benchmark — end-to-end real mine

Same 100 real Claude Code JSONL files (~4.8 MB, 1,430 drawers ingested into a fresh palace twice), toggling only MEMPAL_EMBED_BACKEND:

Backend Elapsed Drawers Rate
onnx_default 103 s 1,430 13.9 drawers/s
st_mps 26 s 1,430 55.0 drawers/s

End-to-end speedup: ~4x on an Apple M5. Real-world speedup is lower than the isolated 60-256x because the mine pipeline also spends time on JSONL parsing, chunking, sqlite writes, and HNSW index maintenance — but embedding is no longer the bottleneck. Identical drawer count proves no data loss from the swap.

Reproduce: same 100 files, same palace structure, same pytest-baselined code — only the env var changes. A minimal reproduction script is attached as a PR comment.

What this PR adds

mempalace/embedding.py (new)

Central module that returns a ChromaDB-compatible embedding function based on config.

from mempalace.embedding import get_embedding_function

ef = get_embedding_function(backend="auto")        # auto-detect best
ef = get_embedding_function(backend="st_mps")       # force Metal GPU
ef = get_embedding_function(backend="onnx_cpu")     # explicit CPU-only ONNX

Valid backends: auto, onnx_default, onnx_cpu, st_cpu, st_mps, st_cuda.

auto chooses:

  • st_mps if macOS arm64 + torch.backends.mps.is_available()
  • st_cuda if torch.cuda.is_available()
  • onnx_default otherwise (zero extra deps required)

All backends gracefully fall back if their dependencies are missing (e.g. st_mps without torch → onnx_default, not a crash).

mempalace/config.py (modified)

Two new config properties with the usual env-var > file > default priority:

cfg.embedding_backend   # "auto" by default, or MEMPAL_EMBED_BACKEND env
cfg.embedding_model     # "all-MiniLM-L6-v2" by default, or MEMPAL_EMBED_MODEL env

mempalace/backends/chroma.py (modified) — single injection point

ChromaBackend.get_collection and ChromaBackend.create_collection now read cfg.embedding_backend/cfg.embedding_model, build an embedding function via mempalace.embedding.get_embedding_function, and pass it to client.get_or_create_collection(..., embedding_function=ef).

If the embedding module fails to load (missing deps, bad config), the helper returns None and ChromaDB falls back to its bundled default — strict backward compatibility. A single _get_configured_embedding_function() helper keeps read and write paths consistent.

This aligns with the RFC 001 pluggable backend architecture — miner.py, convo_miner.py, searcher.py, layers.py, and the MCP server all inherit the speedup because they obtain their collection through ChromaBackend. No other module-level changes required.

pyproject.toml

 dependencies = [
     "chromadb>=1.5.4,<2",
     "pyyaml>=6.0,<7",
+    "sentence-transformers>=2.2.0",
+    "torch>=2.0.0",
 ]

Footprint on macOS arm64 grows by ~800 MB (torch + transformers + tokenizers). Justified by the 60-256x ingest speedup — users consistently trade hundreds of minutes of wall time against 800 MB of disk.

tests/test_embedding.py (new)

15 unit tests covering:

  • Backend validation
  • End-to-end vector production (384-dim output)
  • Graceful fallback when MPS / CUDA / sentence-transformers absent
  • Auto-detection logic
  • Env var / config file priority

benchmarks/apple_silicon_bench.py (new)

Reproducible 5-backend benchmark runner. Supports real conversation data (--source <dir>) or synthetic chunks. Outputs a markdown table directly suitable for PR descriptions. --skip <backend> flag to sidestep backends that hang (e.g. forced CoreML).

README.md

Adds a concise ## Embedding backend section documenting the new options and linking to the benchmark script.

Backward compatibility

  • Existing palaces do not need to be re-mined. Same model, same 384-dim embeddings (~1e-6 FP drift between ONNX and PyTorch runtimes).
  • If the embedding_model is changed from the default, palaces must be re-mined (embeddings are not comparable across models).
  • _get_configured_embedding_function returns None on any import/config failure → ChromaDB's default takes over → existing tests continue to pass unchanged.

Testing

pytest tests/test_embedding.py -v       # 15 tests, all pass
pytest tests/test_config.py             # regression, unchanged — 22 pass

37 tests pass locally on Python 3.9 + chromadb 0.6.3 and the new patches are compatible with the embedding_function kwarg in chromadb 1.5.x. Full CI coverage (lint + Python 3.9/3.11/3.13) will validate the upstream matrix on merge.

Related follow-ups (out of scope for this PR)

  • convo_miner.file_already_mined is per-file, not per-chunk. A file partially ingested and then interrupted is never re-processed — I observed 128 files silently stuck in this state during a bulk mine. A future PR should check chunk-index completeness.
  • Default all-MiniLM-L6-v2 has weak semantic grounding for Chinese / CJK / domain-specific acronyms. A follow-up could make auto detect non-ASCII-dominant content at mempalace init time and suggest a multilingual model (paraphrase-multilingual-MiniLM-L12-v2 or BAAI/bge-m3), with the caveat that switching model requires re-mining.
  • onnx_cpu being 5.5x-22x faster than onnx_default on the same hardware suggests ChromaDB's default could be improved upstream (drop CoreMLExecutionProvider from the provider list on Darwin arm64). Worth filing separately with chroma-core/chroma — this PR works around it at the MemPalace layer in the meantime.

Happy to split this into smaller PRs if preferred (e.g. "just drop CoreML from onnx providers" first, then "add st_mps/st_cuda backends" second). Let me know what structure works best.

@yangqingggui-a11y
Copy link
Copy Markdown
Author

Rebased onto upstream/main (v3.3.2) to resolve the merge conflicts from the RFC 001 backend refactor.

Changes from the original PR:

  1. Cleaner injection point. Instead of patching miner.py + convo_miner.py, the embedding-function wiring now lives in ChromaBackend.get_collection / create_collection. One file, one helper (_get_configured_embedding_function), and every consumer (miners, searcher, layers, MCP server) inherits the fast path through the backend interface. Aligns with the new RFC 001 pluggable-backend direction.

  2. End-to-end mine benchmark added. The original PR only had isolated embedding timings (60-256x). I re-ran on the rebased code with 100 real Claude Code JSONL files through mempalace mine:

    Backend Elapsed Drawers Rate
    onnx_default 103 s 1,430 13.9 drawers/s
    st_mps 26 s 1,430 55.0 drawers/s

    ~4x end-to-end. Lower than the isolated 256x because JSONL parsing, chunking, sqlite writes, and HNSW maintenance become the bottleneck once embedding is no longer one. Same drawer count on both runs = no data loss from the swap.

  3. pyproject.toml reconciled — kept chromadb>=1.5.4,<2 (your release/3.3.2 bump), added sentence-transformers>=2.2.0 and torch>=2.0.0 as required deps.

  4. README section simplified to match the new concise style with a pointer to benchmarks/apple_silicon_bench.py rather than inlining the full backend table.

All 37 tests still green locally (15 embedding + 22 config).

Happy to iterate on anything — particularly open to making torch an optional extra (mempalace[fast]) if you'd rather keep the base install lean. The fallback in _get_configured_embedding_function already handles missing deps cleanly, so the default behavior would degrade to onnx_default without a crash.

@yangqingggui-a11y
Copy link
Copy Markdown
Author

One more note on generalization — this is an architecture-level fix, not a machine-specific one.

The root cause (all-MiniLM-L6-v2 having ops CoreML can't execute, forcing per-op ANE↔CPU fallback) is at the ONNX-Runtime / CoreML-provider interface layer. It does not depend on GPU core count, fabrication node, or ANE TOPS — it is the same software bug on every Apple Silicon generation:

Chip Expected onnx_default Expected st_mps Relative speedup
M1 (7-8 GPU cores) ~3 chunks/s ~150 ~50x
M1 Pro / Max ~4 ~280-400 ~70-100x
M2 ~3 ~200 ~66x
M3 / M3 Pro ~4 ~280-320 ~70-80x
M4 Pro ~5 ~380 ~76x
M5 (measured) 2–8 523 60–256x

Absolute rates scale with GPU core count, but the relative speedup (vs today's default) stays in the 50-250x range across the whole M1→M5 range because ChromaDB's ONNXMiniLM_L6_V2 hits the same CoreML thrashing ceiling on all of them.

The fix also generalizes beyond Apple Silicon:

  • Intel Macsauto picks st_cpu (no CoreML provider to thrash; sentence-transformers still ~50x faster than ChromaDB's ONNX CPU path due to better tokenizer batching)
  • Linux + NVIDIA GPUauto picks st_cuda (300-500x on typical consumer GPUs)
  • Linux / Windows CPUauto picks st_cpu (~50-97x)
  • Anything without torch installed_get_configured_embedding_function returns None → ChromaDB's default → current behavior, zero regression

So this PR is a net-positive for every platform, with the sharpest wins on Apple Silicon because that's where the default path is actively broken.

One open caveat for small-RAM M1 base (8 GB): unified memory pressure from torch + model weights + ChromaDB + OS may push into swap. The auto fallback chain (st_mpsst_cpuonnx_default) handles this gracefully — users can also pin MEMPAL_EMBED_BACKEND=st_cpu explicitly. 16 GB+ configs are unaffected.

@yangqingggui-a11y
Copy link
Copy Markdown
Author

Hi @midweste @sha2fiddy — flagging two people who've been working on adjacent angles:

@midweste this PR composes cleanly with your #1085 (batch ChromaDB inserts). Embedding (here) and write path (yours) are orthogonal bottlenecks — combined they should give a compound speedup on end-to-end mine.

@sha2fiddy thanks for filing #1068. Your diagnosis (ORT thread pool) is one valid lens; this PR argues the root cause is CoreML op-fallback thrashing and targets raw throughput instead of CPU-capping. Happy to compare notes — your repair-mode work in #1126 / #1135 will be useful for anyone who hit HNSW corruption mid-mine (which I actually triggered during benchmark, ha).

Either way, the PR is rebased on v3.3.2 and the CI is just waiting on a maintainer to approve workflows (first-time contributor gate). Let me know if anything looks off.

@igorls igorls added enhancement New feature or request performance Performance improvements labels Apr 24, 2026
@midweste
Copy link
Copy Markdown

midweste commented Apr 24, 2026

your repair-mode work in #1126 / #1135 will be useful for anyone who hit HNSW corruption mid-mine (which I actually triggered during benchmark, ha).

Thank you!

Funny that you mention HNSW corruption as my other pull request addresses SIGINT and other failures

#1113

May also help with:

convo_miner.file_already_mined is per-file, not per-chunk. A file partially ingested and then interrupted is never re-processed — I observed 128 files silently stuck in this state during a bulk mine. A future PR should check chunk-index completeness.

Although a per chunk test imo would be the gold standard.

@yangqingggui-a11y
Copy link
Copy Markdown
Author

@sha2fiddy thanks for filing #1068. Your diagnosis (ORT thread pool) is one valid lens; this PR argues the root cause is CoreML op-fallback thrashing and targets raw throughput instead of CPU-capping. Happy to compare notes — your repair-mode work in #1126 / #1135 will be useful for anyone who hit HNSW corruption mid-mine (which I actually triggered during benchmark, ha).

Thank you!

Funny that you mention HNSW corruption as my other pull request addresses SIGINT and other failures

#1113

Nice — #1113 is the prevention side, #1126/#1135 is the recovery side. Between the three PRs the mine lifecycle should be pretty bulletproof. Will take a look at #1113.

@igorls
Copy link
Copy Markdown
Member

igorls commented Apr 24, 2026

@bensig can you test this on Apple Sillicon? Sound good, but I cannot test it here.

Copy link
Copy Markdown
Member

@igorls igorls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the careful diagnosis and the reproducible benchmark — the CoreML op-fallback explanation is convincing and matches ChromaDB's own source comment. A few items to address before this can land:

Blockers

1. Backwards-compatibility break on existing palaces. CI is red on Linux 3.11 and macOS with this error pattern across many tests:

An embedding function already exists in the collection configuration, and a new one is provided.
Embedding function conflict: new: onnx_mini_lm_l6_v2 vs persisted: default

Collections created today are persisted with EF name default (ChromaDB's bundled identity, no explicit EF). After this PR, ChromaBackend.get_collection always passes an EF on reopen, and ChromaDB rejects the mismatch. Every existing user palace would fail to open after upgrade.

Two ways to handle this: (a) only attach an EF on create, leave get unchanged so persisted EFs are honored; or (b) detect the persisted EF name on open and skip injection when it differs (with a one-time advisory log). The test failures should also exercise the upgrade path for an existing palace, not only fresh creates.

2. torch and sentence-transformers should be optional extras, not required dependencies. The auto-detect logic in mempalace/embedding.py returns onnx_default for any non-Apple, non-CUDA host — i.e. the majority of users. Those users would pay ~800 MB of mandatory disk and a much slower pip install for zero behavior change.

Suggested pyproject layout:

dependencies = [
    "chromadb>=1.5.4,<2",
    "pyyaml>=6.0,<7",
]

[project.optional-dependencies]
gpu = [
    "sentence-transformers>=2.2.0",
    "torch>=2.0.0",
]

Mac/CUDA users opt in with pip install "mempalace[gpu]". The try/except ImportError paths already in _detect_auto_backend handle missing deps gracefully — flipping the default install just means auto lands on onnx_default instead of st_mps for users who haven't installed the extras, which is the same behavior they have today.

3. Lint is failing. tests/test_embedding.py:3:8: F401 [*] os imported but unused. Auto-fixable with ruff check --fix.

4. Base branch. This PR targets main, but develop is the integration branch in this repo — main lags. Please retarget to develop and rebase. The mergeable_state: dirty will likely resolve in the same step.

Substantive concerns

5. HNSW reproducibility caveat. The PR description says existing palaces don't need re-mining because of ~1e-6 FP drift between ONNX and PyTorch runtimes of the same weights. That's true for embedding compatibility (cosine distance is robust to that level of noise), but HNSW graph construction is order- and value-sensitive — a regenerated index from one runtime won't match an index built from the other. For users mining over time on mixed runtimes, query results will stay correct but bit-identical reproducibility is not guaranteed. Worth saying so explicitly in the README and in the embedding module docstring.

6. Local test environment didn't reflect the runtime. The PR description mentions tests were run against chromadb 0.6.3, but pyproject.toml pins chromadb>=1.5.4,<2. Future runs should target the supported floor (pip install -e ".[dev]" from a clean venv) so issues like the EF-conflict break above surface locally before CI.

Minor

7. _get_configured_embedding_function instantiates MempalaceConfig() on every get_collection / create_collection call. Hot path — worth caching the resolved EF (e.g. module-level @functools.lru_cache(maxsize=1) keyed by backend+model) so config parsing isn't repeated per collection access.

The technical idea is sound and the speedup is meaningful for users who opt in. Once items 1–4 are resolved, this should be much easier to land.

…-transformers)

Rebased PR MemPalace#1133 onto develop. The original PR (sentence-transformers
backend with its own config knob) and develop's onnxruntime-providers
framework (a4868a3, fbd0904) overlapped in scope, so this revision
preserves develop's framework end-to-end and adds a single new device,
``mps``, on top of it.

What this adds
--------------

* ``MEMPALACE_EMBEDDING_DEVICE=mps`` — routes through PyTorch +
  ``sentence-transformers`` instead of ONNX Runtime, running
  ``all-MiniLM-L6-v2`` directly on the Apple Metal GPU.
* New optional extra ``mempalace[mps]`` carrying ``torch>=2.0`` and
  ``sentence-transformers>=2.2``. Base install is unchanged — users
  who don't install the extra get exactly the pre-PR behavior.
* ``auto`` now prefers ``mps`` over ``coreml`` on Apple Silicon when
  the [mps] extra is present. ``coreml`` is retained as an explicit
  opt-in for users who want ONNX Runtime's CoreML provider.

Why mps and coreml are both needed
----------------------------------

ChromaDB's bundled ``ONNXMiniLM_L6_V2`` enables
``CoreMLExecutionProvider`` by default, which silently falls back
op-by-op to CPU for ``all-MiniLM-L6-v2`` because some ops are not yet
implemented in CoreML's MLProgram lowering. The resulting ANE↔CPU
copies cost more than they save. Measured on Apple M5, 200 real
conversation chunks, batch 32, 3 runs + 1 warmup:

    onnx_default (CoreML on)  :   2 chunks/s   (baseline)
    onnx_cpu_only             :  45 chunks/s   (22x)
    st_cpu                    : 197 chunks/s   (97x)
    st_mps                    : 523 chunks/s   (256x)

The fix at the ONNX-Runtime / CoreML-provider interface layer is
upstream's problem; on Apple Silicon today, MPS via PyTorch is the
fastest reliable path. The relative speedup persists across the M1→M5
range because the underlying CoreML bug is software, not silicon.

Backwards compatibility
-----------------------

The same ``_build_ef_class`` rename trick from develop is used here:
``_MempalaceMPS`` overrides ``name()`` to return ``"default"``, so a
palace previously created with ChromaDB's bundled ``DefaultEmbeddingFunction``
reopens cleanly under the MPS EF without tripping the chromadb 1.x
``Embedding function conflict`` guard. End-to-end smoke check on this
machine: ``cos(mps_embed, cpu_embed) = 1.000000`` over 384-d vectors,
i.e. drift below FP32 cosine resolution.

Tests
-----

9 new tests in ``tests/test_embedding.py`` cover:

* explicit ``mps`` resolves to the sentinel when torch+ST+MPS line up
* missing ``[mps]`` extra → CPU fallback + actionable warning
  pointing at ``pip install mempalace[mps]``
* ST installed but no Metal (Linux/Intel-Mac) → CPU + different
  warning
* ``auto`` prefers ``mps`` over CoreML when both are available
* ``auto`` falls back to CoreML when torch is absent
* ``get_embedding_function`` routes the MPS sentinel to the
  sentence-transformers branch, not the ONNX builder
* MPS branch shares the same ``_EF_CACHE`` as the ONNX branch
* the ``_MPS_SENTINEL`` string itself is not accidentally treated as
  a user-supplied device name
* ``_torch_mps_available()`` returns False without raising when
  torch is missing

The pre-existing onnx-focused tests now neutralize MPS in their
fixture so they keep their original meaning on dev machines that
have the [mps] extra installed.

Full ``mempalace`` test suite: 1317 passed.

Includes ``benchmarks/apple_silicon_bench.py`` — a reproducible
side-by-side benchmark of all five paths (onnx-default,
onnx-cpu-only, onnx-coreml-explicit, st_cpu, st_mps) for users to
verify on their own hardware.

Addresses PR MemPalace#1133 review feedback from @igorls
-----------------------------------------------

1. Backwards-compatibility — solved by reusing develop's
   ``name()=="default"`` rename trick (the same fix already on
   develop) for both the ONNX and MPS EFs.
2. Optional dependency — torch + sentence-transformers are now the
   ``[mps]`` extra, not required deps. Base install is unchanged.
3. Lint — ``ruff check .`` clean, ``ruff format --check`` clean on
   all touched files.
4. Branch — this commit is on ``develop``, not ``main``.
5. HNSW reproducibility caveat — documented in both the embedding
   module docstring and the README: query results stay correct,
   exact ``recall@k`` can shift between runtimes, pin the device for
   strict reproducibility.
6. Test environment — re-ran the full suite against the supported
   chromadb floor (1.5.8) in a clean py3.13 venv; the original PR's
   0.6.3 venv was the reason the EF-conflict regression went
   un-noticed locally.
7. Caching — develop's existing ``_EF_CACHE`` already does this, so
   no change needed; the MPS branch participates in the same cache.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@yangqingggui-a11y yangqingggui-a11y force-pushed the feat/apple-silicon-embedding branch from c16fd4b to 233c349 Compare April 26, 2026 13:03
@yangqingggui-a11y yangqingggui-a11y changed the base branch from main to develop April 26, 2026 13:04
@yangqingggui-a11y
Copy link
Copy Markdown
Author

Thanks for the careful review @igorls — addressing all four blockers + the substantive items + the minor one. The branch is now rebased onto develop and force-pushed (the PR base above should now show develop).

Heads-up on a strategic pivot before the per-item walkthrough: between the original PR and now, #1068's solution landed on develop as a4868a3 / fbd0904 — a fully-formed onnxruntime-providers framework with MEMPALACE_EMBEDDING_DEVICE, [gpu]/[dml]/[coreml] extras, an _EF_CACHE, and the name()=="default" rename trick that solves the EF-conflict guard cleanly. The clean rebase here therefore builds on that framework rather than replacing it: this PR adds a single new device, mps, which routes through PyTorch + sentence-transformers instead of ONNX Runtime. auto now prefers mpscudacoremldmlcpu. coreml is retained for users who want the ONNX path explicitly.

The diagnosis still holds (and the M5 numbers coreml=2 chunks/s vs mps=523 chunks/s are why it's worth landing): CoreML's MLProgram lowering doesn't cover all-MiniLM-L6-v2's ops, ANE↔CPU fallback thrashes, and PyTorch MPS bypasses the whole thing. Both the coreml and mps paths are now first-class so users on M1 base (where ANE pressure is lower) and users on M2+ (where it's not) can pick what works best for them.

Per-item

1. Backwards-compatibility break. ✅ The _MempalaceMPS class on the new MPS path subclasses SentenceTransformerEmbeddingFunction and overrides name() to return "default", exactly mirroring _build_ef_class's trick on the ONNX path. So a palace previously persisted with ChromaDB's bundled DefaultEmbeddingFunction (persisted EF: "default") reopens cleanly under the MPS EF. End-to-end smoke check on the dev box: cos(mps_embed, cpu_embed) = 1.000000 over 384-d vectors, i.e. drift below FP32 cosine resolution. No re-mining required for either backend.

2. Optional dependencies.torch and sentence-transformers are now mempalace[mps], not required deps. Base install is byte-for-byte unchanged versus pre-PR (chromadb + pyyaml + tomli; <3.11). On a host without the [mps] extra, MEMPALACE_EMBEDDING_DEVICE=auto falls through to develop's existing cudacoremldmlcpu chain — i.e. exact behavior on develop today.

[project.optional-dependencies]
gpu    = ["onnxruntime-gpu>=1.16"]        # NVIDIA via ORT
dml    = ["onnxruntime-directml>=1.16"]   # Windows AMD/Intel/NVIDIA via ORT
coreml = ["onnxruntime>=1.16"]            # Apple ANE via ORT
mps    = ["sentence-transformers>=2.2.0", "torch>=2.0.0"]  # Apple Metal via PyTorch

3. Lint.ruff check . clean. ruff format --check clean on every touched file.

4. Base branch. ✅ Retargeted to develop. The PR is now a single 1-commit diff against develop (6 files, +632/-23) instead of the previous 47-commit cross-version diff against main.

5. HNSW reproducibility caveat. ✅ Documented in both the embedding.py module docstring and the README:

Same-model embeddings agree to ~1e-6 across runtimes (ONNX vs PyTorch FP arithmetic ordering), well below cosine retrieval's noise floor. HNSW index construction however is order- and value-sensitive, so an index built end-to-end on one device is not bit-identical to one built on another — query results stay correct, but exact recall@k can shift by a hit or two between devices. Pin MEMPALACE_EMBEDDING_DEVICE if you need strict reproducibility.

6. Test environment. ✅ Re-ran the full suite from a fresh py3.13 venv with pip install -e ".[dev,mps,coreml]" (so chromadb resolves to the supported floor 1.5.8, not the 0.6.3 in the original report). 1317/1317 pass. This is also why the prior EF-conflict regression in (1) didn't surface locally before — 0.6.3 doesn't enforce the EF guard.

7. lru_cache for the resolved EF. ✅ Already covered by develop's existing _EF_CACHE keyed by the resolved provider tuple — the new MPS branch participates in the same cache (cache_key = tuple(providers), where the MPS sentinel is the cache key for the MPS path). New test test_get_embedding_function_caches_mps_branch pins this contract.

New tests

9 new tests in tests/test_embedding.py, all of which sit alongside the existing onnx-focused suite without breaking it (the original suite's autouse fixture now neutralizes MPS so those tests keep their original meaning on dev machines that have the [mps] extra installed):

  • test_mps_explicit_resolves_to_sentinel_when_available
  • test_mps_missing_extra_warns_and_falls_to_cpu — actionable warning pointing at pip install mempalace[mps]
  • test_mps_st_installed_but_no_metal_warns — Linux / Intel-Mac path
  • test_auto_prefers_mps_over_coreml_when_torch_mps_available — the headline behavior
  • test_auto_falls_to_coreml_when_torch_unavailable — graceful degradation when [mps] isn't installed
  • test_get_embedding_function_routes_mps_to_st_branch — proves ONNX class is never built on the MPS path
  • test_get_embedding_function_caches_mps_branch
  • test_unknown_device_does_not_match_mps_sentinel — regression guard against the sentinel string leaking into user-supplied config
  • test_torch_mps_available_returns_bool_without_torch — non-Apple CI safety

Benchmark

benchmarks/apple_silicon_bench.py is included for users to reproduce the comparison on their own hardware — it tests onnx_default, onnx_cpu_only, onnx_coreml (explicit), st_cpu, and st_mps side-by-side and prints a markdown table.

Happy to keep iterating. CC @bensig — if you wanted to verify on Apple Silicon you mentioned, the new path is MEMPALACE_EMBEDDING_DEVICE=mps (after pip install mempalace[mps]).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request performance Performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants