Skip to content

proposal: concurrent mining via ThreadPoolExecutor + bulk_check_mined() pre-fetch #1088

@jphein

Description

@jphein

Context

On our fork's palace (165,632 drawers, mixed project files + conversation transcripts), a cold-run single-threaded mempalace mine takes ~14 min. Two changes bring it to ~3 min on a 4-core machine without changing any other semantics:

  1. bulk_check_mined() — paginated pre-fetch of (source_file, mtime) pairs in 10 K batches, so the "is this file already mined?" check doesn't do a per-file ChromaDB query.
  2. --workers N flag on mempalace mine — a ThreadPoolExecutor(max_workers=N) fans out process_file() across not-yet-mined files.

Both have been running in production on jphein/mempalace since 2026-04-10. Per-file correctness is preserved because process_file() still acquires mine_lock() per file (from #784) before writing, so the fan-out never races on the same file's drawers.

Why an issue rather than a PR

--workers overlaps in intent with #784 (file-level locking, merged 2026-04-13) — both care about safe concurrency during mining — but the mechanisms differ:

The question is whether the maintainer sees this as a natural extension of #784's concurrency story, or prefers a different direction (e.g., multi-process orchestration outside a single mine invocation). Would rather ask than file a 200-line PR that goes the wrong way.

Concrete numbers

Run environment: MacBook M2, Python 3.13, chromadb 1.5.8, 165 K-drawer palace (~8 K unique source files).

Config Wall time
Single-threaded (upstream current) ~14 min
bulk_check_mined() pre-fetch only ~8 min
--workers 4 + bulk_check_mined() ~3 min
--workers 8 + bulk_check_mined() ~2 min 50 s (diminishing returns past 4)

Open questions

  1. Is in-process fan-out with ThreadPoolExecutor of interest upstream, or is "multiple mempalace mine invocations handling disjoint subsets" the preferred concurrency model?
  2. Should --workers default to 1 (current single-threaded behavior) with explicit opt-in, or to min(4, cpu_count())?
  3. Any concern about bulk_check_mined() memory footprint for palaces with O(100 K+) unique source files? (On our 165 K-drawer palace with ~8 K unique files, the pre-fetch is <1 MB. An O(500 K)-file palace would be ~4 MB — still fine, but a chunked iterator might be warranted at extreme scales.)

Happy to open a PR immediately if the direction is approved. If not, close this with a note and we'll keep it fork-local.

Code for reference (fork main):

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/miningFile and conversation miningenhancementNew feature or requestperformancePerformance improvements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions