Skip to content

PreCompact hook: double-ingest race condition causes HNSW corruption (no palace-wide write lock) #1253

@apollion69

Description

@apollion69

Summary

MemPalace 3.3.3 has a race condition in hooks_cli.py that causes concurrent ChromaDB writes when the PreCompact hook fires, leading to HNSW index corruption.

Root cause

hook_precompact() at line ~656 calls two ingest paths without any palace-wide lock:

  1. _ingest_transcript(transcript_path) — spawns an async subprocess.Popen (fire-and-forget)
  2. _mine_sync(...) — runs a sync subprocess.run immediately after

Both paths write to the same ChromaDB collection. If a Stop/SessionEnd hook or a background mempalace mine is already running, the result is two concurrent HNSW writers.

Neither path is gated by the existing mine.pid PID guard — _ingest_transcript bypasses it entirely, and _mine_sync is a separate code path.

Observable symptom

hook.log shows:

chromadb.errors.InternalError: Error in compaction: Failed to apply logs to the hnsw segment writer

In our case, link_lists.bin (HNSW higher-level connections) grew from ~50 MB to 210 GB apparent / 41 GB real (sparse file expansion) before we caught it.

Reproduction

  1. Enable both PreCompact and SessionEnd hooks in Claude Code settings.json
  2. Have a long session (so both hooks fire close together on context compaction)
  3. Observe HNSW errors in hook.log; du will show the palace growing

Proposed fix

Add a palace-wide fcntl.flock(LOCK_EX) in hooks_cli.py before any HNSW write operation. The lock file should be shared across all three write paths:

  • hook_stop()_ingest_transcript() / _maybe_auto_ingest()
  • hook_precompact() → both ingest calls
  • CLI mempalace mine command

Example pattern (already working in our workaround):

import fcntl, os
LOCK_FILE = os.path.join(palace_dir, ".palace-write.lock")
with open(LOCK_FILE, "w") as lf:
    fcntl.flock(lf, fcntl.LOCK_EX)
    # ... all HNSW writes here

Workaround (used in production)

We disabled PreCompact entirely (no-op exit 0 script) and moved conversation mining to a cron job with a shared flock -n guard. See the monkey-patch approach in stop_diary_only.py that disables the async paths while keeping the diary checkpoint.

Environment

  • MemPalace 3.3.3
  • ChromaDB (bundled version)
  • Claude Code SessionEnd + PreCompact hooks enabled
  • Host: WSL2 on Windows

Related

  • Single-slot mine.pid PID guard has no O_EXCL and is bypassed by both _ingest_transcript (Popen) and _mine_sync paths — worth hardening separately

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions