Environment
- MemPalace 3.3.3 (pip install -e from develop branch, 74M clone)
- Windows 11, Python 3.14.3 in venv
- chromadb 1.5.8 (default backend)
- Local-only palace at
C:\Users\stugr\Studio-palace
- Single-collection palace populated incrementally over a multi-hour bulk-mine session
Summary
After a long-running bulk mine completed cleanly (12 wings, 11h 46m total, ALL_DONE marker logged, mempalace processes exited cleanly, no errors in mine logs), mempalace search fails with a pickle deserialization error in the HNSW segment metadata. mempalace repair and mempalace migrate cannot recover because they hit the same broken read path. Workaround is non-obvious. Data layer in chroma.sqlite3 is fully intact; only the HNSW acceleration index metadata is corrupt.
Reproduction shape
- Run a series of
mempalace mine <dir> --mode projects --wing <name> invocations sequentially in a script over many hours, hitting one shared palace.
- Each wing completes cleanly. Final
ALL_DONE marker present. No errors.
- Next session:
mempalace search "<any query>" fails with the error below.
mempalace repair --yes fails with the same error (its own read path traverses the broken segment).
mempalace migrate --dry-run hangs/errors on the same path.
I cannot give a minimal reproduction yet — it took ~12 hours of mining a 1.2GB-end-state palace to surface — but the symptom set is consistent with an interrupted final flush of index_metadata.pickle while the rest of the segment files are well-formed.
Error
Search error: Error executing plan: Error sending backfill request to compactor:
Error constructing hnsw segment reader: Error creating hnsw segment reader:
Error deserializing pickle file: eval error at offset 8171: unsupported opcode 'w'
Same error from repair and migrate.
State at the time of failure
<palace>/chroma.sqlite3 — 1.04 GB, intact, valid SQLite, no journal residue, all drawers queryable via direct SQL
<palace>/<segment-uuid>/data_level0.bin — 167 MB, mtime = end of clf wing
<palace>/<segment-uuid>/header.bin — 100 B
<palace>/<segment-uuid>/length.bin — 400 KB
<palace>/<segment-uuid>/link_lists.bin — 856 KB
<palace>/<segment-uuid>/index_metadata.pickle — 12 MB (this is the corrupt one)
- All segment-dir files have the same mtime, suggesting they were written by the same process before exit
Workaround that worked
# Backup the segment dir for forensics
cp -r <palace>/<segment-uuid> <palace>/<segment-uuid>.BACKUP
# Delete only the corrupt pickle (keep the .bin files)
rm <palace>/<segment-uuid>/index_metadata.pickle
# Next mempalace operation triggers a metadata rebuild from the bin files
mempalace search "anything"
After this, search works again immediately. ChromaDB rebuilds the metadata pickle from the surviving HNSW bin files.
Suggested fixes / discussion
- Make
mempalace repair resilient to a corrupt index_metadata.pickle — at minimum, detect the pickle deserialization error and apply the workaround automatically (delete pickle, let chromadb rebuild). Right now repair reads the segment via the same path as search, so it hits the same wall.
- Atomic write of
index_metadata.pickle — write to a sibling temp file then rename, so an interrupted final flush leaves the previous version intact rather than a half-written file.
- Documentation: add the manual workaround to a troubleshooting section so users don't have to discover it through trial and error. The pickle-only delete is non-destructive once the bin files are intact.
Happy to share the BACKUP segment dir privately if it would help diagnose the exact pickle corruption pattern (12 MB; contains the broken pickle for forensic inspection).
Forensic context
I have the corrupted index_metadata.pickle preserved on disk. Pickle truncates / mid-stream-write theory fits the "offset 8171: unsupported opcode 'w'" pattern (the offset suggests Python's pickle parser successfully decoded the first ~8KB of opcodes then hit a byte that isn't a valid pickle opcode). I haven't dug into the pickle bytes yet; happy to do so if useful.
Environment
C:\Users\stugr\Studio-palaceSummary
After a long-running bulk mine completed cleanly (12 wings, 11h 46m total, ALL_DONE marker logged, mempalace processes exited cleanly, no errors in mine logs),
mempalace searchfails with a pickle deserialization error in the HNSW segment metadata.mempalace repairandmempalace migratecannot recover because they hit the same broken read path. Workaround is non-obvious. Data layer inchroma.sqlite3is fully intact; only the HNSW acceleration index metadata is corrupt.Reproduction shape
mempalace mine <dir> --mode projects --wing <name>invocations sequentially in a script over many hours, hitting one shared palace.ALL_DONEmarker present. No errors.mempalace search "<any query>"fails with the error below.mempalace repair --yesfails with the same error (its own read path traverses the broken segment).mempalace migrate --dry-runhangs/errors on the same path.I cannot give a minimal reproduction yet — it took ~12 hours of mining a 1.2GB-end-state palace to surface — but the symptom set is consistent with an interrupted final flush of
index_metadata.picklewhile the rest of the segment files are well-formed.Error
Same error from
repairandmigrate.State at the time of failure
<palace>/chroma.sqlite3— 1.04 GB, intact, valid SQLite, no journal residue, all drawers queryable via direct SQL<palace>/<segment-uuid>/data_level0.bin— 167 MB, mtime = end of clf wing<palace>/<segment-uuid>/header.bin— 100 B<palace>/<segment-uuid>/length.bin— 400 KB<palace>/<segment-uuid>/link_lists.bin— 856 KB<palace>/<segment-uuid>/index_metadata.pickle— 12 MB (this is the corrupt one)Workaround that worked
After this, search works again immediately. ChromaDB rebuilds the metadata pickle from the surviving HNSW bin files.
Suggested fixes / discussion
mempalace repairresilient to a corruptindex_metadata.pickle— at minimum, detect the pickle deserialization error and apply the workaround automatically (delete pickle, let chromadb rebuild). Right nowrepairreads the segment via the same path assearch, so it hits the same wall.index_metadata.pickle— write to a sibling temp file then rename, so an interrupted final flush leaves the previous version intact rather than a half-written file.Happy to share the
BACKUPsegment dir privately if it would help diagnose the exact pickle corruption pattern (12 MB; contains the broken pickle for forensic inspection).Forensic context
I have the corrupted
index_metadata.picklepreserved on disk. Pickle truncates / mid-stream-write theory fits the "offset 8171: unsupported opcode 'w'" pattern (the offset suggests Python's pickle parser successfully decoded the first ~8KB of opcodes then hit a byte that isn't a valid pickle opcode). I haven't dug into the pickle bytes yet; happy to do so if useful.