HNSW pickle corruption after long-running mine: 'unsupported opcode' on next read; repair/migrate cannot recover

## Environment

- MemPalace **3.3.3** (pip install -e from develop branch, 74M clone)
- Windows 11, Python 3.14.3 in venv
- chromadb 1.5.8 (default backend)
- Local-only palace at `C:\Users\stugr\Studio-palace`
- Single-collection palace populated incrementally over a multi-hour bulk-mine session

## Summary

After a long-running bulk mine completed cleanly (12 wings, 11h 46m total, ALL_DONE marker logged, mempalace processes exited cleanly, no errors in mine logs), `mempalace search` fails with a pickle deserialization error in the HNSW segment metadata. `mempalace repair` and `mempalace migrate` cannot recover because they hit the same broken read path. Workaround is non-obvious. Data layer in `chroma.sqlite3` is fully intact; only the HNSW acceleration index metadata is corrupt.

## Reproduction shape

1. Run a series of `mempalace mine <dir> --mode projects --wing <name>` invocations sequentially in a script over many hours, hitting one shared palace.
2. Each wing completes cleanly. Final `ALL_DONE` marker present. No errors.
3. Next session: `mempalace search "<any query>"` fails with the error below.
4. `mempalace repair --yes` fails with the same error (its own read path traverses the broken segment).
5. `mempalace migrate --dry-run` hangs/errors on the same path.

I cannot give a minimal reproduction yet — it took ~12 hours of mining a 1.2GB-end-state palace to surface — but the symptom set is consistent with an interrupted final flush of `index_metadata.pickle` while the rest of the segment files are well-formed.

## Error

```
Search error: Error executing plan: Error sending backfill request to compactor:
Error constructing hnsw segment reader: Error creating hnsw segment reader:
Error deserializing pickle file: eval error at offset 8171: unsupported opcode 'w'
```

Same error from `repair` and `migrate`.

## State at the time of failure

- `<palace>/chroma.sqlite3` — 1.04 GB, intact, valid SQLite, no journal residue, all drawers queryable via direct SQL
- `<palace>/<segment-uuid>/data_level0.bin` — 167 MB, mtime = end of clf wing
- `<palace>/<segment-uuid>/header.bin` — 100 B
- `<palace>/<segment-uuid>/length.bin` — 400 KB
- `<palace>/<segment-uuid>/link_lists.bin` — 856 KB
- `<palace>/<segment-uuid>/index_metadata.pickle` — 12 MB (this is the corrupt one)
- All segment-dir files have the same mtime, suggesting they were written by the same process before exit

## Workaround that worked

```
# Backup the segment dir for forensics
cp -r <palace>/<segment-uuid> <palace>/<segment-uuid>.BACKUP

# Delete only the corrupt pickle (keep the .bin files)
rm <palace>/<segment-uuid>/index_metadata.pickle

# Next mempalace operation triggers a metadata rebuild from the bin files
mempalace search "anything"
```

After this, search works again immediately. ChromaDB rebuilds the metadata pickle from the surviving HNSW bin files.

## Suggested fixes / discussion

1. **Make `mempalace repair` resilient to a corrupt `index_metadata.pickle`** — at minimum, detect the pickle deserialization error and apply the workaround automatically (delete pickle, let chromadb rebuild). Right now `repair` reads the segment via the same path as `search`, so it hits the same wall.
2. **Atomic write of `index_metadata.pickle`** — write to a sibling temp file then rename, so an interrupted final flush leaves the previous version intact rather than a half-written file.
3. **Documentation:** add the manual workaround to a troubleshooting section so users don't have to discover it through trial and error. The pickle-only delete is non-destructive once the bin files are intact.

Happy to share the `BACKUP` segment dir privately if it would help diagnose the exact pickle corruption pattern (12 MB; contains the broken pickle for forensic inspection).

## Forensic context

I have the corrupted `index_metadata.pickle` preserved on disk. Pickle truncates / mid-stream-write theory fits the "offset 8171: unsupported opcode 'w'" pattern (the offset suggests Python's pickle parser successfully decoded the first ~8KB of opcodes then hit a byte that isn't a valid pickle opcode). I haven't dug into the pickle bytes yet; happy to do so if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HNSW pickle corruption after long-running mine: 'unsupported opcode' on next read; repair/migrate cannot recover #1266

Environment

Summary

Reproduction shape

Error

State at the time of failure

Workaround that worked

Suggested fixes / discussion

Forensic context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HNSW pickle corruption after long-running mine: 'unsupported opcode' on next read; repair/migrate cannot recover #1266

Description

Environment

Summary

Reproduction shape

Error

State at the time of failure

Workaround that worked

Suggested fixes / discussion

Forensic context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions