Skip to content

feat(backends): pickle-counts gate in quarantine_stale_hnsw#1271

Open
sha2fiddy wants to merge 2 commits intoMemPalace:developfrom
sha2fiddy:feat/quarantine-pickle-consistency-check
Open

feat(backends): pickle-counts gate in quarantine_stale_hnsw#1271
sha2fiddy wants to merge 2 commits intoMemPalace:developfrom
sha2fiddy:feat/quarantine-pickle-consistency-check

Conversation

@sha2fiddy
Copy link
Copy Markdown
Contributor

@sha2fiddy sha2fiddy commented Apr 29, 2026

Problem

quarantine_stale_hnsw misses one corruption shape: index_metadata.pickle where len(id_to_label) != total_elements_added. Stage 2's byte-sniff sees the PROTO + STOP markers and returns healthy, so the segment stays in place. HNSW traversal eventually dereferences the unmapped label and SIGSEGVs.

Same shape as chroma-core/chroma#6979. Hit on a real 522k-drawer palace 2026-04-28; recovered with #1126's repair --mode hnsw. This gate would have caught it before the crash.

Fix

A stage 3 check runs only when stage 2 passes: deserialize index_metadata.pickle and compare len(id_to_label) == total_elements_added.

  • If counts disagree, quarantine to <uuid>.drift-<timestamp> (same path as the stage 2 failure).
  • If counts agree, leave in place.
  • If the unpickler refuses, keys are missing, or types are unexpected, return undetermined and let stage 2's verdict stand. Older 0.6.x palaces with attr-object metadata are not quarantined by this gate alone.

_BoundedUnpickler refuses any non-builtin find_class, so a structurally-valid-but-malicious pickle cannot execute code at quarantine time.

Tests

8 new in tests/test_backends.py:

  • Unit (_segment_pickle_counts_consistent): matching, off-by-one, class-pickle (refused), missing keys, missing file.
  • Integration (quarantine_stale_hnsw): quarantine on off-by-one, leave-alone on consistent counts, leave-alone on byte-sniff-pass + unparseable (the 0.6.x path).

53/53 in test_backends.py and 1469/1469 in the full suite pass. ruff clean. No dependency or lockfile changes.

Adds a third gate that catches the off-by-one corruption shape from
chroma-core/chroma#6979: ``len(id_to_label) != total_elements_added``
in ``index_metadata.pickle``. The byte-sniff in
``_segment_appears_healthy`` cannot see this — the pickle is
structurally valid; the inconsistency lives in the values, and HNSW
traversal eventually lands on the unmapped label and SIGSEGVs (same
shape diagnosed at MemPalace#1046).

Uses a bounded unpickler that refuses any non-builtin
``find_class``, so a structurally-valid-but-malicious pickle cannot
execute arbitrary code at quarantine-check time. Older 0.6.x palaces
that stored metadata as a custom-class instance return "undetermined"
and the existing byte-sniff verdict stands.

8 new tests; 53/53 in test_backends.py and 1469/1469 in the full
suite pass.
Local autoformat hook used ruff 0.15.9 which produced output the CI's
pinned `ruff>=0.4.0,<0.5` rejected. Pinned local venv to match and
re-formatted. Reverts an incidental reformat of the
test_make_client_quarantines_only_on_first_call_per_palace assert
block that 0.15.9 introduced.
@sha2fiddy sha2fiddy force-pushed the feat/quarantine-pickle-consistency-check branch from d20d4f8 to cd25502 Compare April 29, 2026 17:33
@sha2fiddy sha2fiddy marked this pull request as ready for review April 29, 2026 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant