Skip to content

RFC: Synapse Phase 10–14 — Model Guard, Cross-Wing Balancing, Score Explainability, Adaptive Compaction, Paginated Scoring #914

@matrix9neonebuchadnezzar2199-sketch

Description

RFC: Synapse Phase 10–14

Continuation of RFC #595 (Phase 5–9). This RFC proposes five new Synapse pipeline phases that address open, unresolved issues in the repository. All phases are opt-in via RetrievalProfile flags and fully backward-compatible — existing behavior is unchanged when flags are off.

Motivation

Phases 5–9 (PR #596) added MMR deduplication, Pinned Memory, Query Expansion, Supersede Detection, and the Consolidation Engine on top of ChromaDB's default col.query(). However, several critical problems remain open with no PRs addressing them:

Problem Issues Status
Embedding model mismatch between ingest and query — silent search failure #903, #912 Open, no PR
Large wings dominate search results — small wings get zero hits #860 Open, no PR
Ranking logic is opaque — no way to explain why a drawer ranked higher (none filed) No issue exists
preCompact hook permanently blocks /compact in Claude Code #906, #858, #856 Open, no PR
col.get(limit=10000) silently truncates on large palaces (>10K drawers) #850, #851, #723 Open, no PR

This RFC addresses all five with new Synapse pipeline phases.

DX Guarantee

Every phase includes developer-experience instrumentation (timing, tracing, dry-run support, replay logging, and test hooks). All DX features are observability-only — they never modify scores, rankings, candidate selection, or any write path. With all DX flags at their defaults, the pipeline produces byte-identical results to a build without these features. This invariant is enforced by a dedicated test (test_dx_flags_off_identical_results) that runs the same query with all DX flags ON and OFF and asserts that the returned drawer IDs, order, and scores match exactly.

DX capability Default When OFF
Phase-level timing always on N/A — ~100 ns overhead per phase; no effect on scores
Candidate trace false single if guard, zero allocation
Dry run false not a flag — explicit per-call argument; normal calls unaffected
Replay logging false no writes beyond existing Query Expansion logging
Assertion hooks none registered iteration over empty list is a no-op

Phase 10: Model Guard — Embedding Consistency Validation

Problem

The MCP server relies on ChromaDB's built-in default embedding function (all-MiniLM-L6-v2, 384 dimensions). There is no centralized embedding model configuration. If a user ingests with all-mpnet-base-v2 (768-dim), every MCP query silently returns garbage because 384-dim query vectors are compared against 768-dim stored vectors. The dimensional mismatch makes cosine similarity mathematically meaningless, and even when dimensions match, different models produce incompatible vector spaces.

Approach

Insert a validation gate at the very start of the Synapse pipeline (before Phase 1 LTP scoring).

  1. Build-time stamp: During mempalace mine, write the embedding model name and dimension to collection.metadata (ChromaDB native) and optionally to palace_meta.json. Format: {"embedding_model": "all-mpnet-base-v2", "embedding_dim": 768}.
  2. Query-time check: At the start of search_memories(), read the stored model metadata and compare against the currently loaded model's name and dimension.
  3. On mismatch: Add "model_guard": "MISMATCH" to pipeline_trace and populate a warnings field: "ingest model: all-mpnet-base-v2 (768), query model: all-MiniLM-L6-v2 (384) — results may be unreliable". If RetrievalProfile.model_guard_strict is true, return an empty result set with an error message instead of garbage results.

RetrievalProfile flags

  • model_guard: bool (default true) — enable/disable the check.
  • model_guard_strict: bool (default false) — if true, block search on mismatch instead of warning.

Developer experience

  • Timing: phase_timing_ms.model_guard reports the time spent on metadata read + comparison (typically <2 ms). Always included in pipeline_trace.
  • Candidate trace: When candidate_trace: true, a mismatch event is recorded as {"phase": "model_guard", "action": "warn" | "block", "ingest_model": "...", "query_model": "...", "dim_ingest": 768, "dim_query": 384}. This lets developers confirm that Model Guard fired and what it detected.
  • Dry run: dry_run=true executes the full check but never writes to palace_meta.json — useful for verifying detection logic against a test palace without altering its metadata.
  • Replay: The replay_packet records model_guard_result: "MATCH" | "MISMATCH" so that replayed searches reflect the original model state, even if the model has since been changed.
  • Test hooks: register_hook("post_model_guard", fn) fires after validation with context {"match": bool, "ingest_model": str, "query_model": str}. Test example:
def test_model_guard_detects_mismatch():
    result = {}
    pipeline.register_hook("post_model_guard", lambda ctx: result.update(ctx))
    pipeline.run(query="test", profile=mismatched_profile)
    assert result["match"] is False

Impact: Eliminates silent search failure. Low implementation cost (metadata read/write + comparison only). Acts as a safety net for all future embedding model PRs (#515, #553, #756).


Phase 11: Cross-Wing Balancing — Wing-Aware Diversity in MMR

Problem

When a palace has wings of vastly different sizes (e.g., a work wing with hundreds of JSON/YAML files vs. a flights wing with a few travel itineraries), unscoped search returns results exclusively from the largest wing. Searching "cambodia" returns swagger.yaml and schema.rb from work instead of the trip itinerary from flights that explicitly mentions "Cambodia", "Phnom Penh", and "Siem Reap". Similarity scores are negative (-0.53 to -0.58), meaning all results are poor — but the large wing's vectors dominate the nearest-neighbor space simply due to volume.

Approach

Extend Phase 5 (MMR) with a wing-diversity penalty term.

Standard MMR:

score(d) = λ · sim(q, d) − (1−λ) · max_{d' ∈ S} sim(d, d')

Cross-Wing MMR:

score(d) = λ · sim(q, d) − (1−λ) · max_{d' ∈ S} sim(d, d') − γ · wing_saturation(d, S)

Where wing_saturation(d, S) = count(d.wing in S) / |S| — the fraction of already-selected results that share the same wing as candidate d.

To ensure small-wing drawers enter the candidate pool, the initial ChromaDB query requests n_results = k × wing_count (e.g., 5 × 20 = 100) instead of the bare k. MMR then selects the final k from this expanded pool.

RetrievalProfile flags

  • wing_diversity_weight: float (default 0.3) — the γ parameter. Set to 0.0 to disable (equivalent to standard MMR).
  • candidate_pool_multiplier: int (default 1) — multiplier for the initial n_results. Set to wing_count for full cross-wing coverage.

Developer experience

  • Timing: phase_timing_ms.cross_wing_balancing reports the time spent computing wing_saturation penalties and re-ranking. Reported separately from phase_timing_ms.mmr so developers can see the incremental cost of the diversity layer.
  • Candidate trace: When candidate_trace: true, each drawer dropped by Cross-Wing Balancing is logged with the reason:
{
  "drawer_id": "abc-123",
  "dropped_at": "cross_wing_balancing",
  "reason": "wing 'work' already saturated (4/5 selected results)",
  "wing_saturation": 0.80,
  "original_mmr_score": 0.71,
  "penalized_score": 0.47
}

This directly answers "why didn't my flights drawer show up?" — because work had already filled 4 of 5 slots.

  • Dry run: dry_run=true computes all penalties and produces the full candidate trace without persisting any LTP updates that might be triggered by the expanded candidate pool.
  • Replay: replay_packet records wing_counts_at_query_time: {"work": 4200, "flights": 23, ...} so the saturation math can be reproduced exactly.
  • Test hooks: register_hook("post_cross_wing", fn) fires with context {"candidates_before": [...], "candidates_after": [...], "wing_saturation_scores": {...}}. Test example:
def test_cross_wing_surfaces_small_wing():
    captured = {}
    pipeline.register_hook("post_cross_wing", lambda ctx: captured.update(ctx))
    pipeline.run(query="cambodia", profile=diverse_profile)
    wings = {c["wing"] for c in captured["candidates_after"][:5]}
    assert "flights" in wings

Impact: Resolves the "large room dominates" problem without requiring --room filters. Incremental change to existing MMR logic. Setting γ = 0 produces identical results to current behavior.


Phase 12: Score Explainability — Per-Drawer Score Breakdown

Problem

The current pipeline_trace reports aggregate metrics (phases_applied, phases_skipped, total_candidates_in/out, elapsed_ms) but does not explain why any individual drawer ranked where it did. This makes it impossible to debug ranking issues, validate the pipeline's value-add over raw ChromaDB, or produce transparent benchmarks.

Approach

Attach a score_breakdown object to each returned drawer's metadata.

{
  "score_breakdown": {
    "cosine_similarity": 0.82,
    "ltp_boost": 0.15,
    "mmr_penalty": -0.08,
    "wing_diversity_penalty": -0.02,
    "pinned_boost": 0.04,
    "supersede_penalty": 0.00,
    "final_score": 0.91
  }
}

Implementation:

  1. Each phase emits a ScoreEvent(phase: str, delta: float, reason: str) when it modifies a drawer's score.
  2. Each drawer carries a score_events: list[ScoreEvent] through the pipeline.
  3. At the end of the pipeline, score_events are aggregated into score_breakdown.
  4. score_breakdown is only included when RetrievalProfile.explain is true (default false) to avoid overhead in normal searches.
  5. The mempalace_search MCP tool accepts an explain: bool argument to enable this on a per-call basis.

This follows the Explainable IR methodology described in Anand et al. (arXiv:2211.02405) — pointwise score decomposition across ranking features.

RetrievalProfile flags

  • explain: bool (default false) — include score_breakdown in results.

Developer experience

Phase 12 is itself a DX feature — score_breakdown is the developer's primary debugging tool for ranking questions. The additional DX instrumentation layers on top of it:

  • Timing: phase_timing_ms.score_explainability reports the aggregation cost of ScoreEvent lists into score_breakdown. When explain: false, this phase is skipped entirely (no timing entry).
  • Candidate trace: When both candidate_trace: true and explain: true, dropped drawers also include their score_breakdown at the point of exclusion. This answers "drawer X was dropped at MMR — what was its score at that moment?":
{
  "drawer_id": "abc-123",
  "dropped_at": "mmr",
  "reason": "similarity to xyz-789 = 0.92 > threshold 0.85",
  "score_at_drop": {
    "cosine_similarity": 0.78,
    "ltp_boost": 0.10,
    "partial_score": 0.88
  }
}
  • Dry run: dry_run=true with explain=true produces a complete score_breakdown for every candidate (not just the final k), giving a full "what-if" view of the entire ranking.
  • Replay: replay_packet records the full score_events list for every returned drawer, enabling exact reconstruction of the scoring process after the fact.
  • Test hooks: register_hook("post_score_event", fn) fires on every individual ScoreEvent emission. This is the finest-grained hook — it enables assertions like "MMR penalty for drawer X was between -0.1 and -0.05":
def test_mmr_penalty_range():
    events = []
    pipeline.register_hook("post_score_event", lambda e: events.append(e))
    pipeline.run(query="test", profile=explain_profile)
    mmr_events = [e for e in events if e["phase"] == "mmr"]
    for e in mmr_events:
        assert -0.2 <= e["delta"] <= 0.0

Impact: Full transparency into ranking decisions. Enables A/B comparison ("cosine-only" vs. "full pipeline") for benchmarking.


Phase 13: Adaptive Compaction — LTP-Based Memory Compression

Problem

The preCompact hook in Claude Code permanently blocks /compact. When the context window fills up, the hook fires and instructs the model to save everything to MemPalace first — but even after saving, compaction cannot resume because the hook has no mechanism to signal "save complete, proceed with compaction." The root cause is the absence of any criteria for deciding what to keep and what to compress.

Approach

Use Synapse's LTP scores and Supersede Detection to automatically identify compaction candidates.

  1. Trigger: When context token usage exceeds a configurable threshold (e.g., 80%), the Synapse pipeline generates a "compaction candidate list."
  2. Candidate criteria (all must be met):
    • ltp_score < compaction_threshold (default 0.3) — not important long-term.
    • is_superseded = true — a newer drawer has replaced this one.
    • last_accessed > compaction_age_days (default 30) — not recently referenced.
  3. Compaction action: Feed candidates to the Consolidation Engine (Phase 9) for summarization. Original drawers are moved to soft-archive (reversible). The consolidated summary drawer replaces them in active search.
  4. Hook integration: The preCompact hook calls mempalace_session_context to get compaction candidates → executes Consolidation → returns "compaction OK" to Claude Code. This creates an automated flow instead of a permanent block.
  5. Optional decay: Apply a time-based decay factor to LTP scores: decayed_ltp = ltp_score × e^(−λt), where t is days since last access and λ is the decay constant. This models Ebbinghaus's forgetting curve — memories that are never revisited naturally become compaction candidates.

RetrievalProfile flags

  • adaptive_compaction: bool (default false) — enable automatic candidate generation.
  • compaction_threshold: float (default 0.3) — LTP score below which a drawer becomes a candidate.
  • compaction_age_days: int (default 30) — minimum days since last access.
  • decay_enabled: bool (default false) — apply forgetting-curve decay to LTP scores.
  • decay_lambda: float (default 0.01) — decay rate constant.

Developer experience

  • Timing: phase_timing_ms.adaptive_compaction reports the total time for candidate selection + consolidation planning. Broken into sub-timings: ltp_scan_ms, supersede_check_ms, consolidation_plan_ms.
  • Candidate trace: This phase's trace is especially critical because compaction removes drawers from active search. When candidate_trace: true, every compaction candidate is logged with full justification:
{
  "drawer_id": "old-note-42",
  "action": "compact",
  "reason": "ltp=0.18 < 0.3, superseded_by=new-note-99, last_accessed=47 days ago",
  "ltp_score": 0.18,
  "decayed_ltp": 0.12,
  "superseded_by": "new-note-99",
  "days_since_access": 47,
  "consolidated_into": "summary-drawer-7"
}
  • Dry run: Essential for this phase. dry_run=true executes the full candidate selection and consolidation planning but performs zero writes — no soft-archive moves, no summary drawer creation, no LTP updates. Returns the complete compaction plan as output so the developer can review what would be compressed before committing. mempalace_consolidate(mode="evaluate") (existing) is extended to work with the auto-generated candidate list.
  • Replay: replay_packet records compaction_candidates: [...] and compaction_plan: {groups: [...], summary_count: N}. This allows post-mortem analysis: "the compaction on April 15th archived 23 drawers — were they all correctly identified?"
  • Test hooks: register_hook("post_compaction_plan", fn) fires after candidate selection with context {"candidates": [...], "groups": [...], "would_archive": int, "would_create_summaries": int}. Test example:
def test_compaction_does_not_touch_high_ltp():
    captured = {}
    pipeline.register_hook("post_compaction_plan", lambda ctx: captured.update(ctx))
    pipeline.run_compaction(profile=decay_profile)
    for c in captured["candidates"]:
        assert c["ltp_score"] < 0.3
        assert c["is_superseded"] is True

Impact: Unblocks the /compact workflow in Claude Code. Introduces cognitively plausible memory management. Dry run + candidate trace make compaction auditable and reversible.

Dependency: Phase 14 (Paginated Scoring) should be implemented first, since Adaptive Compaction needs to scan all drawers to identify candidates across the entire palace.


Phase 14: Paginated Scoring — Large Palace Support

Problem

col.get(limit=10000) is hardcoded in multiple paths (miner.py, mcp_server.py). On palaces with >10,000 drawers, this silently truncates results — mempalace status shows "10,000 drawers" when the actual count is 122,686. Wing/room breakdowns are incomplete (some wings are entirely missing). The same limitation prevents Synapse phases (LTP scoring, Supersede Detection) from scanning the full palace.

Approach

Introduce cursor-based pagination inside the Synapse pipeline.

  1. Accurate count: Use col.count() for the true total. Display this in mempalace_status.
  2. Batched iteration: Iterate with offset + limit (batch size 5,000):
    total = col.count()
    offset = 0
    while offset < total:
        batch = col.get(limit=5000, offset=offset, include=["metadatas"])
        process_batch(batch)
        offset += len(batch["metadatas"])
  3. Incremental LTP updates: LTP scores are persisted in synapse.sqlite3. Batch processing updates only drawers whose filed_at is newer than the last scan timestamp — no full rescan needed on every call.
  4. Bounded scan: RetrievalProfile.max_scan_depth (default 50000) limits the total number of drawers scanned in a single pipeline run, preventing runaway processing on extremely large palaces.

RetrievalProfile flags

  • paginated_scoring: bool (default true) — enable paginated iteration.
  • batch_size: int (default 5000) — number of drawers per batch.
  • max_scan_depth: int (default 50000) — upper bound on total drawers scanned.

Developer experience

  • Timing: phase_timing_ms.paginated_scoring reports total iteration time, with sub-timings per batch: batch_timings_ms: [12, 14, 11, 13, ...]. This reveals whether specific batches are slow (e.g., batch 8 takes 200 ms because it hits a large wing), enabling targeted optimization.
{
  "phase_timing_ms": {
    "paginated_scoring": {
      "total_ms": 340,
      "batches": 25,
      "batch_timings_ms": [12, 14, 11, 13, "..."],
      "avg_batch_ms": 13.6,
      "slowest_batch": {"index": 8, "ms": 47, "offset": 40000}
    }
  }
}
  • Candidate trace: When candidate_trace: true, each batch reports its contribution to the pipeline:
{
  "batch_index": 3,
  "offset": 15000,
  "drawers_in_batch": 5000,
  "ltp_updated": 127,
  "ltp_skipped_unchanged": 4873,
  "supersede_candidates_found": 4
}

This lets developers see which regions of the palace contain actively changing drawers vs. stable ones, informing batch size tuning.

  • Dry run: dry_run=true iterates through all batches and computes LTP scores and supersede candidates but writes nothing to synapse.sqlite3. Returns a summary of what would be updated. Useful for estimating the cost of a full-palace scan before committing.
  • Replay: replay_packet records pagination_state: {total: 122686, batches_processed: 25, last_offset: 122686, scan_depth_reached: false}. This allows developers to confirm that a past search actually covered the full palace or was truncated by max_scan_depth.
  • Test hooks: register_hook("post_batch", fn) fires after each batch with context {"batch_index": int, "offset": int, "batch_size": int, "ltp_updates": int}. Test example:
def test_pagination_covers_full_palace():
    batches = []
    pipeline.register_hook("post_batch", lambda ctx: batches.append(ctx))
    pipeline.run(query="test", profile=paginated_profile, palace=large_palace)
    total_processed = sum(b["batch_size"] for b in batches)
    assert total_processed == large_palace.count()

Impact: Correct status display on large palaces. Full Synapse pipeline coverage regardless of palace size. Incremental updates minimize repeated computation. Unblocks Phase 13 (Adaptive Compaction) which requires full-palace scans.


Implementation Order

Phase 10 (Model Guard)          → lowest cost, highest safety impact
  ↓
Phase 11 (Cross-Wing MMR)       → incremental change to existing MMR
  ↓
Phase 12 (Score Explainability)  → transparency + benchmark proof
  ↓
Phase 14 (Paginated Scoring)    → prerequisite for Phase 13
  ↓
Phase 13 (Adaptive Compaction)   → depends on Phase 14 for full-palace scan

Each phase will be submitted as a separate PR against develop, following the same pattern as PR #596.

Test Plan

Each phase adds tests to tests/test_synapse_advanced.py and/or new test files:

  • Phase 10: Test match/mismatch detection, strict mode blocking, pipeline_trace output, timing entry, candidate trace event format, dry-run metadata preservation, replay packet model_guard_result, post_model_guard hook firing.
  • Phase 11: Test wing-balanced results vs. standard MMR, γ = 0 backward compatibility, timing separation from MMR, candidate trace with saturation scores, replay wing counts, post_cross_wing hook context.
  • Phase 12: Test score_breakdown structure, explain=false omits it, score component correctness, dropped-drawer score snapshots in candidate trace, full-candidate dry-run output, replay score_events list, post_score_event hook granularity.
  • Phase 13: Test candidate selection criteria, consolidation integration, decay formula, sub-timing breakdown, compaction candidate trace with full justification, dry-run producing zero writes, replay compaction plan, post_compaction_plan hook assertions.
  • Phase 14: Test pagination on mock collections >10K, incremental update correctness, max_scan_depth enforcement, per-batch timing, batch candidate trace, dry-run scan summary, replay pagination state, post_batch hook full-palace coverage.
  • DX invariant: test_dx_flags_off_identical_results — runs the same query with all DX flags ON and OFF, asserts byte-identical drawer IDs, order, and scores.

Target: ~60–80 new tests across all five phases.


References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions