RFC: Synapse Phase 10–14

## RFC: Synapse Phase 10–14

Continuation of [RFC #595](https://github.com/MemPalace/mempalace/issues/595) (Phase 5–9). This RFC proposes five new Synapse pipeline phases that address open, unresolved issues in the repository. All phases are opt-in via `RetrievalProfile` flags and fully backward-compatible — existing behavior is unchanged when flags are off.

### Motivation

Phases 5–9 ([PR #596](https://github.com/MemPalace/mempalace/pull/596)) added MMR deduplication, Pinned Memory, Query Expansion, Supersede Detection, and the Consolidation Engine on top of ChromaDB's default `col.query()`. However, several critical problems remain open with no PRs addressing them:

| Problem | Issues | Status |
|---|---|---|
| Embedding model mismatch between ingest and query — silent search failure | #903, #912 | Open, no PR |
| Large wings dominate search results — small wings get zero hits | #860 | Open, no PR |
| Ranking logic is opaque — no way to explain why a drawer ranked higher | (none filed) | No issue exists |
| preCompact hook permanently blocks `/compact` in Claude Code | #906, #858, #856 | Open, no PR |
| `col.get(limit=10000)` silently truncates on large palaces (>10K drawers) | #850, #851, #723 | Open, no PR |

This RFC addresses all five with new Synapse pipeline phases.

### DX Guarantee

Every phase includes developer-experience instrumentation (timing, tracing, dry-run support, replay logging, and test hooks). All DX features are **observability-only** — they never modify scores, rankings, candidate selection, or any write path. With all DX flags at their defaults, the pipeline produces byte-identical results to a build without these features. This invariant is enforced by a dedicated test (`test_dx_flags_off_identical_results`) that runs the same query with all DX flags ON and OFF and asserts that the returned drawer IDs, order, and scores match exactly.

| DX capability | Default | When OFF |
|---|---|---|
| Phase-level timing | always on | N/A — ~100 ns overhead per phase; no effect on scores |
| Candidate trace | `false` | single `if` guard, zero allocation |
| Dry run | `false` | not a flag — explicit per-call argument; normal calls unaffected |
| Replay logging | `false` | no writes beyond existing Query Expansion logging |
| Assertion hooks | none registered | iteration over empty list is a no-op |

---

### Phase 10: Model Guard — Embedding Consistency Validation

**Problem**

The MCP server relies on ChromaDB's built-in default embedding function (`all-MiniLM-L6-v2`, 384 dimensions). There is no centralized embedding model configuration. If a user ingests with `all-mpnet-base-v2` (768-dim), every MCP query silently returns garbage because 384-dim query vectors are compared against 768-dim stored vectors. The dimensional mismatch makes cosine similarity mathematically meaningless, and even when dimensions match, different models produce incompatible vector spaces.

**Approach**

Insert a validation gate at the very start of the Synapse pipeline (before Phase 1 LTP scoring).

1. **Build-time stamp**: During `mempalace mine`, write the embedding model name and dimension to `collection.metadata` (ChromaDB native) and optionally to `palace_meta.json`. Format: `{"embedding_model": "all-mpnet-base-v2", "embedding_dim": 768}`.
2. **Query-time check**: At the start of `search_memories()`, read the stored model metadata and compare against the currently loaded model's name and dimension.
3. **On mismatch**: Add `"model_guard": "MISMATCH"` to `pipeline_trace` and populate a `warnings` field: `"ingest model: all-mpnet-base-v2 (768), query model: all-MiniLM-L6-v2 (384) — results may be unreliable"`. If `RetrievalProfile.model_guard_strict` is `true`, return an empty result set with an error message instead of garbage results.

**RetrievalProfile flags**

- `model_guard: bool` (default `true`) — enable/disable the check.
- `model_guard_strict: bool` (default `false`) — if `true`, block search on mismatch instead of warning.

**Developer experience**

- **Timing**: `phase_timing_ms.model_guard` reports the time spent on metadata read + comparison (typically <2 ms). Always included in `pipeline_trace`.
- **Candidate trace**: When `candidate_trace: true`, a mismatch event is recorded as `{"phase": "model_guard", "action": "warn" | "block", "ingest_model": "...", "query_model": "...", "dim_ingest": 768, "dim_query": 384}`. This lets developers confirm that Model Guard fired and what it detected.
- **Dry run**: `dry_run=true` executes the full check but never writes to `palace_meta.json` — useful for verifying detection logic against a test palace without altering its metadata.
- **Replay**: The `replay_packet` records `model_guard_result: "MATCH" | "MISMATCH"` so that replayed searches reflect the original model state, even if the model has since been changed.
- **Test hooks**: `register_hook("post_model_guard", fn)` fires after validation with context `{"match": bool, "ingest_model": str, "query_model": str}`. Test example:

```python
def test_model_guard_detects_mismatch():
    result = {}
    pipeline.register_hook("post_model_guard", lambda ctx: result.update(ctx))
    pipeline.run(query="test", profile=mismatched_profile)
    assert result["match"] is False
```

**Impact**: Eliminates silent search failure. Low implementation cost (metadata read/write + comparison only). Acts as a safety net for all future embedding model PRs (#515, #553, #756).

---

### Phase 11: Cross-Wing Balancing — Wing-Aware Diversity in MMR

**Problem**

When a palace has wings of vastly different sizes (e.g., a `work` wing with hundreds of JSON/YAML files vs. a `flights` wing with a few travel itineraries), unscoped search returns results exclusively from the largest wing. Searching "cambodia" returns swagger.yaml and schema.rb from `work` instead of the trip itinerary from `flights` that explicitly mentions "Cambodia", "Phnom Penh", and "Siem Reap". Similarity scores are negative (-0.53 to -0.58), meaning all results are poor — but the large wing's vectors dominate the nearest-neighbor space simply due to volume.

**Approach**

Extend Phase 5 (MMR) with a wing-diversity penalty term.

Standard MMR:
```
score(d) = λ · sim(q, d) − (1−λ) · max_{d' ∈ S} sim(d, d')
```

Cross-Wing MMR:
```
score(d) = λ · sim(q, d) − (1−λ) · max_{d' ∈ S} sim(d, d') − γ · wing_saturation(d, S)
```

Where `wing_saturation(d, S) = count(d.wing in S) / |S|` — the fraction of already-selected results that share the same wing as candidate `d`.

To ensure small-wing drawers enter the candidate pool, the initial ChromaDB query requests `n_results = k × wing_count` (e.g., `5 × 20 = 100`) instead of the bare `k`. MMR then selects the final `k` from this expanded pool.

**RetrievalProfile flags**

- `wing_diversity_weight: float` (default `0.3`) — the `γ` parameter. Set to `0.0` to disable (equivalent to standard MMR).
- `candidate_pool_multiplier: int` (default `1`) — multiplier for the initial `n_results`. Set to `wing_count` for full cross-wing coverage.

**Developer experience**

- **Timing**: `phase_timing_ms.cross_wing_balancing` reports the time spent computing `wing_saturation` penalties and re-ranking. Reported separately from `phase_timing_ms.mmr` so developers can see the incremental cost of the diversity layer.
- **Candidate trace**: When `candidate_trace: true`, each drawer dropped by Cross-Wing Balancing is logged with the reason:

```json
{
  "drawer_id": "abc-123",
  "dropped_at": "cross_wing_balancing",
  "reason": "wing 'work' already saturated (4/5 selected results)",
  "wing_saturation": 0.80,
  "original_mmr_score": 0.71,
  "penalized_score": 0.47
}
```

This directly answers "why didn't my flights drawer show up?" — because work had already filled 4 of 5 slots.

- **Dry run**: `dry_run=true` computes all penalties and produces the full candidate trace without persisting any LTP updates that might be triggered by the expanded candidate pool.
- **Replay**: `replay_packet` records `wing_counts_at_query_time: {"work": 4200, "flights": 23, ...}` so the saturation math can be reproduced exactly.
- **Test hooks**: `register_hook("post_cross_wing", fn)` fires with context `{"candidates_before": [...], "candidates_after": [...], "wing_saturation_scores": {...}}`. Test example:

```python
def test_cross_wing_surfaces_small_wing():
    captured = {}
    pipeline.register_hook("post_cross_wing", lambda ctx: captured.update(ctx))
    pipeline.run(query="cambodia", profile=diverse_profile)
    wings = {c["wing"] for c in captured["candidates_after"][:5]}
    assert "flights" in wings
```

**Impact**: Resolves the "large room dominates" problem without requiring `--room` filters. Incremental change to existing MMR logic. Setting `γ = 0` produces identical results to current behavior.

---

### Phase 12: Score Explainability — Per-Drawer Score Breakdown

**Problem**

The current `pipeline_trace` reports aggregate metrics (`phases_applied`, `phases_skipped`, `total_candidates_in/out`, `elapsed_ms`) but does not explain **why any individual drawer ranked where it did**. This makes it impossible to debug ranking issues, validate the pipeline's value-add over raw ChromaDB, or produce transparent benchmarks.

**Approach**

Attach a `score_breakdown` object to each returned drawer's metadata.

```json
{
  "score_breakdown": {
    "cosine_similarity": 0.82,
    "ltp_boost": 0.15,
    "mmr_penalty": -0.08,
    "wing_diversity_penalty": -0.02,
    "pinned_boost": 0.04,
    "supersede_penalty": 0.00,
    "final_score": 0.91
  }
}
```

Implementation:

1. Each phase emits a `ScoreEvent(phase: str, delta: float, reason: str)` when it modifies a drawer's score.
2. Each drawer carries a `score_events: list[ScoreEvent]` through the pipeline.
3. At the end of the pipeline, `score_events` are aggregated into `score_breakdown`.
4. `score_breakdown` is only included when `RetrievalProfile.explain` is `true` (default `false`) to avoid overhead in normal searches.
5. The `mempalace_search` MCP tool accepts an `explain: bool` argument to enable this on a per-call basis.

This follows the Explainable IR methodology described in Anand et al. (arXiv:2211.02405) — pointwise score decomposition across ranking features.

**RetrievalProfile flags**

- `explain: bool` (default `false`) — include `score_breakdown` in results.

**Developer experience**

Phase 12 is itself a DX feature — `score_breakdown` **is** the developer's primary debugging tool for ranking questions. The additional DX instrumentation layers on top of it:

- **Timing**: `phase_timing_ms.score_explainability` reports the aggregation cost of `ScoreEvent` lists into `score_breakdown`. When `explain: false`, this phase is skipped entirely (no timing entry).
- **Candidate trace**: When both `candidate_trace: true` and `explain: true`, dropped drawers also include their `score_breakdown` at the point of exclusion. This answers "drawer X was dropped at MMR — what was its score at that moment?":

```json
{
  "drawer_id": "abc-123",
  "dropped_at": "mmr",
  "reason": "similarity to xyz-789 = 0.92 > threshold 0.85",
  "score_at_drop": {
    "cosine_similarity": 0.78,
    "ltp_boost": 0.10,
    "partial_score": 0.88
  }
}
```

- **Dry run**: `dry_run=true` with `explain=true` produces a complete `score_breakdown` for every candidate (not just the final `k`), giving a full "what-if" view of the entire ranking.
- **Replay**: `replay_packet` records the full `score_events` list for every returned drawer, enabling exact reconstruction of the scoring process after the fact.
- **Test hooks**: `register_hook("post_score_event", fn)` fires on every individual `ScoreEvent` emission. This is the finest-grained hook — it enables assertions like "MMR penalty for drawer X was between -0.1 and -0.05":

```python
def test_mmr_penalty_range():
    events = []
    pipeline.register_hook("post_score_event", lambda e: events.append(e))
    pipeline.run(query="test", profile=explain_profile)
    mmr_events = [e for e in events if e["phase"] == "mmr"]
    for e in mmr_events:
        assert -0.2 <= e["delta"] <= 0.0
```

**Impact**: Full transparency into ranking decisions. Enables A/B comparison ("cosine-only" vs. "full pipeline") for benchmarking.

---

### Phase 13: Adaptive Compaction — LTP-Based Memory Compression

**Problem**

The `preCompact` hook in Claude Code permanently blocks `/compact`. When the context window fills up, the hook fires and instructs the model to save everything to MemPalace first — but even after saving, compaction cannot resume because the hook has no mechanism to signal "save complete, proceed with compaction." The root cause is the absence of any criteria for deciding **what to keep and what to compress**.

**Approach**

Use Synapse's LTP scores and Supersede Detection to automatically identify compaction candidates.

1. **Trigger**: When context token usage exceeds a configurable threshold (e.g., 80%), the Synapse pipeline generates a "compaction candidate list."
2. **Candidate criteria** (all must be met):
   - `ltp_score < compaction_threshold` (default `0.3`) — not important long-term.
   - `is_superseded = true` — a newer drawer has replaced this one.
   - `last_accessed > compaction_age_days` (default `30`) — not recently referenced.
3. **Compaction action**: Feed candidates to the Consolidation Engine (Phase 9) for summarization. Original drawers are moved to soft-archive (reversible). The consolidated summary drawer replaces them in active search.
4. **Hook integration**: The `preCompact` hook calls `mempalace_session_context` to get compaction candidates → executes Consolidation → returns "compaction OK" to Claude Code. This creates an automated flow instead of a permanent block.
5. **Optional decay**: Apply a time-based decay factor to LTP scores: `decayed_ltp = ltp_score × e^(−λt)`, where `t` is days since last access and `λ` is the decay constant. This models Ebbinghaus's forgetting curve — memories that are never revisited naturally become compaction candidates.

**RetrievalProfile flags**

- `adaptive_compaction: bool` (default `false`) — enable automatic candidate generation.
- `compaction_threshold: float` (default `0.3`) — LTP score below which a drawer becomes a candidate.
- `compaction_age_days: int` (default `30`) — minimum days since last access.
- `decay_enabled: bool` (default `false`) — apply forgetting-curve decay to LTP scores.
- `decay_lambda: float` (default `0.01`) — decay rate constant.

**Developer experience**

- **Timing**: `phase_timing_ms.adaptive_compaction` reports the total time for candidate selection + consolidation planning. Broken into sub-timings: `ltp_scan_ms`, `supersede_check_ms`, `consolidation_plan_ms`.
- **Candidate trace**: This phase's trace is especially critical because compaction **removes drawers from active search**. When `candidate_trace: true`, every compaction candidate is logged with full justification:

```json
{
  "drawer_id": "old-note-42",
  "action": "compact",
  "reason": "ltp=0.18 < 0.3, superseded_by=new-note-99, last_accessed=47 days ago",
  "ltp_score": 0.18,
  "decayed_ltp": 0.12,
  "superseded_by": "new-note-99",
  "days_since_access": 47,
  "consolidated_into": "summary-drawer-7"
}
```

- **Dry run**: **Essential for this phase.** `dry_run=true` executes the full candidate selection and consolidation planning but **performs zero writes** — no soft-archive moves, no summary drawer creation, no LTP updates. Returns the complete compaction plan as output so the developer can review what would be compressed before committing. `mempalace_consolidate(mode="evaluate")` (existing) is extended to work with the auto-generated candidate list.
- **Replay**: `replay_packet` records `compaction_candidates: [...]` and `compaction_plan: {groups: [...], summary_count: N}`. This allows post-mortem analysis: "the compaction on April 15th archived 23 drawers — were they all correctly identified?"
- **Test hooks**: `register_hook("post_compaction_plan", fn)` fires after candidate selection with context `{"candidates": [...], "groups": [...], "would_archive": int, "would_create_summaries": int}`. Test example:

```python
def test_compaction_does_not_touch_high_ltp():
    captured = {}
    pipeline.register_hook("post_compaction_plan", lambda ctx: captured.update(ctx))
    pipeline.run_compaction(profile=decay_profile)
    for c in captured["candidates"]:
        assert c["ltp_score"] < 0.3
        assert c["is_superseded"] is True
```

**Impact**: Unblocks the `/compact` workflow in Claude Code. Introduces cognitively plausible memory management. Dry run + candidate trace make compaction auditable and reversible.

**Dependency**: Phase 14 (Paginated Scoring) should be implemented first, since Adaptive Compaction needs to scan all drawers to identify candidates across the entire palace.

---

### Phase 14: Paginated Scoring — Large Palace Support

**Problem**

`col.get(limit=10000)` is hardcoded in multiple paths (`miner.py`, `mcp_server.py`). On palaces with >10,000 drawers, this silently truncates results — `mempalace status` shows "10,000 drawers" when the actual count is 122,686. Wing/room breakdowns are incomplete (some wings are entirely missing). The same limitation prevents Synapse phases (LTP scoring, Supersede Detection) from scanning the full palace.

**Approach**

Introduce cursor-based pagination inside the Synapse pipeline.

1. **Accurate count**: Use `col.count()` for the true total. Display this in `mempalace_status`.
2. **Batched iteration**: Iterate with `offset` + `limit` (batch size 5,000):
   ```python
   total = col.count()
   offset = 0
   while offset < total:
       batch = col.get(limit=5000, offset=offset, include=["metadatas"])
       process_batch(batch)
       offset += len(batch["metadatas"])
   ```
3. **Incremental LTP updates**: LTP scores are persisted in `synapse.sqlite3`. Batch processing updates only drawers whose `filed_at` is newer than the last scan timestamp — no full rescan needed on every call.
4. **Bounded scan**: `RetrievalProfile.max_scan_depth` (default `50000`) limits the total number of drawers scanned in a single pipeline run, preventing runaway processing on extremely large palaces.

**RetrievalProfile flags**

- `paginated_scoring: bool` (default `true`) — enable paginated iteration.
- `batch_size: int` (default `5000`) — number of drawers per batch.
- `max_scan_depth: int` (default `50000`) — upper bound on total drawers scanned.

**Developer experience**

- **Timing**: `phase_timing_ms.paginated_scoring` reports total iteration time, with sub-timings per batch: `batch_timings_ms: [12, 14, 11, 13, ...]`. This reveals whether specific batches are slow (e.g., batch 8 takes 200 ms because it hits a large wing), enabling targeted optimization.

```json
{
  "phase_timing_ms": {
    "paginated_scoring": {
      "total_ms": 340,
      "batches": 25,
      "batch_timings_ms": [12, 14, 11, 13, "..."],
      "avg_batch_ms": 13.6,
      "slowest_batch": {"index": 8, "ms": 47, "offset": 40000}
    }
  }
}
```

- **Candidate trace**: When `candidate_trace: true`, each batch reports its contribution to the pipeline:

```json
{
  "batch_index": 3,
  "offset": 15000,
  "drawers_in_batch": 5000,
  "ltp_updated": 127,
  "ltp_skipped_unchanged": 4873,
  "supersede_candidates_found": 4
}
```

This lets developers see which regions of the palace contain actively changing drawers vs. stable ones, informing batch size tuning.

- **Dry run**: `dry_run=true` iterates through all batches and computes LTP scores and supersede candidates but **writes nothing to `synapse.sqlite3`**. Returns a summary of what would be updated. Useful for estimating the cost of a full-palace scan before committing.
- **Replay**: `replay_packet` records `pagination_state: {total: 122686, batches_processed: 25, last_offset: 122686, scan_depth_reached: false}`. This allows developers to confirm that a past search actually covered the full palace or was truncated by `max_scan_depth`.
- **Test hooks**: `register_hook("post_batch", fn)` fires after each batch with context `{"batch_index": int, "offset": int, "batch_size": int, "ltp_updates": int}`. Test example:

```python
def test_pagination_covers_full_palace():
    batches = []
    pipeline.register_hook("post_batch", lambda ctx: batches.append(ctx))
    pipeline.run(query="test", profile=paginated_profile, palace=large_palace)
    total_processed = sum(b["batch_size"] for b in batches)
    assert total_processed == large_palace.count()
```

**Impact**: Correct `status` display on large palaces. Full Synapse pipeline coverage regardless of palace size. Incremental updates minimize repeated computation. Unblocks Phase 13 (Adaptive Compaction) which requires full-palace scans.

---

### Implementation Order

```
Phase 10 (Model Guard)          → lowest cost, highest safety impact
  ↓
Phase 11 (Cross-Wing MMR)       → incremental change to existing MMR
  ↓
Phase 12 (Score Explainability)  → transparency + benchmark proof
  ↓
Phase 14 (Paginated Scoring)    → prerequisite for Phase 13
  ↓
Phase 13 (Adaptive Compaction)   → depends on Phase 14 for full-palace scan
```

Each phase will be submitted as a separate PR against `develop`, following the same pattern as PR #596.

### Test Plan

Each phase adds tests to `tests/test_synapse_advanced.py` and/or new test files:

- **Phase 10**: Test match/mismatch detection, strict mode blocking, `pipeline_trace` output, timing entry, candidate trace event format, dry-run metadata preservation, replay packet `model_guard_result`, post_model_guard hook firing.
- **Phase 11**: Test wing-balanced results vs. standard MMR, `γ = 0` backward compatibility, timing separation from MMR, candidate trace with saturation scores, replay wing counts, post_cross_wing hook context.
- **Phase 12**: Test `score_breakdown` structure, `explain=false` omits it, score component correctness, dropped-drawer score snapshots in candidate trace, full-candidate dry-run output, replay score_events list, post_score_event hook granularity.
- **Phase 13**: Test candidate selection criteria, consolidation integration, decay formula, sub-timing breakdown, compaction candidate trace with full justification, dry-run producing zero writes, replay compaction plan, post_compaction_plan hook assertions.
- **Phase 14**: Test pagination on mock collections >10K, incremental update correctness, `max_scan_depth` enforcement, per-batch timing, batch candidate trace, dry-run scan summary, replay pagination state, post_batch hook full-palace coverage.
- **DX invariant**: `test_dx_flags_off_identical_results` — runs the same query with all DX flags ON and OFF, asserts byte-identical drawer IDs, order, and scores.

Target: ~60–80 new tests across all five phases.

---

### References

- RFC #595 — Synapse Phase 1–9 design
- PR #596 — Phase 5–9 implementation
- #903 / #912 — Embedding model mismatch
- #860 — Large room dominance
- #850 / #851 / #723 — 10K drawer truncation
- #906 / #858 / #856 — preCompact hook blocking
- Carbonell & Goldstein (1998), "The Use of MMR, Diversity-Based Reranking" — MMR foundation
- Anand et al. (2022), "Explainable Information Retrieval: A Survey" (arXiv:2211.02405)
- Ebbinghaus (1885), "Über das Gedächtnis" — forgetting curve model


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Synapse Phase 10–14 — Model Guard, Cross-Wing Balancing, Score Explainability, Adaptive Compaction, Paginated Scoring #914

Motivation

DX Guarantee

Phase 10: Model Guard — Embedding Consistency Validation

Phase 11: Cross-Wing Balancing — Wing-Aware Diversity in MMR

Phase 12: Score Explainability — Per-Drawer Score Breakdown

Phase 13: Adaptive Compaction — LTP-Based Memory Compression

Phase 14: Paginated Scoring — Large Palace Support

Implementation Order

Test Plan

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problem	Issues	Status
Embedding model mismatch between ingest and query — silent search failure	#903, #912	Open, no PR
Large wings dominate search results — small wings get zero hits	#860	Open, no PR
Ranking logic is opaque — no way to explain why a drawer ranked higher	(none filed)	No issue exists
preCompact hook permanently blocks `/compact` in Claude Code	#906, #858, #856	Open, no PR
`col.get(limit=10000)` silently truncates on large palaces (>10K drawers)	#850, #851, #723	Open, no PR

DX capability	Default	When OFF
Phase-level timing	always on	N/A — ~100 ns overhead per phase; no effect on scores
Candidate trace	`false`	single `if` guard, zero allocation
Dry run	`false`	not a flag — explicit per-call argument; normal calls unaffected
Replay logging	`false`	no writes beyond existing Query Expansion logging
Assertion hooks	none registered	iteration over empty list is a no-op

RFC: Synapse Phase 10–14 — Model Guard, Cross-Wing Balancing, Score Explainability, Adaptive Compaction, Paginated Scoring #914

Description