JP's production fork of milla-jovovich/mempalace
This fork tracks upstream/develop through the 2026-04-27 sync and runs in production on a 151,478-drawer palace behind palace-daemon at disks.jphe.in:8085. It carries 16 fork-ahead changes that compose with — not replace — bensig's release direction; four landed upstream on 2026-04-26 (#1173, #1177, #1198, #1201). 1,500 tests pass on main. The new things here are what we've learned, not just what we've fixed.
On 2026-04-26 the canonical 151K-drawer palace ran an automatic migration on first daemon restart — "Migrated 667 checkpoint drawer(s) from main → mempalace_session_recovery; mempalace_search now queries content-only." That move closed a class of failure that recall benchmarks deliberately don't measure: the gap between finding the right document and grounding the model on something useful. The same Cat 9 A/B that surfaced the failure on 2026-04-25 re-ran post-migration and the predicted convergence held:
| metric | pre-migration | post-migration |
|---|---|---|
kind=all tokens / question |
632 | 974 |
kind=content tokens / Q |
3 | 1,267 |
| pre vs. post gap | 210× | 1.3× |
Both modes now return real content. The structural fix did the work the algorithmic patch (kind= filter + over-fetch) couldn't. Empirical detail at ~/Projects/notebook/data/cat9-postmigrate/REPORT.md; the long-form story behind it lives at notebook/essays/2026-04-25-mempalace-lessons.md.
The fork has converged on three principles. Treat them as the design test for future work.
The unit of memory in MemPalace is the verbatim utterance — chats, tool calls, mined files, the literal text the user produced or witnessed. Anything else (Stop-hook checkpoints, summaries, KG triples, agent journals, AAAK-encoded reflections) is derivative of that verbatim record. Derivative writes are useful but they are a different kind of thing: their right read pattern is event-shaped (session_id, time, agent), not semantic similarity.
Most public AI memory systems frame the problem the other way around: ingest raw, transform on write, store the derivative as canonical. Mem0 extracts "memories." Zep and Letta tier and summarize. Cognee builds a knowledge graph. Hindsight retains/recalls/reflects with LLM-extracted facts. In each, the verbatim original is gone — or at best, retrievable only through a layer of inference that already lost nuance. The fork's bet is the inverse: keep verbatim canonical, key derivative layers for their actual access pattern, and treat any derivative store as rebuildable from the verbatim. Derivative layers can then be replaced or re-derived without losing underlying truth. The April-2026 verbatim cohort (Longhand, Celiums, mcp-memory-service, MemPalace) converged on this within ~8 days of each other; the timing is suggestive.
Mixing verbatim and derivative in one corpus is the disease the checkpoint split treats. Recovery checkpoints, transcript-mining outputs, future KG-triple stores, and Haiku-enriched topic docs all want their own homes. The main mempalace_drawers collection holds verbatim only; sibling collections (mempalace_session_recovery shipped, more proposed) hold derivatives keyed for their actual read pattern.
This axis is implicit in upstream's RFC 001 (get_collection(palace, collection_name=...) already supports it) but isn't yet named in the spec. Worth making explicit upstream — multi-collection-by-purpose is the architectural move that future backends should plan for.
A week of filter tuning, BM25 fallback, and over-fetch parameters could not make kind=content return more than 3 tokens per question on the canonical palace. ~640 Stop-hook auto-save checkpoint drawers — 0.4% of the corpus — dominated 80%+ of every vector top-N because they were short, query-term-saturated, and embedded close to recent prompts. Recall@5 was 0.984 the whole time. End-to-end answer quality collapsed.
Then we moved them out of the corpus. One structural change — a separate ChromaDB collection for the recovery store, no algorithmic change to ranking — and kind=content jumped to 1,267 tokens per question. The lesson is durable: when corpus shape is wrong, no amount of post-filter cleverness substitutes for fixing the corpus.
This generalizes to every retrieval system that ingests by default and filters by query. Solve it at write time, by purpose, not at query time, by predicate.
The usual case for local AI memory is data sovereignty. The deeper benefit, surfaced this week, is the right to audit your own integration shape. Cat 9 in the SME framework — "the Handshake" — names a class of failure that recall benchmarks miss: the gap between retrieval working and the model actually being grounded on the retrieved content. We could only measure it because we own every layer of the stack. A vendor product would have shown us 0.984 R@5 on a dashboard and called it a day.
If you build memory systems and don't run integration measurements, you don't know how big this gap is on your deployment. A 0.984 / 17% split (engram-2's claim) is real, structural, and on the canonical palace it traces directly to checkpoint dominance — fixable, but only because we could see it. End-to-end LongMemEval on the post-migration palace is now in flight; the principle moves from theory to operationalized as those numbers land.
The deeper read on local-first AI memory: the sovereignty argument lands in court; the right to measure lands in production. The TechEmpower bridge essay at notebook/essays/2026-04-25-techempower-bridge.md develops this further.
Four claims that fall out of the thesis when you take it seriously and run it in production for a few months.
Corpus shape is not a tuning parameter; it's an architectural choice. The 2026-04-25 → 2026-04-26 collection split closed a 210× pre/post token gap that no amount of kind= filtering, over-fetch tuning, or BM25 fallback had touched. Retrieval algorithms have less leverage over end-to-end quality than the shape of what you ingest; when the corpus is wrong-shaped, you don't filter your way out — you split.
Verbatim storage is load-bearing as the canonical layer. Derivative work (KG, summaries, decay scores, embeddings under different models) is welcome as long as it stays next to the verbatim record, not replacing it. The integrity of every downstream layer depends on being able to re-derive from the original — drop the original and every layer above it is fragile.
The right to measure is the local-first benefit that matters in production. Sovereignty wins arguments; auditability wins debugging sessions. Cat 9 / The Handshake on this fork's deployment was findable because we own every layer of the stack — a vendor product would have shown 0.984 R@5 on a dashboard and called it shipped.
The integration gap (Cat 9 / Handshake) is real, reproducible, and measurable. Engram-2's "17% E2E QA" claim landed on a real failure surface — checkpoint domination of vector top-N — and the structural fix demonstrably closes it on this corpus. The 632/3 → 974/1267 token convergence above is the structural-fix proxy; the end-to-end LongMemEval run on the post-migration palace is in flight, with results to publish at notebook/data/cat9-postmigrate-e2e/ (TODO: link when committed).
Underneath all four, the operational work that doesn't make headlines is still mostly the two hard things — naming (wing/room/topic taxonomies, the verbatim-vs-derivative split was itself a naming clarification, multi-label tags, embedding-model identity across collections, what kind should mean) and cache invalidation (HNSW staleness detection, graph-cache write-invalidation, the kind= filter that went inert post-split, decay/recency weighting, stale auto-loaded docs, the .blob_seq_ids_migrated marker). Karlton's joke is durable for a reason: every retrieval system eventually has to engineer good answers to both, and this one is no exception. The thesis above is the part of the work that generalizes; the two-hard-things are the part that keeps showing up on every PR.
We surveyed the memory-system landscape in April 2026 and found no verbatim-first local system with MCP. Every alternative transforms content on write — extracted facts, knowledge graphs, tiered summaries — losing the original text.
| System | Verbatim? | Local? | MCP? | First public | Notes |
|---|---|---|---|---|---|
| MemPalace | Yes | Yes | Yes | 2026-04-06 (v3.0.0) | What we have. 151,478 drawers as of 2026-04-26 — 150,811 in main, 667 in recovery. Verbatim drawers + wings/rooms scope + SQLite KG + BM25/vector hybrid search. |
| Longhand | Yes | Yes | Yes, 16-tool MCP | 2026-04-14 (v0.5.2; repo 2026-04-09) | Closest cousin. Claude Code-specific — reads ~/.claude/projects/*.jsonl directly. SQLite (raw JSON per event) + ChromaDB (embeddings of pre-computed "episodes"). Deterministic file-state replay via stored diffs. |
| Celiums | Yes | Yes (SQLite, Docker, or DO) | Yes, 6-tool MCP | 2026-04-08 (repo) | Stores full module text with PAD emotional vectors, importance scores, and circadian metadata. Bundles a 500K+ expert-module knowledge base alongside personal memory — different product shape. |
| mcp-memory-service | Yes by default (opt-in consolidation) | Yes (SQLite) or Cloudflare Workers | Yes | 2024-12-26 | The long-standing verbatim option. Turn-level storage; MiniLM embeddings local. Targets LangGraph / CrewAI / AutoGen plus Claude. |
| Hindsight | No — LLM extracts facts | Yes (Docker) | Yes | 2026-01-05 | Three ops: retain / recall / reflect. Original text is lost. |
| Mem0 / OpenMemory | No — extracts "memories" | Partial | Yes | 2023-06 | Cloud-first; OpenMemory is local-mode sibling. |
| Cognee | No — knowledge graph | Yes | Yes | 2023-08 | "Knowledge Engine" via ECL pipeline. |
| Letta | No — tiered summarization | Yes | No | 2023-10 (as MemGPT) | Rebrand kept the repo. |
| engram | Structured fields, not raw | Yes | Yes | 2026-04-11 | Go + SQLite FTS5. |
| CaviraOSS OpenMemory | No — temporal graph | Yes | Yes | 2025-10-26 | SQL-native. |
The April-2026 verbatim cluster (MemPalace, Celiums, Longhand, engram all within ~8 days) is striking — it suggests the "store it raw and retrieve well" pattern reached independent critical mass right around the same time. The differentiator: verbatim storage is the foundation; everything else (tags, KG, decay, summaries) is enrichment layered on top. If any layer fails or needs rebuilding, the underlying truth is still there. The same architectural call has been winning in observability for a decade — Grafana Loki's verbatim-event store, with the recent Kafka rearchitect (10× faster aggregated queries, 20× less data scanned), is what mature verbatim-first systems eventually do under scale pressure — useful precedent for the substrate exploration above.
Status: exploring — not committed.
The fork is evaluating a Postgres-based backend (pgvector for vector search, Apache AGE for graph traversal) as a candidate implementation against the upstream RFC 001 backend seam. This is composition, not a fork-led architectural shift: BaseBackend + BaseCollection + PalaceRef + the entry-point registry already live in upstream develop at mempalace/backends/, explicitly designed so third-party backends register via Python entry points without touching core. The architectural decision was upstream's; the fork's contribution would be choosing pgvector + AGE as one specific implementation worth picking.
What this would consolidate. Vector search, full-text search, graph traversal, and the temporal entity-relationship store all in a single engine. Today: ChromaDB (HNSW vectors), SQLite (BM25 + KG triples + corpus_origin index), graph cache (in-process). Under Postgres: one connection, one transaction model, one backup story, one operational surface.
The bridge pattern. Microsoft's pgvector ↔ Apache AGE post (Raunak, 2026-04-15) describes the architectural reference: pgvector cosine similarity scores written as SIMILAR_TO edges in the AGE property graph, making vector similarity itself a traversable relationship. The KG-extraction work (P4/P5) lands much more naturally when the graph is in-database than it does in a separate SQLite alongside ChromaDB.
Why graph structure matters. Dave Plummer's "My Custom AI Went Superhuman Yesterday..." (Dave's Garage, 2026-02-28) is the conceptual reference: his Tempest AI couldn't reason about the playing field as flat coordinates — it needed the actual geometric structure of the 3D web. Memory retrieval is a related claim: an AI cannot reason about memory as flat vectors alone; the relational structure (entity → entity, conversation → mined-doc, decision → outcome) is what lets it navigate. Vectors get you "topically nearby"; the graph gets you "actually related."
What stays the same. The verbatim-first commitment is unchanged — Postgres tables would hold the same canonical raw text, just on a different storage engine. The multi-collection-by-purpose pattern (Principle 1 of the thesis) maps directly onto Postgres schemas or per-collection tables. Composition with upstream stays the rule, including here: this is a backend implementation against the seam, not a parallel reimplementation. If the evaluation pans out, the natural ship shape is a separate pip install package wired via entry-point registration; the fork's main branch keeps tracking upstream develop and ChromaDB stays the default.
What's still open. Embedding-model identity across the migration window. Operational ergonomics versus the current daemon-fronted ChromaDB story. Whether the bridge pattern survives at 150K+ drawers without a custom indexing strategy. Whether the bench numbers justify the migration cost at all. The honest version is "I don't know yet which engine is better on this corpus and want to find out" — same posture as the Hybrid retrieval A/B.
Three bands of work, all instances of the principles above. Detail rows in the appendix at the bottom.
- Structural retrieval fixes (Principle 1, Principle 2). Multi-collection split moves Stop-hook checkpoints to a dedicated
mempalace_session_recoverycollection — physically absent frommempalace_search, queryable via the newmempalace_session_recovery_readMCP tool. PreCompact incorporated. Auto-migrates on first daemon restart. The transitionalkind=filter and over-fetch hack are gone (2026-04-27) — the structural fix made them inert.drawer_idsurfacing on every search/diary/recovery hit so callers can build citation popovers and follow-ups. - Single-writer architecture (Principle 3). palace-daemon is the only process that opens the palace; clients connect over HTTP. ChromaDB 1.5.x's HNSW concurrency hazards (
#974/#965/#823family) become structurally impossible. Cold-start integrity sniff-test on segment metadata files preventsquarantine_stale_hnswfrom destroying healthy indexes during async-flush lag. Cherry-pick of upstream #1085 for 10–30× mining speedup; cherry-pick of upstream-PR-#1094 for boundary-level None-metadata coercion that closes a per-site-guard family. - Deterministic hook saves (Principles 1+2+3 compose). Silent saves bypass auto-memory conflicts entirely — the LLM is no longer in the save path, so
decision: "block"race conditions and Claude's auto-memory winning over MCP tools both go away. Save marker advances only after confirmed write.systemMessagenotification surfaces results. PreCompact writes a recovery-collection marker before mining + compaction so context-boundary events leave a queryable timestamp.
git clone https://github.com/jphein/mempalace.git
cd mempalace
python -m venv venv && source venv/bin/activate
pip install -e ".[dev]"
mempalace init ~/Projects --yes
mempalace mine ~/Projects/myproject
mempalace search "why did we switch to GraphQL"For a daemon-fronted deployment (recommended once palace size reaches the multi-thousand-drawer range), see palace-daemon's setup. The fork's scripts/deploy.sh is a one-command Syncthing-aware redeploy: push fork main, restart palace-daemon, post-restart import-check that the new fork-ahead surface is loaded.
A Stop hook fires every 15 messages in Claude Code, writes directly to mempalace_session_recovery via the Python API (no LLM in the loop), and renders a terminal line so the user sees the save land:
{"systemMessage": "✦ 13 memories woven into the palace — investigate, description, symlinkj"}search_memories (via mempalace_search MCP tool) returns results with scope-authoritative context so callers can tell when the vector layer underdelivered:
{
"query": "kiyo xhci usb crash fix razer",
"total_before_filter": 15,
"available_in_scope": 137949,
"warnings": [],
"results": [
{"drawer_id": "drawer_kiyo-xhci-fix_technical_a8b2c4...", "wing": "projects",
"room": "technical", "similarity": 0.859, "matched_via": "drawer", ...},
{"drawer_id": "drawer_kiyo-xhci-fix_technical_d5e7f9...", "wing": "kiyo-xhci-fix",
"room": "technical", "similarity": 0.852, "matched_via": "drawer", ...}
]
}When the HNSW index is genuinely degraded (rare, post-fix), the same call returns warnings: ["vector search returned 0 of 5 requested; filled 5 from sqlite+BM25 keyword match"] with hits tagged "matched_via": "sqlite_bm25_fallback" — data is never silently hidden.
After the 2026-04-26 migration, the example queries from a week ago all return content rather than checkpoint word-soup. The kind= parameter retired 2026-04-27 — the structural split made it inert.
Three operational principles that inform PR review alongside the thesis above. They predate the thesis but converge on the same conclusions.
Write the raw text first; derive everything else lazily, from unambiguous signals, with a graceful fallback when derivation fails. The verbatim archive is the one thing that must always succeed. Optional enrichment (LLM topic extraction, AAAK encoding, concept chunking) is welcome as long as it stays opt-in, additive, and never a prerequisite for the write to complete.
The inverse — making classification a gate — is where the fork's earliest visible bugs came from: room=None crashes, a stopword list at 285 English entries papering over false positives, wing misassignment. Entity detection misfires, classifiers force wrong rooms, LLM-extracted "facts" lose nuance and can't be un-extracted. The fork's design test for any new write-path feature is now: does this require interpreting content at write time? If yes, derive lazily instead.
Same instinct as the verbatim-vs-derivative axis. Derivative work belongs next to the verbatim record, never replacing it.
Hierarchy works when it's derived from unambiguous signals (cwd, transcript path, project directory) — not when it's hand-classified by content inspection. The earlier mistake was conflating "hierarchy is bad" with "mandatory synchronous classification is bad" — different claims.
Good uses of hierarchy, which we keep:
- Browseable scope for serendipitous recall across 152K drawers.
- Deletion and retention as a unit. Purging an abandoned project is one operation, not a risky query-then-delete.
- Disambiguation without query gymnastics. The same keyword across years of unrelated work.
- Auto-surfacing priors. A wing derived from cwd is a cheap, unambiguous scoping signal.
Bad uses, which we're unwinding:
- Required at write time (caused all the crashes).
- Derived from content-inspection heuristics (NER, keyword matching) rather than unambiguous signals.
- Single-label, as if every drawer had one true parent. Cross-cutting concerns belong in tags (P0).
- Deep nesting when shallow would do.
Spend the algorithmic budget on retrieval, where quality compounds. Classification quality has a hard ceiling set by the accuracy of the classifier, and a write-time classifier won't be that accurate. Vector + BM25 + optional scope filter already beats the hierarchy on its own. Tags (P0), feedback (P3), and decay (P2) extend without requiring write-time commitment.
Effort spent tuning the entity detector is effort not spent on the thing that pays compounding returns.
Reorganized 2026-04-26 around the verbatim-vs-derivative axis. Each item evaluated against the three architectural principles + the three thesis principles above.
- P0 — Multi-label tags (1-2 days, additive). Tags are the cross-cutting-concerns layer that hierarchy can't provide. Add
tagsmetadata (3-8 per drawer) extracted during mining via TF-IDF or longest-non-stopword heuristic. Adjacent: #1033 (<private>tag filter, @zackchiutw) is single-purpose; full multi-label additive on top. Optional opt-in--enrichflag for Haiku-extracted topic tags (96.6% R@5 baseline → competitive before rerank). - P1 — Derive hierarchy from unambiguous signals (half day). Reframe from "best-effort classification" to "derive from cwd, transcript path, project directory." Default wing to source dir name (already mostly works). Demote entity detector to last-resort hint, not gate. Documents the derivation order: cwd → transcript path → project hint → (optional) entity hint → unfiled.
- P6 — Input sanitization on writes (half day). Strip known injection patterns. Flag with
sanitized: truemetadata, don't block. 10K char cap. Low priority while local-only.
- P8 — Corpus partitioning by purpose (architectural). The checkpoint collection split is the first instance.
mempalace_session_recoveryfor hook-fired audit data; future siblings for transcript-mine outputs (the #1083 family — currently being addressed at the hook layer by #1199, but the collection-level move is the durable fix), KG-triple store (P4), Haiku-enriched topic docs (companion to P0). Worth flagging in RFC 001 so future backends know multi-collection-per-palace is the canonical pattern. - P4 — KG auto-population + entity resolution (1.5 days). Hooks extract
subject/predicate/objecttriples on every save using heuristics (no LLM). Triples land in their own store (KG SQLite is already separate, P8-aligned). Normalize entity IDs; alias table + Levenshtein. Triples are derived — re-mine if extraction improves; verbatim untouched. Note: under the Postgres + pgvector + AGE substrate exploration, the graph lives in-database (AGE) rather than in a separate SQLite, which makes this work meaningfully more natural to implement. - P5 — Temporal fact validity (1 day, depends on P4). KG triples get a context slot (SPOC: subject-predicate-object-context). Reference: Zep's Graphiti. Same Postgres+AGE caveat as P4 — temporal validity ranges are SQL-native on Postgres in a way they aren't across two engines.
- P2 — Decay / recency weighting (tracked upstream). Handled by #1032 (Weibull decay, MERGEABLE). Independent
mempalace prune --stale-days 180CLI is still a fork opportunity. - P3 — Feedback loops (rerank tracked upstream; rating still open). #1032 covers Tier 0 LLM rerank (96.6% → 99.4% with Haiku). Tier 1+:
mempalace_rate_memory(drawer_id, useful: bool)MCP tool, implicit echo/fizzle signals. Reference: Celiums's novelty + emotional + circadian importance scoring. - P7 — Alternative storage modes (tracked upstream + fork-side pgvector+AGE evaluation in flight). Upstream owns the RFC 001 seam and the four backend-implementation PRs. Fork is exploring Postgres + pgvector + Apache AGE as one specific implementation against that seam — composition, not a parallel reimplementation. See the dedicated section earlier for what's being evaluated and what's still open.
- Expanding hierarchy types (tunnels, closets, new room categories). Adding categories doesn't address the write-time classification problem. Tags (P0) and derived scope (P1) do.
- Full architecture rewrite — not worth migration cost.
- Dual-granularity ANN, dream engine, foresight signals (Karta-inspired) — require LLM calls on every write. Zero-LLM philosophy makes these opt-in at best.
- FTS5 parallel index — right idea (engram proves it), significant infrastructure alongside ChromaDB. Revisit after tags and decay are proven.
engram-2 published a benchmark note stating MemPalace achieves 0.984 R@5 on LongMemEval but only 17% end-to-end question-answering accuracy. We located one concrete instance of the gap — checkpoint domination of mempalace_search results — and the structural fix shipped 2026-04-25 → 2026-04-26 demonstrably closes it on this corpus. Pre-migration kind=content returned 3 tokens/Q; post-migration it returns 1,267. The corpus-shape thesis proved out.
End-to-end LongMemEval-S through this fork against a modern reader model is now in flight; results will land at notebook/data/cat9-postmigrate-e2e/REPORT.md (TODO: link when committed). Predicted: substantially better than 17% post-migration, possibly close to recall ceiling, with a chunk-size + embedding-model-alignment headroom delta still to characterize (P0 Haiku enrichment, #442 collection-bound model identity). The structural-fix snapshot is what the migration buys today; the E2E number is the durable claim.
BM25 + vector with reciprocal rank fusion vs current hybrid-rerank pipeline. Don't pre-decide the winner. The honest version is "I don't know which is better on my corpus and want to find out."
The SME framework's Cat 9 is an underappreciated piece of the memory-systems landscape — every deployment runs into the integration gap; the field's benchmarks deliberately don't measure it. Worth scaling up: what does Cat 9 look like on Longhand, Celiums, mcp-memory-service? An apples-to-apples comparison would surface whether "verbatim-first cohort" share an integration shape or whether each has its own gap. Adapter work tracked at jphein/multipass-structural-memory-eval. Grafana's o11y-bench (April 2026) is the same instinct applied to observability — bench what agents actually do with the data, not just retrieval-side metrics — and worth tracking as the pattern matures across domains.
@kostadis raised in upstream #1018: a manually curated palace alongside the auto-mined chat palace. The hooks dump everything into one palace today, polluting curated content. Right fix is multi-palace support with per-hook target flag — design needs review (does it fit the single-palace_path model? does it want palace_name aliases?). P8 (collection partitioning) might absorb this — different collections per purpose inside the same palace, vs. multiple palaces. Decide once we've tried the lighter move first.
Knowledge lives across 7+ layers: global CLAUDE.md, project CLAUDE.md, auto-memory, docs/, superpowers specs, code comments, MemPalace. The auto-loaded layers go stale and actively mislead. MemPalace is the only layer that can't go stale (verbatim + timestamped) but never auto-loaded. Planned /verify-docs slash command pattern-matches version strings, file paths, PR numbers, URLs, and verifies against current state. Cleaning stale docs prevents more wrong assumptions than any amount of auto-querying.
A meaningful shift in 2026-04: this fork increasingly composes with upstream rather than carrying parallel implementations.
- Cherry-picks (in-flight upstream PRs we use early): #1085 batched inserts (commit
6be6fff), #1087 rewritecmd_purgeviadelete(where=)(366a9ad), #1094 None-metadata coercion (43d728d). - Coordinated reviews: #1199 (rmdes' unbounded-ingest fix — pulled and tested locally, +1 with composition note), #1219 (pepo72's drawer_id — narrower than ours; offered the diary/recovery extension), RFC 001 #743 (storage backend spec — flagged the multi-collection-by-purpose pattern as worth naming explicitly).
- Closed in favor of upstream: #1171 cross-process write lock (closed 2026-04-25 — Felipe's #976
mine_global_lockat the right layer plus daemon-strict architecture obsoleted ours).
The fork ships structural moves first, validates them on the canonical palace, then either contributes upstream as PRs or aligns with upstream's parallel implementation. The composition is the point.
Claude Code has two complementary memory layers, used in tandem:
| Layer | Storage | Size | Consolidation | Purpose |
|---|---|---|---|---|
| Auto-memory | ~/.claude/projects/*/memory/*.md |
17 files (this project) | None (manual writes) | Preferences, feedback, context |
| MemPalace | palace-daemon at http://disks.jphe.in:8085 (ChromaDB on the daemon host) |
151,478 drawers (150,811 main + 667 recovery) | None (write-only archive) | Verbatim conversations, tool output, code |
Neither has automatic consolidation. Claude Code has unreleased "Auto Dream" consolidation behind a disabled feature flag (anthropics/claude-code#38461) — if it ships, it covers only the lightweight layer. MemPalace decay (P2) and feedback (P3) remain the right priorities for the verbatim archive.
From a 2026-04-21 sweep of upstream MemPalace issue + comment + discussion history. State moves; check the repos directly for current status.
- palace-daemon (@rboarescu) — FastAPI gateway + MCP-over-HTTP proxy. Three asyncio semaphores (read / write / mine). Pins correctness floor at MemPalace ≥3.3.2. This fork migrated to palace-daemon on 2026-04-24 (
c09582cwired MCP + hooks;0e97b19added daemon-strict mode). All reads and writes from the plugin flow through the daemon; auto-migrate-on-startup of the checkpoint split landed as palace-daemon034023c(Phase E). JP's deployment runs atjphein/palace-daemon. - engram (@NickCirv) — File-read interception for AI coding assistants. Uses MemPalace as one of six context providers via
mcp-mempalace mempalace-search; caches with 1h TTL. Upstream discussion #798. - engram (@harreh3iesh — different project, same name) — Hooks + tools for AI memory, first-class MemPalace backend. Stuck detector (
PreToolUsehook counts Grep/Glob calls and nudges the AI when spinning) is a pattern worth borrowing. Upstream discussion #748. - cdd-mempalace (@fuzzymoomoo) — Bridge library mapping Context-Driven Development methodology onto wings/halls/rooms. Multiple active upstream PRs.
- multipass-structural-memory-eval (@M0nkeyFl0wer) — Nine-category diagnostic framework. "Category 9: The Handshake" tests integration under production model usage, not just offline retrieval — the gap our LongMemEval numbers don't close. Forked at jphein/multipass-structural-memory-eval. The mempalace-daemon adapter at
sme/adapters/mempalace_daemon.pytalks HTTP/MCP only — no parallelPersistentClient, daemon-strict-compatible. The Cat 9 A/B harness used for the 2026-04-25 → 2026-04-26 measurements lives here.
- agentmemory (@rohitg00) — BM25 + vector hybrid. 95.2% R@5 on LongMemEval-S with same MiniLM embedding model. Filed methodology review in upstream #747.
- engram-2 — Rust CLI, deterministic, SQLite + FTS5 only. Hybrid via Gemini embeddings + FTS5 reciprocal rank fusion. 0.990 R@5 vs MemPalace's 0.984 with no reranking, claims 17% end-to-end QA for MemPalace — the critique above. Memory-layer-budgeting (identity / critical / topic / deep tiers with token accounting) is worth studying.
- Tiro (project-tiro) (@esagduyu) — Same data-spine architecture (FastAPI + ChromaDB + SQLite + sentence-transformers + MCP) but curated input domain (web pages, email newsletters as clean markdown). Architectural twin to MemPalace's auto-mine-everything: same stack, different input shape. Forked at jphein/project-tiro.
- RLM (Recursive Language Models) (@alexzhang13, MIT OASYS) — LM offloads context as a REPL variable and recursively decomposes. Targets near-infinite context length. Forked at jphein/rlm; integration example at
examples/mempalace_demo.py. Smoke-tested 2026-04-25 against the 151K palace via Foundry gpt-5.3-chat: RLM autonomously calledmempalace_searchfrom docstring alone, returned cited answers in 4 iterations / ~23s. That same test surfaced the checkpoint-noise problem the structural fix now solves. Composition pattern (per familiar.realm.watch v0.3): RLM as outer orchestrator, MemPalace + familiar's/v1/chat/completionsas its tools. - ASI-Evolve (@GAIR-NLP) — Closed-loop autonomous research agent (Researcher / Engineer / Analyzer). Two parallel memory systems: Cognition Store (upfront domain knowledge) and Experiment Database (every trial). Validated on neural architecture design (+0.97 over DeltaNet — ~3× recent human gains). arXiv 2603.29640. Forked at jphein/ASI-Evolve. The Cognition Store is exactly the role MemPalace would play.
Built on top of or alongside MemPalace, by community contributors who use the palace as substrate:
- GraphPalace (@web3guru888) — graph-layer build. Forked at jphein/GraphPalace.
- mempalace-viz (@JoeDoesJits) — visualization layer (wings, rooms, tunnels, drawer counts). Forked at jphein/mempalace-viz.
- AutomataArena (@astrutt) — multi-agent orchestration substrate. Forked at jphein/AutomataArena.
| Fork | Contributor work |
|---|---|
| jphein/mempalace | this fork |
| fuzzymoomoo/cdd-mempalace | 10 comment refs; CDD integration layer |
| potterdigital/mempalace | author of upstream #1081 |
| vnguyen-lexipol/mempalace | author of upstream #851 |
7 open as of 2026-04-27.
| PR | Status | Description |
|---|---|---|
| #660 | CI green, awaiting review | L1 importance pre-filter |
| #1005 | CI green, Dialectician-acked | Warnings + sqlite BM25 top-up — never silently return fewer results than scope contains |
| #1024 | CI green, qodo-acked | Configurable chunk_size / chunk_overlap / min_chunk_size |
| #1086 | CI green, awaiting review | mempalace export CLI wrapper |
| #1087 | CI green, rewritten 2026-04-26 per @igorls's review | mempalace purge --wing/--room via delete(where=) (no nuke-and-rebuild) |
| #1094 | CI green, awaiting review | Coerce None metadatas to {} at ChromaCollection boundary |
| #1142 | CI green, @bensig accepted 2026-04-23 | docs/RELEASING.md |
Forward-looking, in rough priority order. The substrate exploration is the biggest open question; everything else is incremental against the existing direction.
- Continue pgvector + Apache AGE evaluation against the RFC 001 backend seam (
BaseBackend+ entry-point registry, already in upstream develop). Frame it as a candidate implementation, not a commitment. See Substrate exploration above for the bridge pattern and references. - Publish Cat 9 end-to-end results on the post-migration palace at
notebook/data/cat9-postmigrate-e2e/REPORT.md, with adapter parity numbers across the verbatim-first cohort once the SME harness lands. - Publish the multipass-structural-memory-eval harness with adapters for MemPalace, Longhand, Celiums, mcp-memory-service so Cat 9 / The Handshake stops being a one-deployment story.
- Land P0 (multi-label tags) and P2 (decay/recency) — P2 tracked upstream via #1032; P0 is fork-side until upstream wants it.
- Publish the verbatim-vs-derivative axis as a standalone essay, distinct from the README. The axis is doing more work than the README has space to spell out.
- Coordinate with upstream on the multi-collection-by-purpose pattern — implicit in RFC 001 today, worth naming explicitly so future backends plan for it.
- Agent-shaped CLI surface. MCP brings palace data into Claude Code via tool calls; the peer surface is a pipe-friendly CLI with structured-output flags so agents, hooks, or scripts can call
mempalace search ... --jsonand route results into context without the MCP roundtrip. Grafana's GCX CLI is the prior art for this pattern in observability — bring the data to where the agent lives, don't force the agent into a separate UI. Today'smempalaceCLI is operator-shaped (status / mine / repair / search); the next-generation surface should be agent-callable, with first-class JSON output and conventions that compose with shell pipelines and slash commands. - First-class support across the AI coding agent ecosystem. Today's integration is Claude Code-specific (Stop / PreCompact hooks,
~/.claude/projects/*.jsonlmining). Target the broader set: Claude Code, OpenCode, Cursor, Aider, Gemini CLI, Codex CLI, Warp, and adjacent. Path is upstream's RFC 002 source-adapter spec (tracking #989) — each agent ships apip install mempalace-source-<agent>package mapping its session format (Claude Code's JSONL, OpenCode's SQLite, Cursor'sworkspaceStorage/*.vscdb, Aider's.aider.chat.history.md, Gemini/Codex log shapes, …) onto the canonical drawer shape with parity onsession_id/agent/wingderivation. Existing third-party prototypes already proposed against RFC 002: OpenCode SQLite #23, Cursor SQLite #274 (earlier JSONL variant #232), Pi agent JSONL #169, and a combined Cursor + factory.ai session miner #702 — each becomes amempalace-source-*package once the spec lands. Three integration cells: read is universal (the MCP server is already agent-agnostic and works wherever MCP is supported), mine is per-agent via RFC 002 adapters, hook/event wiring lands wherever the host exposes a hook surface (mining-on-cron is the fallback). Fork unblocks the pattern by helping land RFC 002; per-agent adapter PRs land from their respective authors.
# Setup
git clone https://github.com/jphein/mempalace.git
cd mempalace
python -m venv venv && source venv/bin/activate
pip install -e ".[dev]"
# Develop
python -m pytest tests/ -q # 1500 tests (benchmarks deselected)
mempalace status # palace health
ruff check . && ruff format --check . # lint + format
# Doc maintenance (canonical YAML + renderer, see CLAUDE.md)
./scripts/render-docs.py # regenerate FORK_CHANGELOG from docs/fork-changes.yaml
./scripts/check-docs.sh # lint test count, fork hashes, render parity, upstream PR states
# Deploy fork main → palace-daemon on disks
./scripts/deploy.sh # one command: push, sync, restart, health, import-checkThe full enumeration of fork-ahead changes. For the narrative, see What this fork ships above. This is the inventory for verifying claims, looking up specific commits, or picking a contribution.
The canonical source is docs/fork-changes.yaml; FORK_CHANGELOG.md is regenerated from it. Run ./scripts/check-docs.sh to verify everything below resolves to live state.
| Area | Change | Status | Files |
|---|---|---|---|
| Search | Move Stop-hook auto-save checkpoints to dedicated mempalace_session_recovery ChromaDB collection (Principle 1+2). Phases A–E shipped 2026-04-25 → 2026-04-26: collection adapter, write routing, new mempalace_session_recovery_read MCP tool, migration (idempotent, ID/metadata-preserving), PreCompact incorporation, palace-daemon lifespan auto-migrate. Canonical 151K palace migrated 667 checkpoints on 2026-04-26 10:24:09 PDT. Cat 9 A/B re-run shows 632/3 → 974/1267 token convergence. |
PR pending — fork commits e266365 (A–C) → 42817d7 (D + PreCompact); palace-daemon 034023c (E); 18 new tests |
palace.py, mcp_server.py, migrate.py, cli.py, hooks_cli.py |
| Search | Surface drawer_id in mempalace_search results, mempalace_diary_read entries, and mempalace_session_recovery_read payload. ChromaDB primary key was returned but never plumbed into the result-building loop. Defensive zip-with-id-pad for test mocks. |
PR pending — fork commit 9a8bb77; upstream #1219 (@pepo72) is the narrower searcher-only equivalent. |
searcher.py, mcp_server.py, tests/..., website/reference/mcp-tools.md |
| Reliability | hook_precompact writes a session-recovery checkpoint marker before mining + compaction. Mirrors hook_stop's _save_diary_direct call; same routing path (recovery collection, queryable by session_id). |
Bundled with phase D in 42817d7 |
mempalace/hooks_cli.py |
| Performance | Cherry-picked upstream #1085 (@midweste) — batch ChromaDB inserts in miner. New _build_drawer() + add_drawers(). Reported 10–30× mining speedup. |
Cherry-pick of open #1085 — fork commit 6be6fff. Becomes a no-op when #1085 merges. |
mempalace/miner.py |
| Reliability | Cherry-picked upstream #1094 — coerce None metadatas at chromadb boundary. Closes the per-site-guard family of None-metadata bugs (#999, #1198, #1201) at one site instead of N. | Cherry-pick of open #1094 — fork commit 43d728d |
backends/chroma.py, tests/test_backends.py |
| CLI | mempalace purge --wing/--room via collection.delete(where=...). Earlier nuke-and-rebuild draft predicated on #521's race; @igorls's review traced the stack — race is on the upsert path, not delete-by-where. Simpler version preserves embedding fn, no rmtree window, routes through ChromaBackend. |
#1087, rewritten 2026-04-26 per review | cli.py, tests/test_cli.py |
| CLI | mempalace export CLI wrapper for upstream's existing export_palace(). |
#1086 | cli.py |
| Performance | L1 importance pre-filter — importance >= 3 first, full scan fallback. |
#660 | layers.py |
| Config | Configurable chunking parameters — chunk_size (800), chunk_overlap (100), min_chunk_size (50) in config.json, exposed via MempalaceConfig. |
#1024 | config.py, miner.py, convo_miner.py |
| Search | Warnings + sqlite BM25 top-up when vector underdelivers — search_memories returns warnings: [...] + available_in_scope; fallback hits tagged matched_via: "sqlite_bm25_fallback". The palace never silently returns fewer results than the scope contains. |
#1005 | searcher.py |
| Docs | docs/RELEASING.md with mempalace-mcp pre-release grep. |
#1142, accepted by @bensig 2026-04-23 | docs/RELEASING.md |
| Hooks | mempal_save_hook.sh Python auto-detection (MEMPAL_PYTHON → repo venv → system python3). Same pattern in .claude-plugin/. Replied on #1049 offering autodetect, awaiting maintainer arbitration on #1069. |
PR pending after #1069 direction | hooks/mempal_save_hook.sh, .claude-plugin/hooks/... |
| Hooks | Transcript auto-mining with correct defaults + hook_auto_mine config flag. Superseded by @sha2fiddy's #1110 for part 1 (opt-out flag); part 2 (_ingest_transcript shape change) remains fork-only. |
Issue #1083 | hooks_cli.py |
- 2026-04-26: #1173 (
quarantine_stale_hnswcold-start gate + integrity sniff), #1177 (.blob_seq_ids_migratedmarker), #1198 (_tokenizeNone guard), #1201 (palace_graphNone metadata) - 2026-04-23: #659 — diary
wingparameter, hook derives from transcript path - 2026-04-22: #661 (graph cache), #673 (deterministic hook saves), #1021 (Claude Code 2.1.114 stdout fixes)
- 2026-04-21 (in v3.3.2): #1000 (
quarantine_stale_hnsw), #1023 (PID file guard), #681 (Unicode checkmark) - 2026-04-18: #999 — None-metadata guards across 8 read paths
- In v3.3.0: #664, #682, #683, #684, #635 (via #667)
- #1171 (cross-process write lock — superseded by #976 + daemon-strict)
- #1146 (duplicate of @igorls's #1147)
- #1115 (premature, withdrew pending #1069 arbitration)
- #629, #632, #662, #663, #738, #1036 — all superseded; see commit history for context
Articles and surveys that shaped the fork's direction.
- lhl/agentic-memory — multi-system analysis. The MemPalace review at
ANALYSIS-mempalace.mdseeded the original 7-item roadmap. - codingwithcody.com — "MemPalace: digital castles on sand" — TagMem-promotion critique whose hierarchy-causes-bugs argument produced architectural principles 1 and 2.
- OSS Insight — Agent Memory Race 2026 — competitive landscape survey.
- InfoQ — Grafana rearchitects Loki with Kafka and ships a CLI to bring observability into coding agents — verbatim-first observability precedent at scale; GCX CLI as agent-bridge prior art; o11y-bench as parallel to multipass-structural-memory-eval. Cited in the verbatim-cluster paragraph, the Cat 9 investigation, and the agent-shaped-CLI roadmap item.
- Microsoft Tech Community — Combining pgvector and Apache AGE: knowledge graph & semantic intelligence in a single engine (Raunak, 2026-04-15) — bridge-pattern reference for the substrate exploration: pgvector cosine scores written as
SIMILAR_TOedges in the AGE property graph. - Dave's Garage — "My Custom AI Went Superhuman Yesterday..." (Dave Plummer, 2026-02-28) — conceptual reference for why graph structure matters in retrieval: vectors get you "topically nearby"; the graph gets you "actually related."
- Phil Karlton's two hard things — naming and cache invalidation. Cited in "What this fork has learned" because, even at 151K drawers and post-thesis, the day-to-day operational work is still mostly these two.
- Karta — contradiction detection, dream-engine feedback loop, foresight signals. Inspires P3/P4/P5; the heavier per-write LLM features are deprioritized.
- Codex memory — citation-driven retention. Influences P3.
- ByteRover CLI — 5-tier progressive retrieval. Pattern to consider for context-feeding.
- engram — Go + SQLite FTS5; file-read interception prototype. Cited in deprioritized FTS5 item and the auto-surfacing problem.
- context-engine — exponential decay implementation that ports directly into P2.
- Verbatim-first cohort — Longhand, Celiums, mcp-memory-service. Different scopes, same architectural call: keep the drawer verbatim, layer richer metadata on top.
Comparison table columns filled 2026-04-14–18; feature status drifts. Cite upstream before treating any row as current. TagMem is omitted; we couldn't find a public repo for it.
MIT — see LICENSE.