Replies: 2 comments 1 reply
-
|
You raised some very important points and explained them well. Many corrections to the documentation have been made and MemPalace's public tagline was recently changed from "highest-scoring AI memory system ever benchmarked" to "best-benchmarked open-source AI memory system," but that's still a claim that obviously is not true. The R@K vs QA distinction has been well documented in multiple issues in addition to #39 (#27, #29, #125, #314, #333, #367, #875). |
Beta Was this translation helpful? Give feedback.
-
|
Great discussion. The metric confusion here is worth addressing head-on. Retrieval Recall @5 ("did we find the right document in top 5?") and QA Accuracy ("did we answer the question correctly?") are fundamentally different metrics. The LongMemEval paper (ICLR 2025) reports both:
Reporting retrieval recall as if it were QA accuracy inflates scores by 20-30 percentage points. I recently ran the full LongMemEval benchmark against Celiums with end-to-end QA accuracy (retrieve → LLM synthesizes answer → separate LLM judges). Tested 5 models:
Retrieval rate was 100% — the engine always found the right session. The gap to 62.3% QA is pure synthesis difficulty (temporal reasoning, multi-session aggregation). Everything reproducible: benchmarks/BENCHMARKS.md. Full analysis: celiums.ai/blog/mempalace-mirage-benchmark-fraud The industry needs more projects publishing honest, reproducible numbers. Happy to discuss methodology. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey, great work on mempalace. The core insight — raw verbatim text + good embeddings beats LLM extraction — is genuinely valuable and your benchmark scripts are refreshingly reproducible.
I maintain agentmemory, a persistent memory system for AI coding agents. We recently ran LongMemEval-S ourselves and got 95.2% R@5 (BM25+vector hybrid) using the same
all-MiniLM-L6-v2embedding model. While comparing approaches, we dug into the benchmark methodology and wanted to share some findings that might help strengthen your claims.What we verified
What concerns us
1. Metric category error on LongMemEval
LongMemEval is an end-to-end QA benchmark. Every score on the published leaderboard is QA accuracy (retrieve + generate answer + GPT-4o judge). The 96.6% is
recall_any@5— a retrieval-only metric that never generates an answer or invokes a judge.This makes the number incomparable to anything on the leaderboard:
An independent tester (Issue #39) ran the full pipeline and got 82.6% QA accuracy — competitive but substantially different from 96.6%.
We label our own 95.2% explicitly as "retrieval recall, not end-to-end QA accuracy" in LONGMEMEVAL.md. Suggesting the same clarity here would help the community compare fairly.
2. The 100% R@5 score and the 3 targeted patches
The path from 96.6% → 100% involved 3 hand-coded patches for 3 specific failing questions (quoted-phrase boost, person-name boost, nostalgia pattern). Your own BENCHMARKS.md acknowledges this is "teaching to the test." The held-out split was created after the patches, not before.
The 98.4% held-out score is more credible but the split is post-hoc. A pre-registered dev/test split would make this bulletproof.
3. LoCoMo 100% with top_k=50
The 100% LoCoMo claim uses
top_k=50against conversations with at most 32 sessions — retrieving the entire conversation. Your BENCHMARKS.md correctly notes "the embedding retrieval step is bypassed entirely." The honest number (88.9% R@10 at top_k=10) should lead.4. ConvoMem "2x Mem0"
The 92.9% is retrieval recall; Mem0's published numbers are QA accuracy. Different metrics on the same dataset.
5.
--mode rawbenchmarks ChromaDB, not mempalaceIn raw mode, zero mempalace code executes — no palace, no wings, no rooms, no AAAK. The 96.6% is really a benchmark of ChromaDB + MiniLM-L6-v2. This is useful information but shouldn't be attributed to the palace architecture.
What we think is genuinely strong
How agentmemory approaches this differently
We took the opposite bet: compress observations into structured facts/narratives to keep context injection under a token budget (~2K tokens vs raw text). Our search uses triple-stream retrieval (BM25 + vector + knowledge graph) with RRF fusion.
On LongMemEval-S with the same embedding model:
We're 1.4pp behind on R@5 but beat on R@10 (98.6% vs ~97.6%). BM25 adds recall depth that pure vector misses.
Suggestion: the two approaches are complementary
mempalace excels at "what does this codebase contain?" (static corpus search).
agentmemory excels at "what did I do across sessions?" (temporal memory).
A developer could use both:
Would be interesting to explore integration points — mempalace's knowledge graph feeding into agentmemory's context injection, for example.
Not trying to start a flame war — genuinely think both projects push the space forward. The benchmark methodology feedback is meant to strengthen your claims, not undermine them. Happy to discuss any of this.
Beta Was this translation helpful? Give feedback.
All reactions