Benchmark methodology review + complementary approach from agentmemory #747

rohitg00 · 2026-04-09T11:55:11Z

rohitg00
Apr 9, 2026

Hey, great work on mempalace. The core insight — raw verbatim text + good embeddings beats LLM extraction — is genuinely valuable and your benchmark scripts are refreshingly reproducible.

I maintain agentmemory, a persistent memory system for AI coding agents. We recently ran LongMemEval-S ourselves and got 95.2% R@5 (BM25+vector hybrid) using the same all-MiniLM-L6-v2 embedding model. While comparing approaches, we dug into the benchmark methodology and wanted to share some findings that might help strengthen your claims.

What we verified

96.6% R@5 is reproducible. We confirmed this independently. The number is real and deterministic.
88.9% R@10 on LoCoMo (no rerank) appears to be the honest, clean score. Good result.
Raw verbatim > LLM extraction — this insight holds up. Our BM25-only baseline gets 86.2% R@5 on the same dataset, confirming that good retrieval beats lossy compression.

What concerns us

1. Metric category error on LongMemEval

LongMemEval is an end-to-end QA benchmark. Every score on the published leaderboard is QA accuracy (retrieve + generate answer + GPT-4o judge). The 96.6% is recall_any@5 — a retrieval-only metric that never generates an answer or invokes a judge.

This makes the number incomparable to anything on the leaderboard:

System	Metric	Score
OMEGA	QA accuracy	95.4%
Mastra	QA accuracy	84.2% (gpt-4o)
EmergenceMem	QA accuracy	86%
Oracle GPT-4o	QA accuracy	~82.4%
mempalace	retrieval recall	96.6%

An independent tester (Issue #39) ran the full pipeline and got 82.6% QA accuracy — competitive but substantially different from 96.6%.

We label our own 95.2% explicitly as "retrieval recall, not end-to-end QA accuracy" in LONGMEMEVAL.md. Suggesting the same clarity here would help the community compare fairly.

2. The 100% R@5 score and the 3 targeted patches

The path from 96.6% → 100% involved 3 hand-coded patches for 3 specific failing questions (quoted-phrase boost, person-name boost, nostalgia pattern). Your own BENCHMARKS.md acknowledges this is "teaching to the test." The held-out split was created after the patches, not before.

The 98.4% held-out score is more credible but the split is post-hoc. A pre-registered dev/test split would make this bulletproof.

3. LoCoMo 100% with top_k=50

The 100% LoCoMo claim uses top_k=50 against conversations with at most 32 sessions — retrieving the entire conversation. Your BENCHMARKS.md correctly notes "the embedding retrieval step is bypassed entirely." The honest number (88.9% R@10 at top_k=10) should lead.

4. ConvoMem "2x Mem0"

The 92.9% is retrieval recall; Mem0's published numbers are QA accuracy. Different metrics on the same dataset.

5. `--mode raw` benchmarks ChromaDB, not mempalace

In raw mode, zero mempalace code executes — no palace, no wings, no rooms, no AAAK. The 96.6% is really a benchmark of ChromaDB + MiniLM-L6-v2. This is useful information but shouldn't be attributed to the palace architecture.

What we think is genuinely strong

Zero-API retrieval quality — 96.6% R@5 with no external calls is legitimately the highest published zero-API retrieval score on LongMemEval-S
Honest disclosure in BENCHMARKS.md — the caveats are documented, even if they don't make it to the README
Reproducible scripts — anyone can verify the numbers
Temporal knowledge graph with validity windows — this is architecturally interesting and something we don't have
Minimal dependencies — ChromaDB + PyYAML is genuinely simpler than our setup

How agentmemory approaches this differently

We took the opposite bet: compress observations into structured facts/narratives to keep context injection under a token budget (~2K tokens vs raw text). Our search uses triple-stream retrieval (BM25 + vector + knowledge graph) with RRF fusion.

On LongMemEval-S with the same embedding model:

System	R@5	R@10	NDCG@10
agentmemory BM25+Vector	95.2%	98.6%	87.9%
agentmemory BM25-only	86.2%	94.6%	73.0%
mempalace raw vector	96.6%	~97.6%	—

We're 1.4pp behind on R@5 but beat on R@10 (98.6% vs ~97.6%). BM25 adds recall depth that pure vector misses.

Suggestion: the two approaches are complementary

mempalace excels at "what does this codebase contain?" (static corpus search).
agentmemory excels at "what did I do across sessions?" (temporal memory).

A developer could use both:

mempalace maps the territory (structure, relationships, communities)
agentmemory remembers the journey (decisions, bugs, patterns learned)

Would be interesting to explore integration points — mempalace's knowledge graph feeding into agentmemory's context injection, for example.

Not trying to start a flame war — genuinely think both projects push the space forward. The benchmark methodology feedback is meant to strengthen your claims, not undermine them. Happy to discuss any of this.

dial481 · 2026-04-15T08:59:59Z

dial481
Apr 15, 2026

You raised some very important points and explained them well.

Many corrections to the documentation have been made and MemPalace's public tagline was recently changed from "highest-scoring AI memory system ever benchmarked" to "best-benchmarked open-source AI memory system," but that's still a claim that obviously is not true.

The R@K vs QA distinction has been well documented in multiple issues in addition to #39 (#27, #29, #125, #314, #333, #367, #875).

1 reply

rohitg00 Apr 15, 2026
Author

Thanks for pointing it out.

terrizoaguimor · 2026-04-15T18:08:26Z

terrizoaguimor
Apr 15, 2026

Great discussion. The metric confusion here is worth addressing head-on.

Retrieval Recall @5 ("did we find the right document in top 5?") and QA Accuracy ("did we answer the question correctly?") are fundamentally different metrics. The LongMemEval paper (ICLR 2025) reports both:

Retrieval R@5 with Stella V5: 64-73% (Table 3)
QA accuracy with GPT-4o oracle: 87-92% (Figures 3b, 6)

Reporting retrieval recall as if it were QA accuracy inflates scores by 20-30 percentage points.

I recently ran the full LongMemEval benchmark against Celiums with end-to-end QA accuracy (retrieve → LLM synthesizes answer → separate LLM judges). Tested 5 models:

Model	QA Accuracy
Opus 4.6	62.3%
DeepSeek R1 70B	51.6%
Sonnet 4.6	51.1%
Llama 3.3 70B	52.4%
Haiku 4.5	40.0%

Retrieval rate was 100% — the engine always found the right session. The gap to 62.3% QA is pure synthesis difficulty (temporal reasoning, multi-session aggregation).

Everything reproducible: benchmarks/BENCHMARKS.md. Full analysis: celiums.ai/blog/mempalace-mirage-benchmark-fraud

The industry needs more projects publishing honest, reproducible numbers. Happy to discuss methodology.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark methodology review + complementary approach from agentmemory #747

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Benchmark methodology review + complementary approach from agentmemory #747

Uh oh!

rohitg00 Apr 9, 2026

What we verified

What concerns us

1. Metric category error on LongMemEval

2. The 100% R@5 score and the 3 targeted patches

3. LoCoMo 100% with top_k=50

4. ConvoMem "2x Mem0"

5. --mode raw benchmarks ChromaDB, not mempalace

What we think is genuinely strong

How agentmemory approaches this differently

Suggestion: the two approaches are complementary

Replies: 2 comments · 1 reply

Uh oh!

dial481 Apr 15, 2026

Uh oh!

rohitg00 Apr 15, 2026 Author

Uh oh!

terrizoaguimor Apr 15, 2026

rohitg00
Apr 9, 2026

5. `--mode raw` benchmarks ChromaDB, not mempalace

Replies: 2 comments 1 reply

dial481
Apr 15, 2026

rohitg00 Apr 15, 2026
Author

terrizoaguimor
Apr 15, 2026