WMB-100K — Wontopos Memory Benchmark v2.1

An enterprise-scale situational benchmark for AI memory systems — 4.3M tokens, 2,708 questions.

Store 4.3M tokens (2.3M documents + 105K conversation turns), then prove your memory system can retrieve the right information for real-world situations.

What WMB-100K Tests

WMB-100K measures: can your memory system retrieve the right information for real situations?

Not LLM reasoning ability
Not response generation quality
Only situational retrieval accuracy + false memory defense

Memory systems don't answer questions — they provide information to LLMs. WMB-100K tests whether the memory system returned the right memories for the situation. The LLM interpretation is out of scope.

Important: Pure retrieval systems (e.g., WME) store and retrieve memories only. They do not interpret or reason about them. To use retrieved memories in production, connect your memory system with an LLM or a personality layer (e.g., WPE) that interprets the memories for the situation. Systems that internally use LLMs (e.g., Hindsight, Mem0) process memories before returning them, which may affect both accuracy and speed.

What Changed in V2

	V1	V2
Questions	Fact lookup ("What time does the user wake up?")	Situational ("Should we schedule a morning meeting?")
Scoring	Keyword matching	GPT-4o-mini semantic judge
Focus	Did you find the fact?	Did you bring the right memories for the situation?

Scale

Benchmark	Turns	Tokens	Questions	False Memory Test
LOCOMO (Maharana et al., 2024)	600	~50K	~1,500	No
LongMemEval (Wu et al., 2024)	~1,000	~100K	500	No
WMB-100K	105,591	4.3M	2,708	Yes (400)

Part	Data	Tokens
Part A	10 document domains (Wikipedia, public domain)	2.3M
Part B	10 conversation categories (~10K turns each)	~2.0M
Total	Store all 4.3M tokens to answer all questions	~4.3M

How It Works

Phase 1: Ingestion

Feed your memory system all data: 2.3M tokens of documents (10 domains) + 105,591 turns of conversation (10 categories).

Phase 2: Query

Ask 2,708 situational questions. Your system returns relevant memories.

Phase 3: Score

LLM judges evaluate whether returned memories contain the information needed for each situation. Quick mode uses 1 judge, Official mode uses 3 judges with majority vote.

Question Types

S1 — Scored (determines your final score)

Single-memory situational questions. One relevant memory needed to address the situation.

Part A S1: 1,000 questions (documents)
Part B S1: 1,000 questions (conversation)

Example: "A friend invites the user to an early breakfast at 7am. What should they know about the user's morning routine?"

S2-S7 — Analysis (accuracy % reported separately, not scored)

Type	Description	Memories Needed	Difficulty
S2 Multi-Memory	Combine 2-3 memories	2-3	★★
S3 Cross-Category	Connect different domains	2-3	★★★
S4 Temporal	Track changes over time	2+	★★★
S5 Adversarial	Wrong premise, retrieve correct memory	1-2	★★★★
S6 Contradiction	User said conflicting things, retrieve both	2+	★★★★
S7 Reasoning Chain	3+ memories needed in sequence	3+	★★★★★

S2-S7 use the same judge (Quick or Official mode) with CORRECT/WRONG binary scoring, reported as accuracy percentages.

FM — False Memory (penalty)

400 questions about things never mentioned. Correct response: return nothing.

Scoring

Part B:  1,000 S1 questions × 0.1 = 100 points max

Score = Part B - FM Penalty - Speed Penalty = 100 max

FM Penalty: each false positive × -0.25 (200 Part B probes, max -50)
Speed Penalty: applied per question based on response latency
Scores can go negative (no 0 clamping). Actual scores are always shown.

Part A (optional): 1,000 S1 document questions. Reported separately for systems
that support document ingestion. Not included in the main score.

Two Scoring Modes

	Quick Mode	Official Mode
Judge	GPT-4o-mini (1 LLM)	3 LLMs majority vote
LLMs used	GPT-4o-mini only	GPT-4o-mini + Claude Haiku + Gemini Flash
Passing rule	1/1 CORRECT	2/3 CORRECT (majority)
Cost	~$0.36	~$1.16
Use case	Self-testing, development	Leaderboard submission
Label	"Self-tested"	"Verified"

Official mode prevents bias toward any single LLM's judgment. A question is CORRECT only if at least 2 out of 3 judges agree.

Judge Prompt

Each question includes required_memories — the specific information the system must return.

Judge Input:
  Question: "Should we schedule a morning meeting with the user?"
  Required: ["user wakes up at 7:15", "user is not a morning person"]
  Returned: [what your system returned]

Judge Output: CORRECT or WRONG

The exact judge prompt is in scripts/score.py. Temperature: 0. No partial credit.

Fairness note: The judge receives only the question, required_memories, and returned memories. The gold_answer field is never shown to the judge to prevent anchoring bias.

Speed Penalty

Response latency matters in production. A system that takes 60 seconds to recall a memory is unusable regardless of accuracy.

Response Time	Penalty per Question
0 - 300ms	0 (no penalty)
300ms - 500ms	-0.01
500ms - 1,000ms	-0.05
1,000ms+	-0.1

Speed is measured per recall/search call, not including ingestion time.

FM Penalty Ratio (-0.25 vs +0.1)

The 2.5x penalty reflects that false memories are more harmful than missing memories. A missing memory means "I don't know" — inconvenient but safe. A false memory means confidently returning wrong information — potentially dangerous in production (wrong medical history, wrong legal details, wrong user preferences).

Grades

Score	Grade
90-100	Exceptional
80-89	Excellent
70-79	Good
60-69	Fair
50-59	Below Average
0-49	Failing

Results

WMB-100K v2.1

System	Part B (/100)	FM Penalty	Speed Penalty	Score (/100)	Grade	p50	p95	Judge
No results yet	—	—	—	—	—	—	—	—

S2-S7 Accuracy:

System	S2	S3	S4	S5	S6	S7
No results yet	—	—	—	—	—	—

Results will be published when Official mode (3-LLM majority vote) testing is complete. Submit your own results via GitHub Issues.

WME Detailed Results

WME is a retrieval-only system — it stores and retrieves memories without any LLM. Results vary by evaluation method:

Configuration	Score	Evaluation	Why
WME (Official)	—	3-LLM majority (GPT-4o-mini + Claude Haiku + Gemini Flash)	Official scored result. Same standard as all other systems.
WME + WPE (Official)	—	3-LLM majority (GPT-4o-mini + Claude Haiku + Gemini Flash)	Full pipeline: WME retrieves, WPE interprets. Official scored.

WME Official: Scored by the same 3-LLM majority vote as every other system. No special treatment.
WME + WPE Official: Full pipeline scored by the same 3-LLM majority vote. WME retrieves, WPE interprets.

WME alone is the engine. WPE is the brain. Together, they form the complete system.

Note on V1 results: In V1 testing with keyword matching, Mem0 retrieved 84 correct memories out of 2,224 questions (3.8%) and LangChain retrieved 527 (23.7%). However, both scored 0.0 net because FM penalty (-100) exceeded raw points. The 0.0 score reflects net-after-penalty, not zero retrieval. V2 results with semantic scoring may differ.

Why WMB-100K Exists

Existing benchmarks (LOCOMO, LongMemEval) have a fundamental problem: every vendor scores themselves using different evaluation methods, then claims #1.

LOCOMO's official scoring (Token-Overlap F1) gives GPT-4 full context only 32.1%
Yet vendors self-report 60-90% using their own evaluation methods
Different scoring criteria make cross-vendor comparison meaningless

WMB-100K solves this with fixed judges that everyone must use:

Mode	Judges	Rule
Quick	GPT-4o-mini	1/1 CORRECT
Official	GPT-4o-mini + Claude Haiku 4.5 + Gemini 2.0 Flash	2/3 majority

No self-scoring. No custom evaluation. Same 3 models judge everyone equally.

Cross-Benchmark Reference

Note: These scores are not directly comparable. Each benchmark uses different evaluation methods, datasets, and scoring criteria. Vendors self-report using their own methods.

System	LOCOMO (600 turns)	LongMemEval (1K turns)	WMB-100K (100K turns)
Full Context (GPT-4)	32.1% (official F1)	—	$1,638+ per run
Mem0	66.9% (self-reported)	49.0% (self-reported)	Testing
Supermemory	#1 (self-reported)	81.6% (self-reported)	Not tested
Hindsight	—	91.4% (self-reported)	Incomplete
OpenAI Memory	52.9%	—	Not tested

Cost of Full Context Approach

Model	Estimated cost per run
GPT-4o-mini	~$98
GPT-4o	~$1,638
Claude Sonnet	~$1,967
Claude Opus	~$9,835

Estimates based on per-question context loading at March 2026 pricing.

Submit Your Results

To add your system to the leaderboard, open a GitHub Issue with your result.json file. We will verify and add it.

Data

Synthetic Conversation Data

Conversation data (Part B) was generated using Claude Haiku (Anthropic). Each category contains ~10K turns of synthetic dialogue with ~100 facts naturally embedded in noise. The data is synthetic — not real user conversations.

Known limitations of synthetic data:

Conversations may be more structured than real human dialogue
Fact distribution may be more uniform than organic conversations
Emotional/social dynamics may be simplified

Document Data (Part A)

Document data is sourced from Wikipedia (public domain, Creative Commons). 10 domains, ~230K tokens each.

Data Schema

Question format (all_questions.json):

{
  "id": "daily_life.S1.001",
  "category": "daily_life",
  "qtype": "S1Situational",
  "text": "A friend invites the user to breakfast at 7am. What should they know?",
  "gold_answer": "The user wakes up at 7:15 and is not a morning person.",
  "required_memories": ["user wakes up at 7:15", "user is not a morning person"],
  "gold_turn_ids": [120, 453],
  "points": 0.1,
  "false_penalty": 0.0
}

Conversation turn format ({category}.jsonl):

{
  "turn_id": 120,
  "speaker": "user",
  "text": "I usually wake up around 7:15, barely making it on time...",
  "embedded_facts": ["daily_life.004"]
}

10 Conversation Categories

#	Category	Topics	Facts
1	`daily_life`	Routines, meals, habits	100
2	`relationships`	Family, friends, partner	100
3	`work_career`	Projects, salary, promotion	100
4	`health_fitness`	Exercise, injuries, diet	100
5	`travel_places`	Trips, restaurants, moving	100
6	`media_taste`	Movies, books, music, games	100
7	`finance_goals`	Savings, loans, investments	100
8	`pets_hobbies`	Photography, climbing, cat	100
9	`education_skills`	Languages, courses, certs	100
10	`beliefs_values`	Philosophy, politics, goals	100

10 Document Domains

#	Domain	Source	Tokens
1-10	Daily Life, Economics, History, Law, Literature, Medicine, Philosophy, Psychology, Science, Technology	Wikipedia	~230K each

Quick Start

Requirements

Python 3.10+
pip install openai anthropic
OpenAI API key (for scoring — Quick ~$0.36, Official ~$1.16)
Your memory system with store/search interface

Run

# 1. Clone
git clone https://github.com/Irina1920/WMB-100K
cd WMB-100K

# 2. Install dependencies
pip install openai anthropic

# 3. Write an adapter (see scripts/test_mem0.py for example)

# 4. Run your adapter
export OPENAI_API_KEY=sk-...
python scripts/your_adapter.py full

# 5. Score (Quick mode — 1 judge, ~$0.36)
python scripts/score.py datasets/answers.json "YourSystem"

# 5b. Score (Official mode — 3 judges majority vote, ~$1.16)
python scripts/score.py datasets/answers.json "YourSystem" --official

Adapter Template

def store(user_id: str, content: str) -> None:
    """Store a memory."""
    your_system.add(content, user_id=user_id)

def search(user_id: str, query: str) -> list[str]:
    """Search memories, return relevant text."""
    results = your_system.search(query, user_id=user_id)
    return [r["text"] for r in results]

See scripts/test_mem0.py and scripts/test_langchain.py for working examples.

Limitations

Synthetic conversations: Generated by Claude Haiku, not real user data. Real conversations are messier, more ambiguous, and less structured.
English only: All questions and data are in English. Performance on other languages is untested.
Two systems tested: Only Mem0 and LangChain FAISS have been tested so far. Results may not generalize to all memory systems.
LLM judge variability: Different LLMs produce different scores (up to 10% variance). Official mode uses 3-LLM majority vote to mitigate this. Quick mode uses GPT-4o-mini only.
Wikipedia documents: Part A uses Wikipedia text, which may not represent domain-specific enterprise documents.
FM question design: False memory probes are synthetically generated and may not cover all realistic hallucination patterns.
Vendor-created benchmark: WMB-100K was created by Wontopos, which also develops WME (a memory system tested on this benchmark). While the benchmark uses independent LLM judges and all data/code is open source for verification, users should be aware of this relationship. We encourage independent reproduction and welcome third-party audits.

Cost to Run

Step	Quick Mode	Official Mode
Dataset (included in repo)	$0	$0
Scoring	~$0.36 (1 judge)	~$1.16 (3 judges)
Your system's ingestion costs	Varies	Varies

Citation

@misc{wmb100k2026,
  title={WMB-100K v2.1: A 100,000-Turn Situational Benchmark for AI Memory Systems},
  author={Wontopos},
  year={2026},
  url={https://github.com/Irina1920/WMB-100K}
}

License

Apache 2.0 — Dataset, benchmark tool, and scoring system are free to use.

Fairness

We are committed to making WMB-100K a fair and unbiased benchmark. If you find any issues with the scoring methodology, question design, data quality, or anything that could affect fairness, please let us know immediately.

Open a GitHub Issue
Or email: official@wontopos.com

We will investigate and fix any reported issues promptly. A benchmark is only valuable if it is trusted.

Contact

Maintained by Wontopos.


General	official@wontopos.com
CEO	sunwoo.ceo@wontopos.com
Marketing	xcx135@wontopos.com
Frontend	LoseWoo@wontopos.com

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
datasets		datasets
documents		documents
scripts		scripts
src		src
.gitignore		.gitignore
COMPARISON.md		COMPARISON.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

WMB-100K — Wontopos Memory Benchmark v2.1

What WMB-100K Tests

What Changed in V2

Scale

How It Works

Phase 1: Ingestion

Phase 2: Query

Phase 3: Score

Question Types

S1 — Scored (determines your final score)

S2-S7 — Analysis (accuracy % reported separately, not scored)

FM — False Memory (penalty)

Scoring

Two Scoring Modes

Judge Prompt

Speed Penalty

FM Penalty Ratio (-0.25 vs +0.1)

Grades

Results

WMB-100K v2.1

WME Detailed Results

Why WMB-100K Exists

Cross-Benchmark Reference

Cost of Full Context Approach

Submit Your Results

Data

Synthetic Conversation Data

Document Data (Part A)

Data Schema

10 Conversation Categories

10 Document Domains

Quick Start

Requirements

Run

Adapter Template

Limitations

Cost to Run

Citation

License

Fairness

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages