Audience: developers and students who want to understand the framework behind WARNERCO's memory architecture and immediately know which file implements each piece.
Reading time: ~15 minutes.
Reading order: read this first, then run the live demo at coala-memory-walkthrough.md. Both modes work, but the explainer assumes you haven't done the demo yet.
CoALA = Cognitive Architectures for Language Agents. Sumers, Yao, Narasimhan, and Griffiths, Transactions on Machine Learning Research, February 2024 (arXiv:2309.02427).
The paper takes 50 years of cognitive-science research on human memory and proposes that language agents should organize their memory the same way humans do — four distinct tiers, each with different read/write/decay rules. Not one big bucket called "memory."
The four tiers answer four different questions:
| Tier | The question | Human analogue |
|---|---|---|
| Working | "What am I thinking about right now?" | Inner monologue while reading this sentence |
| Episodic | "What happened to me?" | "I remember asking that 5 minutes ago" |
| Semantic | "What do I know about the world?" | "Hydraulics can overheat" — a fact, untethered from when you learned it |
| Procedural | "What do I know how to do?" | Riding a bike. The skill itself, not the memory of using it. |
Most pre-2024 "agent memory" systems were one vector store. Everything went in. Vector search was the only way out. That collapses four very different problems into one mediocre solution. CoALA's pitch is that each tier needs a different backend because they have different physics:
| Failure mode of "one vector store" | What CoALA fixes |
|---|---|
| A casual aside from 30s ago competes for retrieval ranking with a durable engineering fact from 6 months ago | Working memory and semantic memory live in different stores with different decay |
| No notion of "this happened to me at this time" — can't ask "what did we discuss yesterday?" | Episodic memory has explicit timestamps and a recency-weighted scoring formula |
| Procedural skills get embedded as text and randomly retrieved — agent might "discover" a workflow it shouldn't run | Procedural memory is user-invoked, versioned, and lives in source control, not in a searchable index |
| One write rule (embed + index) for everything — slow, expensive, one-size-fits-all | Each tier has its own write path; cheap stuff stays cheap |
Concrete WARNERCO example: when you ask "why is WRN-00006's thermal subsystem failing?", the agent should pull (a) what you just wrote in your scratchpad (working), (b) what you asked 2 turns ago (episodic), (c) the schematic itself (semantic), and (d) the diagnostic prompt template (procedural). One bucket can't do that without leaking context across categories.
The agent's scratchpad for the current session. Observations, inferences, and notes that should inform the next reasoning step but probably don't matter next week. Volatile in spirit — even though we persist it to SQLite for class demos, the contract is "session-scoped."
Properties: cheap to write, cheap to read, small in volume, session-scoped, no relevance scoring (just recency-ordered injection).
| What | Where |
|---|---|
| Adapter | app/adapters/scratchpad_store.py |
| Storage | data/scratchpad/notes.db (SQLite) |
| Models | app/models/scratchpad.py |
| LangGraph node | inject_scratchpad — see app/langgraph/flow.py line 243 |
| MCP tools | warn_scratchpad_write, warn_scratchpad_read, warn_scratchpad_clear, warn_scratchpad_stats |
Predicate vocabulary (cognitive operations on entries):
observed · inferred · relevant_to · summarized_as · contradicts · supersedes · depends_on
Special features in WARNERCO (beyond raw CoALA):
- LLM minimization on write — content gets compressed by an LLM call before storage; the original is kept side-by-side.
- LLM enrichment on write — context-expansion runs once at write time, cached in the row, so reads are pure SQL (no LLM cost).
- Token budget for injection — the LangGraph node respects a configurable budget (default 1500 tokens) so working memory can't dominate the prompt.
Tools → warn_scratchpad_write
subject: WRN-00006
predicate: observed
object_: thermal_system
content: Operator reports thermal trips during heavy hydraulic load
minimize: true
enrich: true
Then warn_scratchpad_stats shows tokens_saved and savings_percentage — the LLM minimization in action.
Specific past events with timestamps. What happened, when, how important it was. The agent's autobiographical record. Append-only — you don't edit yesterday's events; they just get less relevant as time passes.
Properties: append-only, time-indexed, importance-scored at write, recall is the Park et al. recency × importance × relevance formula.
From Generative Agents (Park et al., UIST 2023), implemented verbatim in EpisodicStore.recall():
total = α_recency · 0.5^(hours_since / half_life_hours)
+ α_importance · stored_importance
+ α_relevance · cosine(query_terms, summary_terms)
WARNERCO defaults (override via EPISODIC_* env vars):
| Knob | Default | What it does |
|---|---|---|
EPISODIC_WEIGHT_RECENCY |
0.4 | How much the agent values "I remembered this recently" |
EPISODIC_WEIGHT_IMPORTANCE |
0.3 | How much the agent values "this seemed important when it happened" |
EPISODIC_WEIGHT_RELEVANCE |
0.3 | How much the agent values "this matches what I'm asked about now" |
EPISODIC_RECENCY_HALF_LIFE_HOURS |
24.0 | How fast recency decays — 24h means a 1-day-old event scores half a brand-new one |
Crucially, warn_episodic_recall returns the per-event score breakdown in the response, so you can see why each memory surfaced. That's the slide-worthy moment of the demo.
| What | Where |
|---|---|
| Adapter | app/adapters/episodic_store.py |
| Storage | data/episodic/events.db (SQLite, single 8-column table) |
| Models | app/models/episodic.py |
| LangGraph nodes | recall_episodes (gated to ANALYTICS/DIAGNOSTIC, line 276) + log_episode (always, line 547) in app/langgraph/flow.py |
| MCP tools | warn_episodic_log, warn_episodic_recall, warn_episodic_recent, warn_episodic_stats |
Event kinds: USER_TURN · AGENT_RESPONSE · TOOL_CALL · OBSERVATION
Pedagogical simplifications, flagged in code:
- Bag-of-words cosine for relevance, not embeddings. Three lines of readable code, zero embedding spend per recall. Swap to embeddings in the
_relevance()helper — comment inepisodic_store.pyshows where. - Importance heuristic in
log_episode: errors → 0.8, diagnostic → 0.6, analytics → 0.4, search/lookup → 0.3. Real agents would learn this; we keep it as a transparent rule a student can read in two lines. - Episodic recall is gated to ANALYTICS and DIAGNOSTIC intents only. Pure ID lookups don't need session history; ungated recall just pollutes the context.
After running 2–3 diagnostic queries with the same session_id:
Tools → warn_episodic_recall
query: thermal hydraulic problems
k: 5
The scores array in the response shows {recency, importance, relevance, total} for each event. Point to it. That's the formula made visible.
Durable, generalizable knowledge. Facts. The kind of thing you wouldn't say "I remember when I learned that" about — you just know it. Expensive to write (embeddings cost money), and contradictions need conflict resolution.
Properties: durable, persistent across sessions, expensive to write, scales to thousands of records.
Three swappable backends behind a single interface, plus a knowledge graph alongside:
| Backend | When to use | File |
|---|---|---|
| JSON | Prototyping, fastest startup, keyword-only fallback | app/adapters/json_store.py |
| Chroma | Local dev, true vector search, ~2-5s startup | app/adapters/chroma_store.py |
| Azure AI Search | Enterprise / production scale | app/adapters/azure_search_store.py |
Switch via MEMORY_BACKEND={json,chroma,azure_search} in .env. Same MemoryStore interface, three implementations.
| What | Where |
|---|---|
| Source of truth | data/schematics/schematics.json — 25 robot schematics |
| Vector store cache | data/chroma/ (gitignored — regenerate with ChromaMemoryStore().index_all()) |
| Knowledge graph | app/adapters/graph_store.py — SQLite + NetworkX, 117 entities and 221 relationships |
| LangGraph node | retrieve — see app/langgraph/flow.py line 312 |
| MCP tools | warn_semantic_search, warn_get_robot, warn_list_robots, warn_index_schematic, plus 6 more (CRUD + comparison + explanation) |
| Graph tools | warn_graph_neighbors, warn_graph_path, warn_graph_stats, warn_add_relationship |
Note that the knowledge graph is also semantic memory in CoALA terms — same tier, different shape. Vector search finds similar things; graph queries find connected things. Use both for comprehensive retrieval. The 9-node LangGraph pipeline runs query_graph before retrieve so graph context primes vector search.
Tools → warn_semantic_search
query: thermal sensors for cold environments
session_id: class-demo
top_k: 5
Versioned skills/workflows the agent can perform. System prompts, tool definitions, slash-command playbooks. The instructions for how to do something, not the memory of having done it.
Properties: user-invoked (this is the critical CoALA design choice), versioned, lives in source control with CI publishing changes.
CoALA is explicit about this — direct quote from the paper:
"Learning new actions by writing to procedural memory is significantly riskier than writing to episodic or semantic memory, as it can easily introduce bugs or allow an agent to subvert its designers' intentions."
This is why MCP Prompts are surfaced as slash commands invoked by the user, not auto-pulled by the model. The agent doesn't get to discover its own skills mid-flight.
| What | Where |
|---|---|
| Source | @mcp.prompt() decorators in app/mcp_tools.py |
| Catalog helper | app/adapters/coala_overview.py (PROCEDURAL_PROMPTS constant) |
| Catalog resource | memory://procedural-catalog |
| Versioning | Each prompt's docstring is tagged [CoALA: procedural memory — version 1.0.0] |
The five registered prompts:
| Prompt | What it does |
|---|---|
diagnostic_prompt |
Structured diagnostic analysis for one schematic |
comparison_prompt |
Side-by-side comparison of two schematics |
search_strategy_prompt |
Coaching prompt — how to search the catalog effectively |
maintenance_report_prompt |
Maintenance-report template for a robot model |
schematic_review_prompt |
Technical-review checklist for a schematic |
Resources → memory://procedural-catalog
Returns a JSON list of all five prompts with their versions and descriptions. In Claude Desktop or VS Code Copilot Chat, the same prompts appear as slash commands in the UI — /diagnostic_prompt WRN-00006, etc.
Working and episodic memory are short-term. Semantic is long-term. Consolidation is the bridge — the moment an agent decides "this short-term observation is durable enough to commit to long-term knowledge." Same idea as your brain consolidating short-term memory into long-term during sleep.
| What | Where |
|---|---|
| Module | app/langgraph/consolidate.py |
| MCP tool | warn_consolidate_memory |
| Mechanism | MCP Sampling (ctx.sample()) — the server asks the client's LLM to extract durable facts |
| Output | Synthetic Schematic records with id="FACT-*", category="consolidated_fact", model="MEMORY", written to the vector store |
The flow:
- Pull recent scratchpad entries (working memory) + recent episodes (episodic memory)
- Build a single prompt asking the LLM to extract durable, generalizable facts
- Call
ctx.sample(...)— server-initiated LLM completion via the client - Write each fact as a synthetic
Schematicrow into the vector store - Log the consolidation cycle itself as an
OBSERVATIONevent in episodic memory — so consolidation becomes its own memory
Pedagogical simplification: ADD-only. We don't implement Mem0's full ADD/UPDATE/DELETE/NOOP loop. Each cycle just appends new facts. Production agents need conflict resolution; that's homework. Source: header docstring of consolidate.py calls this out explicitly.
After populating scratchpad + episodic memory with a few entries:
Tools → warn_consolidate_memory
since_minutes: 60
max_facts: 3
session_id: class-demo
Then warn_semantic_search query="consolidated" — the new FACT-* records appear in retrieval. And warn_episodic_recent shows the new OBSERVATION event recording the cycle.
This is the elegant part. The five MCP primitives line up almost perfectly with CoALA operations:
| MCP Primitive | What it does in CoALA terms | WARNERCO example |
|---|---|---|
| Tools | Read/write all four tiers | warn_scratchpad_write writes working; warn_episodic_recall reads episodic; warn_semantic_search reads semantic |
| Resources | Expose memory state for inspection | memory://coala-overview shows live counts of all four tiers |
| Prompts | ARE procedural memory | The 5 @mcp.prompt() registrations |
| Sampling | Drives consolidation (working/episodic → semantic) | ctx.sample() inside warn_consolidate_memory |
| Elicitation | User-confirmed risky writes | Used by warn_replacement_advisor for multi-step user input |
If you're designing a memory-aware MCP server from scratch: pick your CoALA tiers first, then choose which primitive carries each tier's reads and writes. Architecture follows the framework, not the other way around.
The pipeline at app/langgraph/flow.py is the canonical example of all four tiers exercised in one turn:
parse_intent → query_graph → inject_scratchpad → recall_episodes →
retrieve → compress_context → reason → respond → log_episode
| Node | CoALA tier touched | What it does |
|---|---|---|
parse_intent |
(control) | Classify query as LOOKUP / DIAGNOSTIC / ANALYTICS / SEARCH |
query_graph |
semantic (graph variant) | Pull related entities from the knowledge graph |
inject_scratchpad |
working | Add session notes to context |
recall_episodes |
episodic (read) | Pull top-k past events using Park et al. — gated to DIAGNOSTIC/ANALYTICS |
retrieve |
semantic (vector) | Vector search the schematics |
compress_context |
(control) | Stitch all four context blocks together within budget |
reason |
(LLM) | Call Azure OpenAI with the compressed context |
respond |
(control) | Format response |
log_episode |
episodic (write) | Append this turn as a new event with importance heuristic |
Procedural memory isn't in the pipeline because it's user-invoked — slash commands trigger prompts directly, not via the RAG flow.
(Source: docs/diagrams/schematica-memory-architecture.mmd)
WARNERCO is a teaching app, not a production agent. Three deliberate simplifications, each flagged in code with a # CoALA NOTE: comment:
-
Bag-of-words cosine for episodic relevance, not embeddings. Three lines of readable code instead of an embedding API call per recall. Swap-in is a 3-line edit at
_relevance()inepisodic_store.py. Production agents use embeddings. -
ADD-only consolidation. No Mem0-style ADD/UPDATE/DELETE/NOOP dedup against existing facts. Pattern is shown; production hardening is the reader's homework.
-
Consolidated facts share the
Schematicshape. A cleaner design would give them a separate "fact" collection, but that requires forking three vector adapters. Tags + category prefix make them filterable; that's enough for class. -
No bi-temporal edges (Zep/Graphiti style). Episodic memory has only
created_at, nott_valid/t_invalid. The Park et al. recency decay covers the temporal teaching point at half the schema complexity.
If you're using WARNERCO as a starting point for a real product, the homework list is exactly those four bullets.
- Run the live demo:
coala-memory-walkthrough.md— ~4-minute classroom path that hits all four tiers in one continuous scenario. - Read the source — the files ARE documentation:
app/adapters/episodic_store.py— the most heavily commented adapter; reads top-to-bottom as a tutorialapp/langgraph/consolidate.py— the sleep-cycle pattern in ~120 linesapp/adapters/coala_overview.py— the live four-tier snapshot helper
- Go deeper on the theory:
research_synthesis/claude-deep-research.md— the longest of four independent deep-research reports converging on CoALA. Covers Mem0, Zep/Graphiti, Letta/MemGPT, and the production-readiness gaps WARNERCO deliberately leaves open. - Read the original CoALA paper: Sumers et al. 2024, arXiv:2309.02427.
CoALA says agent memory needs four tiers, not one bucket — Working, Episodic, Semantic, Procedural — because they have different read/write/decay rules.
WARNERCO implements all four in one Python codebase under src/warnerco/backend/app/, exercises them in a 9-node LangGraph pipeline, exposes 28 MCP tools across them, surfaces the live state via memory://coala-overview, and ties them together with a "sleep cycle" consolidation tool that uses MCP Sampling to promote short-term observations into long-term knowledge.
That's the framework. Now go run the demo.