Skip to content

fix: use upsert and deterministic IDs to prevent data stagnation#140

Merged
bensig merged 5 commits intoMemPalace:mainfrom
igorls:fix/data-integrity
Apr 8, 2026
Merged

fix: use upsert and deterministic IDs to prevent data stagnation#140
bensig merged 5 commits intoMemPalace:mainfrom
igorls:fix/data-integrity

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented Apr 7, 2026

Problem

1. Data stagnation in miner (HIGH)

miner.add_drawer() uses collection.add(), which throws on duplicate IDs. Modified files never get their palace content updated:

except Exception as e:
    if "already exists" in str(e).lower():
        return False  # ← silently ignored

2. Non-deterministic drawer IDs in MCP (MEDIUM)

tool_add_drawer() generates IDs using content[:100] + datetime.now().isoformat(). Same content at different times → different IDs → duplicate entries.

Fix

Deterministic IDs

-drawer_id = hashlib.md5((content[:100] + datetime.now().isoformat()).encode())
+drawer_id = hashlib.md5(content.encode())

Same content always produces the same ID. Combined with upsert, this makes add_drawer idempotent.

Upsert

Both miner.add_drawer() and tool_add_drawer() now use collection.upsert():

  • New content → inserted
  • Existing ID with updated content → updated
  • Identical content → no-op (same embeddings)

92 tests pass (includes test infrastructure from PR #131).

Copilot AI review requested due to automatic review settings April 7, 2026 20:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates drawer ingestion to avoid stale data by enabling overwrites on existing IDs, and adds/expands automated tests while migrating packaging metadata toward uv/hatchling.

Changes:

  • Switch ChromaDB writes from add() to upsert() in the miner and MCP write path, and make MCP drawer IDs deterministic.
  • Add substantial pytest coverage for MCP server tools, search API, dialect, and knowledge graph behavior with shared fixtures.
  • Remove requirements.txt and update pyproject.toml build/dev tooling configuration.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
mempalace/miner.py Uses collection.upsert() for drawer writes (intended to allow updates).
mempalace/mcp_server.py Uses deterministic content-based drawer IDs and upsert() for MCP-added drawers.
mempalace/__init__.py Bumps __version__ to 3.0.0.
tests/conftest.py Adds shared fixtures for isolated palace/KG setup and seeded Chroma collections.
tests/test_mcp_server.py Adds unit/integration tests for MCP request dispatch and tool handlers.
tests/test_searcher.py Adds tests for search_memories() programmatic API and filters.
tests/test_knowledge_graph.py Adds tests covering entity/triple lifecycle, temporal queries, timeline, stats.
tests/test_dialect.py Adds tests for AAAK dialect compression/encoding/decoding utilities.
pyproject.toml Moves to hatchling, adds dependency groups, updates dev deps.
requirements.txt Removed in favor of pyproject.toml/uv.lock.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/miner.py
Comment on lines 416 to 420
"""Add one drawer to the palace."""
drawer_id = f"drawer_{wing}_{room}_{hashlib.md5((source_file + str(chunk_index)).encode()).hexdigest()[:16]}"
try:
collection.add(
collection.upsert(
documents=[content],
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to collection.upsert() enables updating an existing drawer ID, but process_file() still short-circuits on file_already_mined() (same module) and returns before any chunks are (re)added. As a result, modified files will still never reach this upsert call, so the data-stagnation problem described in the PR remains. Consider changing the skip logic to re-mine when the source file changes (e.g., store and compare a file hash/mtime in metadata, or add a --force/--refresh mode) and optionally delete drawers for chunks that no longer exist after re-chunking.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/mcp_server.py Outdated
Comment thread tests/conftest.py Outdated
Comment on lines +83 to +87
def collection(palace_path):
"""A ChromaDB collection pre-seeded in the temp palace."""
client = chromadb.PersistentClient(path=palace_path)
col = client.get_or_create_collection("mempalace_drawers")
return col
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collection/seeded_collection fixtures create a chromadb.PersistentClient but never explicitly release it. On Windows, ChromaDB can hold file handles after tests finish, and the current shutil.rmtree(..., ignore_errors=True) teardown can silently leave temp directories behind. Consider explicitly dropping references / forcing GC in teardown (or adopting the existing Windows-safe cleanup pattern used elsewhere in the repo) so failures don’t get masked and temp dirs don’t accumulate.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_mcp_server.py Outdated
Comment thread tests/test_mcp_server.py Outdated
@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 7, 2026

Hey Igor — CI is failing, looks like it needs a rebase on main (we've merged a bunch of changes recently). Can you rebase and push?

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 7, 2026

Same situation — your other PRs landed and created conflicts here. One more rebase on main should do it.

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 8, 2026

One more conflict from #135 merging — should be the last one. Sorry for the cascade!

@igorls igorls force-pushed the fix/data-integrity branch from a9199c3 to 10ea480 Compare April 8, 2026 17:34
@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 8, 2026

Almost there — 6 lint errors, all semicolons in tests (_client, _col = ...; del _client). Split onto two lines and it's good to merge.

igorls added 4 commits April 8, 2026 15:11
MCP tool_add_drawer:
- Make drawer_id content-based: hash full content instead of
  content[:100] + timestamp. Same content → same ID, eliminating
  TOCTOU race conditions
- Switch from col.add() to col.upsert() so re-filing with updated
  content updates the existing drawer

miner.add_drawer:
- Switch from collection.add() to collection.upsert() so re-mining
  a modified file updates instead of silently failing
- Remove the try/except catching 'already exists' — upsert handles
  this naturally

Findings: MemPalace#11 (HIGH — add ignores updates), MemPalace#6 (MEDIUM — TOCTOU),
          MemPalace#13 (MEDIUM — non-deterministic IDs)

Includes test infrastructure from PR MemPalace#131.
92 tests pass.
@igorls igorls force-pushed the fix/data-integrity branch from 6a34cc9 to a0bcd0c Compare April 8, 2026 18:12
@bensig bensig merged commit 2c4abb9 into MemPalace:main Apr 8, 2026
4 checks passed
@milla-jovovich milla-jovovich mentioned this pull request Apr 9, 2026
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants