Phase 0 (b): Run-0 calibration harness + exercise CI gates#8
Merged
Conversation
… image fix (c) GraphitiHttpRouter — production Graphiti writer. kb_engine/sync.py: new GraphitiHttpRouter class. POSTs records with target="graphiti" to a graphiti-mcp HTTP endpoint as MCP JSON-RPC `tools/call` envelopes (default tool: `add_memory`). Records with other targets are skipped. Failures are captured in router.failures by default; --strict raises. The HTTP `post` callable is dependency- injected so tests use a fake response without httpx installed. Per AGENTS.md hard rule #1: this is the **designated** Graphiti writer. Plugin code never POSTs Graphiti directly — handlers produce DerivedRecord(target="graphiti", ...), the daemon's router list includes a GraphitiHttpRouter (alongside JsonlRouter + NullRouter as needed), and only the daemon's router runs the HTTP write. +7 tests covering: posts an episode (verifies envelope shape), skips non-graphiti records, handles HTTP errors non-strict (records to failures), strict mode raises, network exception non-strict, falls back to frontmatter for episode_body when no explicit body provided, increments JSON-RPC id per call. kb-engine total: 49 tests. (infra) Optional-dependency extras for kb-engine: [project.optional-dependencies] postgres = ["psycopg[binary]>=3.1", "pgvector>=0.2"] neo4j = ["neo4j>=5"] openai = ["openai>=1.0"] watch = ["watchdog>=4"] yaml = ["pyyaml>=6.0"] http = ["httpx>=0.27"] all = [<every backend>] chio-pack now declares `kb-engine[all]` so `uv sync` brings every store driver in. The lazy-import pattern in store/embed.py / store/postgres.py / store/neo4j.py / sync.py still works without the extras for environments that don't need a particular backend. (infra) docker-compose.yml fix: graphiti-mcp image was guessed `zepai/graphiti-mcp:latest` (doesn't exist on Docker Hub). Verified via `docker ps` against the running PR #599 instance: actual image is `zepai/knowledge-graph-mcp:standalone`. Updated. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
(b) Calibration harness skeleton per ADR-0004. Three rater abstractions:
- HumanRater: interactive stdin prompt, prints scenario + augmentation
body to stderr, reads four 1–5 scores. rater-A is the human (@connor).
- AnthropicRater: lazy-imports the anthropic SDK, prompts a model
with one of two system rubrics, parses a JSON object
{clarity, accuracy, actionability, brevity}. rater-B and rater-C
use this with claude-sonnet-4-6 (canonical) and
claude-haiku-4-5-20251001 (accuracy-emphasis variant).
- DeterministicRater: returns configured fake scores; used by tests
and by `--dry-run` so the harness can be exercised without an
API key or human in the loop.
Two system rubrics live in this module:
CANONICAL_RUBRIC_SYSTEM (rater-A and rater-B) and
ACCURACY_EMPHASIS_SYSTEM (rater-C, deliberately diverged per
ADR-0004 to surface accuracy/brevity trade-offs at calibration time).
Loads scenarios from chio-pack/eval/fixtures/cap-error-explanation/
via PyYAML; picks the augmentation by name from the scenario's
`augmentations:` list. `calibrate(scenario_path, pool, ...)` returns
a CalibrationRun (dataclass) with one RaterScore per rater plus a
`disagreement_flags()` helper that returns dimensions where
max - min > 1 (strict; a diff of 1 does NOT flag — verified by test).
`render_calibration_md(run)` produces the 12-row table
(3 raters × 4 dimensions) the calibration ADR template expects.
CLI: `python -m chio_pack.eval.calibration --dry-run --run-number 0`
walks all 10 scenarios with deterministic raters and prints JSON.
Real runs use --real --augmentation raw|enriched|baseline.
+8 tests in tests/test_calibration.py covering pool size, scenario
loading, unknown-augmentation raises, disagreement flag detection
(>1 strict, =1 not flagged), render produces 12 rows, RaterScore.mean(),
DeterministicRater determinism. chio-pack total: 39 → 47 tests.
This is the harness only. Running Run-0 against the full 10 scenarios
with three real raters is gated on ANTHROPIC_API_KEY (rater-B,
rater-C) plus a sit-down with @connor (rater-A); both are out of band
for this commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Outcome evals — 2026-05-09 02:20 UTC
Deferred (block on Phase 2)
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
HumanRater(interactive),AnthropicRater(lazy-imports anthropic SDK; canonical + accuracy-emphasis system prompts),DeterministicRater(tests +--dry-run).CalibrationRun.disagreement_flags()flags dimensions wheremax - min > 1strict (a diff of 1 does NOT flag — verified by test).chio-dev ingest ../arc --no-postgreswalked 260,956 files, ingested 40,606, upserted 41,889 nodes + 1,283 edges into Neo4j (MATCH (n) RETURN labels(n)[0], count(*)→ 69,947 ChioEntity, 39,313 ChioFile, 1 ChioCrate). First time the live Neo4j ingest path has seen real arc data.test_calibration.py).What's deliberately out of scope
ANTHROPIC_API_KEY(rater-B, rater-C) plus a sit-down with @bb-connor (rater-A). Both are out of band — this PR adds the harness, not the calibration data.chunks_inserted: 0) — re-running withOPENAI_API_KEY+ Postgres is a follow-up; the Neo4j-only path is what's exercised here.Test plan
kb-enginetests pass (49)chio-packtests pass (39, includes 8 new calibration tests)python ops/ci/check-imports.pypasses (no engine ↔ pack boundary violations)python -m chio_pack.eval.calibration --dry-run --run-number 0walks all 10 cap-error scenarios cleanlychio-dev ingest ../arc --no-postgresend-to-end run committed nodes + edges into the running Neo4j🤖 Generated with Claude Code