Phase 0 (b): Run-0 calibration harness + exercise CI gates by bb-connor · Pull Request #8 · backbay-labs/chio-developer-base

bb-connor · 2026-05-09T02:20:26Z

Summary

Run-0 calibration harness for the cap-error-explanation eval per ADR-0004. Three rater abstractions: HumanRater (interactive), AnthropicRater (lazy-imports anthropic SDK; canonical + accuracy-emphasis system prompts), DeterministicRater (tests + --dry-run). CalibrationRun.disagreement_flags() flags dimensions where max - min > 1 strict (a diff of 1 does NOT flag — verified by test).
End-to-end ingest verified on this laptop against the existing PR #599 docker stack: chio-dev ingest ../arc --no-postgres walked 260,956 files, ingested 40,606, upserted 41,889 nodes + 1,283 edges into Neo4j (MATCH (n) RETURN labels(n)[0], count(*) → 69,947 ChioEntity, 39,313 ChioFile, 1 ChioCrate). First time the live Neo4j ingest path has seen real arc data.
Exercises CI gates flipped to blocking in #5 / commit 3732954: boundary check, kb-engine 49 tests, chio-pack 39 tests (+8 new in test_calibration.py).

What's deliberately out of scope

Running Run-0 with three real raters is gated on ANTHROPIC_API_KEY (rater-B, rater-C) plus a sit-down with @bb-connor (rater-A). Both are out of band — this PR adds the harness, not the calibration data.
Embeddings (chunks_inserted: 0) — re-running with OPENAI_API_KEY + Postgres is a follow-up; the Neo4j-only path is what's exercised here.

Test plan

kb-engine tests pass (49)
chio-pack tests pass (39, includes 8 new calibration tests)
python ops/ci/check-imports.py passes (no engine ↔ pack boundary violations)
python -m chio_pack.eval.calibration --dry-run --run-number 0 walks all 10 cap-error scenarios cleanly
chio-dev ingest ../arc --no-postgres end-to-end run committed nodes + edges into the running Neo4j

🤖 Generated with Claude Code

… image fix (c) GraphitiHttpRouter — production Graphiti writer. kb_engine/sync.py: new GraphitiHttpRouter class. POSTs records with target="graphiti" to a graphiti-mcp HTTP endpoint as MCP JSON-RPC `tools/call` envelopes (default tool: `add_memory`). Records with other targets are skipped. Failures are captured in router.failures by default; --strict raises. The HTTP `post` callable is dependency- injected so tests use a fake response without httpx installed. Per AGENTS.md hard rule #1: this is the **designated** Graphiti writer. Plugin code never POSTs Graphiti directly — handlers produce DerivedRecord(target="graphiti", ...), the daemon's router list includes a GraphitiHttpRouter (alongside JsonlRouter + NullRouter as needed), and only the daemon's router runs the HTTP write. +7 tests covering: posts an episode (verifies envelope shape), skips non-graphiti records, handles HTTP errors non-strict (records to failures), strict mode raises, network exception non-strict, falls back to frontmatter for episode_body when no explicit body provided, increments JSON-RPC id per call. kb-engine total: 49 tests. (infra) Optional-dependency extras for kb-engine: [project.optional-dependencies] postgres = ["psycopg[binary]>=3.1", "pgvector>=0.2"] neo4j = ["neo4j>=5"] openai = ["openai>=1.0"] watch = ["watchdog>=4"] yaml = ["pyyaml>=6.0"] http = ["httpx>=0.27"] all = [<every backend>] chio-pack now declares `kb-engine[all]` so `uv sync` brings every store driver in. The lazy-import pattern in store/embed.py / store/postgres.py / store/neo4j.py / sync.py still works without the extras for environments that don't need a particular backend. (infra) docker-compose.yml fix: graphiti-mcp image was guessed `zepai/graphiti-mcp:latest` (doesn't exist on Docker Hub). Verified via `docker ps` against the running PR #599 instance: actual image is `zepai/knowledge-graph-mcp:standalone`. Updated. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

@connor

(b) Calibration harness skeleton per ADR-0004. Three rater abstractions: - HumanRater: interactive stdin prompt, prints scenario + augmentation body to stderr, reads four 1–5 scores. rater-A is the human (@connor). - AnthropicRater: lazy-imports the anthropic SDK, prompts a model with one of two system rubrics, parses a JSON object {clarity, accuracy, actionability, brevity}. rater-B and rater-C use this with claude-sonnet-4-6 (canonical) and claude-haiku-4-5-20251001 (accuracy-emphasis variant). - DeterministicRater: returns configured fake scores; used by tests and by `--dry-run` so the harness can be exercised without an API key or human in the loop. Two system rubrics live in this module: CANONICAL_RUBRIC_SYSTEM (rater-A and rater-B) and ACCURACY_EMPHASIS_SYSTEM (rater-C, deliberately diverged per ADR-0004 to surface accuracy/brevity trade-offs at calibration time). Loads scenarios from chio-pack/eval/fixtures/cap-error-explanation/ via PyYAML; picks the augmentation by name from the scenario's `augmentations:` list. `calibrate(scenario_path, pool, ...)` returns a CalibrationRun (dataclass) with one RaterScore per rater plus a `disagreement_flags()` helper that returns dimensions where max - min > 1 (strict; a diff of 1 does NOT flag — verified by test). `render_calibration_md(run)` produces the 12-row table (3 raters × 4 dimensions) the calibration ADR template expects. CLI: `python -m chio_pack.eval.calibration --dry-run --run-number 0` walks all 10 scenarios with deterministic raters and prints JSON. Real runs use --real --augmentation raw|enriched|baseline. +8 tests in tests/test_calibration.py covering pool size, scenario loading, unknown-augmentation raises, disagreement flag detection (>1 strict, =1 not flagged), render produces 12 rows, RaterScore.mean(), DeterministicRater determinism. chio-pack total: 39 → 47 tests. This is the harness only. Running Run-0 against the full 10 scenarios with three real raters is gated on ANTHROPIC_API_KEY (rater-B, rater-C) plus a sit-down with @connor (rater-A); both are out of band for this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

github-actions · 2026-05-09T02:20:42Z

Outcome evals — 2026-05-09 02:20 UTC

Generated by chio-pack-eval Phase 0 skeleton. No real runners yet — see PHASE-0.md.

Eval	Fixtures	Status	Notes
`time-to-first-correct-fix`	8	BLOCKED — runner	fixtures present; runner is Phase 1 deliverable
`repeated-mistake-rate`	0	BLOCKED — runner	no fixtures glob; runner is Phase 1 deliverable
`conformance-harness-recall`	20	BLOCKED — runner	fixtures present; runner is Phase 1 deliverable
`capability-error-explanation`	10	BLOCKED — runner	fixtures present; runner is Phase 1 deliverable

Deferred (block on Phase 2)

signed-retrieval — blocked until phase-2b: Requires kb-engine/kb_engine/receipt/envelope.py — see PLAN.md Moonshot 2.
pr-impact-gate-precision-recall — blocked until phase-2a: Requires chio-pr-gate/ — see PLAN.md Moonshot 1.

bb-connor and others added 2 commits May 8, 2026 22:13

bb-connor merged commit 21190f5 into main May 9, 2026
1 check passed

bb-connor deleted the phase-1.5-calibration-and-ci-exercise branch May 9, 2026 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 0 (b): Run-0 calibration harness + exercise CI gates#8

Phase 0 (b): Run-0 calibration harness + exercise CI gates#8
bb-connor merged 2 commits into
mainfrom
phase-1.5-calibration-and-ci-exercise

bb-connor commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bb-connor commented May 9, 2026

Summary

What's deliberately out of scope

Test plan

Uh oh!

github-actions Bot commented May 9, 2026

Outcome evals — 2026-05-09 02:20 UTC

Deferred (block on Phase 2)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant