Skip to content

Phase 0 (b): Run-0 calibration harness + exercise CI gates#8

Merged
bb-connor merged 2 commits into
mainfrom
phase-1.5-calibration-and-ci-exercise
May 9, 2026
Merged

Phase 0 (b): Run-0 calibration harness + exercise CI gates#8
bb-connor merged 2 commits into
mainfrom
phase-1.5-calibration-and-ci-exercise

Conversation

@bb-connor
Copy link
Copy Markdown
Contributor

Summary

  • Run-0 calibration harness for the cap-error-explanation eval per ADR-0004. Three rater abstractions: HumanRater (interactive), AnthropicRater (lazy-imports anthropic SDK; canonical + accuracy-emphasis system prompts), DeterministicRater (tests + --dry-run). CalibrationRun.disagreement_flags() flags dimensions where max - min > 1 strict (a diff of 1 does NOT flag — verified by test).
  • End-to-end ingest verified on this laptop against the existing PR #599 docker stack: chio-dev ingest ../arc --no-postgres walked 260,956 files, ingested 40,606, upserted 41,889 nodes + 1,283 edges into Neo4j (MATCH (n) RETURN labels(n)[0], count(*) → 69,947 ChioEntity, 39,313 ChioFile, 1 ChioCrate). First time the live Neo4j ingest path has seen real arc data.
  • Exercises CI gates flipped to blocking in #5 / commit 3732954: boundary check, kb-engine 49 tests, chio-pack 39 tests (+8 new in test_calibration.py).

What's deliberately out of scope

  • Running Run-0 with three real raters is gated on ANTHROPIC_API_KEY (rater-B, rater-C) plus a sit-down with @bb-connor (rater-A). Both are out of band — this PR adds the harness, not the calibration data.
  • Embeddings (chunks_inserted: 0) — re-running with OPENAI_API_KEY + Postgres is a follow-up; the Neo4j-only path is what's exercised here.

Test plan

  • kb-engine tests pass (49)
  • chio-pack tests pass (39, includes 8 new calibration tests)
  • python ops/ci/check-imports.py passes (no engine ↔ pack boundary violations)
  • python -m chio_pack.eval.calibration --dry-run --run-number 0 walks all 10 cap-error scenarios cleanly
  • chio-dev ingest ../arc --no-postgres end-to-end run committed nodes + edges into the running Neo4j

🤖 Generated with Claude Code

bb-connor and others added 2 commits May 8, 2026 22:13
… image fix

(c) GraphitiHttpRouter — production Graphiti writer.

  kb_engine/sync.py: new GraphitiHttpRouter class. POSTs records with
  target="graphiti" to a graphiti-mcp HTTP endpoint as MCP JSON-RPC
  `tools/call` envelopes (default tool: `add_memory`). Records with
  other targets are skipped. Failures are captured in router.failures
  by default; --strict raises. The HTTP `post` callable is dependency-
  injected so tests use a fake response without httpx installed.

  Per AGENTS.md hard rule #1: this is the **designated** Graphiti writer.
  Plugin code never POSTs Graphiti directly — handlers produce
  DerivedRecord(target="graphiti", ...), the daemon's router list
  includes a GraphitiHttpRouter (alongside JsonlRouter + NullRouter as
  needed), and only the daemon's router runs the HTTP write.

  +7 tests covering: posts an episode (verifies envelope shape), skips
  non-graphiti records, handles HTTP errors non-strict (records to
  failures), strict mode raises, network exception non-strict, falls
  back to frontmatter for episode_body when no explicit body provided,
  increments JSON-RPC id per call. kb-engine total: 49 tests.

(infra) Optional-dependency extras for kb-engine:
    [project.optional-dependencies]
    postgres = ["psycopg[binary]>=3.1", "pgvector>=0.2"]
    neo4j    = ["neo4j>=5"]
    openai   = ["openai>=1.0"]
    watch    = ["watchdog>=4"]
    yaml     = ["pyyaml>=6.0"]
    http     = ["httpx>=0.27"]
    all      = [<every backend>]
  chio-pack now declares `kb-engine[all]` so `uv sync` brings every
  store driver in. The lazy-import pattern in store/embed.py /
  store/postgres.py / store/neo4j.py / sync.py still works without
  the extras for environments that don't need a particular backend.

(infra) docker-compose.yml fix:
  graphiti-mcp image was guessed `zepai/graphiti-mcp:latest` (doesn't
  exist on Docker Hub). Verified via `docker ps` against the running
  PR #599 instance: actual image is
  `zepai/knowledge-graph-mcp:standalone`. Updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
(b) Calibration harness skeleton per ADR-0004. Three rater abstractions:

  - HumanRater: interactive stdin prompt, prints scenario + augmentation
    body to stderr, reads four 1–5 scores. rater-A is the human (@connor).
  - AnthropicRater: lazy-imports the anthropic SDK, prompts a model
    with one of two system rubrics, parses a JSON object
    {clarity, accuracy, actionability, brevity}. rater-B and rater-C
    use this with claude-sonnet-4-6 (canonical) and
    claude-haiku-4-5-20251001 (accuracy-emphasis variant).
  - DeterministicRater: returns configured fake scores; used by tests
    and by `--dry-run` so the harness can be exercised without an
    API key or human in the loop.

  Two system rubrics live in this module:
  CANONICAL_RUBRIC_SYSTEM (rater-A and rater-B) and
  ACCURACY_EMPHASIS_SYSTEM (rater-C, deliberately diverged per
  ADR-0004 to surface accuracy/brevity trade-offs at calibration time).

  Loads scenarios from chio-pack/eval/fixtures/cap-error-explanation/
  via PyYAML; picks the augmentation by name from the scenario's
  `augmentations:` list. `calibrate(scenario_path, pool, ...)` returns
  a CalibrationRun (dataclass) with one RaterScore per rater plus a
  `disagreement_flags()` helper that returns dimensions where
  max - min > 1 (strict; a diff of 1 does NOT flag — verified by test).

  `render_calibration_md(run)` produces the 12-row table
  (3 raters × 4 dimensions) the calibration ADR template expects.

  CLI: `python -m chio_pack.eval.calibration --dry-run --run-number 0`
  walks all 10 scenarios with deterministic raters and prints JSON.
  Real runs use --real --augmentation raw|enriched|baseline.

  +8 tests in tests/test_calibration.py covering pool size, scenario
  loading, unknown-augmentation raises, disagreement flag detection
  (>1 strict, =1 not flagged), render produces 12 rows, RaterScore.mean(),
  DeterministicRater determinism. chio-pack total: 39 → 47 tests.

This is the harness only. Running Run-0 against the full 10 scenarios
with three real raters is gated on ANTHROPIC_API_KEY (rater-B,
rater-C) plus a sit-down with @connor (rater-A); both are out of band
for this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Outcome evals — 2026-05-09 02:20 UTC

Generated by chio-pack-eval Phase 0 skeleton. No real runners yet — see PHASE-0.md.

Eval Fixtures Status Notes
time-to-first-correct-fix 8 BLOCKED — runner fixtures present; runner is Phase 1 deliverable
repeated-mistake-rate 0 BLOCKED — runner no fixtures glob; runner is Phase 1 deliverable
conformance-harness-recall 20 BLOCKED — runner fixtures present; runner is Phase 1 deliverable
capability-error-explanation 10 BLOCKED — runner fixtures present; runner is Phase 1 deliverable

Deferred (block on Phase 2)

  • signed-retrieval — blocked until phase-2b: Requires kb-engine/kb_engine/receipt/envelope.py — see PLAN.md Moonshot 2.
  • pr-impact-gate-precision-recall — blocked until phase-2a: Requires chio-pr-gate/ — see PLAN.md Moonshot 1.

@bb-connor bb-connor merged commit 21190f5 into main May 9, 2026
1 check passed
@bb-connor bb-connor deleted the phase-1.5-calibration-and-ci-exercise branch May 9, 2026 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant