feat(js): auto-config verification layer (preflight + postflight) for goldenmatch-js v0.3.0#48
Merged
benzsevern merged 26 commits intomainfrom Apr 15, 2026
Merged
Conversation
Prerequisite for upcoming autoconfig verification layer (Check 6 + weight cap in classifier fixes). Matches Python's confidence semantics: 0.9 (both heuristics agree), 0.7 (one heuristic), 0.3 (string fallthrough).
…unt_no / guid / uuid
Types, dataclasses, ConfigValidationError, makePreflightReport factory, stripConventionPrivate utility. preflight/postflight bodies stubbed; real implementations land in Phase 2 and Phase 3.
…t to results GoldenMatchConfig gains _preflightReport / _strictAutoconfig / _domainProfile as non-readonly optional fields (minimum escape hatch for strict-readonly config; see spec \u00a77). DedupeResult and MatchResult gain optional readonly postflightReport. List is closed \u2014 future internal state uses side-table pattern instead.
…repair - Add DOMAIN_EXTRACTED_COLS constant to domain.ts (__brand__, __model__, __version__) so preflight can distinguish 'missing but producible' column references from hard errors. - Replace preflight stub with a real implementation that walks every matchkey + blocking key reference, flags anything not present in the first row, and auto-repairs config.domain when _domainProfile is stashed on the config and a __<col>__ reference turns up. - Keep types.ts -> autoconfigVerify.ts runtime cycle broken by using only 'import type' from types.ts inside autoconfigVerify.ts.
… matchkeys Drop exact matchkeys whose referenced column is either near-unique (cardinality_ratio >= 0.99 — every row its own block) or near-constant (ratio <= 0.01 — one giant block). Both drops are repaired warnings. If the drops empty the matchkey list, emit a no_matchkeys_remain error so ConfigValidationError fires downstream instead of producing a silent no-op run.
Walks every BlockingKeyConfig, groups a <= 10k-row sample by raw field concatenation, and warns when the p99 block size exceeds 5000 (scoring will be slow) or the p50 is <2 (blocking is too selective, most records end up alone). Raw-value grouping is a coarse proxy for the real transform-applied blocker but catches the typical 'everyone has the same state' / 'ID column used as blocker' failures.
Walks every matchkey field and demotes/drops scorers that depend on remote model downloads: - 'embedding' -> 'ensemble' (in-place scorer swap) - 'record_embedding' fields dropped entirely - weighted matchkey rerank=true -> false Skipped entirely when allowRemoteAssets=true or llmScorer.enabled=true (caller has already committed to remote round-trips). Matchkeys that lose all their fields to the demotion emit remote_asset_matchkey_empty and are removed.
…elds Looks up each weighted matchkey field against the passed-in profiles (optional — no-op when profiles absent). If the classifier confidence is <0.5 and the configured weight is >0.5, cap the weight at 0.5. This keeps a field that was classified with low confidence from dominating the weighted score just because it happened to land with a high weight during autoconfig construction.
- Extend AutoconfigOptions with optional strict + allowRemoteAssets flags. - Detect domain from columns and stash _domainProfile on the config when confidence >0.7, so preflight Check 1 can auto-repair __<col>__ references. - Run preflight at the end of autoConfigureRows with the profiled ColumnProfile list as 'profiles' and the caller's allowRemoteAssets setting; throw ConfigValidationError on unrepairable errors. - Stamp _preflightReport on the returned config; stamp _strictAutoconfig only when strict=true. - Update two pre-existing autoconfig tests to reflect preflight's new effect: exact_email / exact_phone matchkeys built from 100%-unique fixtures are now dropped as cardinality_high (repaired warning in the preflight report, weighted_identity still present).
_applyPostflight helper shared between both pipelines (mirrors Python's _apply_postflight). isPreflightReport guard rejects stale/wrong-type objects. Threshold adjustments apply to pairScores before clustering; empty-pair case logs an advisory. runMatchPipeline threads postflightReport through from the delegated runDedupePipeline result.
README: new Verification section. Version: 0.1.0 -> 0.3.0. JSDoc: every exported symbol in autoconfigVerify.ts documented.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports the Python v1.5.0 auto-config verification layer to the TypeScript port at `packages/goldenmatch-js/`. Ships as `goldenmatch-js-v0.3.0` on npm.
What shipped
Preflight (6 checks) — `preflight(rows, config, opts)` at end of `autoConfigureRows`:
Postflight (4 signals) — `postflight(rows, config, {pairScores})` inside `runDedupePipeline` and `runMatchPipeline`:
Four classifier fixes (Phase 0 — prerequisite):
Config escape hatch: `GoldenMatchConfig` gains three non-readonly optional underscore fields (`_preflightReport`, `_strictAutoconfig`, `_domainProfile`). All existing public fields stay `readonly`. `stripConventionPrivate` utility strips them for YAML/JSON export. Scope-creep warning in-code.
Public API adds (top-level `goldenmatch` exports):
Breaking changes
Verification
Planning artifacts
Test plan
🤖 Generated with Claude Code