feat(xorq-stats): compile_batch_expr for portable Phase-1 aggregate expressions#766
feat(xorq-stats): compile_batch_expr for portable Phase-1 aggregate expressions#766paddymul wants to merge 2 commits into
Conversation
… as portable expressions
Adds `XorqStatPipeline.compile_batch_expr(table) -> (expr, errors)` so callers
can extract the Phase-1 batched-aggregate expression without executing it.
Pass an `xo.table(schema, name=...)` UnboundTable to get a portable, reusable
stat expression that can be cataloged / shipped / rebound to any source later.
The result shape is `(1 row) x (1 + N_batch_results)` with columns
`__total_length__` and `<col>|<stat>` — same naming the internal Phase-1
result reader has always used; now promoted to a class constant
(`TOTAL_LENGTH_KEY`) so external callers don't hard-code it.
Refactors `process_table`'s Phase-1 build loop into `_build_batch_agg_exprs`,
shared by both the public method and the execution path. No behaviour change
to `process_table`: construction-time failures still land in the per-column
accumulator as `Err`, just routed through `StatError`.
Histograms are intentionally excluded — they're Phase 2, parameterised on
scalar min/max from Phase 1, so they can't be folded into one expression.
Computed Python stats (`non_null_count`, `nan_per`, `distinct_per`, `_type`,
`typing_stats`) are also out — they're pure Python on resolved scalars.
Rebind pattern documented in the docstring:
unbound = xo.table(schema, name="t")
expr, _ = pipeline.compile_batch_expr(unbound)
bound = expr.op().replace({unbound.op(): real_source.op()}).to_expr()
df = bound.execute()
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.2.dev26006465187or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.2.dev26006465187MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.2.dev26006465187" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview🎨 Storybook preview |
- Assert batch funcs provide exactly one stat key — surfaces the long-standing implicit constraint (the named aggregate column uses ``sf.provides[0].name``; multiple provides would silently drop after the first). - Clarify in process_table why construction_errors aren't appended to all_errors directly — they reach the caller via resolve_accumulator. - compile_batch_expr docstring: note the table.aggregate wrapper is already applied, and that the rebind source must be schema-compatible. - test_returns_unbound_when_given_unbound: walk the op tree via ``op.find(UnboundTable)`` instead of substring-matching repr(expr), so the test survives ibis repr-format changes. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Design notes —
|
Summary
Adds
XorqStatPipeline.compile_batch_expr(table) -> (expr, errors)— a way to extract the Phase-1 batched-aggregate expression without executing it. Pass anxo.table(schema, name=...)UnboundTable and you get a portable, reusable summary-stats expression you can catalog, ship across processes, or rebind to any source later.Motivating use case: xorq/buckaroo users want to save the summary-stats config as an unbound expression so it can live alongside other catalog entries instead of being trapped inside
process_table's execute loop.What's in the expression
Shape:
(1 row) x (1 + N_batch_results). Columns:__total_length__— table-level row count (promoted toXorqStatPipeline.TOTAL_LENGTH_KEY)<col>|<stat>for every batch-phase (col, stat) pair that survived the column filterOnly Phase-1 batched stats are folded in (
null_count,min,max,distinct_count,mean,std,median). Histograms and pure-Python computed stats (non_null_count,nan_per,distinct_per,_type,typing_stats) are not — histograms need scalar min/max from Phase 1, and the computed stats are Python on resolved scalars.Internals
Refactors
process_table's Phase-1 build loop into_build_batch_agg_exprs(table)returning(agg_exprs, batch_items, errors). Shared by the new public method and the execution path. No behaviour change toprocess_table— construction-time failures still land in the per-column accumulator asErr, just routed throughStatErrornow.Usage
Test plan
pytest tests/unit/test_xorq_compile_batch_expr.py— 7 new tests pass (unbound stays unbound, column naming, no histogram, rebind matchesprocess_tablebaseline, real-table input, construction-error surfacing,process_tableregression)pytest tests/unit/test_xorq_*.py— 79 existing xorq tests pass🤖 Generated with Claude Code