fix(sort-pushdown): restore SortExec elimination after stats-based file reorder#22493
Conversation
…le reorder PR apache#21956 added a `column_in_file_schema` signal that routes plain-column sort requests with stripped output_ordering into the Inexact branch of `FileScanConfig::try_pushdown_sort`. The Inexact branch unconditionally strips `output_ordering` even when the post-sort file groups have become non-overlapping and the ordering would re-validate — so `SortExec` stayed above the source where it had previously been eliminated. Symptom: with `WITH ORDER (key ASC)` (or inferred parquet `sorting_columns` metadata) plus files listed out of order on disk, PushdownSort reordered the files but kept the SortExec, regressing the behaviour PR apache#21182 introduced. Fix: in `rebuild_with_source`, when `is_exact=false` but `all_non_overlapping=true` after the stats-based sort, re-validate the declared ordering against the new file groups. If `ordering_satisfy` passes, keep `output_ordering` so the outer wrapper can upgrade Inexact back to Exact. The outer `FileScanConfig::try_pushdown_sort` now also inspects `output_ordering` on the Inexact branch and returns Exact when it survived. Safety: - When `self.output_ordering` is empty (no `WITH ORDER`, no `sorting_columns` metadata) the re-validate falls through, ordering is stripped, and SortExec is kept. - When `all_non_overlapping=false`, ordering is stripped as before. SLT updates: `sort_pushdown.slt` tests 4.1, 6.1, 8.1, G.1, G.2 now expect the SortExec to be eliminated (the behaviour `apache#21182` originally delivered and SLT comments still describe). New `Test 5b` covers the safety case: files written without `ORDER BY` (no sorting_columns) and no `WITH ORDER` must keep SortExec even when stats happen to be non-overlapping.
There was a problem hiding this comment.
Pull request overview
This PR fixes a regression in sort pushdown where SortExec was no longer eliminated for single-partition, multi-file scans after stats-based file reordering (especially for WITH ORDER / inferred parquet sorting_columns cases) by re-validating ordering after reorder and allowing Inexact to upgrade back to Exact when warranted.
Changes:
- Re-validate declared
output_orderingafter stats-based file reordering whenall_non_overlapping=true, preserving ordering when it now satisfies the requested sort. - In
FileScanConfig::try_pushdown_sort, upgrade theInexactbranch toExactwhenoutput_orderingsurvives rebuild (enablingSortExecelimination again). - Update
sort_pushdown.sltexpectations to reflect restoredSortExecelimination and add a new “no declared ordering” safety test.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
datafusion/datasource/src/file_scan_config/sort_pushdown.rs |
Adds post-reorder re-validation to preserve output_ordering (enabling SortExec elimination) on previously-Inexact paths. |
datafusion/datasource/src/file_scan_config/mod.rs |
Allows Inexact pushdown results to upgrade to Exact when rebuild preserves output_ordering. |
datafusion/sqllogictest/test_files/sort_pushdown.slt |
Updates explain-plan expectations for restored elimination and adds a safety regression test for “no declared ordering”. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1. NULL safety guard
In `rebuild_with_source`'s Inexact-branch re-validate path, mirror
the existing `try_sort_file_groups_by_statistics` fallback by
refusing to upgrade to Exact when any file in the (post-sort) file
groups has NULLs in the sort columns. Otherwise NULLs sitting at the
end of a non-last file (NULLS LAST) would land mid-stream after
concatenation, breaking the ordering.
2. SLT test cleanup
- Test 5b: add DROP TABLE statements for no_decl_low / no_decl_mid /
no_decl_high / no_decl_parquet so the test is self-contained.
- Test 5c (new): NULL-safety regression test — files with reversed
filesystem naming so the Inexact branch fires, and the lower-id
file contains NULLs in the sort column. Asserts SortExec stays
and query results are correct.
The doc-comment for FileScanConfig::try_pushdown_sort still described
the pre-fix Inexact branch ("SortExec kept, scan optimized"). Update
the diagram to reflect the new behaviour: rebuild_with_source can
upgrade back to Exact when post-sort file groups are non-overlapping,
the request re-validates, and no NULLs sit in the sort columns of
non-last files. Also note that the Unsupported fallback now mirrors
the same NULL check.
|
This restores SortExec elimination that #21956 accidentally regressed — plain-column wrong-order-file cases now go through Inexact and unconditionally lose Thanks! |
Replace the nested `if all_non_overlapping { if is_exact {..} else {..} }
else { false }` in `rebuild_with_source` with a `match (all_non_overlapping,
is_exact)`. No behaviour change — the three arms map one-to-one to the prior
branches and make the "overlap → drop", "exact → keep", "inexact →
re-validate" cases explicit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xact upgrade When `try_pushdown_sort` upgrades the `Inexact` result back to `Exact` (post-sort file groups non-overlapping + declared ordering re-validates), the `SortExec` is removed. But the source rebuilt from the `Inexact` result still carried `sort_order_for_reorder` / `reverse_row_groups`, so the parquet opener kept reordering row groups at runtime. With the `SortExec` gone, that runtime reorder is no longer a harmless optimisation — it is load-bearing for correctness, and it is wrong for DESC requests. The opener sorts row groups ASC-by-min and then `reverse()`s them; two row groups within a single file that share the same `min` but differ in `max` get mis-ordered by the reverse. Example: a file declared `WITH ORDER (id DESC)` containing `[10,8,8,8]` whose row groups are `[10,8]` and `[8,8]` (both min 8) streams as `8,8,10,8`. Previously the `SortExec` re-sorted that; once eliminated, the wrong order is the final result. Because the upgrade only fires when each file's declared ordering re-validates against the non-overlapping post-sort file groups, reading the files in their natural (declared-sorted) order already yields the requested ordering — no runtime reorder is needed. Restore the original, hint-free source on upgrade, matching the `Unsupported → Exact` path which also reads files naturally. Existing tests 4.1 / 6.1 / 8.1 / G.1 / G.2 lose the now-absent `sort_order_for_reorder` / `reverse_row_groups` from their `DataSourceExec` plans; data results are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add Test 8b: a DESC scan over a file whose row groups share the same min value (`[10,8,8,8]` → row groups `[10,8]` and `[8,8]`), listed out of order on disk so the Inexact→Exact upgrade fires and SortExec is eliminated. Without the fix this returns `8,8,10,8,...` (id=10 in third position): the opener reorders row groups ASC-by-min then reverses, mis-ordering the two min=8 row groups, and the eliminated SortExec no longer masks it. The test asserts both the clean plan (no runtime reorder hints) and the correct DESC result. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(sort-pushdown): drop runtime row-group reorder hints on Inexact→Exact upgrade (correctness)
adriangb
left a comment
There was a problem hiding this comment.
This looks great thank you!
| if config.output_ordering.is_empty() { | ||
| Ok(SortOrderPushdownResult::Inexact { | ||
| inner: Arc::new(config), | ||
| }) | ||
| } else { |
There was a problem hiding this comment.
I wonder if we can use some sort of early return / early break to avoid the indentation for the else branch.
There was a problem hiding this comment.
Good point @adriangb , addressed in latest PR.
- adriangb r3294820727: flatten the if/else in try_pushdown_sort Inexact arm into an early-return guard clause so the Exact-upgrade block sits at one less level of indentation - typos CI: 'mis-orders' tripped the spell checker on leading 'mis'; reword the comment as 'reorders ... incorrectly' (no behaviour change)
|
Thanks @adriangb for review, merged to main now, will submit a PR target to branch 54. |
…ats-based file reorder (#22501) ## Which issue does this PR close? Cherry-pick of #22493 onto `branch-54`. ## Rationale for this change `branch-54` includes #21956 (`feat: globally reorder files and row groups by statistics for TopK queries`), which introduced a regression: for plain-column, multi-file scans where the on-disk file order does not match the declared sort order, `SortExec` was no longer eliminated even when stats-based reorder produced non-overlapping file groups whose declared ordering re-validated. #22493 restores the pre-#21956 sort-elimination behaviour by re-validating `output_ordering` after `rebuild_with_source` reorders files, and (per @adriangb's correctness follow-up) restoring the original hint-free `file_source` on the Inexact→Exact upgrade so leftover `reverse_row_groups` / `sort_order_for_reorder` hints don't mis-order row groups within a single file once the `SortExec` safety net is gone. ## What changes are included in this PR? Straight cherry-pick of merge commit `94c58d086`. Includes: - `FileScanConfig::try_pushdown_sort` Inexact arm: re-validate, upgrade to Exact (with file_source restore), guard with NULL safety + early-return - `rebuild_with_source`: `match (all_non_overlapping, is_exact)` decision table for keep_ordering - SLT updates restoring `SortExec` elimination expectations + Tests 5b/5c/8b for the NULL-safety and same-min row-group edge cases ## Are these changes tested? Cherry-picked cleanly (auto-merge in `sort_pushdown.rs`). `cargo build -p datafusion-datasource` — passes. `cargo test -p datafusion-sqllogictest --test sqllogictests -- sort_pushdown` — passes. ## Are there any user-facing changes? Same as #22493: plain-column wrong-order-files cases regain SortExec elimination when files happen to be non-overlapping by statistics. No new API. Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ange detection apache#22460 added a consumer-side DynamicFilterTracker that subscribes once to every still-incomplete DynamicFilterPhysicalExpr inside a predicate and answers "has it changed?" with a single atomic load per filter — no tree walk on every check. Three call sites on this branch were polling the predicate's snapshot_generation on every batch / row-group boundary; switch them over: - RowGroupPruner (push_decoder.rs): owns a DynamicFilterTracking and rebuilds its PruningPredicate only when tracker.changed() reports an update. Falls back to a single up-front build for Static / AllComplete predicates. - Parquet opener force_per_rg_runs gate: "dynamic AND not yet complete" is exactly the Watching variant of DynamicFilterTracking — match on that instead of is_dynamic_physical_expr + is_filter_complete. - ParquetSource fmt_extra marker (dynamic_rg_pruning=eligible): use DynamicFilterTracking::classify(...).contains_dynamic_filter() so the deprecated is_dynamic_physical_expr can go. is_filter_complete and its unit tests in opener/mod.rs are removed — the semantics they asserted are now covered by DynamicFilterTracking tests in datafusion_physical_expr::expressions::dynamic_filters. SLT expectations for sort_pushdown.slt are also brought in line with upstream main (the SortExec-elimination changes from apache#22493 landed between this branch's last push and now).
Which issue does this PR close?
Regression introduced by #21956 — single-partition multi-file scans with
WITH ORDER(or inferred parquetsorting_columnsmetadata) plus fileslisted out of order on disk no longer have their
SortExeceliminated,even when the stats-based file reorder produces a non-overlapping layout.
This was the central case PR #21182 was designed to fix.
Rationale for this change
In the Phase 2 sort pushdown design (#21182), when a file source returned
Unsupportedbecausevalidated_output_ordering()stripped its declaredordering, the outer
FileScanConfig::try_pushdown_sortinvoked a fallback(
try_sort_file_groups_by_statistics) that reordered files by min/maxstats, re-validated the ordering against the new file groups, and could
upgrade the result back to
Exact— dropping the outerSortExec.PR #21956 added the
column_in_file_schemasignal so that plain-columnsort requests would return
Inexact(withsort_order_for_reorderset)instead of
Unsupported. That enabled runtime per-RG reorder, but italso pulled the typical wrong-file-order case out of the
Unsupported-fallback re-validation path. The Inexact branch ofrebuild_with_sourcealways stripsoutput_ordering, so even when thepost-sort file groups became non-overlapping and the declared ordering
would re-validate, the outer wrapper returned
Inexactand PushdownSortkept the
SortExec.The SLT comment on test 6.1 (
# … → SortExec eliminated) still describesthe pre-#21956 behaviour, but the recorded expectation was updated to
match the post-#21956 plan where SortExec stayed — a silent regression.
What changes are included in this PR?
rebuild_with_sourceWhen
is_exact=falsebut the stats-based file sort producedall_non_overlapping=true, re-validate the declaredoutput_orderingagainst the new file groups. If
ordering_satisfypasses, preserveoutput_orderinginstead of stripping it.Outer
FileScanConfig::try_pushdown_sortOn the
Inexactbranch, inspectconfig.output_orderingafter rebuildand return
Exactif it survived. This restores theUnsupported → upgrade Exactsemantics, but on theInexactbranch theplain-column path now follows.
Safety
WITH ORDERand noparquet
sorting_columnsmetadata),self.output_orderingis empty,the re-validate produces no orderings,
ordering_satisfyreturnsfalse, andSortExecstays. min/max stats alone never trigger anupgrade.
all_non_overlapping=false(overlapping files post-sort),output_orderingis stripped exactly as before.Tests
sort_pushdown.slt:WITH ORDER ASC),8.1 (
WITH ORDER DESCwith reverse), G.1, G.2 (multi-partition SPM + BufferExec) — expectations updated to match the
restored
SortExec-eliminated plans (matches the PR feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) #21182 era plansand the SLT comments that were already in the file).
ORDER BY(nosorting_columnsmetadata) + external table withoutWITH ORDER:asserts that even when min/max stats happen to be non-overlapping,
SortExecis kept.Are these changes tested?
Yes —
sort_pushdown.sltcovers the affected paths comprehensively.Full
cargo test -p datafusion-sqllogictest --test sqllogictestsandcargo test -p datafusion-datasource --lib/-p datafusion-datasource-parquet --lib/-p datafusion-physical-optimizerpass.
cargo clippy -D warningsclean.Are there any user-facing changes?
Yes — for the in-scope scenarios above,
EXPLAINplans now show theSortExecremoved andoutput_orderingset onDataSourceExec. Queryresults are unaffected.