Skip to content

[branch-54] Cherry-pick #22493: restore SortExec elimination after stats-based file reorder#22501

Merged
zhuqi-lucas merged 1 commit into
apache:branch-54from
zhuqi-lucas:fix-sort-elim-branch-54
May 24, 2026
Merged

[branch-54] Cherry-pick #22493: restore SortExec elimination after stats-based file reorder#22501
zhuqi-lucas merged 1 commit into
apache:branch-54from
zhuqi-lucas:fix-sort-elim-branch-54

Conversation

@zhuqi-lucas
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Cherry-pick of #22493 onto branch-54.

Rationale for this change

branch-54 includes #21956 (feat: globally reorder files and row groups by statistics for TopK queries), which introduced a regression: for plain-column, multi-file scans where the on-disk file order does not match the declared sort order, SortExec was no longer eliminated even when stats-based reorder produced non-overlapping file groups whose declared ordering re-validated.

#22493 restores the pre-#21956 sort-elimination behaviour by re-validating output_ordering after rebuild_with_source reorders files, and (per @adriangb's correctness follow-up) restoring the original hint-free file_source on the Inexact→Exact upgrade so leftover reverse_row_groups / sort_order_for_reorder hints don't mis-order row groups within a single file once the SortExec safety net is gone.

What changes are included in this PR?

Straight cherry-pick of merge commit 94c58d086. Includes:

  • FileScanConfig::try_pushdown_sort Inexact arm: re-validate, upgrade to Exact (with file_source restore), guard with NULL safety + early-return
  • rebuild_with_source: match (all_non_overlapping, is_exact) decision table for keep_ordering
  • SLT updates restoring SortExec elimination expectations + Tests 5b/5c/8b for the NULL-safety and same-min row-group edge cases

Are these changes tested?

Cherry-picked cleanly (auto-merge in sort_pushdown.rs).
cargo build -p datafusion-datasource — passes.
cargo test -p datafusion-sqllogictest --test sqllogictests -- sort_pushdown — passes.

Are there any user-facing changes?

Same as #22493: plain-column wrong-order-files cases regain SortExec elimination when files happen to be non-overlapping by statistics. No new API.

…le reorder (apache#22493)

## Which issue does this PR close?

- Closes apache#22494

Regression introduced by apache#21956 — single-partition multi-file scans with
`WITH ORDER` (or inferred parquet `sorting_columns` metadata) plus files
listed out of order on disk no longer have their `SortExec` eliminated,
even when the stats-based file reorder produces a non-overlapping
layout.
This was the central case PR apache#21182 was designed to fix.

## Rationale for this change

In the Phase 2 sort pushdown design (apache#21182), when a file source
returned
`Unsupported` because `validated_output_ordering()` stripped its
declared
ordering, the outer `FileScanConfig::try_pushdown_sort` invoked a
fallback
(`try_sort_file_groups_by_statistics`) that reordered files by min/max
stats, re-validated the ordering against the new file groups, and could
upgrade the result back to `Exact` — dropping the outer `SortExec`.

PR apache#21956 added the `column_in_file_schema` signal so that plain-column
sort requests would return `Inexact` (with `sort_order_for_reorder` set)
instead of `Unsupported`. That enabled runtime per-RG reorder, but it
also pulled the typical wrong-file-order case out of the
`Unsupported`-fallback re-validation path. The Inexact branch of
`rebuild_with_source` always strips `output_ordering`, so even when the
post-sort file groups became non-overlapping and the declared ordering
would re-validate, the outer wrapper returned `Inexact` and PushdownSort
kept the `SortExec`.

The SLT comment on test 6.1 (`# … → SortExec eliminated`) still
describes
the pre-apache#21956 behaviour, but the recorded expectation was updated to
match the post-apache#21956 plan where SortExec stayed — a silent regression.

## What changes are included in this PR?

### `rebuild_with_source`

When `is_exact=false` but the stats-based file sort produced
`all_non_overlapping=true`, re-validate the declared `output_ordering`
against the new file groups. If `ordering_satisfy` passes, preserve
`output_ordering` instead of stripping it.

### Outer `FileScanConfig::try_pushdown_sort`

On the `Inexact` branch, inspect `config.output_ordering` after rebuild
and return `Exact` if it survived. This restores the
`Unsupported → upgrade Exact` semantics, but on the `Inexact` branch the
plain-column path now follows.

### Safety

- When the source has no declared ordering (no `WITH ORDER` and no
  parquet `sorting_columns` metadata), `self.output_ordering` is empty,
  the re-validate produces no orderings, `ordering_satisfy` returns
  `false`, and `SortExec` stays. min/max stats alone never trigger an
  upgrade.
- When `all_non_overlapping=false` (overlapping files post-sort),
  `output_ordering` is stripped exactly as before.

### Tests

`sort_pushdown.slt`:

- Tests **4.1** (parquet metadata), **6.1** (`WITH ORDER ASC`),
  **8.1** (`WITH ORDER DESC` with reverse), **G.1**, **G.2** (multi-
  partition SPM + BufferExec) — expectations updated to match the
  restored `SortExec`-eliminated plans (matches the PR apache#21182 era plans
  and the SLT comments that were already in the file).
- New **Test 5b** — files written without `ORDER BY` (no
  `sorting_columns` metadata) + external table without `WITH ORDER`:
  asserts that even when min/max stats happen to be non-overlapping,
  `SortExec` is kept.

## Are these changes tested?

Yes — `sort_pushdown.slt` covers the affected paths comprehensively.
Full `cargo test -p datafusion-sqllogictest --test sqllogictests` and
`cargo test -p datafusion-datasource --lib` / `-p
datafusion-datasource-parquet --lib` / `-p
datafusion-physical-optimizer`
pass. `cargo clippy -D warnings` clean.

## Are there any user-facing changes?

Yes — for the in-scope scenarios above, `EXPLAIN` plans now show the
`SortExec` removed and `output_ordering` set on `DataSourceExec`. Query
results are unaffected.

---------

Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit 94c58d0)
Copilot AI review requested due to automatic review settings May 24, 2026 15:45
@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels May 24, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Cherry-pick of #22493 onto branch-54 to restore SortExec elimination behaviour that was regressed by #21956. When a file source returns Inexact because pre-sort validated_output_ordering() stripped the declaration (files listed out of order on disk), the outer try_pushdown_sort now re-validates the declared ordering against the stats-reordered file groups and, if it passes (and no NULL-safety issue applies), upgrades back to Exact and restores the original hint-free file_source so leftover reverse_row_groups / sort_order_for_reorder hints don't mis-order row groups within a single file once the SortExec safety net is gone.

Changes:

  • FileScanConfig::try_pushdown_sort Inexact arm: re-check output_ordering, upgrade to Exact and restore original file_source (hint-free) on upgrade.
  • rebuild_with_source: explicit match (all_non_overlapping, is_exact) decision table; (true, false) re-validates ordering after stats reorder with a NULL-in-non-last-file guard.
  • SLT updates restoring SortExec-eliminated expectations for tests 4.1, 6.1, 8.1, G.1, G.2 plus new tests 5b/5c/8b covering no-declaration, NULL-safety and same-min row-group edge cases.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
datafusion/datasource/src/file_scan_config/mod.rs Updates try_pushdown_sort Inexact arm to re-check ordering after rebuild and upgrade to Exact (with original file_source restored); refreshes the docstring flow diagram.
datafusion/datasource/src/file_scan_config/sort_pushdown.rs Replaces the is_exact && all_non_overlapping keep with a (all_non_overlapping, is_exact) decision table; adds (true, false) branch that runs a NULL-in-sort-cols check and ordering_satisfy re-validation.
datafusion/sqllogictest/test_files/sort_pushdown.slt Restores SortExec-eliminated EXPLAIN expectations for tests 4.1/6.1/8.1/G.1/G.2; adds tests 5b (no declaration), 5c (NULL safety), 8b (same-min RG within one file).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zhuqi-lucas zhuqi-lucas requested review from adriangb and alamb May 24, 2026 15:46
Copy link
Copy Markdown
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Since this fixes a regression in the yet unreleased 54.0.0 I think we should cherry pick it. Can you make sure we update the 54.0.0 tracking issue?

@zhuqi-lucas
Copy link
Copy Markdown
Contributor Author

Thanks. Since this fixes a regression in the yet unreleased 54.0.0 I think we should cherry pick it. Can you make sure we update the 54.0.0 tracking issue?

Thanks @adriangb , i added this PR to the tracking issue for 54.0.0.

@zhuqi-lucas zhuqi-lucas merged commit 1875365 into apache:branch-54 May 24, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants