Skip to content

SortExec elimination regression for wrong-order multi-file scans with WITH ORDER / sorting_columns metadata (regression from #21956) #22494

@zhuqi-lucas

Description

@zhuqi-lucas

Describe the bug

PR #21956 added the column_in_file_schema signal in ParquetSource::try_pushdown_sort, which routes plain-column sort requests through the Inexact branch when the source's output_ordering was stripped by validated_output_ordering() (files listed in wrong order on disk).

The Inexact branch of FileScanConfig::try_pushdown_sort calls rebuild_with_source with is_exact=false, which unconditionally strips output_ordering even when the post-sort file groups are non-overlapping and the declared ordering would re-validate. As a result, SortExec stays above the source for the canonical Phase 2 scenario (#21182) — even when stats-based file reorder restores a perfectly valid ordering.

Before #21956, these cases returned Unsupported and were upgraded to Exact via try_sort_file_groups_by_statistics (the fallback) — SortExec was eliminated.

Repro

See sort_pushdown.slt Test 6.1 — reversed_with_order_parquet with three out-of-order files: the comment says "SortExec eliminated" but the recorded plan keeps SortExec.

CREATE EXTERNAL TABLE reversed_with_order_parquet(id INT, value INT)
STORED AS PARQUET
LOCATION 'test_files/scratch/sort_pushdown/reversed/'   -- [a_high(7-9), b_mid(4-6), c_low(1-3)] alphabetical
WITH ORDER (id ASC);

EXPLAIN SELECT * FROM reversed_with_order_parquet ORDER BY id ASC;
-- physical_plan
-- 01)SortExec: expr=[id@0 ASC NULLS LAST], preserve_partitioning=[false]   ← should be gone
-- 02)--DataSourceExec: file_groups={1 group: [[c_low, b_mid, a_high]]}, ..., sort_order_for_reorder=[id@0 ASC NULLS LAST]

Expected

SortExec eliminated after Phase 2 reorder when:

  • ordering is declared (WITH ORDER or parquet sorting_columns metadata),
  • post-sort file groups are non-overlapping,
  • no NULLs in the sort columns of non-last files.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions