feat: Implement columnStatistics() for Nimble SelectiveNimbleReader to enable file-level filter pushdown by kewang1024 · Pull Request #627 · facebookincubator/nimble

kewang1024 · 2026-03-31T18:14:13Z

Summary:
Nimble writes rich file-level column statistics via VectorizedFileStats
(min/max/count/nullCount per column for integer, floating-point, and
string types) in the "columnar.vectorized_stats" optional section.
However, SelectiveNimbleReader::columnStatistics() returns nullptr,
which means Nimble files cannot participate in file-level filter
pushdown — the mechanism used in HiveConnectorUtil::testFilters() to
skip entire files whose stats prove no rows can match the query filter.

This diff bridges the gap by implementing columnStatistics() in
SelectiveNimbleReader:

Adds toCommonColumnStatistics() helper that converts
nimble::ColumnStatistics to dwio::common::ColumnStatistics subclasses:
- IntegralStatistics -> IntegerColumnStatistics (min/max)
- FloatingPointStatistics -> DoubleColumnStatistics (min/max)
- StringStatistics -> StringColumnStatistics (min/max)
- Base ColumnStatistics -> base ColumnStatistics (valueCount/hasNull/size)
Loads VectorizedFileStats in ReaderBase at construction time,
exposed via fileColumnStats(). This is shared by both columnStatistics()
(for file-level filter pushdown) and computeStatsBasedRowSize()
(for row size estimation), eliminating duplicate stats loading.

End-to-end call chain for file-level filter pushdown:

Query with filter WHERE col > 200 on a Nimble file with col values [0, 100]:

SplitReader::prepareSplit()
  -> checkIfSplitIsEmpty()
    -> filterOnStats()
      -> testFilters()
        -> reader->columnStatistics(colId)
          -> [NEW] ReaderBase::fileColumnStats() (loaded at construction)
          -> [NEW] toCommonColumnStatistics() converts to IntegerColumnStatistics{min=0, max=100}
        -> testFilter(filter=">200", stats={min=0, max=100}, ...)
          -> testInt64Range(0, 100, mayHaveNull) returns false
        -> return false -> FILE SKIPPED

Differential Revision: D98945345

meta-codesync · 2026-03-31T18:14:20Z

@kewang1024 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98945345.

…o enable file-level filter pushdown Summary: Nimble writes rich file-level column statistics via VectorizedFileStats (min/max/count/nullCount per column for integer, floating-point, and string types) in the "columnar.vectorized_stats" optional section. However, SelectiveNimbleReader::columnStatistics() returns nullptr, which means Nimble files cannot participate in file-level filter pushdown — the mechanism used in HiveConnectorUtil::testFilters() to skip entire files whose stats prove no rows can match the query filter. This diff bridges the gap by implementing columnStatistics() in SelectiveNimbleReader: - Adds toCommonColumnStatistics() helper that converts nimble::ColumnStatistics to dwio::common::ColumnStatistics subclasses: - IntegralStatistics -> IntegerColumnStatistics (min/max) - FloatingPointStatistics -> DoubleColumnStatistics (min/max) - StringStatistics -> StringColumnStatistics (min/max) - Base ColumnStatistics -> base ColumnStatistics (valueCount/hasNull/size) - Loads VectorizedFileStats in ReaderBase at construction time, exposed via fileColumnStats(). This is shared by both columnStatistics() (for file-level filter pushdown) and computeStatsBasedRowSize() (for row size estimation), eliminating duplicate stats loading. End-to-end call chain for file-level filter pushdown: ``` Query with filter WHERE col > 200 on a Nimble file with col values [0, 100]: SplitReader::prepareSplit() -> checkIfSplitIsEmpty() -> filterOnStats() -> testFilters() -> reader->columnStatistics(colId) -> [NEW] ReaderBase::fileColumnStats() (loaded at construction) -> [NEW] toCommonColumnStatistics() converts to IntegerColumnStatistics{min=0, max=100} -> testFilter(filter=">200", stats={min=0, max=100}, ...) -> testInt64Range(0, 100, mayHaveNull) returns false -> return false -> FILE SKIPPED ``` Differential Revision: D98945345

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 31, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 31, 2026

kewang1024 force-pushed the export-D98945345 branch 4 times, most recently from 6d2cf0e to f192fd2 Compare April 1, 2026 07:51

meta-codesync bot changed the title ~~Implement columnStatistics() for Nimble SelectiveNimbleReader to enable file-level filter pushdown~~ feat: Implement columnStatistics() for Nimble SelectiveNimbleReader to enable file-level filter pushdown Apr 1, 2026

kewang1024 force-pushed the export-D98945345 branch from f192fd2 to 59d2a02 Compare April 1, 2026 08:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement columnStatistics() for Nimble SelectiveNimbleReader to enable file-level filter pushdown#627

feat: Implement columnStatistics() for Nimble SelectiveNimbleReader to enable file-level filter pushdown#627
kewang1024 wants to merge 1 commit intofacebookincubator:mainfrom
kewang1024:export-D98945345

kewang1024 commented Mar 31, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kewang1024 commented Mar 31, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kewang1024 commented Mar 31, 2026 •

edited by meta-codesync bot

Loading