Skip to content

Fix Parquet LIST/MAP wrapper extraction in native and Avro readers#18325

Merged
xiangfu0 merged 1 commit intomasterfrom
codex/parquet-list-element-reader-fix
Apr 26, 2026
Merged

Fix Parquet LIST/MAP wrapper extraction in native and Avro readers#18325
xiangfu0 merged 1 commit intomasterfrom
codex/parquet-list-element-reader-fix

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented Apr 24, 2026

Problem

The native (ParquetNativeRecordReader) and Avro-backed (ParquetAvroRecordReader) Parquet readers were leaking the LIST/MAP wrapper structs into Pinot rows. A column of array<string> came back as [{"element":"abc"}, {"element":"xyz"}] instead of ["abc", "xyz"], and a map<string,string> came back as {"key_value": [{"key":"k","value":"v"}]} instead of {"k":"v"}. The broken shape propagated into segment generation, multi-value scans, and JSON paths.

Fixes #17420.

Before

{
  "topLevelTags": [{"element": "top-a"}, {"element": "top-b"}],
  "topLevelProperties": {"key_value": [{"key": "k-a", "value": "v-a"}, {"key": "k-b", "value": "v-b"}]},
  "metadata": {
    "element": "real-element-field",
    "tags": [{"element": "abc"}, {"element": "xyz"}]
  }
}

After

{
  "topLevelTags": ["top-a", "top-b"],
  "topLevelProperties": {"k-a": "v-a", "k-b": "v-b"},
  "metadata": {
    "element": "real-element-field",
    "tags": ["abc", "xyz"]
  }
}

Fix

Drive wrapper detection from the file schema, not from the extracted row data.

ParquetNativeRecordExtractor — use Parquet LogicalTypeAnnotation to tell LIST and MAP groups apart from plain structs.

  • extractList: hoist the per-row isListElementWrapper check out of the loop and dispatch the whole list down a single branch. Handles the standard 3-level encoding and legacy 2-level encodings (repeated primitive, repeated multi-field group, repeated single-field group not named element).
  • extractKeyValueMap: resolve the key/value field indices once from the schema and reuse them for every entry.

ParquetAvroRecordExtractor — check the Avro field schema first, then recurse via transformValue on the inner element field so nested LIST<LIST<...>> wrappers and handleDeprecatedTypes (e.g. INT96 → long) keep applying at every level.

Real struct fields named element are preserved; only fields the external schema actually identifies as LIST/MAP wrappers are normalized.

Note on MAP order

Parquet does not guarantee MAP entries are returned in any particular order on read (neither sorted nor insertion order). Writers, page boundaries, and dictionary encodings can all reorder entries. If you need a stable order, use LIST<STRUCT<key, value>> instead of the native MAP logical type. This is documented inline on extractKeyValueMap.

Tests

ParquetCollectionRecordReaderTest covers:

  • Hand-authored Parquet schemas through both readers.
  • Avro-schema-written Parquet files through both readers.
  • A checked-in golden Parquet fixture (collection-reader-fixture.parquet) with primitive types, DECIMAL/DATE/TIMESTAMP logical types, nested structs, LIST and MAP of scalars, struct lists, empty collections, and a real struct field named element.
  • Legacy 2-level list encodings (single-field non-element, multi-field, single-field-named-element wrapping a struct).
  • Nullable list elements.
  • Nested LIST<LIST<STRING>> through the Avro reader, including null inner element and null inner wrapper.

Validation

  • ./mvnw -pl pinot-plugins/pinot-input-format/pinot-parquet test — 19/19 pass.
  • ./mvnw spotless:apply checkstyle:check license:check -pl pinot-plugins/pinot-input-format/pinot-parquet — clean.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 97.14286% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 63.39%. Comparing base (068907e) to head (0638c31).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
...utformat/parquet/ParquetNativeRecordExtractor.java 96.96% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18325      +/-   ##
============================================
- Coverage     63.61%   63.39%   -0.22%     
- Complexity     1659     1668       +9     
============================================
  Files          3246     3252       +6     
  Lines        197549   198661    +1112     
  Branches      30577    30770     +193     
============================================
+ Hits         125662   125950     +288     
- Misses        61847    62642     +795     
- Partials      10040    10069      +29     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 ?
java-21 63.39% <97.14%> (-0.19%) ⬇️
temurin 63.39% <97.14%> (-0.22%) ⬇️
unittests 63.39% <97.14%> (-0.22%) ⬇️
unittests1 55.37% <ø> (-0.23%) ⬇️
unittests2 34.90% <97.14%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 force-pushed the codex/parquet-list-element-reader-fix branch 2 times, most recently from ad14680 to 53091f4 Compare April 24, 2026 21:34
@xiangfu0 xiangfu0 changed the title [codex] Fix Parquet list element wrapper extraction [codex] Fix Parquet collection wrapper extraction Apr 24, 2026
@xiangfu0 xiangfu0 force-pushed the codex/parquet-list-element-reader-fix branch from 53091f4 to 99f4698 Compare April 24, 2026 21:38
@xiangfu0 xiangfu0 marked this pull request as ready for review April 24, 2026 21:53
@xiangfu0 xiangfu0 changed the title [codex] Fix Parquet collection wrapper extraction Fix Parquet collection wrapper extraction Apr 24, 2026
@xiangfu0 xiangfu0 added enhancement Improvement to existing functionality ingestion Related to data ingestion pipeline extension-point Adds or modifies an extension/SPI point backward-incompat Introduces a backward-incompatible API or behavior change plugins Related to the plugin system labels Apr 24, 2026
@xiangfu0 xiangfu0 force-pushed the codex/parquet-list-element-reader-fix branch 2 times, most recently from 6c0097c to 7f86283 Compare April 24, 2026 23:58
@xiangfu0 xiangfu0 requested a review from Jackie-Jiang April 25, 2026 00:07
@xiangfu0 xiangfu0 force-pushed the codex/parquet-list-element-reader-fix branch 5 times, most recently from 3e75c5f to 8896e95 Compare April 25, 2026 01:40
@xiangfu0 xiangfu0 force-pushed the codex/parquet-list-element-reader-fix branch from cd06923 to 706c33c Compare April 25, 2026 06:09
@xiangfu0 xiangfu0 changed the title Fix Parquet collection wrapper extraction Fix Parquet LIST/MAP wrapper extraction in native and Avro readers Apr 25, 2026
@xiangfu0 xiangfu0 force-pushed the codex/parquet-list-element-reader-fix branch from 706c33c to 036cc98 Compare April 25, 2026 06:11
@xiangfu0 xiangfu0 requested a review from Jackie-Jiang April 25, 2026 06:12
Copy link
Copy Markdown
Contributor Author

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-signal correctness issue; see inline comment.

@xiangfu0 xiangfu0 force-pushed the codex/parquet-list-element-reader-fix branch from 036cc98 to 9c566f0 Compare April 25, 2026 20:48
@xiangfu0 xiangfu0 force-pushed the codex/parquet-list-element-reader-fix branch from 9c566f0 to e5c29d8 Compare April 25, 2026 21:36
@xiangfu0 xiangfu0 force-pushed the codex/parquet-list-element-reader-fix branch 5 times, most recently from 784c483 to 59bc43c Compare April 26, 2026 00:06
@xiangfu0 xiangfu0 requested a review from Jackie-Jiang April 26, 2026 02:46
@xiangfu0 xiangfu0 force-pushed the codex/parquet-list-element-reader-fix branch from 59bc43c to a3d401a Compare April 26, 2026 05:24
The native (ParquetNativeRecordReader) and Avro-backed
(ParquetAvroRecordReader) Parquet readers were leaking the LIST/MAP
wrapper structs into Pinot rows, so a column of array<string> came
back as [{"element":"abc"}, {"element":"xyz"}] instead of
["abc", "xyz"], and a map<string,string> came back as
{"key_value": [{"key":"k","value":"v"}]} instead of {"k":"v"}. The
broken shape propagated into segment generation, multi-value scans,
and JSON paths.

Approach: align with Apache Arrow / Parquet LogicalTypes spec
-------------------------------------------------------------
Both readers now follow Apache Arrow's Parquet behavior — wrapper
detection is driven by the Parquet LogicalType annotations and the
spec backward-compat rules, never by guessing from value shape or
field names.

Avro reader: set `parquet.avro.add-list-element-records=false` in the
Hadoop config so parquet-avro flattens the standard 3-level LIST
encoding directly to Avro `array<elem-type>`. With this off, there is
no LIST wrapper to strip on the Pinot side — user-defined records
like `array<record<UserTag, [element]>>` round-trip cleanly because
the file's Avro schema (when present in metadata) is honored as-is,
and hand-authored Parquet `LIST<T>` surfaces as flat values without
the wrapper artifact. The extractor reduces to plain delegation plus
the existing INT96 promotion.

Native reader (ParquetNativeRecordExtractor): apply the Parquet
LogicalTypes backward-compat rules in extractList:
  1. Repeated primitive: the primitive IS the element (no wrapper).
  2. Repeated multi-field group: the group IS the element.
  3. Repeated single-field group named `array` or `<list>_tuple`:
     the group IS the element (legacy convention).
  4. Otherwise (single-field group, any other name): the inner field
     IS the element — strip the wrapper.
Also hoists isListElementWrapper out of the per-row loop and resolves
key/value field indices once for MAP entries. Documents that Parquet
does NOT guarantee MAP read order; users wanting a stable order
should use LIST<STRUCT<key, value>> instead.

Behavior matches Apache Arrow / parquet-cpp / parquet-avro
(with add-list-element-records=false) and the Parquet LogicalTypes
spec, so the same Parquet bytes produce the same logical rows across
readers.

Tests
-----
ParquetCollectionRecordReaderTest covers:
- Hand-authored Parquet schemas through both readers.
- Avro-schema-written Parquet files through both readers.
- A checked-in golden Parquet fixture
  (collection-reader-fixture.parquet) with primitive types,
  DECIMAL/DATE/TIMESTAMP logical types, nested structs, LIST and MAP
  of scalars, struct lists, empty collections, and a real struct
  field named `element`.
- Legacy LIST encodings (single-field non-`element` is flattened per
  spec rule 4; multi-field group is preserved per rule 2).
- Nullable list elements.
- Nested LIST<LIST<STRING>> through the Avro reader, including null
  inner element and null inner wrapper.
- A regression test for user-authored
  `array<record<UserTag, [element: string]>>` confirming the inner
  records survive untouched (case B).

Fixes #17420
@xiangfu0 xiangfu0 force-pushed the codex/parquet-list-element-reader-fix branch from a3d401a to 0638c31 Compare April 26, 2026 05:31
@xiangfu0 xiangfu0 merged commit 4499bf5 into master Apr 26, 2026
11 checks passed
@xiangfu0 xiangfu0 deleted the codex/parquet-list-element-reader-fix branch April 26, 2026 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backward-incompat Introduces a backward-incompatible API or behavior change enhancement Improvement to existing functionality extension-point Adds or modifies an extension/SPI point ingestion Related to data ingestion pipeline plugins Related to the plugin system

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ParquetNativeRecordReader and ParquetAvroRecordReader expand a json struct MV element as a MV element of maps

3 participants