Fix Parquet LIST/MAP wrapper extraction in native and Avro readers#18325
Merged
Fix Parquet LIST/MAP wrapper extraction in native and Avro readers#18325
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18325 +/- ##
============================================
- Coverage 63.61% 63.39% -0.22%
- Complexity 1659 1668 +9
============================================
Files 3246 3252 +6
Lines 197549 198661 +1112
Branches 30577 30770 +193
============================================
+ Hits 125662 125950 +288
- Misses 61847 62642 +795
- Partials 10040 10069 +29
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ad14680 to
53091f4
Compare
53091f4 to
99f4698
Compare
6c0097c to
7f86283
Compare
3e75c5f to
8896e95
Compare
cd06923 to
706c33c
Compare
706c33c to
036cc98
Compare
xiangfu0
commented
Apr 25, 2026
Contributor
Author
xiangfu0
left a comment
There was a problem hiding this comment.
Found one high-signal correctness issue; see inline comment.
036cc98 to
9c566f0
Compare
9c566f0 to
e5c29d8
Compare
784c483 to
59bc43c
Compare
Jackie-Jiang
approved these changes
Apr 26, 2026
59bc43c to
a3d401a
Compare
The native (ParquetNativeRecordReader) and Avro-backed
(ParquetAvroRecordReader) Parquet readers were leaking the LIST/MAP
wrapper structs into Pinot rows, so a column of array<string> came
back as [{"element":"abc"}, {"element":"xyz"}] instead of
["abc", "xyz"], and a map<string,string> came back as
{"key_value": [{"key":"k","value":"v"}]} instead of {"k":"v"}. The
broken shape propagated into segment generation, multi-value scans,
and JSON paths.
Approach: align with Apache Arrow / Parquet LogicalTypes spec
-------------------------------------------------------------
Both readers now follow Apache Arrow's Parquet behavior — wrapper
detection is driven by the Parquet LogicalType annotations and the
spec backward-compat rules, never by guessing from value shape or
field names.
Avro reader: set `parquet.avro.add-list-element-records=false` in the
Hadoop config so parquet-avro flattens the standard 3-level LIST
encoding directly to Avro `array<elem-type>`. With this off, there is
no LIST wrapper to strip on the Pinot side — user-defined records
like `array<record<UserTag, [element]>>` round-trip cleanly because
the file's Avro schema (when present in metadata) is honored as-is,
and hand-authored Parquet `LIST<T>` surfaces as flat values without
the wrapper artifact. The extractor reduces to plain delegation plus
the existing INT96 promotion.
Native reader (ParquetNativeRecordExtractor): apply the Parquet
LogicalTypes backward-compat rules in extractList:
1. Repeated primitive: the primitive IS the element (no wrapper).
2. Repeated multi-field group: the group IS the element.
3. Repeated single-field group named `array` or `<list>_tuple`:
the group IS the element (legacy convention).
4. Otherwise (single-field group, any other name): the inner field
IS the element — strip the wrapper.
Also hoists isListElementWrapper out of the per-row loop and resolves
key/value field indices once for MAP entries. Documents that Parquet
does NOT guarantee MAP read order; users wanting a stable order
should use LIST<STRUCT<key, value>> instead.
Behavior matches Apache Arrow / parquet-cpp / parquet-avro
(with add-list-element-records=false) and the Parquet LogicalTypes
spec, so the same Parquet bytes produce the same logical rows across
readers.
Tests
-----
ParquetCollectionRecordReaderTest covers:
- Hand-authored Parquet schemas through both readers.
- Avro-schema-written Parquet files through both readers.
- A checked-in golden Parquet fixture
(collection-reader-fixture.parquet) with primitive types,
DECIMAL/DATE/TIMESTAMP logical types, nested structs, LIST and MAP
of scalars, struct lists, empty collections, and a real struct
field named `element`.
- Legacy LIST encodings (single-field non-`element` is flattened per
spec rule 4; multi-field group is preserved per rule 2).
- Nullable list elements.
- Nested LIST<LIST<STRING>> through the Avro reader, including null
inner element and null inner wrapper.
- A regression test for user-authored
`array<record<UserTag, [element: string]>>` confirming the inner
records survive untouched (case B).
Fixes #17420
a3d401a to
0638c31
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The native (
ParquetNativeRecordReader) and Avro-backed (ParquetAvroRecordReader) Parquet readers were leaking the LIST/MAP wrapper structs into Pinot rows. A column ofarray<string>came back as[{"element":"abc"}, {"element":"xyz"}]instead of["abc", "xyz"], and amap<string,string>came back as{"key_value": [{"key":"k","value":"v"}]}instead of{"k":"v"}. The broken shape propagated into segment generation, multi-value scans, and JSON paths.Fixes #17420.
Before
{ "topLevelTags": [{"element": "top-a"}, {"element": "top-b"}], "topLevelProperties": {"key_value": [{"key": "k-a", "value": "v-a"}, {"key": "k-b", "value": "v-b"}]}, "metadata": { "element": "real-element-field", "tags": [{"element": "abc"}, {"element": "xyz"}] } }After
{ "topLevelTags": ["top-a", "top-b"], "topLevelProperties": {"k-a": "v-a", "k-b": "v-b"}, "metadata": { "element": "real-element-field", "tags": ["abc", "xyz"] } }Fix
Drive wrapper detection from the file schema, not from the extracted row data.
ParquetNativeRecordExtractor— use ParquetLogicalTypeAnnotationto tell LIST and MAP groups apart from plain structs.extractList: hoist the per-rowisListElementWrappercheck out of the loop and dispatch the whole list down a single branch. Handles the standard 3-level encoding and legacy 2-level encodings (repeated primitive, repeated multi-field group, repeated single-field group not namedelement).extractKeyValueMap: resolve thekey/valuefield indices once from the schema and reuse them for every entry.ParquetAvroRecordExtractor— check the Avro field schema first, then recurse viatransformValueon the inner element field so nestedLIST<LIST<...>>wrappers andhandleDeprecatedTypes(e.g. INT96 → long) keep applying at every level.Real struct fields named
elementare preserved; only fields the external schema actually identifies as LIST/MAP wrappers are normalized.Note on MAP order
Parquet does not guarantee MAP entries are returned in any particular order on read (neither sorted nor insertion order). Writers, page boundaries, and dictionary encodings can all reorder entries. If you need a stable order, use
LIST<STRUCT<key, value>>instead of the native MAP logical type. This is documented inline onextractKeyValueMap.Tests
ParquetCollectionRecordReaderTestcovers:collection-reader-fixture.parquet) with primitive types,DECIMAL/DATE/TIMESTAMPlogical types, nested structs, LIST and MAP of scalars, struct lists, empty collections, and a real struct field namedelement.element, multi-field, single-field-named-elementwrapping a struct).LIST<LIST<STRING>>through the Avro reader, including null inner element and null inner wrapper.Validation
./mvnw -pl pinot-plugins/pinot-input-format/pinot-parquet test— 19/19 pass../mvnw spotless:apply checkstyle:check license:check -pl pinot-plugins/pinot-input-format/pinot-parquet— clean.