Commit 784c483
committed
Fix Parquet LIST/MAP wrapper extraction in native and Avro readers
The native (ParquetNativeRecordReader) and Avro-backed
(ParquetAvroRecordReader) Parquet readers were leaking the LIST/MAP
wrapper structs into Pinot rows, so a column of array<string> came
back as [{"element":"abc"}, {"element":"xyz"}] instead of
["abc", "xyz"], and a map<string,string> came back as
{"key_value": [{"key":"k","value":"v"}]} instead of {"k":"v"}. The
broken shape propagated into segment generation, multi-value scans,
and JSON paths.
Approach: align with Apache Arrow / Parquet LogicalTypes spec
-------------------------------------------------------------
Both readers now follow Apache Arrow's Parquet behavior — wrapper
detection is driven by the Parquet LogicalType annotations and the
spec backward-compat rules, never by guessing from value shape or
field names.
Avro reader: set `parquet.avro.add-list-element-records=false` in the
Hadoop config so parquet-avro flattens the standard 3-level LIST
encoding directly to Avro `array<elem-type>`. With this off, there is
no LIST wrapper to strip on the Pinot side — user-defined records
like `array<record<UserTag, [element]>>` round-trip cleanly because
the file's Avro schema (when present in metadata) is honored as-is,
and hand-authored Parquet `LIST<T>` surfaces as flat values without
the wrapper artifact. The extractor reduces to plain delegation plus
the existing INT96 promotion.
Native reader (ParquetNativeRecordExtractor): apply the Parquet
LogicalTypes backward-compat rules in extractList:
1. Repeated primitive: the primitive IS the element (no wrapper).
2. Repeated multi-field group: the group IS the element.
3. Repeated single-field group named `array` or `<list>_tuple`:
the group IS the element (legacy convention).
4. Otherwise (single-field group, any other name): the inner field
IS the element — strip the wrapper.
Also hoists isListElementWrapper out of the per-row loop and resolves
key/value field indices once for MAP entries. Documents that Parquet
does NOT guarantee MAP read order; users wanting a stable order
should use LIST<STRUCT<key, value>> instead.
Behavior matches Apache Arrow / parquet-cpp / parquet-avro
(with add-list-element-records=false) and the Parquet LogicalTypes
spec, so the same Parquet bytes produce the same logical rows across
readers.
Tests
-----
ParquetCollectionRecordReaderTest covers:
- Hand-authored Parquet schemas through both readers.
- Avro-schema-written Parquet files through both readers.
- A checked-in golden Parquet fixture
(collection-reader-fixture.parquet) with primitive types,
DECIMAL/DATE/TIMESTAMP logical types, nested structs, LIST and MAP
of scalars, struct lists, empty collections, and a real struct
field named `element`.
- Legacy LIST encodings (single-field non-`element` is flattened per
spec rule 4; multi-field group is preserved per rule 2).
- Nullable list elements.
- Nested LIST<LIST<STRING>> through the Avro reader, including null
inner element and null inner wrapper.
- A regression test for user-authored
`array<record<UserTag, [element: string]>>` confirming the inner
records survive untouched (case B).
Fixes #174201 parent 068907e commit 784c483
6 files changed
Lines changed: 851 additions & 15 deletions
File tree
- pinot-plugins/pinot-input-format/pinot-parquet/src
- main/java/org/apache/pinot/plugin/inputformat/parquet
- test
- java/org/apache/pinot/plugin/inputformat/parquet
- resources
Lines changed: 10 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
28 | 37 | | |
29 | 38 | | |
30 | 39 | | |
31 | 40 | | |
32 | | - | |
| 41 | + | |
33 | 42 | | |
34 | 43 | | |
35 | 44 | | |
| |||
Lines changed: 92 additions & 14 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
42 | | - | |
43 | | - | |
44 | | - | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
| |||
161 | 161 | | |
162 | 162 | | |
163 | 163 | | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
164 | 167 | | |
165 | 168 | | |
166 | 169 | | |
| |||
173 | 176 | | |
174 | 177 | | |
175 | 178 | | |
176 | | - | |
177 | | - | |
178 | | - | |
179 | | - | |
180 | | - | |
181 | | - | |
182 | | - | |
183 | | - | |
184 | | - | |
185 | | - | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
186 | 210 | | |
187 | | - | |
| 211 | + | |
188 | 212 | | |
189 | 213 | | |
190 | 214 | | |
| 215 | + | |
| 216 | + | |
191 | 217 | | |
192 | 218 | | |
193 | 219 | | |
194 | 220 | | |
195 | 221 | | |
196 | 222 | | |
197 | 223 | | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
198 | 276 | | |
Lines changed: 7 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
92 | 92 | | |
93 | 93 | | |
94 | 94 | | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
95 | 102 | | |
96 | 103 | | |
97 | 104 | | |
| |||
0 commit comments