benchmarks for writing REE arrays to parquet by Rich-T-kid · Pull Request #9936 · apache/arrow-rs

Rich-T-kid · 2026-05-07T03:43:01Z

Which issue does this PR close?

Closes Add benchmarks for REE to parquet #9935.

Rationale for this change

there is no way to currently tell which approach to writing out REE columns to parquet is more performant. This PR aims to solve that.

What changes are included in this PR?

Added a create_string_ree_bench_batch() function that builds record batches of REE data — it plugs into the existing benchmark structure.

For controlling the shape of the generated REE arrays, I currently have two constants, MIN_RUN and MAX_RUN, that bound the run length. The intent is to let benchmarks cover long uniform runs as well as shorter / more sparse data, rather than only one shape.

An alternative would be a small params struct with defaults that callers can override — happy to switch to that if it's preferred, but that would require changing other callsites

Are these changes tested?

yes

Are there any user-facing changes?

no

…riter to allow for ree to be included without erroring out

Rich-T-kid · 2026-05-14T16:55:52Z

        let mut file = Empty::default();
-        let mut writer =
-            ArrowWriter::try_new(&mut file, batch.schema(), Some(props.clone())).unwrap();
+        let Ok(mut writer) = ArrowWriter::try_new(&mut file, batch.schema(), Some(props.clone()))


This approach adds no overhead for the regular (non-ree) branches since unwrap should boil down to the same machine code as this expression. the issue is this doesn't allow for errors to be propagated for other reasons. this shouldnt be an issue for existing benchmarks since they all run fine with no issues. is there another way to go about this?

I don't follow; why do we need these changes in the first place?

This PR is to introduce benchmarks for #8016 , so that we can gauge the performance of changes as they happen. I thought I'd make sense to make it a separate PR since it can be isolated from the write logic & would be easier to review.

Generally its easier to have benchmarks merged to main since it lets us use the bot to run the comparison; given we don't have a baseline anyway since its not really supported, maybe we can just comment out the lines with the ree benchmarks in this PR, e.g.

// let batch = create_ree_bench_batch(DataType::Utf8, BATCH_SIZE, 0.25, 0.75).unwrap(); // batches.push(("string_ree", batch)); // let batch = create_ree_bench_batch(DataType::Int32, BATCH_SIZE, 0.25, 0.75).unwrap(); // batches.push(("int32_ree", batch));

So we don't need to have this handling for potentially unsupported datatypes in the benchmarks which can be confusing.

This way we can still have this benchmark code ready and the main PR later won't get bogged down (other than uncommenting the lines)

this makes sense to me, commented out the benchmarks, and included a lint bypass to pass CI , with a comment linking this discussion.

Rich-T-kid · 2026-05-22T23:04:46Z

@Jefffrey / @brancz when you get a chance could you take a look at this

Jefffrey · 2026-05-23T03:50:39Z

        let mut file = Empty::default();
-        let mut writer =
-            ArrowWriter::try_new(&mut file, batch.schema(), Some(props.clone())).unwrap();
+        let Ok(mut writer) = ArrowWriter::try_new(&mut file, batch.schema(), Some(props.clone()))


I don't follow; why do we need these changes in the first place?

Rich-T-kid · 2026-05-24T15:51:09Z

pushed up a revision 🚀

draft of benchmarks

c449cea

Rich-T-kid changed the title ~~draft of benchmarks~~ benchmarks for writing REE arrays to parquet May 14, 2026

Rich-T-kid force-pushed the rich-T-kid/REE-to-Parquet-BenchMarks branch from b59c826 to 3179d31 Compare May 14, 2026 16:48

github-actions Bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels May 14, 2026

Rich-T-kid commented May 14, 2026

View reviewed changes

Comment thread arrow/src/util/data_gen.rs

introduce benchmarks for REE -> Parquet. added match statment arrow_w…

801e4a5

…riter to allow for ree to be included without erroring out

Rich-T-kid force-pushed the rich-T-kid/REE-to-Parquet-BenchMarks branch from 3179d31 to 801e4a5 Compare May 14, 2026 16:52

Rich-T-kid commented May 14, 2026

View reviewed changes

Rich-T-kid marked this pull request as ready for review May 14, 2026 16:56

Rich-T-kid mentioned this pull request May 14, 2026

Support converting RunEndEncodedType to parquet #8016

Open

Rich-T-kid commented May 14, 2026

View reviewed changes

Comment thread parquet/benches/arrow_writer.rs Outdated

includes other datatypes for ree bench & removes un-needed comment

5272752

Jefffrey reviewed May 23, 2026

View reviewed changes

Revised PR comments

0be749f

comment out benches and include comment explaining lint bypass

badf2d2

Jefffrey approved these changes May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarks for writing REE arrays to parquet#9936

benchmarks for writing REE arrays to parquet#9936
Rich-T-kid wants to merge 5 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/REE-to-Parquet-BenchMarks

Rich-T-kid commented May 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Rich-T-kid May 14, 2026

Uh oh!

Jefffrey May 23, 2026

Uh oh!

Rich-T-kid May 24, 2026

Uh oh!

Jefffrey May 25, 2026

Uh oh!

Rich-T-kid May 25, 2026

Uh oh!

Uh oh!

Rich-T-kid commented May 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jefffrey May 23, 2026

Uh oh!

Rich-T-kid commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Rich-T-kid commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Rich-T-kid May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Rich-T-kid commented May 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jefffrey May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rich-T-kid commented May 7, 2026 •

edited

Loading