benchmarks for writing REE arrays to parquet#9936
Conversation
b59c826 to
3179d31
Compare
…riter to allow for ree to be included without erroring out
3179d31 to
801e4a5
Compare
| let mut file = Empty::default(); | ||
| let mut writer = | ||
| ArrowWriter::try_new(&mut file, batch.schema(), Some(props.clone())).unwrap(); | ||
| let Ok(mut writer) = ArrowWriter::try_new(&mut file, batch.schema(), Some(props.clone())) |
There was a problem hiding this comment.
This approach adds no overhead for the regular (non-ree) branches since unwrap should boil down to the same machine code as this expression. the issue is this doesn't allow for errors to be propagated for other reasons. this shouldnt be an issue for existing benchmarks since they all run fine with no issues. is there another way to go about this?
There was a problem hiding this comment.
I don't follow; why do we need these changes in the first place?
There was a problem hiding this comment.
This PR is to introduce benchmarks for #8016 , so that we can gauge the performance of changes as they happen. I thought I'd make sense to make it a separate PR since it can be isolated from the write logic & would be easier to review.
There was a problem hiding this comment.
Generally its easier to have benchmarks merged to main since it lets us use the bot to run the comparison; given we don't have a baseline anyway since its not really supported, maybe we can just comment out the lines with the ree benchmarks in this PR, e.g.
// let batch = create_ree_bench_batch(DataType::Utf8, BATCH_SIZE, 0.25, 0.75).unwrap();
// batches.push(("string_ree", batch));
// let batch = create_ree_bench_batch(DataType::Int32, BATCH_SIZE, 0.25, 0.75).unwrap();
// batches.push(("int32_ree", batch));So we don't need to have this handling for potentially unsupported datatypes in the benchmarks which can be confusing.
This way we can still have this benchmark code ready and the main PR later won't get bogged down (other than uncommenting the lines)
There was a problem hiding this comment.
this makes sense to me, commented out the benchmarks, and included a lint bypass to pass CI , with a comment linking this discussion.
| let mut file = Empty::default(); | ||
| let mut writer = | ||
| ArrowWriter::try_new(&mut file, batch.schema(), Some(props.clone())).unwrap(); | ||
| let Ok(mut writer) = ArrowWriter::try_new(&mut file, batch.schema(), Some(props.clone())) |
There was a problem hiding this comment.
I don't follow; why do we need these changes in the first place?
|
pushed up a revision 🚀 |
Which issue does this PR close?
Rationale for this change
there is no way to currently tell which approach to writing out REE columns to parquet is more performant. This PR aims to solve that.
What changes are included in this PR?
Added a
create_string_ree_bench_batch()function that builds record batches of REE data — it plugs into the existing benchmark structure.For controlling the shape of the generated REE arrays, I currently have two constants,
MIN_RUNandMAX_RUN, that bound the run length. The intent is to let benchmarks cover long uniform runs as well as shorter / more sparse data, rather than only one shape.An alternative would be a small params struct with defaults that callers can override — happy to switch to that if it's preferred, but that would require changing other callsites
Are these changes tested?
yes
Are there any user-facing changes?
no