Releases: apache/datafusion-comet
0.16.0
DataFusion Comet 0.16.0 Changelog
This release consists of 127 commits from 17 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: report task output metrics in Spark UI #3999 (0lai0)
- fix: cast to and from timestamp_ntz #4008 (parthchandra)
- fix: support to_json on Spark 4.0 #4036 (andygrove)
- fix: enable arrays_overlap #3901 (kazuyukitanimura)
- fix: Iceberg reflection for current() on TableOperations hierarchy #3895 (karuppayya)
- fix: fall back to Spark for shuffle/sort/aggregate on non-default collated strings [Spark 4] #4035 (andygrove)
- fix: scalar subquery pushdown and reuse for CometNativeScanExec (SPARK-43402) #4053 (mbutrovich)
- fix: fall back for shredded Variant scans on Spark 4.0 #4084 (andygrove)
- fix: enable Spark 4 SQL tests previously ignored for issues #3313 and #3314 #4092 (andygrove)
- fix: fall back to Spark for hash join and sort-merge join on non-default collated string keys [Spark 4] #4095 (0lai0)
- fix: reject string/binary read as numeric in native_datafusion scan #4091 (andygrove)
- fix: reject incompatible decimal precision/scale in native_datafusion scan #4090 (andygrove)
- fix: throw SchemaColumnConvertNotSupportedException from native_datafusion schema mismatch #4117 (andygrove)
- fix: substring with negative start index #4017 (kazuyukitanimura)
- fix: honor strictFloatingPoint in RangePartitioning #4167 (0lai0)
- fix: [Spark 4.1.1] preserve stored allowDecimalPrecisionLoss in DecimalPrecision rule #4179 (andygrove)
- fix: [Spark 4.1.1] preserve parent struct nullness when all requested fields missing in Parquet #4190 (andygrove)
- fix: support Spark 4.1 BloomFilter V2 format and bit-scattering #4196 (andygrove)
- fix: JNI local reference cleanup in JVMClasses::with_env #4225 (0lai0)
- fix: broadcast exchange bypasses AQE partition coalescing #4163 (andygrove)
- fix: resolve Scala compiler warnings for auto-tupling and bare try #4227 (andygrove)
- fix: [Spark 4.1] preserve union output partitioning in CometUnionExec #4207 (andygrove)
- fix: re-enable tests skipped for Spark 4.1 (issue #4098) #4253 (andygrove)
- fix: cargo clean before release build to avoid stale native libs #4257 (andygrove)
Performance related:
- perf: avoid redundant columnar shuffle when both parent and child are non-Comet #4010 (andygrove)
- perf: reduce per-node allocations in to_native_metric_node #4075 (andygrove)
Implemented enhancements:
- feat: enable native Iceberg reader by default #3819 (andygrove)
- feat: support
collect_set#3954 (comphead) - feat: non-AQE DPP for native Parquet scans, broadcast exchange reuse for DPP subqueries #4011 (mbutrovich)
- feat: add support for array_position expression #3172 (andygrove)
- feat: Cast string to timestamp_ntz #4034 (parthchandra)
- feat: Add TimestampNTZType support for unix_timestamp #4039 (parthchandra)
- feat: fix array_compact for Spark 4.0 and correct return type metadata #3796 (andygrove)
- feat: task-level input metrics (bytesRead) for Iceberg native scan #4128 (mbutrovich)
- feat: add MapSort expression support for Spark 4.0 #4076 (andygrove)
- feat: Support Spark expression
str_to_map#3654 (unknowntpo) - feat: add support for timestamp_seconds expression #3146 (andygrove)
- feat: add config to gate converting Spark shuffle to Comet shuffle when child is non-Comet plan #4166 (andygrove)
- feat: AQE DPP for native Parquet scans with broadcast reuse #4112 (mbutrovich)
- feat: support regular BuildRight+LeftAnti hash join #4073 (viirya)
- feat: add bug-triage Claude skill #4109 (andygrove)
- feat: support
PartialMergeaggregation mode #4003 (comphead) - feat: add encode time tracking for shuffle operations #4068 (0lai0)
- feat: Add support for Spark ToDegrees and ToRadians math expressions #3786 (rafafrdz)
- feat: Add support for Spark Acosh, Asinh, Atanh math expressions #3787 (rafafrdz)
- feat: Add support for Spark Cbrt math expression #3788 (rafafrdz)
- feat: Add support for Spark Pi math expression #3789 (rafafrdz)
- feat: support Parquet field ID matching in native_datafusion scan #4216 (mbutrovich)
- feat: support AQE DPP broadcast reuse for Iceberg native scans #4215 (mbutrovich)
- feat: add support for url_encode, url_decode, and try_url_decode #4231 (parthchandra)
- feat: support TimestampType join keys in SortMergeJoin #3986 (andygrove)
Documentation updates:
- docs: Add changelog for 0.15.0 #4000 (andygrove)
- docs: Update README and benchmark results for 0.15.0 release #3995 (andygrove)
- docs: fix errors in benchmark pages #4001 (andygrove)
- docs: split compatibility guide into multiple pages #4055 (andygrove)
- docs: Generate expression compatibility docs from code #4057 (andygrove)
- doc: update documentation for cast and datetime functions #4058 (parthchandra)
- docs: add compatibility documentation to all expressions #4067 (andygrove)
- docs: rename SQL File Tests to Comet SQL Tests #4108 (andygrove)
- docs: add Understanding Comet Plans user guide page #4086 (andygrove)
- docs: support conditional content for snapshot vs release builds #4030 (andygrove)
- docs: update Spark version support and add version compatibility page #4138 (andygrove)
- docs: improve review skill and contributor guide for serde patterns #4132 (andygrove)
- docs: Fix errors in list of supported Spark versions #4141 (andygrove)
- docs: Update roadmap in contributor guide #4144 (andygrove)
- docs: add implement-comet-expression Claude skill #4158 (andygrove)
- docs: add roadmap items for spillable hash join, UDF support, memory management, and 1.0.0 #4171 (andygrove)
- docs: start Spark 4.1 known-limitations section, seeded with #4199 #4202 (andygrove)
- docs: document Spark 4 IntelliJ setup #4198 (yuboxx)
- docs: refresh Gluten comparison with ANSI, Spark 4, and Iceberg coverage #4169 (andygrove)
- docs: check off 53 implemented expressions in support doc #4147 (andygrove)
- docs: replace p...
0.15.0
DataFusion Comet 0.15.0 Changelog
This release consists of 142 commits from 19 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: enable native_datafusion Spark SQL tests previously ignored in #3315 #3696 (andygrove)
- fix: route file-not-found errors through SparkError JSON path #3699 (andygrove)
- fix: fall back from native_datafusion for duplicate fields in case-insensitive mode #3687 (andygrove)
- fix: enable more Spark SQL tests for
native_datafusion(DynamicPartitionPruningSuite/ExplainSuite) #3694 (andygrove) - fix: Correct GetArrayItem null handling for dynamic indices and re-enable native execution #3709 (0lai0)
- fix: enable native_datafusion Spark SQL tests for #3320, #3401, #3719 #3718 (andygrove)
- fix: Native engine crashes on literal DateTrunc and TimestampTrunc #3668 (0lai0)
- fix: Use the loaded Comet extension too (Spark 3.5.8) #3707 (martin-g)
- fix: Use thread context classloader for Iceberg class loading #3738 (karuppayya)
- fix: disable ANSI mode in benchmarks to avoid exceptions on invalid input #3750 (parthchandra)
- fix: fix string to timestamp cast for UTC timestamps #3656 (parthchandra)
- fix: native error message not propagated to SparkException on empty errorClass #3727 (manuzhang)
- fix: add timezone and special formats support for cast string to timestamp #3730 (parthchandra)
- fix: handle inf/-inf/nan in ShimSparkErrorConverter cast overflow #3768 (manuzhang)
- fix: handle scalar decimal value overflow correctly in ANSI mode #3803 (parthchandra)
- fix: correct array_append return type and mark as Compatible #3795 (andygrove)
- fix: remove broken directBuffer feature for parquet reads #3814 (andygrove)
- fix: remove unnecessary IgnoreCometNativeDataFusion tags from 3.5.8 diff #3831 (andygrove)
- fix: query tolerance= in SQL file tests now also asserts Comet native execution #3797 (andygrove)
- fix: include scan impl in PR Linux artifact names #3853 (manuzhang)
- fix: correct invalid Option.contains assertion in cast test #3851 (manuzhang)
- fix: native_datafusion: case-insensitive mode doesn't detect duplicate/ambiguous Parquet fields #3808 (vaibhawvipul)
- fix: cache object stores and bucket regions to reduce DNS query volume #3802 (andygrove)
- fix: skip Comet columnar shuffle for stages with DPP scans #3879 (andygrove)
- fix: Native_datafusion reports correct files and bytes scanned #3798 (0lai0)
- fix: address clippy collapsible_match warnings #3863 (manuzhang)
- fix: parameterize file count in Native_datafusion metrics test #3896 (0lai0)
- fix: Make cast string to timestamp compatible with Spark #3884 (parthchandra)
- fix: add EmptySchemaShufflePartitioner and test from #3858 #3893 (mbutrovich)
- fix: use min instead of max when capping write buffer size to Int range #3914 (andygrove)
- fix: Update TPC-DS q36a golden file for Spark 4.0 decimal UNION widening change #3915 (parthchandra)
- fix: audit array_insert expression for correctness and test coverage #3890 (andygrove)
- fix: handle ambiguous and non-existent local times #3865 (matthewalex4)
- fix: improve tracing feature #3688 (andygrove)
- fix: make tan and atan2 compatible #3849 (kazuyukitanimura)
- fix: checkSparkAnswer displays incorrect labels #3927 (parthchandra)
- fix: support full-width and null characters, and negative scale in string to decimal #3922 (parthchandra)
- fix: enable Corr #3892 (kazuyukitanimura)
- fix: array to array cast #2897 (manuzhang)
- fix: exclude tpcds-plan-stability extended.txt files from rat license check #3964 (andygrove)
- fix: use UTC for Arrow schema timezone in SparkToColumnar conversions #3878 (andygrove)
- fix: remove spurious .flatten call that garbled SortMergeJoin fallback messages #3968 (andygrove)
- fix: Add legacy mode handling to cast Decimal to String #3939 (parthchandra)
- fix: improve test coverage for decimal to primitive type casts #3948 (parthchandra)
- fix: fix decimal div and add tests #3952 (parthchandra)
- fix: make shuffle fallback decisions sticky across planning passes #3982 (andygrove)
Performance related:
- perf: Coalesce broadcast exchange batches before broadcasting #3703 (mbutrovich)
- perf: stop using FFI in native shuffle read path #3731 (andygrove)
- perf: Enable native c2r for more queries #3764 (andygrove)
- perf: Mark more operators as FFI safe to avoid deep copies #3765 (andygrove)
- perf: remove BufReader wrapper when copying spill files to shuffle output #3861 (andygrove)
- fix: share unified memory pools across native execution contexts within a task #3924 (andygrove)
Implemented enhancements:
- feat: Add PR review skill for Comet expression reviews #3711 (andygrove)
- feat: add sort_array benchmark #3758 (grorge123)
- feat: Support Spark expression days #3746 (0lai0)
- feat: expose comet metrics through Sparks external monitoring system #3708 (coderfender)
- feat: support SQL aggregate FILTER (WHERE ...) clause in native execution #3835 (viirya)
- feat: Implement CRC32C algorithm #3822 (snmvaughan)
- feat: add audit-comet-expression Claude Code skill #3793 (andygrove)
- feat: enable native_datafusion scan in auto mode #3781 (andygrove)
- feat: support LEAD and LAG window functions with IGNORE NULLS #3876 (viirya)
- feat: add standalone shuffle benchmark tool #3752 (andygrove)
- feat: Mark array_compact as Compatible and improve test coverage #3889 (andygrove)
- feat: add native support for get_json_object expression #3747 (andygrove)
- feat: Support Spark expression hours #3804 (0lai0)
- feat: add support for date_from_unix_date expression #3144 (andygrove)
- feat: support spark bin function #3928 (kazantsev-maksim)
- feat: support sort_array expression #3706 (grorge123)
Documentation updates:
- docs: Add some .lldbint configurations for debugging document #3686 (wForget)
- docs: document Iceberg Spark tests in contributor guide #3777 (mbutrovich)
- docs: document negative zero cast-to-string incompatibility [#3811](https://github.com/...
0.14.1
DataFusion Comet 0.14.1 Changelog
This release consists of 5 commits from 1 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: [branch-0.14] backport #3802 - cache object stores and bucket regions to reduce DNS query volume #3935 (andygrove)
- fix: [branch-0.14] backport #3924 - share unified memory pools across native execution contexts #3938 (andygrove)
- fix: [branch-0.14] backport #3879 - skip Comet columnar shuffle for stages with DPP scans #3934 (andygrove)
- fix: [branch-0.14] backport #3914 - use min instead of max when capping write buffer size to Int range #3936 (andygrove)
- fix: [branch-0.14] backport #3865 - handle ambiguous and non-existent local times #3937 (andygrove)
Credits
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
5 Andy Grove
Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.
0.14.0
DataFusion Comet 0.14.0 Changelog
This release consists of 189 commits from 21 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: [iceberg] Fall back on dynamicpruning expressions for CometIcebergNativeScan #3335 (mbutrovich)
- fix: [iceberg] Disable native c2r by default #3348 (andygrove)
- fix: Fix
space()with negative input #3347 (hsiang-c) - fix: respect scan impl config for v2 scan #3357 (andygrove)
- fix: fix memory safety issue in native c2r #3367 (andygrove)
- fix: preserve partitioning in CometNativeScanExec for bucketed scans #3392 (andygrove)
- fix: unignore row index Spark SQL tests for native_datafusion #3414 (andygrove)
- fix: fall back to Spark when Parquet field ID matching is enabled in native_datafusion #3415 (andygrove)
- fix: Expose bucketing information from CometNativeScanExec #3437 (andygrove)
- fix: support scalar processing for
spacefunction #3408 (kazantsev-maksim) - fix: Revert "perf: Remove mutable buffers from scan partition/missing columns (#3411)" [iceberg] #3486 (mbutrovich)
- fix: unignore input_file_name Spark SQL tests for native_datafusion #3458 (andygrove)
- fix: add scalar support for bit_count expression #3361 (hsiang-c)
- fix: Support concat_ws with literal NULL separator #3542 (0lai0)
- fix: handle type mismatches in native c2r conversion #3583 (andygrove)
- fix: disable native C2R for legacy Iceberg scans [iceberg] #3663 (mbutrovich)
- fix: resolve Miri UB in null struct field test, re-enable Miri on PRs #3669 (andygrove)
- fix: Support on all-literal RLIKE expression #3647 (0lai0)
- fix: Fix scan metrics test to run with both native_datafusion and native_iceberg_compat #3690 (andygrove)
Performance related:
- perf: refactor sum int with specialized implementations for each eval_mode #3054 (andygrove)
- perf: Optimize contains expression with SIMD-based scalar pattern sea… #2991 (Shekharrajak)
- perf: Add batch coalescing in BufBatchWriter to reduce IPC schema overhead #3441 (andygrove)
- perf: Use
native_datafusionscan in benchmark scripts (6% faster for TPC-H) #3460 (andygrove) - perf: Remove mutable buffers from scan partition/missing columns #3411 (andygrove)
- perf: [iceberg] Single-pass FileScanTask validation #3443 (mbutrovich)
- perf: Improve benchmarks for native row-to-columnar used by JVM shuffle #3290 (andygrove)
- perf: executePlan uses a channel to park executor task thread instead of yield_now() [iceberg] #3553 (mbutrovich)
- perf: Initialize tokio runtime worker threads from spark.executor.cores #3555 (andygrove)
- perf: Add Comet config for native Iceberg reader's data file concurrency [iceberg] #3584 (mbutrovich)
- perf: reuse CometConf.COMET_TRACING_ENABLED, Native, NativeUtil in NativeBatchDecoderIterator #3627 (mbutrovich)
- perf: Improve performance of native row-to-columnar transition used by JVM shuffle #3289 (andygrove)
- perf: use aligned pointer reads for SparkUnsafeRow field accessors #3670 (andygrove)
- perf: Optimize some decimal expressions #3619 (andygrove)
Implemented enhancements:
- feat: Native columnar to row conversion (Phase 2) #3266 (andygrove)
- feat: Enable native columnar-to-row by default #3299 (andygrove)
- feat: add support for
width_bucketexpression #3273 (davidlghellin) - feat: Drop
native_cometas a valid option forCOMET_NATIVE_SCAN_IMPLconfig #3358 (andygrove) - feat: Support date to timestamp cast #3383 (coderfender)
- feat: CometExecRDD supports per-partition plan data, reduce Iceberg native scan serialization, add DPP [iceberg] #3349 (mbutrovich)
- feat: Support right expression #3207 (Shekharrajak)
- feat: support map_contains_key expression #3369 (peterxcli)
- feat: add support for make_date expression #3147 (andygrove)
- feat: add support for next_day expression #3148 (andygrove)
- feat: implement cast from whole numbers to binary format and bool to decimal #3083 (coderfender)
- feat: Support for StringSplit #2772 (Shekharrajak)
- feat: CometNativeScan per-partition plan serde #3511 (mbutrovich)
- feat: Remove mutable buffers from scan partition/missing columns [iceberg] #3514 (andygrove)
- feat: pass spark.comet.datafusion.* configs through to DataFusion session #3455 (andygrove)
- feat: pass vended credentials to Iceberg native scan #3523 (tokoko)
- feat: Cast date to Numeric (No Op) #3544 (coderfender)
- feat: add support
crc32expression #3498 (rafafrdz) - feat: Support int to timestamp casts #3541 (coderfender)
- feat(benchmarks): add async-profiler support to TPC benchmark scripts #3613 (andygrove)
- feat: Cast numeric (non int) to timestamp #3559 (coderfender)
- feat: [ANSI] Ansi sql error messages #3580 (parthchandra)
- feat: enable debug assertions in CI profile, fix unaligned memory access bug #3652 (andygrove)
- feat: Enable native c2r by default, add debug asserts #3649 (andygrove)
- feat: support Spark luhn_check expression #3573 (n0r0shi)
Documentation updates:
- docs: Add changelog for 0.13.0 #3260 (andygrove)
- docs: fix bug in placement of prettier-ignore-end in generated docs #3287 (andygrove)
- docs: Add contributor guide page for SQL file tests #3333 (andygrove)
- docs: fix inaccurate claim about mutable buffers in parquet scan docs #3378 (andygrove)
- docs: Improve documentation on maven usage for running tests #3370 (andygrove)
- docs: move release process docs to contributor guide #3492 (andygrove)
- docs: improve release process documentation #3508 (andygrove)
- docs: update roadm...
0.13.0
DataFusion Comet 0.13.0 Changelog
This release consists of 169 commits from 15 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: NativeScan count assert firing for no reason #2850 (EmilyMatt)
- fix: Correct link to tracing guide in CometConf #2866 (manuzhang)
- fix: Fall back to Spark for MakeDecimal with unsupported input type #2815 (andygrove)
- fix: Normalize s3 paths for PME key retriever #2874 (mbutrovich)
- fix: modify CometNativeScan to generate the file partitions without instantiating RDD #2891 (mbutrovich)
- fix: Modulus on decimal data type mismatch #2922 (andygrove)
- fix: [iceberg] Mark nativeIcebergScanMetadata @transient #2930 (mbutrovich)
- fix: enable cast tests for Spark 4.0 #2919 (manuzhang)
- fix: Remove fallback for maps containing complex types #2943 (andygrove)
- fix: CometShuffleManager hang by deferring SparkEnv access #3002 (Shekharrajak)
- fix: format decimal to string when casting to short #2916 (manuzhang)
- fix: [iceberg] reduce granularity of metrics updates in IcebergFileStream #3050 (mbutrovich)
- fix: native shuffle now reports spill metrics correctly #3197 (andygrove)
- fix: Prevent native write when input is not Arrow format #3227 (andygrove)
- fix: Add JDK to Docker image for release build #3262 (hsiang-c)
Performance related:
- perf: [iceberg] Deduplicate serialized metadata for Iceberg native scan #2933 (mbutrovich)
- perf: Use await instead of block_on in native shuffle writer #2937 (mbutrovich)
- perf: refactor executePlan to try to avoid constantly entering Tokio runtime #2938 (mbutrovich)
- perf: Optimize lpad/rpad to remove unnecessary memory allocations per element #2963 (andygrove)
- perf: Improve performance of normalize_nan #2999 (andygrove)
- perf: Improve string expression microbenchmarks #3012 (andygrove)
- perf: Improve date/time microbenchmarks to avoid redundant/duplicate benchmarks #3020 (andygrove)
- perf: Improve aggregate expression microbenchmarks #3021 (andygrove)
- perf: Improve conditional expression microbenchmarks #3024 (andygrove)
- perf: Improve performance of date truncate #2997 (andygrove)
- perf: Add microbenchmark for comparison expressions #3026 (andygrove)
- perf: Implement more microbenchmarks for cast expressions #3031 (andygrove)
- perf: Add microbenchmark for hash expressions #3028 (andygrove)
- perf: Improve performance of CAST from string to int #3017 (coderfender)
- perf: Improve criterion benchmarks for cast string to int #3049 (andygrove)
- perf: Additional optimizations for cast from string to int #3048 (andygrove)
- perf: set DataFusion session context's target_partitions to match Spark's spark.task.cpus #3062 (mbutrovich)
- perf: don't busy-poll Tokio stream for plans without CometScan #3063 (mbutrovich)
- perf: minor optimizations in
process_sorted_row_partition#3059 (andygrove) - perf: optimize complex-type hash implementations #3140 (mbutrovich)
- perf: [iceberg] Remove IcebergFileStream, use iceberg-rust's parallelization, bump iceberg-rust to latest, cache SchemaAdapter #3051 (mbutrovich)
- perf: [iceberg] reduce nativeIcebergScanMetadata serialization points #3243 (mbutrovich)
- perf: reduce GC pressure in protobuf serialization #3242 (andygrove)
- perf: cache serialized query plans to avoid per-partition serialization #3246 (andygrove)
- perf: [iceberg] Use protobuf instead of JSON to serialize Iceberg partition values #3247 (parthchandra)
Implemented enhancements:
- feat: Add experimental support for native Parquet writes #2812 (andygrove)
- feat: Partially implement file commit protocol for native Parquet writes #2828 (andygrove)
- feat: CometNativeWriteExec support with native scan as a child #2839 (mbutrovich)
- feat: Add support for
explodeandexplode_outerfor array inputs #2836 (andygrove) - feat: Support ANSI mode SUM (Decimal types) #2826 (coderfender)
- feat: Add expression registry to native planner #2851 (andygrove)
- feat: Implement native operator registry #2875 (andygrove)
- feat: Improve fallback reporting for
native_datafusionscan #2879 (andygrove) - feat: Enable bucket pruning with native_datafusion scans #2888 (mbutrovich)
- feat: support_ansi-mode_aggregated_benchmarking #2901 (coderfender)
- feat: [iceberg] REST catalog support for CometNativeIcebergScan #2895 (mbutrovich)
- feat: [iceberg] Support session token in Iceberg Native scan #2913 (hsiang-c)
- feat: Make shuffle writer buffer size configurable #2899 (andygrove)
- feat: Add partial support for
from_json#2934 (andygrove) - feat: Create benchmarks comet cast #2932 (coderfender)
- feat: Support string decimal cast #2925 (coderfender)
- feat: Remove unnecessary transition for native writes #2960 (comphead)
- feat: Initial implementation of size for array inputs #2862 (andygrove)
- feat: Support ANSI mode sum expr (int inputs) #2600 (coderfender)
- feat: Support casting string float types #2835 (coderfender)
- feat: Support ANSI mode avg expr (int inputs) #2817 (coderfender)
- feat: Add support for remote Parquet HDFS writer with openDAL #2929 (comphead)
- feat: Expand
murmur3hash support to complex types #3077 (andygrove) - feat: Comet Writer should respect object store settings #3042 (comphead)
- feat: add support for unix_date expression #3141 (andygrove)
- feat: add partial support for date_format expression #3201 (andygrove)
- feat: add complex type support to native Parquet writer [#32...
0.12.0
DataFusion Comet 0.12.0 Changelog
This release consists of 105 commits from 13 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: Fix
None.getinstringDecodewhenbinchild cannot be converted #2606 (cfmcgrady) - fix: Update FuzzDataGenerator to produce dictionary-encoded string arrays & fix bugs that this exposes #2635 (andygrove)
- fix: Fallback to Spark for lpad/rpad for unsupported arguments & fix negative length handling #2630 (andygrove)
- fix: Mark SortOrder with floating-point as incompatible #2650 (andygrove)
- fix: Fall back to Spark for
trunc/date_truncfunctions when format string is unsupported, or is not a literal value #2634 (andygrove) - fix: [native_datafusion] only pass single partition of PartitionedFiles into DataSourceExec #2675 (mbutrovich)
- fix: Fix subcommands options in fuzz-testing #2684 (manuzhang)
- fix: Do not replace SMJ with HJ for
LeftSemi#2687 (comphead) - fix: Apply spotless on Iceberg 1.8.1 diff [iceberg] #2700 (hsiang-c)
- fix: Fix generate-user-guide-reference-docs failure when mvn command is not executed at root #2691 (manuzhang)
- fix: Fix missing SortOrder fallback reason in range partitioning #2716 (andygrove)
- fix: CometLiteral class cast exception with arrays #2718 (andygrove)
- fix: NormalizeNaNAndZero::children() returns child's child #2732 (mbutrovich)
- fix: checkSparkMaybeThrows should compare Spark and Comet results in success case #2728 (andygrove)
- fix: Mark
WindowsExecas incompatible #2748 (andygrove) - fix: Add strict floating point mode and fallback to Spark for min/max/sort on floating point inputs when enabled #2747 (andygrove)
- fix: Implement producedAttributes for CometWindowExec #2789 (rahulbabarwal89)
- fix: Pass all Comet configs to native plan #2801 (andygrove)
Implemented enhancements:
- feat: Add option to write benchmark results to file #2640 (andygrove)
- feat: Implement metrics for iceberg compat #2615 (EmilyMatt)
- feat: Define function signatures in CometFuzz #2614 (andygrove)
- feat: cherry-pick UUID conversion logic from #2528 #2648 (mbutrovich)
- feat: support
concatfor strings #2604 (comphead) - feat: Add support for
abs#2689 (andygrove) - feat: Support variadic function in CometFuzz #2682 (manuzhang)
- feat: CometExecRule refactor: Unify CometNativeExec creation with Serde in CometOperatorSerde trait #2768 (andygrove)
- feat: support cot #2755 (psvri)
- feat: Add bash script to build and run fuzz testing #2686 (manuzhang)
- feat: Add getSupportLevel to CometAggregateExpressionSerde trait #2777 (andygrove)
- feat: Add CI check to ensure generated docs are in sync with code #2779 (andygrove)
- feat: Add prettier enforcement #2783 (andygrove)
- feat: hyperbolic trig functions #2784 (psvri)
- feat: [iceberg] Native scan by serializing FileScanTasks to iceberg-rust #2528 (mbutrovich)
Documentation updates:
- docs: Add changelog for 0.11.0 release #2585 (mbutrovich)
- docs: Improve documentation layout #2587 (andygrove)
- docs: Publish 0.11.0 user guide #2589 (andygrove)
- docs: Put Comet logo in top nav bar, respect light/dark mode #2591 (andygrove)
- docs: Improve main landing page #2593 (andygrove)
- docs: Improve site navigation #2597 (andygrove)
- docs: Update benchmark results #2596 (andygrove)
- docs: Upgrade pydata-sphinx-theme to 0.16.1 #2602 (andygrove)
- docs: Fix redirect #2603 (andygrove)
- docs: Fix broken image link #2613 (andygrove)
- docs: Add FFI docs to contributor guide #2668 (andygrove)
- docs: Various documentation updates #2674 (andygrove)
- docs: Add supported SortOrder expressions and fix a typo #2694 (andygrove)
- docs: Minor docs update for running Spark SQL tests #2712 (andygrove)
- docs: Update contributor guide for adding a new expression #2704 (andygrove)
- docs: Documentation updates for
LocalTableScanandWindowExec#2742 (andygrove) - docs: Typo fix #2752 (wForget)
- docs: Categorize some configs as
testingand add notes about known time zone issues #2740 (andygrove) - docs: Run prettier on all markdown files #2782 (andygrove)
- docs: Ignore prettier formatting for generated tables #2790 (andygrove)
- docs: Add new section to contributor guide, explaining how to add a new operator #2758 (andygrove)
Other:
- chore: Start 0.12.0 development #2584 (mbutrovich)
- chore: Bump Spark from 3.5.6 to 3.5.7 #2574 (cfmcgrady)
- chore(deps): bump parquet from 56.0.0 to 56.2.0 in /native #2608 (dependabot[bot])
- chore(deps): bump tikv-jemallocator from 0.6.0 to 0.6.1 in /native #2609 (dependabot[bot])
- chore(deps): bump tikv-jemalloc-ctl from 0.6.0 to 0.6.1 in /native #2610 (dependabot[bot])
- tests: FuzzDataGenerator instead of Parquet-specific generator #2616 (mbutrovich)
- chore: Simplify on-heap memory configuration #2599 (andygrove)
- Feat: Add sha1 function impl #2471 (kazantsev-maksim)
- chore: Refactor Parquet/DataFrame fuzz data generators #2629 (andygrove)
- chore: Remove needless from_raw calls #2638 (EmilyMatt)
- chore: support DataFusion 50.3.0 #2605 (comphead)
- chore(deps): bump actions/upload-artifact from 4 to 5 #2654 (dependabot[bot])
- chore(deps): bump cc from 1.2.42 to 1.2.43 in /native #2653 (dependabot[bot])
- chore(deps): bump actions/download-artifact from 5 to 6 #2652 (dependabot[bot])
- chore: extract c...
0.11.0
DataFusion Comet 0.11.0 Changelog
This release consists of 131 commits from 15 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: temporarily ignore test for hdfs file systems #2359 (parthchandra)
- fix: Check reused broadcast plan in non-AQE and make setNumPartitions thread safe #2398 (wForget)
- fix: correct
missingInputforCometHashAggregateExec#2409 (comphead) - fix:clippy errros rust 1.9.0 update #2419 (coderfender)
- fix: Avoid spark plan execution cache preventing CometBatchRDD numPartitions change #2420 (wForget)
- fix: regressions in
CometToPrettyStringSuite#2384 (hsiang-c) - fix: Byte array Literals failed on cast #2432 (comphead)
- fix: Do not push down subquery filters on native_datafusion scan #2438 (wForget)
- fix: Improve error handling when resolving S3 bucket region #2440 (andygrove)
- fix: [iceberg] additional parquet independent api for iceberg integration #2442 (parthchandra)
- fix: Specify reqwest crate features #2446 (andygrove)
- fix: distributed RangePartitioning bounds calculation with native shuffle #2258 (mbutrovich)
- fix: fix regression in tpcbench.py #2512 (andygrove)
- fix: [iceberg] Close reader instance in ReadConf #2510 (hsiang-c)
- fix: Enable plan stability tests for
autoscan #2516 (andygrove) - fix: Capture unexpected output when retrieving JVM 17 args in Makefile #2566 (zuston)
Performance related:
- perf: New Configuration from shared conf to avoid high costs #2402 (wForget)
- perf: Use DataFusion's
count_udafinstead ofSUM(IF(expr IS NOT NULL, 1, 0))#2407 (andygrove) - perf: Improve BroadcastExchangeExec conversion #2417 (wForget)
Implemented enhancements:
- feat: Add dynamic
enabledandallowIncompatconfigs for all supported expressions #2329 (andygrove) - feat: feature specific tests #2372 (parthchandra)
- feat: Support more date part expressions #2316 (wForget)
- feat: rpad support column for second arg instead of just literal #2099 (coderfender)
- feat: Support comet native log level conf #2379 (wForget)
- feat: Enable WeekDay function #2411 (wForget)
- feat: Add nested Array literal support #2181 (comphead)
- feat:add_additional_char_support_rpad #2436 (coderfender)
- feat: do not fallback to Spark for
COUNT(distinct)#2429 (comphead) - feat: implement_ansi_eval_mode_arithmetic #2136 (coderfender)
- feat: Add plan conversion statistics to extended explain info #2412 (andygrove)
- feat: implement_comet_native_lpad_expr #2102 (coderfender)
- feat: Add
backtracefeature to simplify enabling native backtraces inCometNativeException#2515 (andygrove) - feat: Support reverse function with ArrayType input #2481 (cfmcgrady)
- feat: Change default off-heap memory pool from
greedy_unifiedtofair_unified#2526 (andygrove) - feat: Make DiskManager
max_temp_directory_sizeconfigurable #2479 (manuzhang) - feat: Parquet Modular Encryption with Spark KMS for native readers #2447 (mbutrovich)
- feat: Add support for Spark-compatible cast from integral to decimal #2472 (coderfender)
- feat:Support ANSI mode integral divide #2421 (coderfender)
- feat: Add config to enable running Comet in onheap mode #2554 (andygrove)
- feat:support ansi mode rounding function #2542 (coderfender)
- feat:support ansi mode remainder function #2556 (coderfender)
- feat: Implement array-to-string cast support #2425 (cfmcgrady)
- feat: Various improvements to memory pool configuration, logging, and documentation #2538 (andygrove)
- feat: Enable complex types for columnar shuffle #2573 (mbutrovich)
- feat: support_decimal_types_bool_cast_native_impl #2490 (coderfender)
- feat: Use buf write to reduce system call on index write #2579 (zuston)
Documentation updates:
- doc: Document usage IcebergCometBatchReader.java #2347 (comphead)
- docs: Add changelog for 0.10.0 release #2361 (andygrove)
- docs: Fix error in docs #2373 (andygrove)
- docs: Fix more comet versions in docs #2374 (andygrove)
- docs: Publish 0.10.0 user guide #2394 (andygrove)
- doc: macos benches doc clarifications #2418 (comphead)
- docs: update configs.md after #2422 #2428 (mbutrovich)
- docs: update docs and tuning guide related to native shuffle #2487 (mbutrovich)
- docs: Improve EC2 benchmarking guide #2474 (andygrove)
- docs: docs_update_ansi_support #2496 (coderfender)
- docs:support lpad expression documentation update #2517 (coderfender)
- docs: doc changes to support ANSI mode integral divide #2570 (coderfender)
- docs: Split configuration guide into different sections (scan, exec, shuffle, etc) #2568 (andygrove)
- docs: doc update to support ANSI mode remainder function #2576 (coderfender)
- docs: Documentation updates #2581 (andygrove)
Other:
- chore(deps): bump uuid from 1.18.0 to 1.18.1 in /native #2336 (dependabot[bot])
- build: Check that all Scala test suites run in PR builds #2304 (andygrove)
- chore: Start 0.11.0 development #2365 (andygrove)
- chore: Split expression serde hash map into separate categories #2322 (andygrove)
- chore: exclude Iceberg diffs from rat checks #2376 (hsiang-c)
- chore: Refactor UnaryMinus serde #2378 (andygrove)
- chore: Revert "chore: [1941-Part1]: Introduce
map_sortscalar function (#2… #2381 (comphead) - chore: Refactor Literal serde [#2377](https://github.com/apache/datafusion-comet/pull/...
0.10.1
DataFusion Comet 0.10.1 Changelog
This release consists of 7 commits from 1 contributors. See credits at the end of this changelog for more information.
Documentation updates:
- docs: [branch-0.10] Update version number in branch-0.10 user guide #2395 (andygrove)
Other:
- chore: [branch-0.10] Support Spark 4.0.1 instead of 4.0.0 (#2414) #2497 (andygrove)
- build: [branch-0.10] Stop caching libcomet in CI (#2498) #2502 (andygrove)
- chore: [branch-0.10] perf: Improve BroadcastExchangeExec conversion #2501 (andygrove)
- chore: [branch-0.10] [iceberg] additional parquet independent api for iceberg integration (#2442) #2499 (andygrove)
- fix: [branch-0.10] Avoid spark plan execution cache preventing CometBatchRDD numPartitions change (#2420) #2503 (andygrove)
- build: [branch-0.10] Bump version to 0.10.1 #2508 (andygrove)
Credits
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
7 Andy Grove
Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.
0.10.0
DataFusion Comet 0.10.0 Changelog
This release consists of 183 commits from 26 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: [Iceberg] Fix decimal corruption #1985 (andygrove)
- fix: broken link in development.md #2024 (petern48)
- fix: [iceberg] Add LogicalTypeAnnotation in ParquetColumnSpec #2000 (huaxingao)
- fix: hdfs read into buffer fully #2031 (parthchandra)
- fix: Refactor arithmetic serde and fix correctness issues with EvalMode::TRY #2018 (andygrove)
- fix: clean up [iceberg] integration APIs #2032 (huaxingao)
- fix: zero Arrow Array offset before sending across FFI #2052 (mbutrovich)
- fix: [iceberg] more fixes for Iceberg integration APIs. #2078 (parthchandra)
- fix: Add support for StringDecode in Spark 4.0.0 #2075 (peter-toth)
- fix: Avoid double free in CometUnifiedShuffleMemoryAllocator #2122 (andygrove)
- fix: Remove duplicate serde code #2098 (andygrove)
- fix: Improve logic for determining when an UnpackOrDeepCopy is needed #2142 (andygrove)
- fix: Add CopyExec to inputs to SortMergeJoinExec #2155 (andygrove)
- fix: Fix repeatedly url-decode path when reading parquet from s3 using native parquet reader #2138 (Kontinuation)
- fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel #1987 (hsiang-c)
- fix: [iceberg] Fall back to spark for schemas with empty structs #2204 (andygrove)
- fix: Fix failing TPC-DS workflow in PR CI runs #2207 (andygrove)
- fix: [iceberg] order query result deterministically #2208 (hsiang-c)
- fix: use
spark.comet.batchSizeinstead ofconf.arrowMaxRecordsPerBatchfor data that is coming from Java #2196 (rluvaton) - fix: if expr nullable #2217 (Asura7969)
- fix: Support
autoscan mode with Spark 4.0.0 #1975 (andygrove) - fix: Make Sha2 fallback message more user-friendly #2213 (rishvin)
- fix: separate type checking for CometExchange and CometColumnarExchange #2241 (mbutrovich)
- fix: Fix potential resource leak in native shuffle block reader #2247 (andygrove)
- fix: Remove unreachable code in
CometScanRule#2252 (andygrove) - fix: Fall back to
native_cometfor encrypted Parquet scans #2250 (andygrove) - fix: Fall back to
native_cometwhen object store not supported bynative_iceberg_compat#2251 (andygrove) - fix: split expr.proto file (new) #2267 (kination)
- fix: handle cast to dictionary vector introduced by case when #2044 (parthchandra)
- fix: Remove check for custom S3 endpoints #2288 (andygrove)
- fix: implement lazy evaluation in Coalesce function #2270 (coderfender)
- fix: Update benchmarking scripts #2293 (andygrove)
- fix: Fix regression in NativeConfigSuite #2299 (andygrove)
- fix: Validating object store configs should not throw exception #2308 (andygrove)
- fix: TakeOrderedAndProjectExec is not reporting all fallback reasons #2323 (kazuyukitanimura)
- fix: Fallback length function with binary input #2349 (wForget)
Performance related:
- perf: Optimize
AvgDecimalGroupsAccumulator#1893 (leung-ming) - perf: Optimize
SumDecimalGroupsAccumulator::update_single#2069 (leung-ming) - perf: Avoid FFI copy in
ScanExecwhen reading data from exchanges #2268 (andygrove)
Implemented enhancements:
- feat: Add from_unixtime support #1943 (kazuyukitanimura)
- feat: randn expression support #2010 (akupchinskiy)
- feat: monotonically_increasing_id and spark_partition_id implementation #2037 (akupchinskiy)
- feat: support
map_entries#2059 (comphead) - feat: Support Array Literal #2057 (comphead)
- feat: Add new trait for operator serde #2115 (andygrove)
- feat: limit with offset support #2070 (akupchinskiy)
- feat: Include scan implementation name in CometScan nodeName #2141 (andygrove)
- feat: Add config option to log fallback reasons #2154 (andygrove)
- feat: [iceberg] Enable Comet shuffle in Iceberg diff #2205 (andygrove)
- feat: Improve shuffle fallback reporting #2194 (andygrove)
- feat: Reset data buf of NativeBatchDecoderIterator on close #2235 (wForget)
- feat: Improve fallback mechanism for ANSI mode #2211 (andygrove)
- feat: Support hdfs with OpenDAL #2244 (wForget)
- feat: Ignore fallback info for command execs #2297 (wForget)
- feat: Improve some confusing fallback reasons #2301 (wForget)
- feat: Make supported hadoop filesystem schemes configurable #2272 (wForget)
- feat: [1941-Part1]: Introduce map-sort scalar function #2262 (rishvin)
- feat: [iceberg] delete rows support using selection vectors #2346 (parthchandra)
Documentation updates:
- docs: Update benchmark results for 0.9.0 #1959 (andygrove)
- doc: Add comment about local clippy run before submitting a pull request #1961 (akupchinskiy)
- docs: Minor improvements to Spark SQL test docs #1980 (andygrove)
- docs: Update Maven links for 0.9.0 release #1988 (andygrove)
- docs: Documentation updates for 0.9.0 release #1981 (andygrove)
- docs: Add guide showing comparison between Comet and Gluten #2012 (andygrove)
- docs: Remove legacy comment in docs #2022 (andygrove)
- docs: Update Gluten comparision to clarify that Velox is open-source #2043 (andygrove)
- docs: Improve Gluten comparison based on feedback from the community #2048 (andygrove)
- docs: added a missing export into the plan stability section #2071 (akupchinskiy)
- doc: Added documentation for supported map functions #2074 (codetyri0n)
- doc: Alternative way to start Spark Master to run benchmarks #2072 (comphead)
- docs: Update to support try arithmetic functions #2143 (coderfender)
- doc: update macos standalone spark start instructions #2103 (comphead)
- docs: Update confs to bypass Iceberg Spark issues #2166 (hsiang-c)
- docs: Add Roadmap #2191 (andygrove)
- docs: Update installation guide for 0.9.1 #2230 (andygrov...
0.9.1
DataFusion Comet 0.9.1 Changelog
This release consists of 2 commits from 1 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: [branch-0.9] Backport FFI fix #2164 (andygrove)
- fix: [branch-0.9] Avoid double free in CometUnifiedShuffleMemoryAllocator #2201 (andygrove)
Credits
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
2 Andy Grove
Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.