0.5.0
Pre-release
Pre-release
DataFusion Comet 0.5.0 Changelog
This release consists of 69 commits from 15 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: Unsigned type related bugs #1095 (kazuyukitanimura)
- fix: Use RDD partition index #1112 (viirya)
- fix: Various metrics bug fixes and improvements #1111 (andygrove)
- fix: Don't create CometScanExec for subclasses of ParquetFileFormat #1129 (Kimahriman)
- fix: Fix metrics regressions #1132 (andygrove)
- fix: Enable scenarios accidentally commented out in CometExecBenchmark #1151 (mbutrovich)
- fix: Spark 4.0-preview1 SPARK-47120 #1156 (kazuyukitanimura)
- fix: Document enabling comet explain plan usage in Spark (4.0) #1176 (parthchandra)
- fix: stddev_pop should not directly return 0.0 when count is 1.0 #1184 (viirya)
- fix: fix missing explanation for then branch in case when #1200 (rluvaton)
- fix: Fall back to Spark for unsupported partition or sort expressions in window aggregates #1253 (andygrove)
- fix: Fall back to Spark for distinct aggregates #1262 (andygrove)
- fix: disable initCap by default #1276 (kazuyukitanimura)
Performance related:
- perf: Stop passing Java config map into native createPlan #1101 (andygrove)
- feat: Make native shuffle compression configurable and respect
spark.shuffle.compress
#1185 (andygrove) - perf: Improve query planning to more reliably fall back to columnar shuffle when native shuffle is not supported #1209 (andygrove)
- feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support #1192 (andygrove)
- feat: Implement custom RecordBatch serde for shuffle for improved performance #1190 (andygrove)
Implemented enhancements:
- feat: support array_insert #1073 (SemyonSinchenko)
- feat: enable decimal to decimal cast of different precision and scale #1086 (himadripal)
- feat: Improve ScanExec native metrics #1133 (andygrove)
- feat: Add Spark-compatible implementation of SchemaAdapterFactory #1169 (andygrove)
- feat: Improve shuffle metrics (second attempt) #1175 (andygrove)
- feat: Add a
spark.comet.exec.memoryPool
configuration for experimenting with various datafusion memory pool setups. #1021 (Kontinuation) - feat: Reenable tests for filtered SMJ anti join #1211 (comphead)
- feat: add support for array_remove expression #1179 (jatin510)
Documentation updates:
- docs: Update documentation for 0.4.0 release #1096 (andygrove)
- docs: Fix readme typo FGPA -> FPGA #1117 (gstvg)
- docs: Add more technical detail and new diagram to Comet plugin overview #1119 (andygrove)
- docs: Add some documentation explaining how shuffle works #1148 (andygrove)
- docs: Update TPC-H benchmark results #1257 (andygrove)
Other:
- chore: Add changelog for 0.4.0 #1089 (andygrove)
- chore: Prepare for 0.5.0 development #1090 (andygrove)
- build: Skip installation of spark-integration and fuzz testing modules #1091 (parthchandra)
- minor: Add hint for finding the GPG key to use when publishing to maven #1093 (andygrove)
- chore: Include first ScanExec batch in metrics #1105 (andygrove)
- chore: Improve CometScan metrics #1100 (andygrove)
- chore: Add custom metric for native shuffle fetching batches from JVM #1108 (andygrove)
- chore: Remove unused StringView struct #1143 (andygrove)
- test: enable more Spark 4.0 tests #1145 (kazuyukitanimura)
- chore: Refactor cast to use SparkCastOptions param #1146 (andygrove)
- chore: Move more expressions from core crate to spark-expr crate #1152 (andygrove)
- chore: Remove dead code #1155 (andygrove)
- chore: Move string kernels and expressions to spark-expr crate #1164 (andygrove)
- chore: Move remaining expressions to spark-expr crate + some minor refactoring #1165 (andygrove)
- chore: Add ignored tests for reading complex types from Parquet #1167 (andygrove)
- test: enabling Spark tests with offHeap requirement #1177 (kazuyukitanimura)
- minor: move shuffle classes from common to spark #1193 (andygrove)
- minor: refactor to move decodeBatches to broadcast exchange code as private function #1195 (andygrove)
- minor: refactor prepare_output so that it does not require an ExecutionContext #1194 (andygrove)
- minor: remove unused source files #1202 (andygrove)
- chore: Upgrade to DataFusion 44.0.0-rc2 #1154 (andygrove)
- chore: Add safety check to CometBuffer #1050 (viirya)
- chore: Remove unreachable code #1213 (andygrove)
- test: Enable Comet by default except some tests in SparkSessionExtensionSuite #1201 (kazuyukitanimura)
- chore: extract
struct
expressions to folders based on spark grouping #1216 (rluvaton) - chore: extract static invoke expressions to folders based on spark grouping #1217 (rluvaton)
- chore: Follow-on PR to fully enable onheap memory usage #1210 (andygrove)
- chore: extract agg_funcs expressions to folders based on spark grouping #1224 (rluvaton)
- chore: extract datetime_funcs expressions to folders based on spark grouping #1222 (rluvaton)
- chore: Upgrade to DataFusion 44.0.0 from 44.0.0 RC2 #1232 (rluvaton)
- chore: extract strings file to
strings_func
like in spark grouping #1215 (rluvaton) - chore: extract predicate_functions expressions to folders based on spark grouping #1218 (rluvaton)
- build(deps): bump protobuf version to 3.21.12 #1234 (wForget)
- chore: extract json_funcs expressions to folders based on spark grouping #1220 (rluvaton)
- test: Enable shuffle by default in Spark tests #1240 (kazuyukitanimura)
- chore: extract hash_funcs expressions to folders based on spark grouping #1221 (rluvaton)
- build: Fix test failure caused by merging conflicting PRs #1259 (andygrove)
Credits
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
37 Andy Grove
10 Raz Luvaton
7 KAZUYUKI TANIMURA
3 Liang-Chi Hsieh
2 Parth Chandra
1 Adam Binford
1 Dharan Aditya
1 Himadri Pal
1 Jagdish Parihar
1 Kristin Cowalcijk
1 Matt Butrovich
1 Oleks V
1 Sem
1 Zhen Wang
1 gstvg
Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.