DataFusion Comet 0.5.0 Changelog

This release consists of 69 commits from 15 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

fix: Unsigned type related bugs #1095 (kazuyukitanimura)
fix: Use RDD partition index #1112 (viirya)
fix: Various metrics bug fixes and improvements #1111 (andygrove)
fix: Don't create CometScanExec for subclasses of ParquetFileFormat #1129 (Kimahriman)
fix: Fix metrics regressions #1132 (andygrove)
fix: Enable scenarios accidentally commented out in CometExecBenchmark #1151 (mbutrovich)
fix: Spark 4.0-preview1 SPARK-47120 #1156 (kazuyukitanimura)
fix: Document enabling comet explain plan usage in Spark (4.0) #1176 (parthchandra)
fix: stddev_pop should not directly return 0.0 when count is 1.0 #1184 (viirya)
fix: fix missing explanation for then branch in case when #1200 (rluvaton)
fix: Fall back to Spark for unsupported partition or sort expressions in window aggregates #1253 (andygrove)
fix: Fall back to Spark for distinct aggregates #1262 (andygrove)
fix: disable initCap by default #1276 (kazuyukitanimura)

Performance related:

perf: Stop passing Java config map into native createPlan #1101 (andygrove)
feat: Make native shuffle compression configurable and respect spark.shuffle.compress #1185 (andygrove)
perf: Improve query planning to more reliably fall back to columnar shuffle when native shuffle is not supported #1209 (andygrove)
feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support #1192 (andygrove)
feat: Implement custom RecordBatch serde for shuffle for improved performance #1190 (andygrove)

Implemented enhancements:

feat: support array_insert #1073 (SemyonSinchenko)
feat: enable decimal to decimal cast of different precision and scale #1086 (himadripal)
feat: Improve ScanExec native metrics #1133 (andygrove)
feat: Add Spark-compatible implementation of SchemaAdapterFactory #1169 (andygrove)
feat: Improve shuffle metrics (second attempt) #1175 (andygrove)
feat: Add a spark.comet.exec.memoryPool configuration for experimenting with various datafusion memory pool setups. #1021 (Kontinuation)
feat: Reenable tests for filtered SMJ anti join #1211 (comphead)
feat: add support for array_remove expression #1179 (jatin510)

Documentation updates:

docs: Update documentation for 0.4.0 release #1096 (andygrove)
docs: Fix readme typo FGPA -> FPGA #1117 (gstvg)
docs: Add more technical detail and new diagram to Comet plugin overview #1119 (andygrove)
docs: Add some documentation explaining how shuffle works #1148 (andygrove)
docs: Update TPC-H benchmark results #1257 (andygrove)

Other:

chore: Add changelog for 0.4.0 #1089 (andygrove)
chore: Prepare for 0.5.0 development #1090 (andygrove)
build: Skip installation of spark-integration and fuzz testing modules #1091 (parthchandra)
minor: Add hint for finding the GPG key to use when publishing to maven #1093 (andygrove)
chore: Include first ScanExec batch in metrics #1105 (andygrove)
chore: Improve CometScan metrics #1100 (andygrove)
chore: Add custom metric for native shuffle fetching batches from JVM #1108 (andygrove)
chore: Remove unused StringView struct #1143 (andygrove)
test: enable more Spark 4.0 tests #1145 (kazuyukitanimura)
chore: Refactor cast to use SparkCastOptions param #1146 (andygrove)
chore: Move more expressions from core crate to spark-expr crate #1152 (andygrove)
chore: Remove dead code #1155 (andygrove)
chore: Move string kernels and expressions to spark-expr crate #1164 (andygrove)
chore: Move remaining expressions to spark-expr crate + some minor refactoring #1165 (andygrove)
chore: Add ignored tests for reading complex types from Parquet #1167 (andygrove)
test: enabling Spark tests with offHeap requirement #1177 (kazuyukitanimura)
minor: move shuffle classes from common to spark #1193 (andygrove)
minor: refactor to move decodeBatches to broadcast exchange code as private function #1195 (andygrove)
minor: refactor prepare_output so that it does not require an ExecutionContext #1194 (andygrove)
minor: remove unused source files #1202 (andygrove)
chore: Upgrade to DataFusion 44.0.0-rc2 #1154 (andygrove)
chore: Add safety check to CometBuffer #1050 (viirya)
chore: Remove unreachable code #1213 (andygrove)
test: Enable Comet by default except some tests in SparkSessionExtensionSuite #1201 (kazuyukitanimura)
chore: extract struct expressions to folders based on spark grouping #1216 (rluvaton)
chore: extract static invoke expressions to folders based on spark grouping #1217 (rluvaton)
chore: Follow-on PR to fully enable onheap memory usage #1210 (andygrove)
chore: extract agg_funcs expressions to folders based on spark grouping #1224 (rluvaton)
chore: extract datetime_funcs expressions to folders based on spark grouping #1222 (rluvaton)
chore: Upgrade to DataFusion 44.0.0 from 44.0.0 RC2 #1232 (rluvaton)
chore: extract strings file to strings_func like in spark grouping #1215 (rluvaton)
chore: extract predicate_functions expressions to folders based on spark grouping #1218 (rluvaton)
build(deps): bump protobuf version to 3.21.12 #1234 (wForget)
chore: extract json_funcs expressions to folders based on spark grouping #1220 (rluvaton)
test: Enable shuffle by default in Spark tests #1240 (kazuyukitanimura)
chore: extract hash_funcs expressions to folders based on spark grouping #1221 (rluvaton)
build: Fix test failure caused by merging conflicting PRs #1259 (andygrove)

Credits

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    37	Andy Grove
    10	Raz Luvaton
     7	KAZUYUKI TANIMURA
     3	Liang-Chi Hsieh
     2	Parth Chandra
     1	Adam Binford
     1	Dharan Aditya
     1	Himadri Pal
     1	Jagdish Parihar
     1	Kristin Cowalcijk
     1	Matt Butrovich
     1	Oleks V
     1	Sem
     1	Zhen Wang
     1	gstvg

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.5.0

DataFusion Comet 0.5.0 Changelog

Credits