feat: support array_insert #1073

SemyonSinchenko · 2024-11-09T16:13:44Z

Which issue does this PR close?

Related to #1042

array_insert: SELECT array_insert(array(1, 2, 3, 4), 5, 5)

Rationale for this change

As described in #1042

What changes are included in this PR?

QueryPlanSerde.scala: I added an additional case for the array insert;
expr.proto: I added a new message for the ArrayInsert;
planner.rs: I added a case for the array_insert;
list.rs:
- I added a new ArrayInsert struct;
- I implemented PhysicalExpr, Display and PartialExpr for it;
- The main logic of insertion is in fn array_insert

How are these changes tested?

At the moment I added a simple tests for fn array_insert and a test for QueryPlanSerde.

SemyonSinchenko · 2024-11-09T17:47:44Z

As I was able to realize, array_insert does not supported in datafusion. Is the list.rs a good place to have an implementation of ArrayInsert and PhysicalExpr for it?

SemyonSinchenko · 2024-11-11T14:34:56Z

@andygrove Sorry for tagging but I have questions about the ticket (array_insert).

[RESOLVED] array_insert was added in spark 3.4, so all the 3.3.x tests are obviously failed. I checked and it looks like the EoL for 3.3 is about the end of 2024. Technically I think I can try to workaround tests in 3.3.x by reflection API, my question is mostly should I do it due to soon EoL of the 3.3.x?
array_insert is not supported in DataFusion. I made an implementation (and it looks like it works, except negative indices and corner cases). Is the list.rs a good place for it? Or should I move my code somewhere else?
Spark does not support anything except Int32 for position argument, is it OK if I will support only int32 too? In theory, other types can be supported too, but I'm still trying to realize how to achieve it and it may become complex...

Thanks in advance! That is my first serious attempt to contribute to the project, so sorry If I'm annoying.

+ fix tests for spark < 3.4

- added test for the negative index - added test for the legacy spark mode

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

andygrove · 2024-11-13T22:00:06Z

Thanks, @SemyonSinchenko. I think it's fine to skip the test for Spark 3.3. I plan on reviewing this PR in more detail tomorrow, but it looks good from an initial read.

codecov-commenter · 2024-11-14T06:58:20Z

Codecov Report

Attention: Patch coverage is 59.09091% with 9 lines in your changes missing coverage. Please review.

Project coverage is 34.27%. Comparing base (845b654) to head (8b58d8d).
Report is 18 commits behind head on main.

Files with missing lines	Patch %	Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala	59.09%	7 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1073      +/-   ##
============================================
- Coverage     34.46%   34.27%   -0.20%     
  Complexity      888      888              
============================================
  Files           113      113              
  Lines         43580    43355     -225     
  Branches       9658     9488     -170     
============================================
- Hits          15021    14860     -161     
- Misses        25507    25596      +89     
+ Partials       3052     2899     -153

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SemyonSinchenko · 2024-11-16T13:42:15Z

This PR is ready for the review. The only failed check failed due to the internal GHA error:

GitHub Actions has encountered an internal error when running your job.

native/spark-expr/src/list.rs

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

Co-authored-by: Andy Grove <[email protected]>

NoeB · 2024-11-19T07:34:09Z

I am not sure if this should be done together with this PR but it would some add "free" tests. Spark introduced with 3.5 array_prepend which it implements with array_insert and starting 4.0 it also implements array_append with array_insert. If you want you can copy the array_append tests and replace array_append with array_prepend for Spark 3.5+ and enable the array_append tests for spark 4.0. You can also ignore this comment if you do not agree or if it leads to unrelated errors.

- fixes; - tests; - comments in the code;

In one case there is a zero in index and test fails due to spark error

SemyonSinchenko · 2024-11-20T18:01:58Z

Thanks for comments and suggestions!

This PR is ready for the review again.

What were changed from the last round of the review:

I added tests for the array_prepend (spark 3.5+) and enabled the test for array_append (spark 4.0+);
I added tests for the corner cases:
- Index is negative;
- Index is bigger than the array length;
- Index is negative and it's abs is greater than the array length;
- Test of the fallback to spark (udf as child);
- Value to insert is actually null;
I fixed the native part behavior for some of the corner cases;

These tests pointed my attention to the uncovered parts of the code that I fixed and I also added couple of additional tests to the native part. I revisited how spark do array_insert and it is more tricky than I realized at the first glance. I added few additional comments to the native part of the implementation and fixed the behavior.

At the moment this PR is tested in multiple ways:

Basic tests in native part that are useful because easy to debug and very fast to run;
Basic tests in native part for the so called "legacy mode" in spark;
Basic tests on the scala side;
Logical tests on small data for corner cases (negative, positive, long, short, null, etc.) on the scala side;
Tests in array_prepend and array_append that are calling array_insert under the hood (NULLS, different data types, etc.);
Test of the fallback to the Spark in case when one of children is not supported by Comet;

So, it looks to me, that all the possible cases are covered and the behavior is the same like in spark.

SemyonSinchenko · 2024-11-20T18:02:14Z

I am not sure if this should be done together with this PR but it would some add "free" tests. Spark introduced with 3.5 array_prepend which it implements with array_insert and starting 4.0 it also implements array_append with array_insert. If you want you can copy the array_append tests and replace array_append with array_prepend for Spark 3.5+ and enable the array_append tests for spark 4.0. You can also ignore this comment if you do not agree or if it leads to unrelated errors.

Done!

andygrove · 2024-11-20T22:45:58Z

native/spark-expr/src/list.rs

+        let src_element_type = match src_value.data_type() {
+            DataType::List(field) => field.data_type(),
+            DataType::LargeList(field) => field.data_type(),
+            data_type => {
+                return Err(DataFusionError::Internal(format!(
+                    "Unexpected src array type in ArrayInsert: {:?}",
+                    data_type
+                )))
+            }


minor nit: this logic for extracting a list type is repeated a few times and could be factored out into a function

@andygrove Thanks for the suggestion!
I moved a checking of the array type (and the exception logic) to the method:

pub fn array_type(&self, data_type: &DataType) -> DataFusionResult<DataType> { match data_type { DataType::List(field) => Ok(DataType::List(Arc::clone(field))), DataType::LargeList(field) => Ok(DataType::LargeList(Arc::clone(field))), data_type => { return Err(DataFusionError::Internal(format!( "Unexpected src array type in ArrayInsert: {:?}", data_type ))) } } }

It allows at least to avoid returning the same error multiple time. Is it what you suggested? Or should I move this method to a helper function and refactor also GerArrayStructField to use such a function?

P.S. Sorry for the stupid question... But can you please explain to me why we always check both List and LargeList, while Apache Spark only supports i32 indexes for arrays (max length is Integer.MAX_VALUE - 15), which is the case of List to my understanding? All the code in the list.rs might become a bit simpler if we make it non-generic (it also makes implementation of other missing methods like array_zip simpler).

That's a good question. I wonder if the existing code for LargeList is actually being tested. It would be interesting to try removing it and see if there are any regressions. It makes sense to only handle List if Spark only supports i32 indexes.

The difference I think is that a LargeList can store more than Integer.MAX_VALUE entries in all rows in a single batch, so if you have multiple Spark rows all with the max num of rows supported, it wouldn't fit into an Arrow List array. That would probably need to be supported elsewhere, but it may be worth keeping the LargeList handling around in case that scenario is supported? And other DataFusion expressions might return a LargeList even if it doesn't come directly from Spark? Does the native Parquet reader ever use a LargeList?

Thanks for the explanation!
You are right, I will close #1118 then

andygrove

LGTM. Thanks @SemyonSinchenko!

* feat: support array_append (#1072) * feat: support array_append * formatted code * rewrite array_append plan to match spark behaviour and fixed bug in QueryPlan serde * remove unwrap * Fix for Spark 3.3 * refactor array_append binary expression serde code * Disabled array_append test for spark 4.0+ * chore: Simplify CometShuffleMemoryAllocator to use Spark unified memory allocator (#1063) * docs: Update benchmarking.md (#1085) * feat: Require offHeap memory to be enabled (always use unified memory) (#1062) * Require offHeap memory * remove unused import * use off heap memory in stability tests * reorder imports * test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE config (#1087) * Add changelog for 0.4.0 (#1089) * chore: Prepare for 0.5.0 development (#1090) * Update version number for build * update docs * build: Skip installation of spark-integration and fuzz testing modules (#1091) * Add hint for finding the GPG key to use when publishing to maven (#1093) * docs: Update documentation for 0.4.0 release (#1096) * update TPC-H results * update Maven links * update benchmarking guide and add TPC-DS results * include q72 * fix: Unsigned type related bugs (#1095) ## Which issue does this PR close? Closes #1067 ## Rationale for this change Bug fix. A few expressions were failing some unsigned type related tests ## What changes are included in this PR? - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()` (`>` vs `>=`) ## How are these changes tested? Put back tests for unsigned types * chore: Include first ScanExec batch in metrics (#1105) * include first batch in ScanExec metrics * record row count metric * fix regression * chore: Improve CometScan metrics (#1100) * Add native metrics for plan creation * make messages consistent * Include get_next_batch cost in metrics * formatting * fix double count of rows * chore: Add custom metric for native shuffle fetching batches from JVM (#1108) * feat: support array_insert (#1073) * Part of the implementation of array_insert * Missing methods * Working version * Reformat code * Fix code-style * Add comments about spark's implementation. * Implement negative indices + fix tests for spark < 3.4 * Fix code-style * Fix scalastyle * Fix tests for spark < 3.4 * Fixes & tests - added test for the negative index - added test for the legacy spark mode * Use assume(isSpark34Plus) in tests * Test else-branch & improve coverage * Update native/spark-expr/src/list.rs Co-authored-by: Andy Grove <[email protected]> * Fix fallback test In one case there is a zero in index and test fails due to spark error * Adjust the behaviour for the NULL case to Spark * Move the logic of type checking to the method * Fix code-style --------- Co-authored-by: Andy Grove <[email protected]> * feat: enable decimal to decimal cast of different precision and scale (#1086) * enable decimal to decimal cast of different precision and scale * add more test cases for negative scale and higher precision * add check for compatibility for decimal to decimal * fix code style * Update spark/src/main/scala/org/apache/comet/expressions/CometCast.scala Co-authored-by: Andy Grove <[email protected]> * fix the nit in comment --------- Co-authored-by: himadripal <[email protected]> Co-authored-by: Andy Grove <[email protected]> * docs: fix readme FGPA/FPGA typo (#1117) * fix: Use RDD partition index (#1112) * fix: Use RDD partition index * fix * fix * fix * fix: Various metrics bug fixes and improvements (#1111) * fix: Don't create CometScanExec for subclasses of ParquetFileFormat (#1129) * Use exact class comparison for parquet scan * Add test * Add comment * fix: Fix metrics regressions (#1132) * fix metrics issues * clippy * update tests * docs: Add more technical detail and new diagram to Comet plugin overview (#1119) * Add more technical detail and new diagram to Comet plugin overview * update diagram * add info on Arrow IPC * update diagram * update diagram * update docs * address feedback * Stop passing Java config map into native createPlan (#1101) * feat: Improve ScanExec native metrics (#1133) * save * remove shuffle jvm metric and update tuning guide * docs * add source for all ScanExecs * address feedback * address feedback * chore: Remove unused StringView struct (#1143) * Remove unused StringView struct * remove more dead code * docs: Add some documentation explaining how shuffle works (#1148) * add some notes on shuffle * reads * improve docs * test: enable more Spark 4.0 tests (#1145) ## Which issue does this PR close? Part of #372 and #551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR enables more Spark 4.0 tests that were fixed by recent changes ## How are these changes tested? tests enabled * chore: Refactor cast to use SparkCastOptions param (#1146) * Refactor cast to use SparkCastOptions param * update tests * update benches * update benches * update benches * Enable more scenarios in CometExecBenchmark. (#1151) * chore: Move more expressions from core crate to spark-expr crate (#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * remove dead code (#1155) * fix: Spark 4.0-preview1 SPARK-47120 (#1156) ## Which issue does this PR close? Part of #372 and #551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR fixes the new test SPARK-47120 added in Spark 4.0 ## How are these changes tested? tests enabled * chore: Move string kernels and expressions to spark-expr crate (#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies * chore: Move remaining expressions to spark-expr crate + some minor refactoring (#1165) * move CheckOverflow to spark-expr crate * move NegativeExpr to spark-expr crate * move UnboundColumn to spark-expr crate * move ExpandExec from execution::datafusion::operators to execution::operators * refactoring to remove datafusion subpackage * update imports in benches * fix * fix * chore: Add ignored tests for reading complex types from Parquet (#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array * feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test * fix: Document enabling comet explain plan usage in Spark (4.0) (#1176) * test: enabling Spark tests with offHeap requirement (#1177) ## Which issue does this PR close? ## Rationale for this change After #1062 We have not running Spark tests for native execution ## What changes are included in this PR? Removed the off heap requirement for testing ## How are these changes tested? Bringing back Spark tests for native execution * feat: Improve shuffle metrics (second attempt) (#1175) * improve shuffle metrics * docs * more metrics * refactor * address feedback * Fix redundancy in Cargo.lock. * Format, more post-merge cleanup. * Compiles * Compiles * Remove empty file. * Attempt to fix JNI issue and native test build issues. * Test Fix * Update planner.rs Remove println from test. --------- Co-authored-by: NoeB <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Raz Luvaton <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: Parth Chandra <[email protected]> Co-authored-by: KAZUYUKI TANIMURA <[email protected]> Co-authored-by: Sem <[email protected]> Co-authored-by: Himadri Pal <[email protected]> Co-authored-by: himadripal <[email protected]> Co-authored-by: gstvg <[email protected]> Co-authored-by: Adam Binford <[email protected]>

* feat: support array_append (#1072) * feat: support array_append * formatted code * rewrite array_append plan to match spark behaviour and fixed bug in QueryPlan serde * remove unwrap * Fix for Spark 3.3 * refactor array_append binary expression serde code * Disabled array_append test for spark 4.0+ * chore: Simplify CometShuffleMemoryAllocator to use Spark unified memory allocator (#1063) * docs: Update benchmarking.md (#1085) * feat: Require offHeap memory to be enabled (always use unified memory) (#1062) * Require offHeap memory * remove unused import * use off heap memory in stability tests * reorder imports * test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE config (#1087) * Add changelog for 0.4.0 (#1089) * chore: Prepare for 0.5.0 development (#1090) * Update version number for build * update docs * build: Skip installation of spark-integration and fuzz testing modules (#1091) * Add hint for finding the GPG key to use when publishing to maven (#1093) * docs: Update documentation for 0.4.0 release (#1096) * update TPC-H results * update Maven links * update benchmarking guide and add TPC-DS results * include q72 * fix: Unsigned type related bugs (#1095) ## Which issue does this PR close? Closes #1067 ## Rationale for this change Bug fix. A few expressions were failing some unsigned type related tests ## What changes are included in this PR? - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()` (`>` vs `>=`) ## How are these changes tested? Put back tests for unsigned types * chore: Include first ScanExec batch in metrics (#1105) * include first batch in ScanExec metrics * record row count metric * fix regression * chore: Improve CometScan metrics (#1100) * Add native metrics for plan creation * make messages consistent * Include get_next_batch cost in metrics * formatting * fix double count of rows * chore: Add custom metric for native shuffle fetching batches from JVM (#1108) * feat: support array_insert (#1073) * Part of the implementation of array_insert * Missing methods * Working version * Reformat code * Fix code-style * Add comments about spark's implementation. * Implement negative indices + fix tests for spark < 3.4 * Fix code-style * Fix scalastyle * Fix tests for spark < 3.4 * Fixes & tests - added test for the negative index - added test for the legacy spark mode * Use assume(isSpark34Plus) in tests * Test else-branch & improve coverage * Update native/spark-expr/src/list.rs Co-authored-by: Andy Grove <[email protected]> * Fix fallback test In one case there is a zero in index and test fails due to spark error * Adjust the behaviour for the NULL case to Spark * Move the logic of type checking to the method * Fix code-style --------- Co-authored-by: Andy Grove <[email protected]> * feat: enable decimal to decimal cast of different precision and scale (#1086) * enable decimal to decimal cast of different precision and scale * add more test cases for negative scale and higher precision * add check for compatibility for decimal to decimal * fix code style * Update spark/src/main/scala/org/apache/comet/expressions/CometCast.scala Co-authored-by: Andy Grove <[email protected]> * fix the nit in comment --------- Co-authored-by: himadripal <[email protected]> Co-authored-by: Andy Grove <[email protected]> * docs: fix readme FGPA/FPGA typo (#1117) * fix: Use RDD partition index (#1112) * fix: Use RDD partition index * fix * fix * fix * fix: Various metrics bug fixes and improvements (#1111) * fix: Don't create CometScanExec for subclasses of ParquetFileFormat (#1129) * Use exact class comparison for parquet scan * Add test * Add comment * fix: Fix metrics regressions (#1132) * fix metrics issues * clippy * update tests * docs: Add more technical detail and new diagram to Comet plugin overview (#1119) * Add more technical detail and new diagram to Comet plugin overview * update diagram * add info on Arrow IPC * update diagram * update diagram * update docs * address feedback * Stop passing Java config map into native createPlan (#1101) * feat: Improve ScanExec native metrics (#1133) * save * remove shuffle jvm metric and update tuning guide * docs * add source for all ScanExecs * address feedback * address feedback * chore: Remove unused StringView struct (#1143) * Remove unused StringView struct * remove more dead code * docs: Add some documentation explaining how shuffle works (#1148) * add some notes on shuffle * reads * improve docs * test: enable more Spark 4.0 tests (#1145) ## Which issue does this PR close? Part of #372 and #551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR enables more Spark 4.0 tests that were fixed by recent changes ## How are these changes tested? tests enabled * chore: Refactor cast to use SparkCastOptions param (#1146) * Refactor cast to use SparkCastOptions param * update tests * update benches * update benches * update benches * Enable more scenarios in CometExecBenchmark. (#1151) * chore: Move more expressions from core crate to spark-expr crate (#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * remove dead code (#1155) * fix: Spark 4.0-preview1 SPARK-47120 (#1156) ## Which issue does this PR close? Part of #372 and #551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR fixes the new test SPARK-47120 added in Spark 4.0 ## How are these changes tested? tests enabled * chore: Move string kernels and expressions to spark-expr crate (#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies * chore: Move remaining expressions to spark-expr crate + some minor refactoring (#1165) * move CheckOverflow to spark-expr crate * move NegativeExpr to spark-expr crate * move UnboundColumn to spark-expr crate * move ExpandExec from execution::datafusion::operators to execution::operators * refactoring to remove datafusion subpackage * update imports in benches * fix * fix * chore: Add ignored tests for reading complex types from Parquet (#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array * feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test * fix: Document enabling comet explain plan usage in Spark (4.0) (#1176) * test: enabling Spark tests with offHeap requirement (#1177) ## Which issue does this PR close? ## Rationale for this change After #1062 We have not running Spark tests for native execution ## What changes are included in this PR? Removed the off heap requirement for testing ## How are these changes tested? Bringing back Spark tests for native execution * feat: Improve shuffle metrics (second attempt) (#1175) * improve shuffle metrics * docs * more metrics * refactor * address feedback * fix: stddev_pop should not directly return 0.0 when count is 1.0 (#1184) * add test * fix * fix * fix * feat: Make native shuffle compression configurable and respect `spark.shuffle.compress` (#1185) * Make shuffle compression codec and level configurable * remove lz4 references * docs * update comment * clippy * fix benches * clippy * clippy * disable test for miri * remove lz4 reference from proto * minor: move shuffle classes from common to spark (#1193) * minor: refactor decodeBatches to make private in broadcast exchange (#1195) * minor: refactor prepare_output so that it does not require an ExecutionContext (#1194) * fix: fix missing explanation for then branch in case when (#1200) * minor: remove unused source files (#1202) * chore: Upgrade to DataFusion 44.0.0-rc2 (#1154) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * save * save * save * remove unused imports * clippy * implement more hashers * implement Hash and PartialEq * implement Hash and PartialEq * implement Hash and PartialEq * benches * fix ScalarUDFImpl.return_type failure * exclude test from miri * ignore correct test * ignore another test * remove miri checks * use return_type_from_exprs * Revert "use return_type_from_exprs" This reverts commit febc1f1. * use DF main branch * hacky workaround for regression in ScalarUDFImpl.return_type * fix repo url * pin to revision * bump to latest rev * bump to latest DF rev * bump DF to rev 9f530dd * add Cargo.lock * bump DF version * no default features * Revert "remove miri checks" This reverts commit 4638fe3. * Update pin to DataFusion e99e02b9b9093ceb0c13a2dd32a2a89beba47930 * update pin * Update Cargo.toml Bump to 44.0.0-rc2 * update cargo lock * revert miri change --------- Co-authored-by: Andrew Lamb <[email protected]> * feat: add support for array_contains expression (#1163) * feat: add support for array_contains expression * test: add unit test for array_contains function * Removes unnecessary case expression for handling null values * chore: Move more expressions from core crate to spark-expr crate (#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * remove dead code (#1155) * fix: Spark 4.0-preview1 SPARK-47120 (#1156) ## Which issue does this PR close? Part of #372 and #551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR fixes the new test SPARK-47120 added in Spark 4.0 ## How are these changes tested? tests enabled * chore: Move string kernels and expressions to spark-expr crate (#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies * chore: Move remaining expressions to spark-expr crate + some minor refactoring (#1165) * move CheckOverflow to spark-expr crate * move NegativeExpr to spark-expr crate * move UnboundColumn to spark-expr crate * move ExpandExec from execution::datafusion::operators to execution::operators * refactoring to remove datafusion subpackage * update imports in benches * fix * fix * chore: Add ignored tests for reading complex types from Parquet (#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array * feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test * fix: Document enabling comet explain plan usage in Spark (4.0) (#1176) * test: enabling Spark tests with offHeap requirement (#1177) ## Which issue does this PR close? ## Rationale for this change After #1062 We have not running Spark tests for native execution ## What changes are included in this PR? Removed the off heap requirement for testing ## How are these changes tested? Bringing back Spark tests for native execution * feat: Improve shuffle metrics (second attempt) (#1175) * improve shuffle metrics * docs * more metrics * refactor * address feedback * fix: stddev_pop should not directly return 0.0 when count is 1.0 (#1184) * add test * fix * fix * fix * feat: Make native shuffle compression configurable and respect `spark.shuffle.compress` (#1185) * Make shuffle compression codec and level configurable * remove lz4 references * docs * update comment * clippy * fix benches * clippy * clippy * disable test for miri * remove lz4 reference from proto * minor: move shuffle classes from common to spark (#1193) * minor: refactor decodeBatches to make private in broadcast exchange (#1195) * minor: refactor prepare_output so that it does not require an ExecutionContext (#1194) * fix: fix missing explanation for then branch in case when (#1200) * minor: remove unused source files (#1202) * chore: Upgrade to DataFusion 44.0.0-rc2 (#1154) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * save * save * save * remove unused imports * clippy * implement more hashers * implement Hash and PartialEq * implement Hash and PartialEq * implement Hash and PartialEq * benches * fix ScalarUDFImpl.return_type failure * exclude test from miri * ignore correct test * ignore another test * remove miri checks * use return_type_from_exprs * Revert "use return_type_from_exprs" This reverts commit febc1f1. * use DF main branch * hacky workaround for regression in ScalarUDFImpl.return_type * fix repo url * pin to revision * bump to latest rev * bump to latest DF rev * bump DF to rev 9f530dd * add Cargo.lock * bump DF version * no default features * Revert "remove miri checks" This reverts commit 4638fe3. * Update pin to DataFusion e99e02b9b9093ceb0c13a2dd32a2a89beba47930 * update pin * Update Cargo.toml Bump to 44.0.0-rc2 * update cargo lock * revert miri change --------- Co-authored-by: Andrew Lamb <[email protected]> * update UT Signed-off-by: Dharan Aditya <[email protected]> * fix typo in UT Signed-off-by: Dharan Aditya <[email protected]> --------- Signed-off-by: Dharan Aditya <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: KAZUYUKI TANIMURA <[email protected]> Co-authored-by: Parth Chandra <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Raz Luvaton <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * feat: Add a `spark.comet.exec.memoryPool` configuration for experimenting with various datafusion memory pool setups. (#1021) * feat: Reenable tests for filtered SMJ anti join (#1211) * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests * Add CoalesceBatchesExec around SMJ with join filter * adding `CoalesceBatches` * adding `CoalesceBatches` * adding `CoalesceBatches` * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests --------- Co-authored-by: Andy Grove <[email protected]> * chore: Add safety check to CometBuffer (#1050) * chore: Add safety check to CometBuffer * Add CometColumnarToRowExec * fix * fix * more * Update plan stability results * fix * fix * fix * Revert "fix" This reverts commit 9bad173. * Revert "Revert "fix"" This reverts commit d527ad1. * fix BucketedReadWithoutHiveSupportSuite * fix SparkPlanSuite * remove unreachable code (#1213) * test: Enable Comet by default except some tests in SparkSessionExtensionSuite (#1201) ## Which issue does this PR close? Part of #1197 ## Rationale for this change Since `loadCometExtension` in the diffs were not using `isCometEnabled`, `SparkSessionExtensionSuite` was not using Comet. Once enabled, some test failures discovered ## What changes are included in this PR? `loadCometExtension` now uses `isCometEnabled` that enables Comet by default Temporary ignore the failing tests in SparkSessionExtensionSuite ## How are these changes tested? existing tests * extract struct expressions to folders based on spark grouping (#1216) * chore: extract static invoke expressions to folders based on spark grouping (#1217) * extract static invoke expressions to folders based on spark grouping * Update native/spark-expr/src/static_invoke/mod.rs Co-authored-by: Andy Grove <[email protected]> --------- Co-authored-by: Andy Grove <[email protected]> * chore: Follow-on PR to fully enable onheap memory usage (#1210) * Make datafusion's native memory pool configurable * save * fix * Update memory calculation and add draft documentation * ready for review * ready for review * address feedback * Update docs/source/user-guide/tuning.md Co-authored-by: Liang-Chi Hsieh <[email protected]> * Update docs/source/user-guide/tuning.md Co-authored-by: Kristin Cowalcijk <[email protected]> * Update docs/source/user-guide/tuning.md Co-authored-by: Liang-Chi Hsieh <[email protected]> * Update docs/source/user-guide/tuning.md Co-authored-by: Liang-Chi Hsieh <[email protected]> * remove unused config --------- Co-authored-by: Kristin Cowalcijk <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]> * feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support (#1192) * Implement native decoding and decompression * revert some variable renaming for smaller diff * fix oom issues? * make NativeBatchDecoderIterator more consistent with ArrowReaderIterator * fix oom and prep for review * format * Add LZ4 support * clippy, new benchmark * rename metrics, clean up lz4 code * update test * Add support for snappy * format * change default back to lz4 * make metrics more accurate * format * clippy * use faster unsafe version of lz4_flex * Make compression codec configurable for columnar shuffle * clippy * fix bench * fmt * address feedback * address feedback * address feedback * minor code simplification * cargo fmt * overflow check * rename compression level config * address feedback * address feedback * rename constant * chore: extract agg_funcs expressions to folders based on spark grouping (#1224) * extract agg_funcs expressions to folders based on spark grouping * fix rebase * extract datetime_funcs expressions to folders based on spark grouping (#1222) Co-authored-by: Andy Grove <[email protected]> * chore: use datafusion from crates.io (#1232) * chore: extract strings file to `strings_func` like in spark grouping (#1215) * chore: extract predicate_functions expressions to folders based on spark grouping (#1218) * extract predicate_functions expressions to folders based on spark grouping * code review changes --------- Co-authored-by: Andy Grove <[email protected]> * build(deps): bump protobuf version to 3.21.12 (#1234) * extract json_funcs expressions to folders based on spark grouping (#1220) Co-authored-by: Andy Grove <[email protected]> * test: Enable shuffle by default in Spark tests (#1240) ## Which issue does this PR close? ## Rationale for this change Because `isCometShuffleEnabled` is false by default, some tests were not reached ## What changes are included in this PR? Removed `isCometShuffleEnabled` and updated spark test diff ## How are these changes tested? existing test * chore: extract hash_funcs expressions to folders based on spark grouping (#1221) * extract hash_funcs expressions to folders based on spark grouping * extract hash_funcs expressions to folders based on spark grouping --------- Co-authored-by: Andy Grove <[email protected]> * fix: Fall back to Spark for unsupported partition or sort expressions in window aggregates (#1253) * perf: Improve query planning to more reliably fall back to columnar shuffle when native shuffle is not supported (#1209) * fix regression (#1259) * feat: add support for array_remove expression (#1179) * wip: array remove * added comet expression test * updated test cases * fixed array_remove function for null values * removed commented code * remove unnecessary code * updated the test for 'array_remove' * added test for array_remove in case the input array is null * wip: case array is empty * removed test case for empty array * fix: Fall back to Spark for distinct aggregates (#1262) * fall back to Spark for distinct aggregates * update expected plans for 3.4 * update expected plans for 3.5 * force build * add comment * feat: Implement custom RecordBatch serde for shuffle for improved performance (#1190) * Implement faster encoder for shuffle blocks * make code more concise * enable fast encoding for columnar shuffle * update benches * test all int types * test float * remaining types * add Snappy and Zstd(6) back to benchmark * fix regression * Update native/core/src/execution/shuffle/codec.rs Co-authored-by: Liang-Chi Hsieh <[email protected]> * address feedback * support nullable flag --------- Co-authored-by: Liang-Chi Hsieh <[email protected]> * docs: Update TPC-H benchmark results (#1257) * fix: disable initCap by default (#1276) * fix: disable initCap by default * Update spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala Co-authored-by: Andy Grove <[email protected]> * address review comments --------- Co-authored-by: Andy Grove <[email protected]> * chore: Add changelog for 0.5.0 (#1278) * Add changelog * revert accidental change * move 2 items to performance section * update TPC-DS results for 0.5.0 (#1277) * fix: cast timestamp to decimal is unsupported (#1281) * fix: cast timestamp to decimal is unsupported * fix style * revert test name and mark as ignore * add comment * Fix build after merge * Fix tests after merge * Fix plans after merge * fix partition id in execute plan after merge (from Andy Grove) --------- Signed-off-by: Dharan Aditya <[email protected]> Co-authored-by: NoeB <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Raz Luvaton <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: KAZUYUKI TANIMURA <[email protected]> Co-authored-by: Sem <[email protected]> Co-authored-by: Himadri Pal <[email protected]> Co-authored-by: himadripal <[email protected]> Co-authored-by: gstvg <[email protected]> Co-authored-by: Adam Binford <[email protected]> Co-authored-by: Matt Butrovich <[email protected]> Co-authored-by: Raz Luvaton <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Dharan Aditya <[email protected]> Co-authored-by: Kristin Cowalcijk <[email protected]> Co-authored-by: Oleks V <[email protected]> Co-authored-by: Zhen Wang <[email protected]> Co-authored-by: Jagdish Parihar <[email protected]>

* feat: support array_append (#1072) * feat: support array_append * formatted code * rewrite array_append plan to match spark behaviour and fixed bug in QueryPlan serde * remove unwrap * Fix for Spark 3.3 * refactor array_append binary expression serde code * Disabled array_append test for spark 4.0+ * chore: Simplify CometShuffleMemoryAllocator to use Spark unified memory allocator (#1063) * docs: Update benchmarking.md (#1085) * feat: Require offHeap memory to be enabled (always use unified memory) (#1062) * Require offHeap memory * remove unused import * use off heap memory in stability tests * reorder imports * test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE config (#1087) * Add changelog for 0.4.0 (#1089) * chore: Prepare for 0.5.0 development (#1090) * Update version number for build * update docs * build: Skip installation of spark-integration and fuzz testing modules (#1091) * Add hint for finding the GPG key to use when publishing to maven (#1093) * docs: Update documentation for 0.4.0 release (#1096) * update TPC-H results * update Maven links * update benchmarking guide and add TPC-DS results * include q72 * fix: Unsigned type related bugs (#1095) ## Which issue does this PR close? Closes #1067 ## Rationale for this change Bug fix. A few expressions were failing some unsigned type related tests ## What changes are included in this PR? - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()` (`>` vs `>=`) ## How are these changes tested? Put back tests for unsigned types * chore: Include first ScanExec batch in metrics (#1105) * include first batch in ScanExec metrics * record row count metric * fix regression * chore: Improve CometScan metrics (#1100) * Add native metrics for plan creation * make messages consistent * Include get_next_batch cost in metrics * formatting * fix double count of rows * chore: Add custom metric for native shuffle fetching batches from JVM (#1108) * feat: support array_insert (#1073) * Part of the implementation of array_insert * Missing methods * Working version * Reformat code * Fix code-style * Add comments about spark's implementation. * Implement negative indices + fix tests for spark < 3.4 * Fix code-style * Fix scalastyle * Fix tests for spark < 3.4 * Fixes & tests - added test for the negative index - added test for the legacy spark mode * Use assume(isSpark34Plus) in tests * Test else-branch & improve coverage * Update native/spark-expr/src/list.rs Co-authored-by: Andy Grove <[email protected]> * Fix fallback test In one case there is a zero in index and test fails due to spark error * Adjust the behaviour for the NULL case to Spark * Move the logic of type checking to the method * Fix code-style --------- Co-authored-by: Andy Grove <[email protected]> * feat: enable decimal to decimal cast of different precision and scale (#1086) * enable decimal to decimal cast of different precision and scale * add more test cases for negative scale and higher precision * add check for compatibility for decimal to decimal * fix code style * Update spark/src/main/scala/org/apache/comet/expressions/CometCast.scala Co-authored-by: Andy Grove <[email protected]> * fix the nit in comment --------- Co-authored-by: himadripal <[email protected]> Co-authored-by: Andy Grove <[email protected]> * docs: fix readme FGPA/FPGA typo (#1117) * fix: Use RDD partition index (#1112) * fix: Use RDD partition index * fix * fix * fix * fix: Various metrics bug fixes and improvements (#1111) * fix: Don't create CometScanExec for subclasses of ParquetFileFormat (#1129) * Use exact class comparison for parquet scan * Add test * Add comment * fix: Fix metrics regressions (#1132) * fix metrics issues * clippy * update tests * docs: Add more technical detail and new diagram to Comet plugin overview (#1119) * Add more technical detail and new diagram to Comet plugin overview * update diagram * add info on Arrow IPC * update diagram * update diagram * update docs * address feedback * Stop passing Java config map into native createPlan (#1101) * feat: Improve ScanExec native metrics (#1133) * save * remove shuffle jvm metric and update tuning guide * docs * add source for all ScanExecs * address feedback * address feedback * chore: Remove unused StringView struct (#1143) * Remove unused StringView struct * remove more dead code * docs: Add some documentation explaining how shuffle works (#1148) * add some notes on shuffle * reads * improve docs * test: enable more Spark 4.0 tests (#1145) ## Which issue does this PR close? Part of #372 and #551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR enables more Spark 4.0 tests that were fixed by recent changes ## How are these changes tested? tests enabled * chore: Refactor cast to use SparkCastOptions param (#1146) * Refactor cast to use SparkCastOptions param * update tests * update benches * update benches * update benches * Enable more scenarios in CometExecBenchmark. (#1151) * chore: Move more expressions from core crate to spark-expr crate (#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * remove dead code (#1155) * fix: Spark 4.0-preview1 SPARK-47120 (#1156) ## Which issue does this PR close? Part of #372 and #551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR fixes the new test SPARK-47120 added in Spark 4.0 ## How are these changes tested? tests enabled * chore: Move string kernels and expressions to spark-expr crate (#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies * chore: Move remaining expressions to spark-expr crate + some minor refactoring (#1165) * move CheckOverflow to spark-expr crate * move NegativeExpr to spark-expr crate * move UnboundColumn to spark-expr crate * move ExpandExec from execution::datafusion::operators to execution::operators * refactoring to remove datafusion subpackage * update imports in benches * fix * fix * chore: Add ignored tests for reading complex types from Parquet (#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array * feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test * fix: Document enabling comet explain plan usage in Spark (4.0) (#1176) * test: enabling Spark tests with offHeap requirement (#1177) ## Which issue does this PR close? ## Rationale for this change After #1062 We have not running Spark tests for native execution ## What changes are included in this PR? Removed the off heap requirement for testing ## How are these changes tested? Bringing back Spark tests for native execution * feat: Improve shuffle metrics (second attempt) (#1175) * improve shuffle metrics * docs * more metrics * refactor * address feedback * fix: stddev_pop should not directly return 0.0 when count is 1.0 (#1184) * add test * fix * fix * fix * feat: Make native shuffle compression configurable and respect `spark.shuffle.compress` (#1185) * Make shuffle compression codec and level configurable * remove lz4 references * docs * update comment * clippy * fix benches * clippy * clippy * disable test for miri * remove lz4 reference from proto * minor: move shuffle classes from common to spark (#1193) * minor: refactor decodeBatches to make private in broadcast exchange (#1195) * minor: refactor prepare_output so that it does not require an ExecutionContext (#1194) * fix: fix missing explanation for then branch in case when (#1200) * minor: remove unused source files (#1202) * chore: Upgrade to DataFusion 44.0.0-rc2 (#1154) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * save * save * save * remove unused imports * clippy * implement more hashers * implement Hash and PartialEq * implement Hash and PartialEq * implement Hash and PartialEq * benches * fix ScalarUDFImpl.return_type failure * exclude test from miri * ignore correct test * ignore another test * remove miri checks * use return_type_from_exprs * Revert "use return_type_from_exprs" This reverts commit febc1f1. * use DF main branch * hacky workaround for regression in ScalarUDFImpl.return_type * fix repo url * pin to revision * bump to latest rev * bump to latest DF rev * bump DF to rev 9f530dd * add Cargo.lock * bump DF version * no default features * Revert "remove miri checks" This reverts commit 4638fe3. * Update pin to DataFusion e99e02b9b9093ceb0c13a2dd32a2a89beba47930 * update pin * Update Cargo.toml Bump to 44.0.0-rc2 * update cargo lock * revert miri change --------- Co-authored-by: Andrew Lamb <[email protected]> * feat: add support for array_contains expression (#1163) * feat: add support for array_contains expression * test: add unit test for array_contains function * Removes unnecessary case expression for handling null values * chore: Move more expressions from core crate to spark-expr crate (#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * remove dead code (#1155) * fix: Spark 4.0-preview1 SPARK-47120 (#1156) ## Which issue does this PR close? Part of #372 and #551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR fixes the new test SPARK-47120 added in Spark 4.0 ## How are these changes tested? tests enabled * chore: Move string kernels and expressions to spark-expr crate (#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies * chore: Move remaining expressions to spark-expr crate + some minor refactoring (#1165) * move CheckOverflow to spark-expr crate * move NegativeExpr to spark-expr crate * move UnboundColumn to spark-expr crate * move ExpandExec from execution::datafusion::operators to execution::operators * refactoring to remove datafusion subpackage * update imports in benches * fix * fix * chore: Add ignored tests for reading complex types from Parquet (#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array * feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test * fix: Document enabling comet explain plan usage in Spark (4.0) (#1176) * test: enabling Spark tests with offHeap requirement (#1177) ## Which issue does this PR close? ## Rationale for this change After #1062 We have not running Spark tests for native execution ## What changes are included in this PR? Removed the off heap requirement for testing ## How are these changes tested? Bringing back Spark tests for native execution * feat: Improve shuffle metrics (second attempt) (#1175) * improve shuffle metrics * docs * more metrics * refactor * address feedback * fix: stddev_pop should not directly return 0.0 when count is 1.0 (#1184) * add test * fix * fix * fix * feat: Make native shuffle compression configurable and respect `spark.shuffle.compress` (#1185) * Make shuffle compression codec and level configurable * remove lz4 references * docs * update comment * clippy * fix benches * clippy * clippy * disable test for miri * remove lz4 reference from proto * minor: move shuffle classes from common to spark (#1193) * minor: refactor decodeBatches to make private in broadcast exchange (#1195) * minor: refactor prepare_output so that it does not require an ExecutionContext (#1194) * fix: fix missing explanation for then branch in case when (#1200) * minor: remove unused source files (#1202) * chore: Upgrade to DataFusion 44.0.0-rc2 (#1154) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * save * save * save * remove unused imports * clippy * implement more hashers * implement Hash and PartialEq * implement Hash and PartialEq * implement Hash and PartialEq * benches * fix ScalarUDFImpl.return_type failure * exclude test from miri * ignore correct test * ignore another test * remove miri checks * use return_type_from_exprs * Revert "use return_type_from_exprs" This reverts commit febc1f1. * use DF main branch * hacky workaround for regression in ScalarUDFImpl.return_type * fix repo url * pin to revision * bump to latest rev * bump to latest DF rev * bump DF to rev 9f530dd * add Cargo.lock * bump DF version * no default features * Revert "remove miri checks" This reverts commit 4638fe3. * Update pin to DataFusion e99e02b9b9093ceb0c13a2dd32a2a89beba47930 * update pin * Update Cargo.toml Bump to 44.0.0-rc2 * update cargo lock * revert miri change --------- Co-authored-by: Andrew Lamb <[email protected]> * update UT Signed-off-by: Dharan Aditya <[email protected]> * fix typo in UT Signed-off-by: Dharan Aditya <[email protected]> --------- Signed-off-by: Dharan Aditya <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: KAZUYUKI TANIMURA <[email protected]> Co-authored-by: Parth Chandra <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Raz Luvaton <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * feat: Add a `spark.comet.exec.memoryPool` configuration for experimenting with various datafusion memory pool setups. (#1021) * feat: Reenable tests for filtered SMJ anti join (#1211) * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests * Add CoalesceBatchesExec around SMJ with join filter * adding `CoalesceBatches` * adding `CoalesceBatches` * adding `CoalesceBatches` * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests --------- Co-authored-by: Andy Grove <[email protected]> * chore: Add safety check to CometBuffer (#1050) * chore: Add safety check to CometBuffer * Add CometColumnarToRowExec * fix * fix * more * Update plan stability results * fix * fix * fix * Revert "fix" This reverts commit 9bad173. * Revert "Revert "fix"" This reverts commit d527ad1. * fix BucketedReadWithoutHiveSupportSuite * fix SparkPlanSuite * remove unreachable code (#1213) * test: Enable Comet by default except some tests in SparkSessionExtensionSuite (#1201) ## Which issue does this PR close? Part of #1197 ## Rationale for this change Since `loadCometExtension` in the diffs were not using `isCometEnabled`, `SparkSessionExtensionSuite` was not using Comet. Once enabled, some test failures discovered ## What changes are included in this PR? `loadCometExtension` now uses `isCometEnabled` that enables Comet by default Temporary ignore the failing tests in SparkSessionExtensionSuite ## How are these changes tested? existing tests * extract struct expressions to folders based on spark grouping (#1216) * chore: extract static invoke expressions to folders based on spark grouping (#1217) * extract static invoke expressions to folders based on spark grouping * Update native/spark-expr/src/static_invoke/mod.rs Co-authored-by: Andy Grove <[email protected]> --------- Co-authored-by: Andy Grove <[email protected]> * chore: Follow-on PR to fully enable onheap memory usage (#1210) * Make datafusion's native memory pool configurable * save * fix * Update memory calculation and add draft documentation * ready for review * ready for review * address feedback * Update docs/source/user-guide/tuning.md Co-authored-by: Liang-Chi Hsieh <[email protected]> * Update docs/source/user-guide/tuning.md Co-authored-by: Kristin Cowalcijk <[email protected]> * Update docs/source/user-guide/tuning.md Co-authored-by: Liang-Chi Hsieh <[email protected]> * Update docs/source/user-guide/tuning.md Co-authored-by: Liang-Chi Hsieh <[email protected]> * remove unused config --------- Co-authored-by: Kristin Cowalcijk <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]> * feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support (#1192) * Implement native decoding and decompression * revert some variable renaming for smaller diff * fix oom issues? * make NativeBatchDecoderIterator more consistent with ArrowReaderIterator * fix oom and prep for review * format * Add LZ4 support * clippy, new benchmark * rename metrics, clean up lz4 code * update test * Add support for snappy * format * change default back to lz4 * make metrics more accurate * format * clippy * use faster unsafe version of lz4_flex * Make compression codec configurable for columnar shuffle * clippy * fix bench * fmt * address feedback * address feedback * address feedback * minor code simplification * cargo fmt * overflow check * rename compression level config * address feedback * address feedback * rename constant * chore: extract agg_funcs expressions to folders based on spark grouping (#1224) * extract agg_funcs expressions to folders based on spark grouping * fix rebase * extract datetime_funcs expressions to folders based on spark grouping (#1222) Co-authored-by: Andy Grove <[email protected]> * chore: use datafusion from crates.io (#1232) * chore: extract strings file to `strings_func` like in spark grouping (#1215) * chore: extract predicate_functions expressions to folders based on spark grouping (#1218) * extract predicate_functions expressions to folders based on spark grouping * code review changes --------- Co-authored-by: Andy Grove <[email protected]> * build(deps): bump protobuf version to 3.21.12 (#1234) * extract json_funcs expressions to folders based on spark grouping (#1220) Co-authored-by: Andy Grove <[email protected]> * test: Enable shuffle by default in Spark tests (#1240) ## Which issue does this PR close? ## Rationale for this change Because `isCometShuffleEnabled` is false by default, some tests were not reached ## What changes are included in this PR? Removed `isCometShuffleEnabled` and updated spark test diff ## How are these changes tested? existing test * chore: extract hash_funcs expressions to folders based on spark grouping (#1221) * extract hash_funcs expressions to folders based on spark grouping * extract hash_funcs expressions to folders based on spark grouping --------- Co-authored-by: Andy Grove <[email protected]> * fix: Fall back to Spark for unsupported partition or sort expressions in window aggregates (#1253) * perf: Improve query planning to more reliably fall back to columnar shuffle when native shuffle is not supported (#1209) * fix regression (#1259) * feat: add support for array_remove expression (#1179) * wip: array remove * added comet expression test * updated test cases * fixed array_remove function for null values * removed commented code * remove unnecessary code * updated the test for 'array_remove' * added test for array_remove in case the input array is null * wip: case array is empty * removed test case for empty array * fix: Fall back to Spark for distinct aggregates (#1262) * fall back to Spark for distinct aggregates * update expected plans for 3.4 * update expected plans for 3.5 * force build * add comment * feat: Implement custom RecordBatch serde for shuffle for improved performance (#1190) * Implement faster encoder for shuffle blocks * make code more concise * enable fast encoding for columnar shuffle * update benches * test all int types * test float * remaining types * add Snappy and Zstd(6) back to benchmark * fix regression * Update native/core/src/execution/shuffle/codec.rs Co-authored-by: Liang-Chi Hsieh <[email protected]> * address feedback * support nullable flag --------- Co-authored-by: Liang-Chi Hsieh <[email protected]> * docs: Update TPC-H benchmark results (#1257) * fix: disable initCap by default (#1276) * fix: disable initCap by default * Update spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala Co-authored-by: Andy Grove <[email protected]> * address review comments --------- Co-authored-by: Andy Grove <[email protected]> * chore: Add changelog for 0.5.0 (#1278) * Add changelog * revert accidental change * move 2 items to performance section * update TPC-DS results for 0.5.0 (#1277) * fix: cast timestamp to decimal is unsupported (#1281) * fix: cast timestamp to decimal is unsupported * fix style * revert test name and mark as ignore * add comment * chore: Start 0.6.0 development (#1286) * start 0.6.0 development * update some docs * Revert a change * update CI * docs: Fix links and provide complete benchmarking scripts (#1284) * fix links and provide complete scripts * fix path * fix incorrect text * feat: Add HasRowIdMapping interface (#1288) * fix style * fix * fix for plan serialization --------- Signed-off-by: Dharan Aditya <[email protected]> Co-authored-by: NoeB <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Raz Luvaton <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: KAZUYUKI TANIMURA <[email protected]> Co-authored-by: Sem <[email protected]> Co-authored-by: Himadri Pal <[email protected]> Co-authored-by: himadripal <[email protected]> Co-authored-by: gstvg <[email protected]> Co-authored-by: Adam Binford <[email protected]> Co-authored-by: Matt Butrovich <[email protected]> Co-authored-by: Raz Luvaton <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Dharan Aditya <[email protected]> Co-authored-by: Kristin Cowalcijk <[email protected]> Co-authored-by: Oleks V <[email protected]> Co-authored-by: Zhen Wang <[email protected]> Co-authored-by: Jagdish Parihar <[email protected]>

…exec - 20240121 (#1316) * feat: support array_append (#1072) * feat: support array_append * formatted code * rewrite array_append plan to match spark behaviour and fixed bug in QueryPlan serde * remove unwrap * Fix for Spark 3.3 * refactor array_append binary expression serde code * Disabled array_append test for spark 4.0+ * chore: Simplify CometShuffleMemoryAllocator to use Spark unified memory allocator (#1063) * docs: Update benchmarking.md (#1085) * feat: Require offHeap memory to be enabled (always use unified memory) (#1062) * Require offHeap memory * remove unused import * use off heap memory in stability tests * reorder imports * test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE config (#1087) * Add changelog for 0.4.0 (#1089) * chore: Prepare for 0.5.0 development (#1090) * Update version number for build * update docs * build: Skip installation of spark-integration and fuzz testing modules (#1091) * Add hint for finding the GPG key to use when publishing to maven (#1093) * docs: Update documentation for 0.4.0 release (#1096) * update TPC-H results * update Maven links * update benchmarking guide and add TPC-DS results * include q72 * fix: Unsigned type related bugs (#1095) ## Which issue does this PR close? Closes #1067 ## Rationale for this change Bug fix. A few expressions were failing some unsigned type related tests ## What changes are included in this PR? - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()` (`>` vs `>=`) ## How are these changes tested? Put back tests for unsigned types * chore: Include first ScanExec batch in metrics (#1105) * include first batch in ScanExec metrics * record row count metric * fix regression * chore: Improve CometScan metrics (#1100) * Add native metrics for plan creation * make messages consistent * Include get_next_batch cost in metrics * formatting * fix double count of rows * chore: Add custom metric for native shuffle fetching batches from JVM (#1108) * feat: support array_insert (#1073) * Part of the implementation of array_insert * Missing methods * Working version * Reformat code * Fix code-style * Add comments about spark's implementation. * Implement negative indices + fix tests for spark < 3.4 * Fix code-style * Fix scalastyle * Fix tests for spark < 3.4 * Fixes & tests - added test for the negative index - added test for the legacy spark mode * Use assume(isSpark34Plus) in tests * Test else-branch & improve coverage * Update native/spark-expr/src/list.rs Co-authored-by: Andy Grove <[email protected]> * Fix fallback test In one case there is a zero in index and test fails due to spark error * Adjust the behaviour for the NULL case to Spark * Move the logic of type checking to the method * Fix code-style --------- Co-authored-by: Andy Grove <[email protected]> * feat: enable decimal to decimal cast of different precision and scale (#1086) * enable decimal to decimal cast of different precision and scale * add more test cases for negative scale and higher precision * add check for compatibility for decimal to decimal * fix code style * Update spark/src/main/scala/org/apache/comet/expressions/CometCast.scala Co-authored-by: Andy Grove <[email protected]> * fix the nit in comment --------- Co-authored-by: himadripal <[email protected]> Co-authored-by: Andy Grove <[email protected]> * docs: fix readme FGPA/FPGA typo (#1117) * fix: Use RDD partition index (#1112) * fix: Use RDD partition index * fix * fix * fix * fix: Various metrics bug fixes and improvements (#1111) * fix: Don't create CometScanExec for subclasses of ParquetFileFormat (#1129) * Use exact class comparison for parquet scan * Add test * Add comment * fix: Fix metrics regressions (#1132) * fix metrics issues * clippy * update tests * docs: Add more technical detail and new diagram to Comet plugin overview (#1119) * Add more technical detail and new diagram to Comet plugin overview * update diagram * add info on Arrow IPC * update diagram * update diagram * update docs * address feedback * Stop passing Java config map into native createPlan (#1101) * feat: Improve ScanExec native metrics (#1133) * save * remove shuffle jvm metric and update tuning guide * docs * add source for all ScanExecs * address feedback * address feedback * chore: Remove unused StringView struct (#1143) * Remove unused StringView struct * remove more dead code * docs: Add some documentation explaining how shuffle works (#1148) * add some notes on shuffle * reads * improve docs * test: enable more Spark 4.0 tests (#1145) ## Which issue does this PR close? Part of #372 and #551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR enables more Spark 4.0 tests that were fixed by recent changes ## How are these changes tested? tests enabled * chore: Refactor cast to use SparkCastOptions param (#1146) * Refactor cast to use SparkCastOptions param * update tests * update benches * update benches * update benches * Enable more scenarios in CometExecBenchmark. (#1151) * chore: Move more expressions from core crate to spark-expr crate (#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * remove dead code (#1155) * fix: Spark 4.0-preview1 SPARK-47120 (#1156) ## Which issue does this PR close? Part of #372 and #551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR fixes the new test SPARK-47120 added in Spark 4.0 ## How are these changes tested? tests enabled * chore: Move string kernels and expressions to spark-expr crate (#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies * chore: Move remaining expressions to spark-expr crate + some minor refactoring (#1165) * move CheckOverflow to spark-expr crate * move NegativeExpr to spark-expr crate * move UnboundColumn to spark-expr crate * move ExpandExec from execution::datafusion::operators to execution::operators * refactoring to remove datafusion subpackage * update imports in benches * fix * fix * chore: Add ignored tests for reading complex types from Parquet (#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array * feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test * fix: Document enabling comet explain plan usage in Spark (4.0) (#1176) * test: enabling Spark tests with offHeap requirement (#1177) ## Which issue does this PR close? ## Rationale for this change After #1062 We have not running Spark tests for native execution ## What changes are included in this PR? Removed the off heap requirement for testing ## How are these changes tested? Bringing back Spark tests for native execution * feat: Improve shuffle metrics (second attempt) (#1175) * improve shuffle metrics * docs * more metrics * refactor * address feedback * fix: stddev_pop should not directly return 0.0 when count is 1.0 (#1184) * add test * fix * fix * fix * feat: Make native shuffle compression configurable and respect `spark.shuffle.compress` (#1185) * Make shuffle compression codec and level configurable * remove lz4 references * docs * update comment * clippy * fix benches * clippy * clippy * disable test for miri * remove lz4 reference from proto * minor: move shuffle classes from common to spark (#1193) * minor: refactor decodeBatches to make private in broadcast exchange (#1195) * minor: refactor prepare_output so that it does not require an ExecutionContext (#1194) * fix: fix missing explanation for then branch in case when (#1200) * minor: remove unused source files (#1202) * chore: Upgrade to DataFusion 44.0.0-rc2 (#1154) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * save * save * save * remove unused imports * clippy * implement more hashers * implement Hash and PartialEq * implement Hash and PartialEq * implement Hash and PartialEq * benches * fix ScalarUDFImpl.return_type failure * exclude test from miri * ignore correct test * ignore another test * remove miri checks * use return_type_from_exprs * Revert "use return_type_from_exprs" This reverts commit febc1f1. * use DF main branch * hacky workaround for regression in ScalarUDFImpl.return_type * fix repo url * pin to revision * bump to latest rev * bump to latest DF rev * bump DF to rev 9f530dd * add Cargo.lock * bump DF version * no default features * Revert "remove miri checks" This reverts commit 4638fe3. * Update pin to DataFusion e99e02b9b9093ceb0c13a2dd32a2a89beba47930 * update pin * Update Cargo.toml Bump to 44.0.0-rc2 * update cargo lock * revert miri change --------- Co-authored-by: Andrew Lamb <[email protected]> * feat: add support for array_contains expression (#1163) * feat: add support for array_contains expression * test: add unit test for array_contains function * Removes unnecessary case expression for handling null values * chore: Move more expressions from core crate to spark-expr crate (#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * remove dead code (#1155) * fix: Spark 4.0-preview1 SPARK-47120 (#1156) ## Which issue does this PR close? Part of #372 and #551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR fixes the new test SPARK-47120 added in Spark 4.0 ## How are these changes tested? tests enabled * chore: Move string kernels and expressions to spark-expr crate (#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies * chore: Move remaining expressions to spark-expr crate + some minor refactoring (#1165) * move CheckOverflow to spark-expr crate * move NegativeExpr to spark-expr crate * move UnboundColumn to spark-expr crate * move ExpandExec from execution::datafusion::operators to execution::operators * refactoring to remove datafusion subpackage * update imports in benches * fix * fix * chore: Add ignored tests for reading complex types from Parquet (#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array * feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test * fix: Document enabling comet explain plan usage in Spark (4.0) (#1176) * test: enabling Spark tests with offHeap requirement (#1177) ## Which issue does this PR close? ## Rationale for this change After #1062 We have not running Spark tests for native execution ## What changes are included in this PR? Removed the off heap requirement for testing ## How are these changes tested? Bringing back Spark tests for native execution * feat: Improve shuffle metrics (second attempt) (#1175) * improve shuffle metrics * docs * more metrics * refactor * address feedback * fix: stddev_pop should not directly return 0.0 when count is 1.0 (#1184) * add test * fix * fix * fix * feat: Make native shuffle compression configurable and respect `spark.shuffle.compress` (#1185) * Make shuffle compression codec and level configurable * remove lz4 references * docs * update comment * clippy * fix benches * clippy * clippy * disable test for miri * remove lz4 reference from proto * minor: move shuffle classes from common to spark (#1193) * minor: refactor decodeBatches to make private in broadcast exchange (#1195) * minor: refactor prepare_output so that it does not require an ExecutionContext (#1194) * fix: fix missing explanation for then branch in case when (#1200) * minor: remove unused source files (#1202) * chore: Upgrade to DataFusion 44.0.0-rc2 (#1154) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * save * save * save * remove unused imports * clippy * implement more hashers * implement Hash and PartialEq * implement Hash and PartialEq * implement Hash and PartialEq * benches * fix ScalarUDFImpl.return_type failure * exclude test from miri * ignore correct test * ignore another test * remove miri checks * use return_type_from_exprs * Revert "use return_type_from_exprs" This reverts commit febc1f1. * use DF main branch * hacky workaround for regression in ScalarUDFImpl.return_type * fix repo url * pin to revision * bump to latest rev * bump to latest DF rev * bump DF to rev 9f530dd * add Cargo.lock * bump DF version * no default features * Revert "remove miri checks" This reverts commit 4638fe3. * Update pin to DataFusion e99e02b9b9093ceb0c13a2dd32a2a89beba47930 * update pin * Update Cargo.toml Bump to 44.0.0-rc2 * update cargo lock * revert miri change --------- Co-authored-by: Andrew Lamb <[email protected]> * update UT Signed-off-by: Dharan Aditya <[email protected]> * fix typo in UT Signed-off-by: Dharan Aditya <[email protected]> --------- Signed-off-by: Dharan Aditya <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: KAZUYUKI TANIMURA <[email protected]> Co-authored-by: Parth Chandra <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Raz Luvaton <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * feat: Add a `spark.comet.exec.memoryPool` configuration for experimenting with various datafusion memory pool setups. (#1021) * feat: Reenable tests for filtered SMJ anti join (#1211) * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests * Add CoalesceBatchesExec around SMJ with join filter * adding `CoalesceBatches` * adding `CoalesceBatches` * adding `CoalesceBatches` * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests --------- Co-authored-by: Andy Grove <[email protected]> * chore: Add safety check to CometBuffer (#1050) * chore: Add safety check to CometBuffer * Add CometColumnarToRowExec * fix * fix * more * Update plan stability results * fix * fix * fix * Revert "fix" This reverts commit 9bad173. * Revert "Revert "fix"" This reverts commit d527ad1. * fix BucketedReadWithoutHiveSupportSuite * fix SparkPlanSuite * remove unreachable code (#1213) * test: Enable Comet by default except some tests in SparkSessionExtensionSuite (#1201) ## Which issue does this PR close? Part of #1197 ## Rationale for this change Since `loadCometExtension` in the diffs were not using `isCometEnabled`, `SparkSessionExtensionSuite` was not using Comet. Once enabled, some test failures discovered ## What changes are included in this PR? `loadCometExtension` now uses `isCometEnabled` that enables Comet by default Temporary ignore the failing tests in SparkSessionExtensionSuite ## How are these changes tested? existing tests * extract struct expressions to folders based on spark grouping (#1216) * chore: extract static invoke expressions to folders based on spark grouping (#1217) * extract static invoke expressions to folders based on spark grouping * Update native/spark-expr/src/static_invoke/mod.rs Co-authored-by: Andy Grove <[email protected]> --------- Co-authored-by: Andy Grove <[email protected]> * chore: Follow-on PR to fully enable onheap memory usage (#1210) * Make datafusion's native memory pool configurable * save * fix * Update memory calculation and add draft documentation * ready for review * ready for review * address feedback * Update docs/source/user-guide/tuning.md Co-authored-by: Liang-Chi Hsieh <[email protected]> * Update docs/source/user-guide/tuning.md Co-authored-by: Kristin Cowalcijk <[email protected]> * Update docs/source/user-guide/tuning.md Co-authored-by: Liang-Chi Hsieh <[email protected]> * Update docs/source/user-guide/tuning.md Co-authored-by: Liang-Chi Hsieh <[email protected]> * remove unused config --------- Co-authored-by: Kristin Cowalcijk <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]> * feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support (#1192) * Implement native decoding and decompression * revert some variable renaming for smaller diff * fix oom issues? * make NativeBatchDecoderIterator more consistent with ArrowReaderIterator * fix oom and prep for review * format * Add LZ4 support * clippy, new benchmark * rename metrics, clean up lz4 code * update test * Add support for snappy * format * change default back to lz4 * make metrics more accurate * format * clippy * use faster unsafe version of lz4_flex * Make compression codec configurable for columnar shuffle * clippy * fix bench * fmt * address feedback * address feedback * address feedback * minor code simplification * cargo fmt * overflow check * rename compression level config * address feedback * address feedback * rename constant * chore: extract agg_funcs expressions to folders based on spark grouping (#1224) * extract agg_funcs expressions to folders based on spark grouping * fix rebase * extract datetime_funcs expressions to folders based on spark grouping (#1222) Co-authored-by: Andy Grove <[email protected]> * chore: use datafusion from crates.io (#1232) * chore: extract strings file to `strings_func` like in spark grouping (#1215) * chore: extract predicate_functions expressions to folders based on spark grouping (#1218) * extract predicate_functions expressions to folders based on spark grouping * code review changes --------- Co-authored-by: Andy Grove <[email protected]> * build(deps): bump protobuf version to 3.21.12 (#1234) * extract json_funcs expressions to folders based on spark grouping (#1220) Co-authored-by: Andy Grove <[email protected]> * test: Enable shuffle by default in Spark tests (#1240) ## Which issue does this PR close? ## Rationale for this change Because `isCometShuffleEnabled` is false by default, some tests were not reached ## What changes are included in this PR? Removed `isCometShuffleEnabled` and updated spark test diff ## How are these changes tested? existing test * chore: extract hash_funcs expressions to folders based on spark grouping (#1221) * extract hash_funcs expressions to folders based on spark grouping * extract hash_funcs expressions to folders based on spark grouping --------- Co-authored-by: Andy Grove <[email protected]> * fix: Fall back to Spark for unsupported partition or sort expressions in window aggregates (#1253) * perf: Improve query planning to more reliably fall back to columnar shuffle when native shuffle is not supported (#1209) * fix regression (#1259) * feat: add support for array_remove expression (#1179) * wip: array remove * added comet expression test * updated test cases * fixed array_remove function for null values * removed commented code * remove unnecessary code * updated the test for 'array_remove' * added test for array_remove in case the input array is null * wip: case array is empty * removed test case for empty array * fix: Fall back to Spark for distinct aggregates (#1262) * fall back to Spark for distinct aggregates * update expected plans for 3.4 * update expected plans for 3.5 * force build * add comment * feat: Implement custom RecordBatch serde for shuffle for improved performance (#1190) * Implement faster encoder for shuffle blocks * make code more concise * enable fast encoding for columnar shuffle * update benches * test all int types * test float * remaining types * add Snappy and Zstd(6) back to benchmark * fix regression * Update native/core/src/execution/shuffle/codec.rs Co-authored-by: Liang-Chi Hsieh <[email protected]> * address feedback * support nullable flag --------- Co-authored-by: Liang-Chi Hsieh <[email protected]> * docs: Update TPC-H benchmark results (#1257) * fix: disable initCap by default (#1276) * fix: disable initCap by default * Update spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala Co-authored-by: Andy Grove <[email protected]> * address review comments --------- Co-authored-by: Andy Grove <[email protected]> * chore: Add changelog for 0.5.0 (#1278) * Add changelog * revert accidental change * move 2 items to performance section * update TPC-DS results for 0.5.0 (#1277) * fix: cast timestamp to decimal is unsupported (#1281) * fix: cast timestamp to decimal is unsupported * fix style * revert test name and mark as ignore * add comment * chore: Start 0.6.0 development (#1286) * start 0.6.0 development * update some docs * Revert a change * update CI * docs: Fix links and provide complete benchmarking scripts (#1284) * fix links and provide complete scripts * fix path * fix incorrect text * feat: Add HasRowIdMapping interface (#1288) --------- Signed-off-by: Dharan Aditya <[email protected]> Co-authored-by: NoeB <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Raz Luvaton <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: KAZUYUKI TANIMURA <[email protected]> Co-authored-by: Sem <[email protected]> Co-authored-by: Himadri Pal <[email protected]> Co-authored-by: himadripal <[email protected]> Co-authored-by: gstvg <[email protected]> Co-authored-by: Adam Binford <[email protected]> Co-authored-by: Matt Butrovich <[email protected]> Co-authored-by: Raz Luvaton <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Dharan Aditya <[email protected]> Co-authored-by: Kristin Cowalcijk <[email protected]> Co-authored-by: Oleks V <[email protected]> Co-authored-by: Zhen Wang <[email protected]> Co-authored-by: Jagdish Parihar <[email protected]>

Part of the implementation of array_insert

f583e5c

SemyonSinchenko added 5 commits November 11, 2024 09:11

Missing methods

e870c21

Working version

ac7a2b3

Reformat code

9d9518e

Fix code-style

6e0d5f4

Add comments about spark's implementation.

e4b5e4c

SemyonSinchenko added 2 commits November 13, 2024 12:34

Implement negative indices

19230bf

+ fix tests for spark < 3.4

Fix code-style

58ecb82

SemyonSinchenko changed the title ~~[WIP][DO-NOT-MERGE] feat: support array_insert~~ [WIP] feat: support array_insert Nov 13, 2024

SemyonSinchenko changed the title ~~[WIP] feat: support array_insert~~ feat: support array_insert Nov 13, 2024

Fix scalastyle

a248567

SemyonSinchenko changed the title ~~feat: support array_insert~~ [WIP] feat: support array_insert Nov 13, 2024

SemyonSinchenko added 2 commits November 13, 2024 13:13

Fix tests for spark < 3.4

e4349f5

Fixes & tests

0d38ef0

- added test for the negative index - added test for the legacy spark mode

SemyonSinchenko changed the title ~~[WIP] feat: support array_insert~~ feat: support array_insert Nov 13, 2024

SemyonSinchenko marked this pull request as ready for review November 13, 2024 18:04

andygrove reviewed Nov 13, 2024

View reviewed changes

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala Show resolved Hide resolved

SemyonSinchenko added 2 commits November 14, 2024 06:55

Use assume(isSpark34Plus) in tests

c7f26f9

Merge remote-tracking branch 'refs/remotes/origin/main'

8b58d8d

Test else-branch & improve coverage

f832cf0

andygrove reviewed Nov 18, 2024

View reviewed changes

native/spark-expr/src/list.rs Outdated Show resolved Hide resolved

andygrove reviewed Nov 18, 2024

View reviewed changes

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala Outdated Show resolved Hide resolved

Update native/spark-expr/src/list.rs

6e41858

Co-authored-by: Andy Grove <[email protected]>

Merge main + add tests

659ab7a

- fixes; - tests; - comments in the code;

SemyonSinchenko added 2 commits November 19, 2024 19:33

Fix fallback test

4770fce

In one case there is a zero in index and test fails due to spark error

Adjust the behaviour for the NULL case to Spark

e9ef941

SemyonSinchenko closed this Nov 20, 2024

SemyonSinchenko reopened this Nov 20, 2024

SemyonSinchenko requested a review from andygrove November 20, 2024 18:02

andygrove reviewed Nov 20, 2024

View reviewed changes

andygrove approved these changes Nov 20, 2024

View reviewed changes

SemyonSinchenko added 2 commits November 21, 2024 10:25

Move the logic of type checking to the method

6431ad9

Fix code-style

e02d20f

andygrove merged commit 9990b34 into apache:main Nov 22, 2024
74 checks passed

SemyonSinchenko deleted the array-insert branch November 23, 2024 09:02

SemyonSinchenko mentioned this pull request Nov 24, 2024

chore: Make list.rs non generic & simplify the code #1118

Closed

andygrove mentioned this pull request Jan 7, 2025

[EPIC] Add support for all array expressions #1042

Open

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support array_insert #1073

feat: support array_insert #1073

SemyonSinchenko commented Nov 9, 2024 •

edited

Loading

SemyonSinchenko commented Nov 9, 2024

SemyonSinchenko commented Nov 11, 2024 •

edited

Loading

andygrove commented Nov 13, 2024

codecov-commenter commented Nov 14, 2024 •

edited

Loading

SemyonSinchenko commented Nov 16, 2024

NoeB commented Nov 19, 2024

SemyonSinchenko commented Nov 20, 2024

SemyonSinchenko commented Nov 20, 2024

andygrove Nov 20, 2024

SemyonSinchenko Nov 21, 2024 •

edited

Loading

andygrove Nov 22, 2024

Kimahriman Nov 28, 2024 •

edited

Loading

SemyonSinchenko Nov 28, 2024

andygrove left a comment

feat: support array_insert #1073

feat: support array_insert #1073

Conversation

SemyonSinchenko commented Nov 9, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

SemyonSinchenko commented Nov 9, 2024

SemyonSinchenko commented Nov 11, 2024 • edited Loading

andygrove commented Nov 13, 2024

codecov-commenter commented Nov 14, 2024 • edited Loading

Codecov Report

SemyonSinchenko commented Nov 16, 2024

NoeB commented Nov 19, 2024

SemyonSinchenko commented Nov 20, 2024

SemyonSinchenko commented Nov 20, 2024

andygrove Nov 20, 2024

Choose a reason for hiding this comment

SemyonSinchenko Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

andygrove Nov 22, 2024

Choose a reason for hiding this comment

Kimahriman Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

SemyonSinchenko Nov 28, 2024

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

SemyonSinchenko commented Nov 9, 2024 •

edited

Loading

SemyonSinchenko commented Nov 11, 2024 •

edited

Loading

codecov-commenter commented Nov 14, 2024 •

edited

Loading

SemyonSinchenko Nov 21, 2024 •

edited

Loading

Kimahriman Nov 28, 2024 •

edited

Loading