chore: Upgrade to latest DataFusion revision #909

andygrove · 2024-09-04T15:06:23Z

Which issue does this PR close?

N/A

Rationale for this change

DataFusion 42 will be released soon so we need to make sure there are no changes that cause regressions in Comet before it is released.

What changes are included in this PR?

Update DataFusion revision
Refactor aggregate functions due to upstream API changes
Refactor to reduce duplicate code
Remove our copy of StatsType and use DataFusion's version
Remove our copy of down_cast_any_ref and use DataFusion's version
Implement group accumulator support for stddev and variance (or file follow-on issue)

How are these changes tested?

Existing tests.

andygrove · 2024-09-04T15:35:10Z

native/spark-expr/src/utils.rs

-/// A utility function from DataFusion. It is not exposed by DataFusion.
-pub fn down_cast_any_ref(any: &dyn Any) -> &dyn Any {
-    if any.is::<Arc<dyn PhysicalExpr>>() {
-        any.downcast_ref::<Arc<dyn PhysicalExpr>>()
-            .unwrap()
-            .as_any()
-    } else if any.is::<Box<dyn PhysicalExpr>>() {
-        any.downcast_ref::<Box<dyn PhysicalExpr>>()
-            .unwrap()
-            .as_any()
-    } else {
-        any
-    }
-}


This function is now public in DataFusion, so we use that version now

andygrove · 2024-09-04T17:28:28Z

@huaxingao There are quite a few changes to aggregates in this PR due to upstream API changes. Could you review when you get a chance?

kazuyukitanimura

LGTM pending CI

kazuyukitanimura · 2024-09-04T17:52:25Z

native/core/src/execution/datafusion/planner.rs

-                        Ok(Arc::new(SumDecimal::new("sum", child, datatype)))
+                        let func = AggregateUDF::new_from_impl(SumDecimal::new(
+                            "sum",
+                            Arc::clone(&child),


Just for me to understand, what would happen if we do not do Arc::clone() here?

We need to clone because we reference child again in the next statement. If I remove the clone, the code fails to compile:

error[E0382]: use of moved value: `child` --> core/src/execution/datafusion/planner.rs:1357:72 | 1347 | let child = self.create_expr(expr.child.as_ref().unwrap(), Arc::clone(&schema))?; | ----- move occurs because `child` has type `Arc<dyn datafusion_physical_expr::PhysicalExpr>`, which does not implement the `Copy` trait ... 1354 | child, | ----- value moved here ... 1357 | AggregateExprBuilder::new(Arc::new(func), vec![child]) | ^^^^^ value used here after move

Hmm, I think the second parameter is Arc<dyn PhysicalExpr>. If it is not changed, it should be child?

Oh, I see. It creates Arc<T> actually.

https://doc.rust-lang.org/std/sync/struct.Arc.html#impl-Clone-for-Arc%3CT,+A%3E

Yes, we recently started using Arc::clone(foo) instead of foo.clone() to make it easy to see when we are just cloning an Arc (cheap) vs a more expensive clone operation. There is a clippy lint that checks that we are using this style.

kazuyukitanimura · 2024-09-04T18:15:13Z

Oops, some test failures

andygrove · 2024-09-04T18:55:48Z

failure:

2024-09-04T18:07:57.8583417Z - var_pop and var_samp *** FAILED *** (532 milliseconds)
2024-09-04T18:07:57.8588646Z   org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 34999.0 failed 1 times, most recent failure: Lost task 1.0 in stage 34999.0 (TID 19829) (e5b532f86b88 executor driver): org.apache.comet.CometNativeException: Invalid argument error: column types must match schema types, expected Float64 but found UInt64 at column index 1

I rolled back implementing the group accumulators.

andygrove added 13 commits September 4, 2024 08:06

update dependency version

4773c6a

update avg

4098e97

update avg_decimal

fbeaf97

update sum_decimal

1fa346d

variance

a946ce4

stddev

5c674a6

covariance

9474f2d

correlation

be6b032

save progress

cb0d86e

code compiles

f2ae56d

clippy

942930b

remove duplicate of down_cast_any_ref function

2ace729

remove duplicate of down_cast_any_ref function

81ddd56

andygrove commented Sep 4, 2024

View reviewed changes

andygrove added 5 commits September 4, 2024 10:18

machete

7ff01bf

bump DF version again and use StatsType from DataFusion

f0eacda

implement groups accumulator for stddev and variance

b1ab6db

refactor

0625ad5

fmt

23fc1c3

andygrove marked this pull request as ready for review September 4, 2024 17:27

andygrove requested review from huaxingao and kazuyukitanimura September 4, 2024 17:27

kazuyukitanimura approved these changes Sep 4, 2024

View reviewed changes

revert group accumulator

11e0938

viirya approved these changes Sep 4, 2024

View reviewed changes

kazuyukitanimura approved these changes Sep 4, 2024

View reviewed changes

andygrove merged commit 00eaa8e into apache:main Sep 5, 2024
75 checks passed

andygrove deleted the df-upgrade branch September 5, 2024 18:09

Kimahriman mentioned this pull request Sep 9, 2024

chore: Enable additional CreateArray tests #928

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Upgrade to latest DataFusion revision #909

chore: Upgrade to latest DataFusion revision #909

andygrove commented Sep 4, 2024 •

edited

Loading

andygrove Sep 4, 2024

andygrove commented Sep 4, 2024

kazuyukitanimura left a comment

kazuyukitanimura Sep 4, 2024

andygrove Sep 4, 2024 •

edited

Loading

viirya Sep 4, 2024

viirya Sep 4, 2024 •

edited

Loading

andygrove Sep 4, 2024

kazuyukitanimura commented Sep 4, 2024

andygrove commented Sep 4, 2024

chore: Upgrade to latest DataFusion revision #909

chore: Upgrade to latest DataFusion revision #909

Conversation

andygrove commented Sep 4, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

andygrove Sep 4, 2024

Choose a reason for hiding this comment

andygrove commented Sep 4, 2024

kazuyukitanimura left a comment

Choose a reason for hiding this comment

kazuyukitanimura Sep 4, 2024

Choose a reason for hiding this comment

andygrove Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

viirya Sep 4, 2024

Choose a reason for hiding this comment

viirya Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

andygrove Sep 4, 2024

Choose a reason for hiding this comment

kazuyukitanimura commented Sep 4, 2024

andygrove commented Sep 4, 2024

andygrove commented Sep 4, 2024 •

edited

Loading

andygrove Sep 4, 2024 •

edited

Loading

viirya Sep 4, 2024 •

edited

Loading