Linearize binary expressions to reduce proto tree complexity #4115

isidentical · 2022-11-05T23:23:48Z

Which issue does this PR close?

Closes #4066.

Rationale for this change

This PR tries to represent chained binary expressions (like a AND b AND C or a + b + c) in a linearized manner (so instead of ((a, AND, b), AND, c), they are represented as ([a, b, c], AND)) which reduces the complexity of protobuf trees and help serialize some of the complex expressions that weren't possible to serialize before.

What changes are included in this PR?

New representation of the binary expressions in serialized logical plans.

Are there any user-facing changes?

This PR changes the structure in the logical plan, so not sure if this qualifies as an API change. If it might be better to actually do it without removing the existing fields from the protobuf declaration of BinaryExpr, we can also add a new field to the current form and represent all the extra operands there (but I think this one is much more straightforward).

andygrove

This looks great. Thanks @isidentical. I'd like to see us implement this in the logical plan as well, eventually, as mentioned in #1434.

datafusion/proto/src/from_proto.rs

alamb

Thanks @isidentical -- this looks like a great step forward!

datafusion/proto/src/to_proto.rs

datafusion/proto/src/bytes/mod.rs

datafusion/proto/src/from_proto.rs

alamb

looks great -- thank you @isidentical

alamb · 2022-11-07T19:47:17Z

datafusion/proto/src/bytes/mod.rs

+                let or_chain = (0..n)
+                    .fold(basic_expr.clone(), |expr, _| expr.or(basic_expr.clone()));
+                // (a < 5) OR (a < 5) AND (a < 5) OR (a < 5) AND (a < 5) AND (a < 5) OR ...
+                let expr =


alamb · 2022-11-07T19:47:39Z

datafusion/proto/src/bytes/mod.rs

+        let expr_ordered = col("A").and(col("B")).and(col("C")).and(col("D"));
+        assert_eq!(expr_ordered, roundtrip_expr(&expr_ordered));
+
+        // Ensure that no other variation becomes equal


ursabot · 2022-11-07T19:52:56Z

Benchmark runs are scheduled for baseline = 3892a1f and contender = 6b71294. 6b71294 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

yahoNanJing · 2022-11-14T16:25:16Z

Seems it brings a regression issue which causes stack overflow issue for "sum case when"

isidentical · 2022-11-14T16:59:21Z

Thanks @yahoNanJing for the report! Would you mind sharing the logs (or linking them) (I don't see a new issue relevant to this and the CI on main is passing so not sure when it fails 🤔)

yahoNanJing · 2022-11-15T08:32:52Z

Hi @isidentical, the issue may not relate to this PR. I made a mistake in our testing environment. Really sorry for bringing the confusing info.

isidentical · 2022-11-15T16:01:09Z

Ah, no problem at all. Let me know if it resurfaces @yahoNanJing!

isidentical marked this pull request as ready for review November 6, 2022 00:37

andygrove approved these changes Nov 6, 2022

View reviewed changes

andygrove added the api change Changes the API exposed to users of the crate label Nov 6, 2022

HaoYang670 reviewed Nov 7, 2022

View reviewed changes

datafusion/proto/src/from_proto.rs Outdated Show resolved Hide resolved

alamb approved these changes Nov 7, 2022

View reviewed changes

datafusion/proto/src/to_proto.rs Show resolved Hide resolved

datafusion/proto/src/bytes/mod.rs Show resolved Hide resolved

datafusion/proto/src/from_proto.rs Outdated Show resolved Hide resolved

isidentical force-pushed the gh-4066 branch 2 times, most recently from a1d6b02 to 4213d8f Compare November 7, 2022 15:15

Linearize binary expressions to reduce proto tree complexity

b416aaf

isidentical force-pushed the gh-4066 branch from 4213d8f to b416aaf Compare November 7, 2022 15:21

isidentical requested a review from alamb November 7, 2022 15:22

alamb approved these changes Nov 7, 2022

View reviewed changes

alamb merged commit 6b71294 into apache:master Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linearize binary expressions to reduce proto tree complexity #4115

Linearize binary expressions to reduce proto tree complexity #4115

isidentical commented Nov 5, 2022 •

edited

Loading

andygrove left a comment

alamb left a comment

alamb left a comment

alamb Nov 7, 2022

alamb Nov 7, 2022

ursabot commented Nov 7, 2022

yahoNanJing commented Nov 14, 2022

isidentical commented Nov 14, 2022

yahoNanJing commented Nov 15, 2022

isidentical commented Nov 15, 2022

Linearize binary expressions to reduce proto tree complexity #4115

Linearize binary expressions to reduce proto tree complexity #4115

Conversation

isidentical commented Nov 5, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

andygrove left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Nov 7, 2022

Choose a reason for hiding this comment

alamb Nov 7, 2022

Choose a reason for hiding this comment

ursabot commented Nov 7, 2022

yahoNanJing commented Nov 14, 2022

isidentical commented Nov 14, 2022

yahoNanJing commented Nov 15, 2022

isidentical commented Nov 15, 2022

isidentical commented Nov 5, 2022 •

edited

Loading