Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linearize binary expressions to reduce proto tree complexity #4115

Merged
merged 1 commit into from
Nov 7, 2022

Conversation

isidentical
Copy link
Contributor

@isidentical isidentical commented Nov 5, 2022

Which issue does this PR close?

Closes #4066.

Rationale for this change

This PR tries to represent chained binary expressions (like a AND b AND C or a + b + c) in a linearized manner (so instead of ((a, AND, b), AND, c), they are represented as ([a, b, c], AND)) which reduces the complexity of protobuf trees and help serialize some of the complex expressions that weren't possible to serialize before.

What changes are included in this PR?

New representation of the binary expressions in serialized logical plans.

Are there any user-facing changes?

This PR changes the structure in the logical plan, so not sure if this qualifies as an API change. If it might be better to actually do it without removing the existing fields from the protobuf declaration of BinaryExpr, we can also add a new field to the current form and represent all the extra operands there (but I think this one is much more straightforward).

@isidentical isidentical marked this pull request as ready for review November 6, 2022 00:37
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks @isidentical. I'd like to see us implement this in the logical plan as well, eventually, as mentioned in #1434.

@andygrove andygrove added the api change Changes the API exposed to users of the crate label Nov 6, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @isidentical -- this looks like a great step forward!

@isidentical isidentical force-pushed the gh-4066 branch 2 times, most recently from a1d6b02 to 4213d8f Compare November 7, 2022 15:15
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great -- thank you @isidentical

let or_chain = (0..n)
.fold(basic_expr.clone(), |expr, _| expr.or(basic_expr.clone()));
// (a < 5) OR (a < 5) AND (a < 5) OR (a < 5) AND (a < 5) AND (a < 5) OR ...
let expr =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

let expr_ordered = col("A").and(col("B")).and(col("C")).and(col("D"));
assert_eq!(expr_ordered, roundtrip_expr(&expr_ordered));

// Ensure that no other variation becomes equal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@alamb alamb merged commit 6b71294 into apache:master Nov 7, 2022
@ursabot
Copy link

ursabot commented Nov 7, 2022

Benchmark runs are scheduled for baseline = 3892a1f and contender = 6b71294. 6b71294 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@yahoNanJing
Copy link
Contributor

Seems it brings a regression issue which causes stack overflow issue for "sum case when"

@isidentical
Copy link
Contributor Author

Thanks @yahoNanJing for the report! Would you mind sharing the logs (or linking them) (I don't see a new issue relevant to this and the CI on main is passing so not sure when it fails 🤔)

@yahoNanJing
Copy link
Contributor

Hi @isidentical, the issue may not relate to this PR. I made a mistake in our testing environment. Really sorry for bringing the confusing info.

@isidentical
Copy link
Contributor Author

Ah, no problem at all. Let me know if it resurfaces @yahoNanJing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support serializing more deeply nested AND / OR expressions
6 participants