Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orthogonalize distribution and sort enforcement rules into EnforceDistribution and EnforceSorting #4839

Merged
merged 5 commits into from
Jan 9, 2023
Merged

Orthogonalize distribution and sort enforcement rules into EnforceDistribution and EnforceSorting #4839

merged 5 commits into from
Jan 9, 2023

Conversation

mustafasrepo
Copy link
Contributor

Which issue does this PR close?

N/A

Rationale for this change

During the review of #4691, one of the key findings was that there was a small separation of concern issue: Both BasicEnforcement and OptimizeSorts were dealing with local sorting and there was some overlap in functionality.

This PR pays this technical debt: Enforcers of distribution and sorting requirements will henceforth be two completely orthogonal rules (EnforceDistribution and EnforceSorting). Note that one can get the same result with the old BasicEnforcement rule by applying these two rules in succession.

The new EnforceSorting doesn't just enforce sorts by naively adding SortExecs, it will smartly add OR remove them as it enforces the ordering requirements. This will hopefully help with rule reuse and ease reasoning (we will not lose optimality by liberally using EnforceSorting).

What changes are included in this PR?

Local sort enforcement AND optimization is handled with a single rule. Current rule can be called multiple times without a downside in terms of the final physical plan.

Are these changes tested?

Existing tests check for plan correctness.

Are there any user-facing changes?

No.

Future Work

Some of the EnforceDistribution tests actually test the full EnforceDistribution + EnforceSorting cascade. It would be a good idea to orthogonalize those tests too.

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions labels Jan 7, 2023
@jackwener
Copy link
Member

cc @mingmwang

@alamb
Copy link
Contributor

alamb commented Jan 7, 2023

Thank you @mustafasrepo -- I plan to review this PR tomorrow

@alamb alamb changed the title Orthogonalize distribution and sort enforcement rules Orthogonalize distribution and sort enforcement rules into EnforceDistribution and EnforceSorting Jan 8, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mustafasrepo -- this is a very nice change 👌

I agree with @jackwener it would be good to get @mingmwang 's opinion on this change if they have time. However, given all the existing tests pass I view this as a (nice) refactoring exercise that is likely to be uncontroversial

cc @liukun4515

let optimized = optimizer.optimize($PLAN, &config)?;
let optimizer = EnforceSorting {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is slightly confusing that the tests in EnforceDistribution also rely on EnforceSorting though I understand the reason for this given they both started in the same pass

Maybe we can add a comment like:

// These tests also ensure  `EnforceSorting` because they were written prior to the
// separation of `EnforceSorting` and `EnfoceDistribution`

Or something like that to give future readers a clue about the rationale for this seemingly strange choice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- just sent a commit with a NOTE and a TODO.

@@ -287,6 +287,10 @@ impl ExecutionPlan for LocalLimitExec {
self.input.output_ordering()
}

fn maintains_input_order(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mingmwang
Copy link
Contributor

mingmwang commented Jan 9, 2023

@mustafasrepo @alamb
Generally, I am OK with this change. I will take a closer look at the PR today.

For the original OptimizeSorts rule, I just have a feeling that the implementation was too complex and I think maybe there are ways to simply it. And I also have a feeling that the rule should be implemented in a top-down approach. I will try to do some test also.
Adding SortExec and RepartitionExec is applied in a bottom-up approach(this can also be implemented in a top-down way, it is much more complex but more powerful, I plan to implement this in future); Removing SortExec can be applied in a top-down approach because we want to remove the unnecessary Sorts which are usually in the high level of the plan tree.

@@ -832,7 +828,7 @@ fn new_join_conditions(
/// Within this function, it checks whether we need to add additional plan operators
/// of data exchanging and data ordering to satisfy the required distribution and ordering.
/// And we should avoid to manually add plan operators of data exchanging and data ordering in other places
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please modify the comment here and remove the "data ordering".

@mingmwang
Copy link
Contributor

The change LGTM except for a small comment change.

@ozankabak
Copy link
Contributor

Thanks for the review @mingmwang, comment is updated.

@alamb alamb merged commit ceff6cb into apache:master Jan 9, 2023
@alamb
Copy link
Contributor

alamb commented Jan 9, 2023

Thanks everyone

@ursabot
Copy link

ursabot commented Jan 9, 2023

Benchmark runs are scheduled for baseline = c5e2594 and contender = ceff6cb. ceff6cb is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@mustafasrepo mustafasrepo deleted the feature/unify_sort_rules branch January 10, 2023 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants