Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorder the physical plan optimizer rules #4678

Closed
yahoNanJing opened this issue Dec 20, 2022 · 0 comments · Fixed by #4714
Closed

Reorder the physical plan optimizer rules #4678

yahoNanJing opened this issue Dec 20, 2022 · 0 comments · Fixed by #4714
Labels
enhancement New feature or request

Comments

@yahoNanJing
Copy link
Contributor

yahoNanJing commented Dec 20, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently the order of the physical plan optimizer rules does not consider the dependencies between rules, which may cause a few issues. For example,

  • The rule of CoalesceBatches may be influenced by the output of the rule of Repartition. It's not necessary to add a CoalesceBatchesExec after a RepartitionExec.
  • The Repartition may change the partition from single to multiple so that the rule of BasicEnforcement has to be run twice.
  • The BasicEnforcement is mixed with the selection of the global sort algorithm.

Describe the solution you'd like

It can be refined by the following steps:

  1. Extract the global sort algorithm selection from the BasicEnforcement to be a separate rule, GlobalSortSelection.
  2. Make the Repartition optional.
  3. Reorder the rules as following:
    • AggregateStatistics
    • Repartition(optional)
    • GlobalSortSelection
    • JoinSelection
    • BasicEnforcement
    • CoalesceBatches(optional)

The reason for this ordering is as follows:

  • For Repartition, in order to increase the parallelism, it will change the output partitioning of some operators in the plan tree, which will influence other rules. Therefore, it should be run as soon as possible. The reason to make it optional is it's not used for the distributed engine, Ballista. And it's conflicted with some parts of the BasicEnforcement, since it will introduce additional repartitioning while the BasicEnforcement aims at reducing unnecessary repartitioning.
  • For GlobalSortSelection, since currently it will depend on the partition number to decide whether change the single node sort to parallel local sort and merge, it should be run after the Repartition. Since it will change the output ordering of some operators, it should be run before JoinSelection and BasicEnforcement, which may depend on that.
  • For JoinSelection, based on statistics, it will change the Auto mode to real join implementation, like collect left, or hash join, or future sort merge join, which will influence the BasicEnforcement to decide whether to add additional repartition and local sort to meet the distribution and ordering requirements. Therefore, it should be run before BasicEnforcement.
  • For BasicEnforcement, before run this rule, please make sure that the whole plan tree is determined.
  • For CoalesceBatches, it will not influence the distribution and ordering of the whole plan tree. Therefore, to avoid influencing other rules, it should be run at last.

Describe alternatives you've considered

Additional context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
1 participant