You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently the order of the physical plan optimizer rules does not consider the dependencies between rules, which may cause a few issues. For example,
The rule of CoalesceBatches may be influenced by the output of the rule of Repartition. It's not necessary to add a CoalesceBatchesExec after a RepartitionExec.
The Repartition may change the partition from single to multiple so that the rule of BasicEnforcement has to be run twice.
The BasicEnforcement is mixed with the selection of the global sort algorithm.
Describe the solution you'd like
It can be refined by the following steps:
Extract the global sort algorithm selection from the BasicEnforcement to be a separate rule, GlobalSortSelection.
Make the Repartition optional.
Reorder the rules as following:
AggregateStatistics
Repartition(optional)
GlobalSortSelection
JoinSelection
BasicEnforcement
CoalesceBatches(optional)
The reason for this ordering is as follows:
For Repartition, in order to increase the parallelism, it will change the output partitioning of some operators in the plan tree, which will influence other rules. Therefore, it should be run as soon as possible. The reason to make it optional is it's not used for the distributed engine, Ballista. And it's conflicted with some parts of the BasicEnforcement, since it will introduce additional repartitioning while the BasicEnforcement aims at reducing unnecessary repartitioning.
For GlobalSortSelection, since currently it will depend on the partition number to decide whether change the single node sort to parallel local sort and merge, it should be run after the Repartition. Since it will change the output ordering of some operators, it should be run before JoinSelection and BasicEnforcement, which may depend on that.
For JoinSelection, based on statistics, it will change the Auto mode to real join implementation, like collect left, or hash join, or future sort merge join, which will influence the BasicEnforcement to decide whether to add additional repartition and local sort to meet the distribution and ordering requirements. Therefore, it should be run before BasicEnforcement.
For BasicEnforcement, before run this rule, please make sure that the whole plan tree is determined.
For CoalesceBatches, it will not influence the distribution and ordering of the whole plan tree. Therefore, to avoid influencing other rules, it should be run at last.
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently the order of the physical plan optimizer rules does not consider the dependencies between rules, which may cause a few issues. For example,
CoalesceBatches
may be influenced by the output of the rule ofRepartition
. It's not necessary to add aCoalesceBatchesExec
after aRepartitionExec
.Repartition
may change the partition from single to multiple so that the rule ofBasicEnforcement
has to be run twice.BasicEnforcement
is mixed with the selection of the global sort algorithm.Describe the solution you'd like
It can be refined by the following steps:
BasicEnforcement
to be a separate rule,GlobalSortSelection
.Repartition
optional.The reason for this ordering is as follows:
Repartition
, in order to increase the parallelism, it will change the output partitioning of some operators in the plan tree, which will influence other rules. Therefore, it should be run as soon as possible. The reason to make it optional is it's not used for the distributed engine, Ballista. And it's conflicted with some parts of theBasicEnforcement
, since it will introduce additional repartitioning while theBasicEnforcement
aims at reducing unnecessary repartitioning.GlobalSortSelection
, since currently it will depend on the partition number to decide whether change the single node sort to parallel local sort and merge, it should be run after theRepartition
. Since it will change the output ordering of some operators, it should be run beforeJoinSelection
andBasicEnforcement
, which may depend on that.JoinSelection
, based on statistics, it will change the Auto mode to real join implementation, like collect left, or hash join, or future sort merge join, which will influence theBasicEnforcement
to decide whether to add additional repartition and local sort to meet the distribution and ordering requirements. Therefore, it should be run beforeBasicEnforcement
.BasicEnforcement
, before run this rule, please make sure that the whole plan tree is determined.CoalesceBatches
, it will not influence the distribution and ordering of the whole plan tree. Therefore, to avoid influencing other rules, it should be run at last.Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: