-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
need to support pulling aggregate up upon join #6895
Comments
Is the result still be correct? |
@winoros The result is correct, merge join will not save the performance. The main overhead here is we aggregate a table with 6kw rows, and the aggregate resulted in 200w rows, which takes a lot of time. In stead we should first do a outer join, which can have a smaller result set. |
This rewrite is not reasonable. In most cases, a rule should be something performed on a small field. Usually just about a plan and its children. Neither should you consider this plan's father. Nor should you consider the children of its children. |
@winoros You are right, the rule only need to consider a piece of the operator tree. But here I only meant to show that we need to perform the "join then aggregate" operation, maybe you should drive your focus of attention to this point. As the plan I showed in this case, we can generate it through the combination of other rules. |
And yes, before we applying a rule, we need to consider the equivalency between the two operator trees before and after applying the rule. And as you said we may need to take data distribution, column index, etc. into consideration. For these subquery rewriting methods I think maybe we can lean from "Orthogonal Optimization of Subqueries and Aggregation" |
@zz-jason
|
The original query 17 in the TPC-H benchmark is:
The physical plan generated by TiDB optimizer is:

We handle the subquery in the method of "aggregate(
HashAgg_39
) then join(HashLeftJoin_45
)", this query runs about 3 minutes in my computer with scale factor 10In fact, we can pull the aggregate
HashAgg_39
up, handle this subquery in the method of "join then aggregate", which means the subquery can be modified to the following form:The corresponding execution plan for this modified query is:

This modified version only runs 45 seconds and produced the same result with the original one.
But, this rule is not guaranteed to always generate a better plan, we should take cost into consideration: in the physical plan exhausting phase, not only consider the implementation rules(which physical operator implementation to use, chose hash join or merge join), but also consider the transformation rules, like aggregate push down, aggregate pull up
The text was updated successfully, but these errors were encountered: