Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support query block level no_decorrelate() hint #37789

Closed
time-and-fate opened this issue Sep 13, 2022 · 1 comment · Fixed by #37633
Closed

Support query block level no_decorrelate() hint #37789

time-and-fate opened this issue Sep 13, 2022 · 1 comment · Fixed by #37633
Labels
sig/planner SIG: Planner type/enhancement The issue or PR belongs to an enhancement.

Comments

@time-and-fate
Copy link
Member

Enhancement

Currently, the decorrelation in tidb is rule-based, not cost-based. But decorrelation doesn't guarantee to generate a better plan. So there are some cases that we need tidb not decorrelate a subquery but tidb directly decorrelates it, and generates a bad plan.

A typical case is like this: The outer query has a very small row count, and the subquery has an index on the correlated column. But there are some other operators (like aggregate) above the table in the subquery, which makes the IndexJoin inapplicable. If tidb decorrelates it, there would be a full scan in the subquery. However, we could have used the index to access only several rows from the table in the subquery.
Example:

create table t1(a int, b int);
create table t2(a int, b int, index ia(a));
explain select a > (select sum(b) from t2 where a = t1.b) from t1;
+------------------------------------+----------+-----------+---------------+-------------------------------------------------------------------------------------------+
| id                                 | estRows  | task      | access object | operator info                                                                             |
+------------------------------------+----------+-----------+---------------+-------------------------------------------------------------------------------------------+
| Projection_10                      | 10000.00 | root      |               | gt(cast(test.t1.a, decimal(20,0) BINARY), Column#10)->Column#11                           |
| └─HashJoin_11                      | 10000.00 | root      |               | left outer join, equal:[eq(test.t1.b, test.t2.a)]                                         |
|   ├─HashAgg_22(Build)              | 7992.00  | root      |               | group by:test.t2.a, funcs:sum(Column#13)->Column#10, funcs:firstrow(test.t2.a)->test.t2.a |
|   │ └─TableReader_23               | 7992.00  | root      |               | data:HashAgg_15                                                                           |
|   │   └─HashAgg_15                 | 7992.00  | cop[tikv] |               | group by:test.t2.a, funcs:sum(test.t2.b)->Column#13                                       |
|   │     └─Selection_21             | 9990.00  | cop[tikv] |               | not(isnull(test.t2.a))                                                                    |
|   │       └─TableFullScan_20       | 10000.00 | cop[tikv] | table:t2      | keep order:false, stats:pseudo                                                            |
|   └─TableReader_14(Probe)          | 10000.00 | root      |               | data:TableFullScan_13                                                                     |
|     └─TableFullScan_13             | 10000.00 | cop[tikv] | table:t1      | keep order:false, stats:pseudo                                                            |
+------------------------------------+----------+-----------+---------------+-------------------------------------------------------------------------------------------+

If the row count of t1 is very small and a = t1.b could filter most rows of data, row-by-row executed Apply will be much faster than the full scan + hash join strategy.

Though cost-based decorrelation is not easy for current tidb, we can provide a method to control this behavior manually.
Currently, there is only a global optimization rule blocklist to control this behavior:
https://docs.pingcap.com/tidb/dev/blocklist-control-plan#the-blocklist-of-optimization-rules-and-expression-pushdown
It's very hard to use and not friendly.
To make it easier to control, we can implement a hint like other cost-based choices. To provide more flexibility, we can make this hint a query block level hint so that we can specifically choose which Apply we don't want to decorrelate.

@time-and-fate time-and-fate added type/enhancement The issue or PR belongs to an enhancement. sig/planner SIG: Planner labels Sep 13, 2022
@time-and-fate
Copy link
Member Author

Design

  1. Make the parser support no_decorrelate() hint.
  2. Add 2 new fields to PlanBuilder to pass information between handleXXXSubquery() and buildSelect()
    One field tells buildSelect() whether we are handling a subquery. Another field tells handleXXXSubquery() whether there are valid hints in the subquery.
  3. Check the validity: The hint is invalid if we are not handling a subquery, or there are no correlated columns for this subquery.
    If the hint is invalid, report a warning.
  4. Add a new field to LogicalApply. Mark this field in handleXXXSubquery() if there is a valid hint.
  5. During decorrelation, skip marked LogicalApply.

@VelocityLight VelocityLight added affects-5.0 This bug affects 5.0.x versions. and removed affects-5.0 This bug affects 5.0.x versions. labels Dec 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/planner SIG: Planner type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants