-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
planner, expression: support multi-distinct agg under MPP mode #39973
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
add another commit to adjust the stats estimated
when split mppScalar agg into as follows: what we need to do is to adjust the partial agg, middle proj, partial proj, and the exchanger between them. some notes here:
|
return false, nil | ||
} | ||
for _, arg := range fun.Args { | ||
// bail out when args are not simple column, see GitHub issue #35417 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does that issue affect this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied from hasheng's pr in single distinct-agg. @fixdb have some memory about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In single distinct agg's case, if the distinct key is not a simple column, planner will generate a shuffle with empty key. But In this PR, @AilinKid will project a column for complex expression, so I think the comment doesn't apply here if the projection is done before optimization.
84677f9
to
8748dad
Compare
/run-unit-test |
// for every group set,try to find its position to insert if possible,otherwise create a new grouping set. | ||
for j := len(oneNewGroupingSet) - 1; j >= 0; j-- { | ||
cur := oneNewGroupingSet[j] | ||
if targetOne.SubSetOf(cur) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the targetOne is super set of cur and j=0, then it will create the new GroupingSet at the line:126.
For example:
targetOne is [a,b]
gss is [
[[a],]
]
The result will be
gss [
[[a]]
[[a,b]]
]
The result is not expected by your comment. Are you sure this logic is correctly ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch, fixed~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, LGTM
planner/core/task.go
Outdated
|
||
// canUse3Stage4MultiDistinctAgg returns true if this agg can use 3 stage for multi distinct aggregation | ||
func (p *basePhysicalAgg) canUse3Stage4MultiDistinctAgg() (can bool, gss expression.GroupingSets) { | ||
if !p.ctx.GetSessionVars().Enable3StageDistinctAgg || len(p.GroupByItems) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we don't support multi distinct with group by ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the next step if needed
return partialAgg, finalAgg | ||
} | ||
|
||
func (p *basePhysicalAgg) scale3StageForDistinctAgg() (bool, expression.GroupingSets) { | ||
if p.canUse3Stage4SingleDistinctAgg() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this logic is more clearly ?
if (!Enable3StageDistinctAgg) {
return false
}
distinct_num = 0;
if (distinct_num == 1) {
canUse3Stage4SingleDistinctAgg();
} else {
canUse3Stage4MultiDistinctAgg();
}
(BTW, this is just my personal opinion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, switch judgment can be taken out here, while the distinct_num can only be scratched inside
return false, nil | ||
} | ||
for _, arg := range fun.Args { | ||
// bail out when args are not simple column, see GitHub issue #35417 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In single distinct agg's case, if the distinct key is not a simple column, planner will generate a shuffle with empty key. But In this PR, @AilinKid will project a column for complex expression, so I think the comment doesn't apply here if the projection is done before optimization.
// groupingID now is the offset of target grouping in GroupingSets. | ||
// todo: it may be changed after grouping set merge in the future. | ||
fun.GroupingID = int64(len(GroupingSets)) | ||
} else if len(fun.Args) > 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really want to bail out here? I see there are a lot of code dealing with group with multiple columns in GroupingSet
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this bailout cases like normal agg with multi args:
sum(a+b) has only one arg. --- scalarFunc(a,b). admitted
group_concat(x order by y) has two arg here. banned now
And some JSON agg will have complicated multi args too. banned now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can some comments here to make it clearer
planner/core/task.go
Outdated
// even for distinct agg multi args like count(distinct a,b), either null of a,b will cause count func to skip this row. | ||
// so we only need to set the either one arg of them to be a case when expression to target for its self-group: | ||
// Expr = (case when groupingID = ? then arg else null end) | ||
// count(distinct a,b) => count(distinct (case when groupingID = targetID then a else null end), b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I remembered we don't need the case when
here, at least for built-in agg functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a standard way, actually, we can remove this projection to let distinct-agg and normal-agg aggregate on whole scope data. (null rows included but not counted for these null-strict built aggs), what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. count(*) is an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count(*) has already been rewritten as count(1) when entering here
] | ||
}, | ||
{ | ||
"SQL": "select count(distinct a, b), count(distinct b), count(c), sum(d) from t", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support it at the moment, is it because TiFlash doesn't support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it needs some changes: Expand op produces non-fixed schema length(currently just append one more LongLong column --- groupingID), I will do it next step. Otherwise, this TiFlash pr will become huge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood.
planner/core/task.go
Outdated
|
||
// step2: adjust middle agg | ||
// proj4Middle output all the base col from lower op + caseWhen proj cols.(reuse partial agg stats) | ||
proj4Middle := new(PhysicalProjection).Init(p.ctx, partialHashAgg.stats, partialHashAgg.SelectBlockOffset()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no non-distinct agg function (all of them are distinct functions), do we still need the projection operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proj4Partial: project for normal-agg arg with case-when
proj4Middle: project for distinct-agg arg with case-when
as mentioned in the comment above, if do it in a more efficient/non-standard way, we can remove case-when to let both null-strict agg aggregate on whole scope data
new refactor, simplify the computation since all current TiDB aggregation is null-stricted, so we can just let all distinct aggregation aggregate on whole-scope data (including their expanded rows, which means no case-when expression to select targeted data), because those expanded rows are all filled with null values for your current targeted agg, these rows all naturally ignored in it's agg computation logic. for normal agg, cases like: select count(distinct a), count(distinct b), count(c) from t; since c is not one of the grouping column, count(c) will get a doubled-result when it aggregates on whole-scope data since column c is not filled with null in other grouping sets. So normal agg still needs case-when expression to targeted for their cared group data.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't taken a careful look at how we merge the grouping sets. Others parts LGTM
// groupingID now is the offset of target grouping in GroupingSets. | ||
// todo: it may be changed after grouping set merge in the future. | ||
fun.GroupingID = int64(len(GroupingSets)) | ||
} else if len(fun.Args) > 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can some comments here to make it clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// for every group set,try to find its position to insert if possible,otherwise create a new grouping set. | ||
for j := len(oneNewGroupingSet) - 1; j >= 0; j-- { | ||
cur := oneNewGroupingSet[j] | ||
if targetOne.SubSetOf(cur) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, LGTM
Signed-off-by: AilinKid <[email protected]>
Signed-off-by: AilinKid <[email protected]>
Signed-off-by: AilinKid <[email protected]>
Signed-off-by: AilinKid <[email protected]>
Signed-off-by: AilinKid <[email protected]>
Signed-off-by: AilinKid <[email protected]>
Signed-off-by: AilinKid <[email protected]>
Signed-off-by: AilinKid <[email protected]>
Signed-off-by: AilinKid <[email protected]>
Signed-off-by: AilinKid <[email protected]>
Signed-off-by: AilinKid <[email protected]>
Signed-off-by: AilinKid <[email protected]>
f70d1e7
to
b353681
Compare
/merge |
This pull request has been accepted and is ready to merge. Commit hash: b353681
|
Signed-off-by: AilinKid <[email protected]>
/merge |
This pull request has been accepted and is ready to merge. Commit hash: d9f693b
|
/retest |
1 similar comment
/retest |
Signed-off-by: AilinKid [email protected]
What problem does this PR solve?
Issue Number: close #16581 #34704
What is changed and how it works?
1: implement grouping sets basic expression logic
2: implement multi-distinct agg's admission logic and 3-phase plan adjustment
3: implement a mock repeat source executor to help understand what's going on and even for test convenience.
4: implement the theoretically computed stats for partial agg, projection, and exchanger related. (details commented below)
tipb requirement: pingcap/tipb#283
tiflash optional: pingcap/tiflash#6545
For review friendly, some isolated parts can be extracted as alternative small ones. You can pick friendly here first.
Demo
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.