-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colexec: add default aggregate function #52174
Conversation
641ed72
to
d37e40a
Compare
The comparison against the wrapped rowexec processor is rather positive:
And the absolute speeds are pretty good as well: string_agg. Note that the numbers are somewhat inflated when comparing against other optimized aggregate functions because |
4ce6881
to
7067281
Compare
Only last two commits belong in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 17 of 17 files at r3, 15 of 26 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)
pkg/sql/colexec/aggregate_funcs.go, line 25 at r3 (raw file):
) // isAggOptimized returns whether aggFn has optimized implementation.
nit: s/has/has an/
pkg/sql/colexec/aggregate_funcs.go, line 195 at r3 (raw file):
funcAllocs := make([]aggregateFuncAlloc, len(spec.Aggregations)) var toClose Closers var idxsToConvert util.FastIntSet
Is this over-optimization? If so, I would just use []int
. If not, see if newVecToDatumConverter
can use FastIntSet
to avoid having to convert?
pkg/sql/colexec/aggregate_funcs.go, line 299 at r3 (raw file):
if err != nil { return nil, nil, toClose, err
Are closers closed on error?
pkg/sql/colexec/aggregate_funcs.go, line 345 at r3 (raw file):
inputTypes []*types.T, ) ( constructors []execinfrapb.AggregateConstructor,
I don't think we need to use named return variables here (you're initializing pretty each one minus error anyway)
pkg/sql/colexec/count_agg_tmpl.go, line 78 at r4 (raw file):
) { var i int // Remove unused warning.
🤔 what's the cause of this unused warning?
pkg/sql/colexec/default_agg_tmpl.go, line 118 at r3 (raw file):
// 'convertedTupleIdx'. These indices are the same when there is no selection // vector but could be different if there is one. func _ADD_TUPLE(
Any chance we can use @jordanlewis' new templating framework for this and other things in this file?
pkg/sql/colexec/hash_aggregator.go, line 366 at r3 (raw file):
func (op *hashAggregator) Close(ctx context.Context) error { op.toClose.CloseAndLogOnErr(ctx, "hash-aggregator")
I think that CloseAndLogOnErr
should be used when we cannot/don't want to return an error which I think is not the case here. We could maybe decorate the error with the fact that we're closing from the hash aggregator but I don't think we should swallow the error.
pkg/sql/colexec/utils_test.go, line 1439 at r3 (raw file):
} func (c *chunkingBatchSource) reset(context.Context) {
Add a var _ resetter
assertion above?
pkg/sql/colexec/colbuilder/execplan.go, line 698 at r3 (raw file):
) } else { evalCtx.SingleDatumAggMemAccount = streamingMemAccount
Some aggregate functions like ARRAY_AGG
are not streaming so it is no longer true that hash = buffering, ordered = streaming. We might need to do something different here.
pkg/sql/distsql/columnar_operators_test.go, line 60 at r3 (raw file):
var da sqlbase.DatumAlloc // We need +1 because an entry for index=6 was omitted by mistake.
🤔
pkg/sql/distsql/columnar_operators_test.go, line 202 at r3 (raw file):
// on. continue // TODO(yuzefovich): here is a more tight condition,
Don't think so. Do we check the case where one returns an error but the other doesn't?
pkg/sql/distsql/columnar_utils_test.go, line 79 at r3 (raw file):
if rng.Float64() < 0.5 { randomBatchSize := 1 + rng.Intn(3) fmt.Printf("coldata.BatchSize() is set to %d\n", randomBatchSize)
Was this not useful?
pkg/sql/execinfrapb/processors.go, line 143 at r3 (raw file):
// AggregateFuncToNumArguments maps aggregate functions to the number of // arguments they take. var AggregateFuncToNumArguments = map[AggregatorSpec_Func]int{
How do we keep this up to date? If this is used by only testing code, I would rather keep it close to the tests (i.e. not in execinfrapb
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)
pkg/sql/colexec/aggregate_funcs.go, line 195 at r3 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Is this over-optimization? If so, I would just use
[]int
. If not, see ifnewVecToDatumConverter
can useFastIntSet
to avoid having to convert?
Done.
pkg/sql/colexec/aggregate_funcs.go, line 299 at r3 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Are closers closed on error?
It's the same behavior as for all other Closer
s. I've just checked vectorizedFlowCreator.setupFlow
and no, they won't get closed. But this function will return an error only when we're trying to create an aggregate function on an unsupported type, and AFAIK we now fully support all optimized functions and have this default unoptimized one, so if an error occurs, then something has gone really wrong.
pkg/sql/colexec/aggregate_funcs.go, line 345 at r3 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
I don't think we need to use named return variables here (you're initializing pretty each one minus error anyway)
I added named return variables in order to document the code better and make it easier to use this function. Unless you have a strong objection, I'll keep it this way.
pkg/sql/colexec/count_agg_tmpl.go, line 78 at r4 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
🤔 what's the cause of this unused warning?
It was a case of hash aggregation when we don't need to pay attention to nulls. I updated the template to generate more efficient code for that case.
pkg/sql/colexec/default_agg_tmpl.go, line 118 at r3 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Any chance we can use @jordanlewis' new templating framework for this and other things in this file?
I think updated execgen doesn't support all things needed for all aggregate functions, so I don't think it's worth spending time on figuring out whether we could implement this particular template in that framework (some complications are that we have if eq "_AGGKIND" "Ordered"
condition that all aggregate functions templates have, and that condition is handled on the "meta" level). I'm pretty sure it'll be easier to update all of the files at once.
pkg/sql/colexec/hash_aggregator.go, line 366 at r3 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
I think that
CloseAndLogOnErr
should be used when we cannot/don't want to return an error which I think is not the case here. We could maybe decorate the error with the fact that we're closing from the hash aggregator but I don't think we should swallow the error.
Actually, the aggregators will never return an error here because defaultHashAggAlloc
implements Closer
interface and in Close
method it calls tree.AggregateFunc.Close
method which doesn't return an error.
Left a clarifying comment.
pkg/sql/colexec/utils_test.go, line 1439 at r3 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Add a
var _ resetter
assertion above?
Done.
pkg/sql/colexec/colbuilder/execplan.go, line 698 at r3 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Some aggregate functions like
ARRAY_AGG
are not streaming so it is no longer true that hash = buffering, ordered = streaming. We might need to do something different here.
Such functions create and manage their own memory accounts.
SingleDatumAggMemAccount
is shared by all aggregate functions that need to store like a single datum.
pkg/sql/distsql/columnar_operators_test.go, line 60 at r3 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
🤔
Yeah, I know. Apparently, that's an artifact that has been present since 1.0.
pkg/sql/distsql/columnar_operators_test.go, line 202 at r3 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Don't think so. Do we check the case where one returns an error but the other doesn't?
Removed. Yeah, that case is checked separately and it'll have a different error message ("different number of metas returned").
pkg/sql/distsql/columnar_utils_test.go, line 79 at r3 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Was this not useful?
Yeah, I think it wasn't useful, just clogging up the output.
pkg/sql/execinfrapb/processors.go, line 143 at r3 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
How do we keep this up to date? If this is used by only testing code, I would rather keep it close to the tests (i.e. not in
execinfrapb
)
I added a note to processors_sql.proto
which should help with keeping the map up to date.
I agree, though, the map probably doesn't have to live in execinfrapb
, moved it next to the tests. I think originally I was thinking of using the map somewhere else as well, but I don't remember what that was for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 15 files at r1, 52 of 52 files at r5, 8 of 8 files at r6, 18 of 18 files at r7, 26 of 26 files at r8.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/sql/colexec/aggregate_funcs.go, line 299 at r3 (raw file):
Previously, yuzefovich wrote…
It's the same behavior as for all other
Closer
s. I've just checkedvectorizedFlowCreator.setupFlow
and no, they won't get closed. But this function will return an error only when we're trying to create an aggregate function on an unsupported type, and AFAIK we now fully support all optimized functions and have this default unoptimized one, so if an error occurs, then something has gone really wrong.
Fair enough, but maybe we shouldn't return toClose
if we encounter an error.
pkg/sql/colexec/hash_aggregator.go, line 366 at r3 (raw file):
Previously, yuzefovich wrote…
Actually, the aggregators will never return an error here because
defaultHashAggAlloc
implementsCloser
interface and inClose
method it callstree.AggregateFunc.Close
method which doesn't return an error.Left a clarifying comment.
Even if the current implementations don't return an error, future ones might, or the code might change to do so. I think it's more sane to propagate whatever error happens up if we can. This is partly why I didn't want to implement CloseAndLogOnErr
, because it should be pretty rare that you want to swallow the error. I think I'm also guilty of overusing CloseAndLogOnErr
.
pkg/sql/colexec/ordered_aggregator.go, line 339 at r7 (raw file):
// tree.AggregateFunc.Close doesn't return an error, and that's why it's ok // to "swallow" errors here because they won't actually occur. a.toClose.CloseAndLogOnErr(ctx, "ordered-aggregator")
ditto. Please also add a comment on CloseAndLogOnErr
that specifies that one should only use it if returning an error doesn't make sense.
pkg/sql/colexec/colbuilder/execplan.go, line 698 at r3 (raw file):
Previously, yuzefovich wrote…
Such functions create and manage their own memory accounts.
SingleDatumAggMemAccount
is shared by all aggregate functions that need to store like a single datum.
OK. I guess my concern was re IsStreaming
below but I defer to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)
pkg/sql/colexec/aggregate_funcs.go, line 299 at r3 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Fair enough, but maybe we shouldn't return
toClose
if we encounter an error.
Done. I don't think it's important though.
pkg/sql/colexec/hash_aggregator.go, line 366 at r3 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Even if the current implementations don't return an error, future ones might, or the code might change to do so. I think it's more sane to propagate whatever error happens up if we can. This is partly why I didn't want to implement
CloseAndLogOnErr
, because it should be pretty rare that you want to swallow the error. I think I'm also guilty of overusingCloseAndLogOnErr
.
Ok, fair enough, updated.
pkg/sql/colexec/ordered_aggregator.go, line 339 at r7 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
ditto. Please also add a comment on
CloseAndLogOnErr
that specifies that one should only use it if returning an error doesn't make sense.
Done.
pkg/sql/colexec/colbuilder/execplan.go, line 698 at r3 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
OK. I guess my concern was re
IsStreaming
below but I defer to you
I think still even with functions as array_agg
ordered aggregator should be considered "streaming".
I agree that it becomes a little confusing, but I also think that it's not very important to be exactly correct here - the aggregate functions like this perform the memory accounting, so if we reach the memory limit, an error will return; however, this will also be the case for rowexec.orderedAggregator
, so there is not much benefit in prohibiting such aggregate functions with vectorize=201auto
(which is the only case when IsStreaming
matters).
🎉 All dependencies have been resolved ! |
3f3e660
to
3708582
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 34 of 34 files at r9, 26 of 26 files at r10.
Reviewable status:complete! 1 of 0 LGTMs obtained
This commit introduces "default" aggregate function which is an adapter from `tree.AggregateFunc` to `colexec.aggregateFunc`. It works as follows: - the aggregator (either hash or ordered) is responsible for converting all necessary vectors to `tree.Datum` columns before calling `Compute` (this allows us to share the conversion between multiple functions if they happened to take in the same columns and between multiple groups in case of the hash aggregator) - the default aggregate function populates "arguments" to be passed into the wrapped `tree.AggregateFunc` and adds them - when the new group is encountered, the result so far is flushed and the wrapped `tree.AggregateFunc` is reset. One detail is that these wrapped `tree.AggregateFunc`s need to be closed, and currently that responsibility lies with the alloc object that is creating them. In the future, we might want to shift the responsibility to the aggregators. Release note: None
Hash aggregate function always have non-nil `sel`, and this commit removes the code generation for nil `sel` case (meaning it removes the dead code). It also templates out nulls vs no-nulls cases in `bool_and` and `bool_or` aggregates. Release note: None
TFTR! bors r+ |
Build succeeded: |
Depends on #51337.
Depends on #52315.
colexec: add default aggregate function
This commit introduces "default" aggregate function which is an adapter
from
tree.AggregateFunc
tocolexec.aggregateFunc
. It works asfollows:
all necessary vectors to
tree.Datum
columns before callingCompute
(this allows us to share the conversion between multiple functions if
they happened to take in the same columns and between multiple groups in
case of the hash aggregator)
the wrapped
tree.AggregateFunc
and adds themthe wrapped
tree.AggregateFunc
is reset.One detail is that these wrapped
tree.AggregateFunc
s need to beclosed, and currently that responsibility lies with the alloc object
that is creating them. In the future, we might want to shift the
responsibility to the aggregators.
Addresses: #43561.
Release note: None
colexec: clean up hash aggregate functions
Hash aggregate function always have non-nil
sel
, and this commitremoves the code generation for nil
sel
case (meaning it removes thedead code). It also templates out nulls vs no-nulls cases in
bool_and
and
bool_or
aggregates.Release note: None