-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colexec: implement function concat_agg for vectorized engine #51148
Conversation
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. I have added a few people who may be able to assist in reviewing: 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
18e6851
to
d483b56
Compare
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on! It's in a very good shape already, but I left some suggestions on how to simplify the code a bit.
Reviewed 11 of 11 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @yongyanglai)
pkg/sql/colexec/concat_agg_tmpl.go, line 32 at r1 (raw file):
// {{/* // Declarations to make the template compile properly
nit: missing periods here and in several places below.
pkg/sql/colexec/concat_agg_tmpl.go, line 50 at r1 (raw file):
// {{range .}} func (a *concat_TYPE_AGGKINDAgg) Init(groups []bool, vec coldata.Vec) {
I think we don't need to do templating on _TYPE
- we know that concat_agg
works only on tree.DBytes
and tree.DString
, and both types have the same physical representation, coldata.Bytes
, so I think it is safe to hard-code that we're operating on Bytes
vectors and _TYPE
entirely.
pkg/sql/colexec/concat_agg_tmpl.go, line 82 at r1 (raw file):
[]coldata.Vec{a.vec}, func() { previousAggValMemoryUsage := _AGG_VAL_MEMORY_USAGE(a.curAgg)
I think we need to do less explicit memory accounting here. The reason is that when execgen.SET
templated function call is within PerformOperation
, then we're already measuring the updated memory usage automatically. What I mean is that a.alloc.allocator.PerformOperation
itself will measure the memory footprint of a.vec
before and after the operation is applied, so there is no need to do anything explicitly (in fact, what we have right now is double counting the memory usage).
As a result, I don't think we need this _AGG_VAL_MEMORY_USAGE
function at all.
pkg/sql/colexec/concat_agg_tmpl.go, line 121 at r1 (raw file):
} a.alloc.allocator.AdjustMemoryUsage(-(_AGG_VAL_MEMORY_USAGE(a.curAgg))) a.curAgg = zero_RET_TYPEValue
nit: no need to zero out since Flush
is called once the computation has been completed.
pkg/sql/colexec/concat_agg_tmpl.go, line 183 at r1 (raw file):
// {{/* func _ACCUMULATE_CONCAT( a *sum_SUMKIND_TYPE_AGGKINDAgg, nulls *coldata.Nulls, i int, _HAS_NULLS bool,
nit: s/sum_SUMKIND_TYPE_AGGKINDAgg/concat_TYPE_AGGKINDAgg/
.
pkg/sql/colexec/overloads_test_utils.eg.go, line 3303 at r1 (raw file):
// to go around "unused" error. _ = _overloadHelper
Our linter is complaining about this function that "r value is never used". Try this diff
diff --git a/pkg/sql/colexec/execgen/cmd/execgen/overloads_test_utils_gen.go b/pkg/sql/colexec/execgen/cmd/execgen/overloads_test_utils_gen.go
index 38c81e2dcc..2f0e030819 100644
--- a/pkg/sql/colexec/execgen/cmd/execgen/overloads_test_utils_gen.go
+++ b/pkg/sql/colexec/execgen/cmd/execgen/overloads_test_utils_gen.go
@@ -39,6 +39,7 @@ import (
// these utility test methods for datum-backed types.
// */}}
{{if and (not (eq .Left.VecMethod "Datum")) (not (eq .Right.VecMethod "Datum"))}}
+{{if and (not (eq .Left.VecMethod "Bytes")) (not (eq .Right.VecMethod "Bytes"))}}
func {{template "opName" .}}(a {{.Left.GoType}}, b {{.Right.GoType}}) {{.Right.RetGoType}} {
var r {{.Right.RetGoType}}
@@ -52,6 +53,7 @@ func {{template "opName" .}}(a {{.Left.GoType}}, b {{.Right.GoType}}) {{.Right.R
return r
}
+{{end}}
{{end}}
{{end}}
`
so that we don't generate any "perform" methods on bytes types (since those are not used, and I'll be removing this file shortly anyway).
pkg/sql/colexec/execgen/cmd/execgen/concat_agg_gen.go, line 1 at r1 (raw file):
// Copyright 2018 The Cockroach Authors.
nit: s/2018/2020/
.
pkg/sql/colexec/execgen/cmd/execgen/overloads_bin.go, line 265 at r1 (raw file):
caller, idx, err := parseNonIndexableTargetElem(targetElem) var setStatement string if err != nil && idx == "-1" {
I think there is no need for "-1" index because we always set to it to -1 when err
is non-nil, or am I missing something?
pkg/sql/logictest/testdata/logic_test/vectorize_agg, line 36 at r1 (raw file):
query TT SELECT concat_agg(_bytes), concat_agg(_string) FROM bytes_string
It would be nice to also run EXPLAIN (VEC)
on these SELECT queries to make sure that the vectorized engine is used.
433aecb
to
dc3c60d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/sql/colexec/concat_agg_tmpl.go, line 32 at r1 (raw file):
Previously, yuzefovich wrote…
nit: missing periods here and in several places below.
Done
pkg/sql/colexec/concat_agg_tmpl.go, line 50 at r1 (raw file):
Previously, yuzefovich wrote…
I think we don't need to do templating on
_TYPE
- we know thatconcat_agg
works only ontree.DBytes
andtree.DString
, and both types have the same physical representation,coldata.Bytes
, so I think it is safe to hard-code that we're operating onBytes
vectors and_TYPE
entirely.
Done
pkg/sql/colexec/concat_agg_tmpl.go, line 82 at r1 (raw file):
Previously, yuzefovich wrote…
I think we need to do less explicit memory accounting here. The reason is that when
execgen.SET
templated function call is withinPerformOperation
, then we're already measuring the updated memory usage automatically. What I mean is thata.alloc.allocator.PerformOperation
itself will measure the memory footprint ofa.vec
before and after the operation is applied, so there is no need to do anything explicitly (in fact, what we have right now is double counting the memory usage).As a result, I don't think we need this
_AGG_VAL_MEMORY_USAGE
function at all.
Done
pkg/sql/colexec/concat_agg_tmpl.go, line 121 at r1 (raw file):
Previously, yuzefovich wrote…
nit: no need to zero out since
Flush
is called once the computation has been completed.
Done
pkg/sql/colexec/concat_agg_tmpl.go, line 183 at r1 (raw file):
Previously, yuzefovich wrote…
nit:
s/sum_SUMKIND_TYPE_AGGKINDAgg/concat_TYPE_AGGKINDAgg/
.
Done
pkg/sql/colexec/overloads_test_utils.eg.go, line 3303 at r1 (raw file):
Previously, yuzefovich wrote…
Our linter is complaining about this function that "r value is never used". Try this diff
diff --git a/pkg/sql/colexec/execgen/cmd/execgen/overloads_test_utils_gen.go b/pkg/sql/colexec/execgen/cmd/execgen/overloads_test_utils_gen.go index 38c81e2dcc..2f0e030819 100644 --- a/pkg/sql/colexec/execgen/cmd/execgen/overloads_test_utils_gen.go +++ b/pkg/sql/colexec/execgen/cmd/execgen/overloads_test_utils_gen.go @@ -39,6 +39,7 @@ import ( // these utility test methods for datum-backed types. // */}} {{if and (not (eq .Left.VecMethod "Datum")) (not (eq .Right.VecMethod "Datum"))}} +{{if and (not (eq .Left.VecMethod "Bytes")) (not (eq .Right.VecMethod "Bytes"))}} func {{template "opName" .}}(a {{.Left.GoType}}, b {{.Right.GoType}}) {{.Right.RetGoType}} { var r {{.Right.RetGoType}} @@ -52,6 +53,7 @@ func {{template "opName" .}}(a {{.Left.GoType}}, b {{.Right.GoType}}) {{.Right.R return r } +{{end}} {{end}} {{end}} `
so that we don't generate any "perform" methods on bytes types (since those are not used, and I'll be removing this file shortly anyway).
Done
pkg/sql/colexec/execgen/cmd/execgen/concat_agg_gen.go, line 1 at r1 (raw file):
Previously, yuzefovich wrote…
nit:
s/2018/2020/
.
Done
pkg/sql/colexec/execgen/cmd/execgen/overloads_bin.go, line 265 at r1 (raw file):
Previously, yuzefovich wrote…
I think there is no need for "-1" index because we always set to it to -1 when
err
is non-nil, or am I missing something?
Done
This change has been removed since hard coded template doesn't need it
pkg/sql/logictest/testdata/logic_test/vectorize_agg, line 36 at r1 (raw file):
Previously, yuzefovich wrote…
It would be nice to also run
EXPLAIN (VEC)
on these SELECT queries to make sure that the vectorized engine is used.
Done
Thanks for reviews! The template is hard coded now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 11 of 11 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @yongyanglai)
pkg/sql/colexec/concat_agg_tmpl.go, line 135 at r2 (raw file):
// memory usage of curAgg varies during aggregation, we // need the allocator to monitor this change. alloc *concat_AGGKINDAggAlloc
nit: rather than using *concat_AGGKINDAggAlloc
to get access to the allocator we should use *colmem.Allocator
object directly.
dc3c60d
to
22c4d9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)
pkg/sql/colexec/concat_agg_tmpl.go, line 135 at r2 (raw file):
Previously, yuzefovich wrote…
nit: rather than using
*concat_AGGKINDAggAlloc
to get access to the allocator we should use*colmem.Allocator
object directly.
Done
PTAL
I have a question here. If there is possibility that aggregator run out of memory before allocator notice it. For example: SELECT concat_agg(_string_col) FROM some_table Seems that aggregate function won't flush aggregate value to output vector before it aggregate all values of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is such a possibility, and initially I thought it was ok for the value of curAgg
be unaccounted for until the aggregation group is finished, but maybe it's unacceptable, let's see what my teammates think. cc @asubiotto @jordanlewis
An alternative would be to do what you proposed initially - adjusting memory account of the allocator after every a.curAgg = append(a.curAgg, col.Get(i)...)
, however, this will have a performance overhead.
How about this idea: the case that we want to be protected from is when thousands of values belong to the same aggregation group because in that case a.curAgg
can grow arbitrarily large until it is "flushed". We could add a protection on per-batch granularity - we could get the length of curAgg
before the loops of _ACCUMULATE_CONCAT
in Compute
and after the loops, and then we will adjust the memory account by the difference after - before
(we will also need to reduce the account by len(a.curAgg)
in Flush
method). This way the performance hit will be negligible yet we're accounting for a.curAgg
between different coldata.Batch
es. Batch size is 1024, so at most we will have 1024 x avg string elem
bytes unaccounted for temporarily, and given that CockroachDB doesn't work well with large in size values (say 1MB), then we won't account - temporarily - for at most 1GB of RAM which seems acceptable to me.
Reviewed 3 of 3 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yongyanglai)
pkg/sql/colexec/concat_agg_tmpl.go, line 132 at r3 (raw file):
type concat_AGGKINDAgg struct { // alloc is the alloc used to create this aggregateFunc
nit: the comment needs minor adjustment.
22c4d9b
to
900cfc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI build failed. I didn't find meaningful logs in teamcity and diff shows only changes in code comments.
I ran make test
on my linux computer, but there was seems no problem.
Should I trigger another run in teamcity?
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)
pkg/sql/colexec/concat_agg_tmpl.go, line 132 at r3 (raw file):
Previously, yuzefovich wrote…
nit: the comment needs minor adjustment.
Done
I think it's a reasonable solution and I can't figure out a more efficient method to do so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I just took look at your initial version with memory accounting, and that was basically what I outlined above. Yeah, I think we should bring it back, and you were right from the very start, my bad from steering you into wrong direction. The nice thing is that the code will now be simplified because we're hard-coding the usage on Bytes
vector. I'll wait for you to update the PR accordingly.
Reviewed 3 of 3 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale)
900cfc0
to
b4efd70
Compare
This commit implements aggregate function "concat_agg" for vectorized engine Release note (sql change): vectorized engine now supports aggregate function "concat_agg"
b4efd70
to
dd8049d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated PR accordingly
PTAL
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for your contribution!
bors r=yuzefovich
Reviewed 3 of 3 files at r5.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale)
Seems like bors crashed. bors r=yuzefovich |
Build succeeded |
This PR contains implementation of vectorized aggregate function "concat_agg".
Related: #43561
Release note (sql change): vectorized engine now support aggregate function "concat_agg"