colexec: refactor hash aggregator #51337

yuzefovich · 2020-07-11T06:58:07Z

colexec: minor cleanup of the hash table

This commit improves some of the resetting behavior in the hash table
(now we copy over the whole slices rather than resetting one element at
a time) as well as introduces nicer shorter-named variables for some
slice accesses. It also moves one non-templated method outside of the
template into a regular file.

Release note: None

colexec: refactor hash aggregator

This commit refactors the hash aggregator to use the vectorized hash
table instead of Go's map and tightly couples the hash table with the
actual aggregation. The hash table is used in a somewhat similar mode to
how unordered distinct uses it with some crucial differences. Now the
algorithm for online aggregation is as follows:

read batch from the input
group all tuples from the batch into "equality chains" (this is done
by "probing" the batch against itself - similar to unordered distinct

but instead of "head" tuples into the hash table, we populate special
"equality" selection vectors as well as a separate selection vector that
contains heads of the equality chains)

probe the heads of the equality chains against already existing
buckets in the hash table
if there are any matches, it means that all tuples in the
corresponding equality chains belong to existing groups and are
aggregated.
all unmatched equality chains form new groups, so we create
a separate bucket for each and aggregate the tuples into it.

The crucial observation here is that we're maintaining a 1-to-1 mapping
between the "heads" of the aggregation groups that are stored in the
hash table and the "bucket" of aggregate function in buckets slice.

This fundamental change to the hash aggregator's algorithm shows 4-5x
speedups when the group sizes are small and has tolerable 30-40% hit
when the group sizes are big. Such tradeoff is acceptable since the
absolute speed in the latter case is still very high.

Aggregator/MIN/hash/int/groupSize=1/hasNulls=false/numInputBatches=64-24     4.64MB/s ± 1%  24.34MB/s ± 1%  +424.66%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=1/hasNulls=true/numInputBatches=64-24      4.64MB/s ± 1%  23.67MB/s ± 1%  +410.00%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=2/hasNulls=false/numInputBatches=64-24     9.59MB/s ± 1%  45.78MB/s ± 1%  +377.31%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=2/hasNulls=true/numInputBatches=64-24      9.60MB/s ± 1%  44.73MB/s ± 1%  +365.88%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=32/hasNulls=false/numInputBatches=64-24     131MB/s ± 1%    211MB/s ± 0%   +61.13%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=32/hasNulls=true/numInputBatches=64-24      124MB/s ± 1%    197MB/s ± 0%   +58.65%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=128/hasNulls=false/numInputBatches=64-24    314MB/s ± 0%    266MB/s ± 0%   -15.28%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=128/hasNulls=true/numInputBatches=64-24     280MB/s ± 0%    242MB/s ± 0%   -13.50%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=512/hasNulls=false/numInputBatches=64-24    451MB/s ± 0%    282MB/s ± 0%   -37.51%  (p=0.000 n=9+10)
Aggregator/MIN/hash/int/groupSize=512/hasNulls=true/numInputBatches=64-24     382MB/s ± 1%    255MB/s ± 0%   -33.06%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=1024/hasNulls=false/numInputBatches=64-24   471MB/s ± 1%    280MB/s ± 0%   -40.61%  (p=0.000 n=9+10)
Aggregator/MIN/hash/int/groupSize=1024/hasNulls=true/numInputBatches=64-24    400MB/s ± 0%    254MB/s ± 0%   -36.53%  (p=0.000 n=9+10)

Release note: None

colexec: clean up aggregate functions

This commit does the following:

changes the signature of Flush method to take in outputIdx
argument which is used by the hash aggregate functions to know where to
write its output (this argument is ignored by the ordered aggregate
functions). This allows us to remove one int from the hash aggregate
functions which can be noticeable in case of many groups.
changes the signature of Compute method to take in "disassembled"
batch (separate vectors, input length, and the selection vector). This
allows us to not perform the copies of the equality chains into the
selection vector of the batch in the hash aggregator.
extracts base structs that implement the common functionality of
aggregate functions.
hashAggregatorAllocSize has been retuned and is now 128 instead of
previous 64.

Note that I prototyped introduction of aggregateFuncBase interface
which would be implemented by orderedAggregateFuncBase and
hashAggregateFuncBase structs for the step 3 above, but it showed
worse performance (slower speed, slightly more allocations), so that
prototype was discarded.

It also moves a couple of cancel checks outside of the for loop as well
as cleans up a few logic test files.

Release note: None

cockroach-teamcity · 2020-07-11T06:58:16Z

This change is

yuzefovich · 2020-07-11T06:59:10Z

The benchmarks of the first commit are here and of the second commit are here.

The benchmarks of the third commit are here and here. Hash aggregation improves, but for some reason ordered aggregation has unintuitive to me changes - it speeds up when nulls are not present and slows down when nulls are present. Still, I think the third commit is worth it.

And the benchmarks of the whole PR: here and here. The highlight:

Aggregator/MIN/hash/int/groupSize=1/hasNulls=false/numInputBatches=64-24        4.57MB/s ± 1%  26.43MB/s ± 1%  +478.74%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=1/hasNulls=true/numInputBatches=64-24         4.55MB/s ± 2%  25.56MB/s ± 2%  +462.09%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=2/hasNulls=false/numInputBatches=64-24        9.38MB/s ± 1%  48.69MB/s ± 1%  +418.94%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=2/hasNulls=true/numInputBatches=64-24         9.40MB/s ± 1%  47.42MB/s ± 1%  +404.64%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=32/hasNulls=false/numInputBatches=64-24        128MB/s ± 0%    210MB/s ± 0%   +63.61%  (p=0.000 n=10+9)
Aggregator/MIN/hash/int/groupSize=32/hasNulls=true/numInputBatches=64-24         126MB/s ± 0%    203MB/s ± 0%   +61.31%  (p=0.000 n=9+10)
Aggregator/MIN/hash/int/groupSize=128/hasNulls=false/numInputBatches=64-24       303MB/s ± 0%    259MB/s ± 0%   -14.74%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=128/hasNulls=true/numInputBatches=64-24        289MB/s ± 1%    248MB/s ± 1%   -14.22%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=512/hasNulls=false/numInputBatches=64-24       427MB/s ± 0%    273MB/s ± 1%   -35.98%  (p=0.000 n=10+9)
Aggregator/MIN/hash/int/groupSize=512/hasNulls=true/numInputBatches=64-24        399MB/s ± 1%    263MB/s ± 1%   -34.21%  (p=0.000 n=10+10)
Aggregator/MIN/hash/int/groupSize=1024/hasNulls=false/numInputBatches=64-24      448MB/s ± 0%    276MB/s ± 0%   -38.44%  (p=0.000 n=7+10)
Aggregator/MIN/hash/int/groupSize=1024/hasNulls=true/numInputBatches=64-24       420MB/s ± 1%    263MB/s ± 0%   -37.35%  (p=0.000 n=10+9)

yuzefovich · 2020-07-31T16:44:37Z

Friendly ping.

yuzefovich · 2020-08-03T18:20:59Z

I think the performance hit that we see in cases when the group sizes are big can be alleviated (or even entirely eliminated) if we add rehashing behavior to the hash table and start out with small number of buckets (#52257).

This commit improves some of the resetting behavior in the hash table (now we copy over the whole slices rather than resetting one element at a time) as well as introduces nicer shorter-named variables for some slice accesses. It also moves one non-templated method outside of the template into a regular file. Release note: None

This commit refactors the hash aggregator to use the vectorized hash table instead of Go's map and tightly couples the hash table with the actual aggregation. The hash table is used in a somewhat similar mode to how unordered distinct uses it with some crucial differences. Now the algorithm for online aggregation is as follows: 1. read batch from the input 2. group all tuples from the batch into "equality chains" (this is done by "probing" the batch against itself - similar to unordered distinct - but instead of "head" tuples into the hash table, we populate special "equality" selection vectors as well as a separate selection vector that contains heads of the equality chains) 3. probe the heads of the equality chains against already existing buckets in the hash table 4. if there are any matches, it means that all tuples in the corresponding equality chains belong to existing groups and are aggregated. 5. all unmatched equality chains form new groups, so we create a separate bucket for each and aggregate the tuples into it. The crucial observation here is that we're maintaining a 1-to-1 mapping between the "heads" of the aggregation groups that are stored in the hash table and the "bucket" of aggregate function in `buckets` slice. This fundamental change to the hash aggregator's algorithm shows 4-5x speedups when the group sizes are small and has tolerable 30-40% hit when the group sizes are big. Such tradeoff is acceptable since the absolute speed in the latter case is still very high. Release note: None

This commit does the following: 1. changes the signature of `Flush` method to take in `outputIdx` argument which is used by the hash aggregate functions to know where to write its output (this argument is ignored by the ordered aggregate functions). This allows us to remove one `int` from the hash aggregate functions which can be noticeable in case of many groups. 2. changes the signature of `Compute` method to take in "disassembled" batch (separate vectors, input length, and the selection vector). This allows us to not perform the copies of the equality chains into the selection vector of the batch in the hash aggregator. 3. extracts base structs that implement the common functionality of aggregate functions. 4. hashAggregatorAllocSize has been retuned and is now 128 instead of previous 64. Note that I prototyped introduction of `aggregateFuncBase` interface which would be implemented by `orderedAggregateFuncBase` and `hashAggregateFuncBase` structs for the step 3 above, but it showed worse performance (slower speed, slightly more allocations), so that prototype was discarded. It also moves a couple of cancel checks outside of the for loop as well as cleans up a few logic test files. Release note: None

jordanlewis

Reviewed 5 of 5 files at r1, 7 of 7 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)

pkg/sql/colexec/concat_agg_tmpl.go, line 139 at r3 (raw file):

type concat_AGGKINDAgg struct {
	// {{if eq "_AGGKIND" "Ordered"}}
	orderedAggregateFuncBase

I really wish not every aggregate had to know about its context... but I guess you didn't introduce that here, either

yuzefovich

TFTR!

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)

pkg/sql/colexec/concat_agg_tmpl.go, line 139 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I really wish not every aggregate had to know about its context... but I guess you didn't introduce that here, either

Yeah, I don't think this addition makes anything worse. Maybe this instantiation could be moved into a templated constructor method, but that can wait.

craig · 2020-08-06T19:04:16Z

Build succeeded:

Compile Build (Cockroach)

52174: colexec: add default aggregate function r=yuzefovich a=yuzefovich Depends on #51337. Depends on #52315. **colexec: add default aggregate function** This commit introduces "default" aggregate function which is an adapter from `tree.AggregateFunc` to `colexec.aggregateFunc`. It works as follows: - the aggregator (either hash or ordered) is responsible for converting all necessary vectors to `tree.Datum` columns before calling `Compute` (this allows us to share the conversion between multiple functions if they happened to take in the same columns and between multiple groups in case of the hash aggregator) - the default aggregate function populates "arguments" to be passed into the wrapped `tree.AggregateFunc` and adds them - when the new group is encountered, the result so far is flushed and the wrapped `tree.AggregateFunc` is reset. One detail is that these wrapped `tree.AggregateFunc`s need to be closed, and currently that responsibility lies with the alloc object that is creating them. In the future, we might want to shift the responsibility to the aggregators. Addresses: #43561. Release note: None **colexec: clean up hash aggregate functions** Hash aggregate function always have non-nil `sel`, and this commit removes the code generation for nil `sel` case (meaning it removes the dead code). It also templates out nulls vs no-nulls cases in `bool_and` and `bool_or` aggregates. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>

yuzefovich requested review from jordanlewis and a team July 11, 2020 06:58

yuzefovich force-pushed the hash-agg branch 7 times, most recently from b0a3c17 to 6f5d819 Compare July 15, 2020 21:01

yuzefovich force-pushed the hash-agg branch from 6f5d819 to 64054ed Compare July 22, 2020 23:12

yuzefovich force-pushed the hash-agg branch from 64054ed to b255438 Compare July 31, 2020 15:57

yuzefovich mentioned this pull request Jul 31, 2020

colexec: add default aggregate function #52174

Merged

yuzefovich force-pushed the hash-agg branch 2 times, most recently from c28de12 to c383614 Compare August 5, 2020 15:47

yuzefovich force-pushed the hash-agg branch from c383614 to cf244e5 Compare August 5, 2020 22:55

yuzefovich added 2 commits August 5, 2020 20:26

yuzefovich force-pushed the hash-agg branch from cf244e5 to 4e2c55b Compare August 6, 2020 03:28

jordanlewis approved these changes Aug 6, 2020

View reviewed changes

yuzefovich commented Aug 6, 2020

View reviewed changes

craig bot merged commit 29591de into cockroachdb:master Aug 6, 2020

yuzefovich deleted the hash-agg branch August 6, 2020 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colexec: refactor hash aggregator #51337

colexec: refactor hash aggregator #51337

yuzefovich commented Jul 11, 2020 •

edited

Loading

cockroach-teamcity commented Jul 11, 2020

yuzefovich commented Jul 11, 2020 •

edited

Loading

yuzefovich commented Jul 31, 2020

yuzefovich commented Aug 3, 2020

jordanlewis left a comment

yuzefovich left a comment

craig bot commented Aug 6, 2020

colexec: refactor hash aggregator #51337

colexec: refactor hash aggregator #51337

Conversation

yuzefovich commented Jul 11, 2020 • edited Loading

cockroach-teamcity commented Jul 11, 2020

yuzefovich commented Jul 11, 2020 • edited Loading

yuzefovich commented Jul 31, 2020

yuzefovich commented Aug 3, 2020

jordanlewis left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

craig bot commented Aug 6, 2020

yuzefovich commented Jul 11, 2020 •

edited

Loading

yuzefovich commented Jul 11, 2020 •

edited

Loading