Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
51337: colexec: refactor hash aggregator r=yuzefovich a=yuzefovich **colexec: minor cleanup of the hash table** This commit improves some of the resetting behavior in the hash table (now we copy over the whole slices rather than resetting one element at a time) as well as introduces nicer shorter-named variables for some slice accesses. It also moves one non-templated method outside of the template into a regular file. Release note: None **colexec: refactor hash aggregator** This commit refactors the hash aggregator to use the vectorized hash table instead of Go's map and tightly couples the hash table with the actual aggregation. The hash table is used in a somewhat similar mode to how unordered distinct uses it with some crucial differences. Now the algorithm for online aggregation is as follows: 1. read batch from the input 2. group all tuples from the batch into "equality chains" (this is done by "probing" the batch against itself - similar to unordered distinct - but instead of "head" tuples into the hash table, we populate special "equality" selection vectors as well as a separate selection vector that contains heads of the equality chains) 3. probe the heads of the equality chains against already existing buckets in the hash table 4. if there are any matches, it means that all tuples in the corresponding equality chains belong to existing groups and are aggregated. 5. all unmatched equality chains form new groups, so we create a separate bucket for each and aggregate the tuples into it. The crucial observation here is that we're maintaining a 1-to-1 mapping between the "heads" of the aggregation groups that are stored in the hash table and the "bucket" of aggregate function in `buckets` slice. This fundamental change to the hash aggregator's algorithm shows 4-5x speedups when the group sizes are small and has tolerable 30-40% hit when the group sizes are big. Such tradeoff is acceptable since the absolute speed in the latter case is still very high. ``` Aggregator/MIN/hash/int/groupSize=1/hasNulls=false/numInputBatches=64-24 4.64MB/s ± 1% 24.34MB/s ± 1% +424.66% (p=0.000 n=10+10) Aggregator/MIN/hash/int/groupSize=1/hasNulls=true/numInputBatches=64-24 4.64MB/s ± 1% 23.67MB/s ± 1% +410.00% (p=0.000 n=10+10) Aggregator/MIN/hash/int/groupSize=2/hasNulls=false/numInputBatches=64-24 9.59MB/s ± 1% 45.78MB/s ± 1% +377.31% (p=0.000 n=10+10) Aggregator/MIN/hash/int/groupSize=2/hasNulls=true/numInputBatches=64-24 9.60MB/s ± 1% 44.73MB/s ± 1% +365.88% (p=0.000 n=10+10) Aggregator/MIN/hash/int/groupSize=32/hasNulls=false/numInputBatches=64-24 131MB/s ± 1% 211MB/s ± 0% +61.13% (p=0.000 n=10+10) Aggregator/MIN/hash/int/groupSize=32/hasNulls=true/numInputBatches=64-24 124MB/s ± 1% 197MB/s ± 0% +58.65% (p=0.000 n=10+10) Aggregator/MIN/hash/int/groupSize=128/hasNulls=false/numInputBatches=64-24 314MB/s ± 0% 266MB/s ± 0% -15.28% (p=0.000 n=10+10) Aggregator/MIN/hash/int/groupSize=128/hasNulls=true/numInputBatches=64-24 280MB/s ± 0% 242MB/s ± 0% -13.50% (p=0.000 n=10+10) Aggregator/MIN/hash/int/groupSize=512/hasNulls=false/numInputBatches=64-24 451MB/s ± 0% 282MB/s ± 0% -37.51% (p=0.000 n=9+10) Aggregator/MIN/hash/int/groupSize=512/hasNulls=true/numInputBatches=64-24 382MB/s ± 1% 255MB/s ± 0% -33.06% (p=0.000 n=10+10) Aggregator/MIN/hash/int/groupSize=1024/hasNulls=false/numInputBatches=64-24 471MB/s ± 1% 280MB/s ± 0% -40.61% (p=0.000 n=9+10) Aggregator/MIN/hash/int/groupSize=1024/hasNulls=true/numInputBatches=64-24 400MB/s ± 0% 254MB/s ± 0% -36.53% (p=0.000 n=9+10) ``` Release note: None **colexec: clean up aggregate functions** This commit does the following: 1. changes the signature of `Flush` method to take in `outputIdx` argument which is used by the hash aggregate functions to know where to write its output (this argument is ignored by the ordered aggregate functions). This allows us to remove one `int` from the hash aggregate functions which can be noticeable in case of many groups. 2. changes the signature of `Compute` method to take in "disassembled" batch (separate vectors, input length, and the selection vector). This allows us to not perform the copies of the equality chains into the selection vector of the batch in the hash aggregator. 3. extracts base structs that implement the common functionality of aggregate functions. 4. hashAggregatorAllocSize has been retuned and is now 128 instead of previous 64. Note that I prototyped introduction of `aggregateFuncBase` interface which would be implemented by `orderedAggregateFuncBase` and `hashAggregateFuncBase` structs for the step 3 above, but it showed worse performance (slower speed, slightly more allocations), so that prototype was discarded. It also moves a couple of cancel checks outside of the for loop as well as cleans up a few logic test files. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>
- Loading branch information