Rework GroupByHash to for faster performance and support grouping by nulls #808

alamb · 2021-08-01T12:56:10Z

Which issue does this PR close?

Closes #790 by implementing a new design for group by hash. Built on #812 so it may be easier to review that one first

This PR is an amazing collaborative effort and includes ideas from @Dandandan @jhorstmann @rdettai @jorgecarleitao and likely others I forgot.

Rationale for this change

Regain performance lost when we added support for GROUP BY NULL; See Rework GroupByHash for faster performance and support grouping by nulls #790 for more details

What changes are included in this PR?

Use a hash to to create the appropriate grouping, use indexes rather than hash keys many time

Potential Follow On Work

There are some great ideas from @sundy-li and @Dandandan on Improve grouping performance by special casing small / fixed size keys #846 (comment) for additional ways to improve grouping performance

Performance Summary

db-query: @Dandandan 's measurements show improvement except for one query
DataFusion Synthetic Aggregate Benchmark (specially created for this work): %15 - 27% improvement
Aggregate microbenches show no changes within my margin of error

db-query

The numbers @Dandandan measureed are at #808 (comment)

DataFusion Synthetic Aggregate Benchmark (newly created for this work):

Full results, https://github.com/alamb/datafusion_aggregate_bench/blob/main/RESULTS.md . Summary:

test	master	arrow-datafusion #808	`gby_null` / `master` (less than 1 is better)
`100 Groups; 100M rows, int64_keys(10% nulls), f64 values(1% nulls)`	22.40s	16.27s	.73
`100 Groups; 100M rows, utf8_keys(10% nulls), f64 values(1% nulls)`	29.46s	22.73s	.77
`100 Groups; 100M rows, dictionary(utf8, int32) keys(10% nulls), f64 values(1% nulls)`	31.54s	26.96s	.85

aggregate micro benchmarks

Tested via

cargo bench --bench aggregate_query_sql -- --save-baseline <test name>

Results

group                                                gby_null_new1                           master
-----                                                -------------                           ------
aggregate_query_group_by                             1.00      2.9±0.13ms        ? ?/sec     1.10      3.2±0.35ms        ? ?/sec
aggregate_query_group_by_u64 15 12                   1.00      3.1±0.16ms        ? ?/sec     1.00      3.1±0.08ms        ? ?/sec
aggregate_query_group_by_with_filter                 1.12      2.2±0.12ms        ? ?/sec     1.00  1969.5±41.36µs        ? ?/sec
aggregate_query_group_by_with_filter_u64 15 12       1.09      2.2±0.08ms        ? ?/sec     1.00      2.1±0.08ms        ? ?/sec
aggregate_query_no_group_by 15 12                    1.00  1195.8±70.21µs        ? ?/sec     1.02  1223.4±108.78µs        ? ?/sec
aggregate_query_no_group_by_count_distinct_narrow    1.00      5.5±0.23ms        ? ?/sec     1.09      6.0±0.90ms        ? ?/sec
aggregate_query_no_group_by_count_distinct_wide      1.09      8.0±0.59ms        ? ?/sec     1.00      7.4±0.30ms        ? ?/sec
aggregate_query_no_group_by_min_max_f64              1.08  1177.6±117.36µs        ? ?/sec    1.00  1092.1±29.93µs        ? ?/sec

Performance Source

This approach avoids the following operations which should improve its speed:

Avoids copying GroupValues into a Vec to hash, saving both time and space
Avoids several hash table lookups (used indexes into group_values instead
Uses vectorized hashing

Are there any user-facing changes?

Faster performance

Notes:

I tried to keep the same names and structure of the existing hash algorithm (as I found that easy to follow -- nice work @Dandandan and @andygrove ) and I think that will make this easier to review

Items completed

Basic debugging
~~Add ascii art diagrams~~ (the design got simpler using RawTable API)
Measure performance
Handle hash collisions - Incorrect results for joins on hash collisions #843
Implement vectorized hashing for DictionaryArray types #812
Implement fast comparison with ScalarValue and ArrayRef Add ScalarValue::eq_array optimized comparison function #844

datafusion/src/physical_plan/hash_aggregate.rs

alamb · 2021-08-09T17:36:49Z

I am basically done with this PR. All that remains in my mind is to run some benchmarks and I'll mark it as ready for review

Dandandan · 2021-08-09T18:02:41Z

On the db-benchmark aggregation queries:

PR:

q1 took 33 ms
q2 took 377 ms
q3 took 986 ms
q4 took 47 ms
q5 took 973 ms
q7 took 932 ms
q10 took 4040 ms

Master:

q1 took 37 ms
q2 took 325 ms
q3 took 1431 ms
q4 took 56 ms
q5 took 1287 ms
q7 took 1304 ms
q10 took 9380 ms

It looks like it's a small perf hit on q2, but I think the other 4 queries do greatly compensate for this 🎉

NGA-TRAN

I have skimmed #812 and this. I do not blame to understand everything but the code does it is described to do. The tests and performance numbers look great.

NGA-TRAN · 2021-08-09T21:46:28Z

datafusion/src/physical_plan/hash_aggregate.rs

+    /// scratch space used to collect indices for input rows in a
+    /// bach that have values to aggregate. Reset on each batch
+    indices: Vec<u32>,
+}


NGA-TRAN · 2021-08-09T21:47:46Z

datafusion/src/scalar.rs

+                eq_array_primitive!(array, index, IntervalDayTimeArray, val)
+            }
+        }
+    }


NGA-TRAN · 2021-08-09T21:51:29Z

datafusion/src/scalar.rs

+        let u32_vals = make_typed_vec!(u8_vals, u32);
+        let u64_vals = make_typed_vec!(u8_vals, u64);
+
+        let str_vals = vec![Some("foo"), None, Some("bar")];


I wonder why all the second value is always NULL? Will it be more general to have it random (first or third)?

The NULL is present to test null handling (which found a bug in my dictionary implementation, actually)

It is always the second entry because:

I basically copy/pasted the tests

I figured putting the validity bit in the middle (rather than at the ends) would be more likely to catch potential latent bugs (though your suggestion of varying its location is probably better). In theory all the null edge cases should be handled in the underlying arrow code

Dandandan · 2021-08-10T20:12:29Z

I had a good look and think all looks GREAT!

jorgecarleitao

Amazing work and results 🥇. Thanks a lot for this, @alamb !

alamb · 2021-08-10T20:48:08Z

Thanks @Dandandan and @jorgecarleitao -- I plan to merge #812 in first and leave this one open for another few days in case anyone else wants to comment.

alamb · 2021-08-11T08:47:47Z

Rebased now that #812 has been merged

Dandandan · 2021-08-11T20:57:24Z

A TCP-H query that got quite a bit faster is q13, on parquet SF=100 from 37.8s -> 29.5s

Dandandan · 2021-08-12T18:38:05Z

Thanks @alamb ! 🎉 🎉 🎉 🎉

alamb · 2021-08-12T19:51:26Z

Thanks everyone for all the help. This was a very cool experience of collaborative development for me

jorgecarleitao · 2021-08-14T09:58:30Z

Potentially relevant: https://www.researchgate.net/publication/326669722_SIMD_Vectorized_Hashing_for_Grouped_Aggregation_22nd_European_Conference_ADBIS_2018_Budapest_Hungary_September_2-5_2018_Proceedings

alamb · 2021-08-14T10:36:59Z

That is an interesting article. Looking at the summary:

The common implementation of the function using hashing techniques

suffers lower throughput rate due to the collision of the insert keys in the hashing techniques..... I actually found it very hard to test the group by collision handling correctness because the hashing technique in `create_hashes` was so good I could not find any example data that hased to the same value in a reasonable amount of time -- LOL However, the technique to search several slots at once might indeed be relevant <https://www.researchgate.net/figure/SIMD-accelerated-cuckoo-hashing-extended-from-Ross-et-al-14_fig1_326669722>

…

On Sat, Aug 14, 2021 at 4:58 AM Jorge Leitao ***@***.***> wrote: Potentially relevant: https://www.researchgate.net/publication/326669722_SIMD_Vectorized_Hashing_for_Grouped_Aggregation_22nd_European_Conference_ADBIS_2018_Budapest_Hungary_September_2-5_2018_Proceedings — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#808 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXZMNMY67U56FRXCORIV3T4Y45BANCNFSM5BLDRZWA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

Dandandan · 2021-08-14T11:16:26Z

Hashbrown already implements many tricks like this I believe, it's one of the fastest hash table implementations:
https://docs.rs/hashbrown/0.11.2/hashbrown/hash_map/index.html

There is also a nightly rawtable API to retrieve multiple values at once get_each_mut, which might be a bit faster.

So far, in profiling results, I haven't seen the probing/hashmap itself being a very expensive part currently. AFAIK It's mostly other parts that could be optimized: updating the states/values, collision checks, converting to array, creating hash values, actual sum over the array, etc.

github-actions bot added the datafusion Changes in the datafusion crate label Aug 1, 2021

alamb mentioned this pull request Aug 2, 2021

Produce correct answers for Group BY NULL (Option 1) #793

Merged

alamb force-pushed the alamb/gby_null_new branch 3 times, most recently from 4265132 to 9ad6719 Compare August 5, 2021 19:03

Dandandan reviewed Aug 5, 2021

View reviewed changes

datafusion/src/physical_plan/hash_aggregate.rs Show resolved Hide resolved

Dandandan reviewed Aug 6, 2021

View reviewed changes

datafusion/src/physical_plan/hash_aggregate.rs Outdated Show resolved Hide resolved

Dandandan added the performance Make DataFusion faster label Aug 6, 2021

alamb mentioned this pull request Aug 6, 2021

Rework GroupByHash for faster performance and support grouping by nulls #790

Closed

alamb force-pushed the alamb/gby_null_new branch from de95fef to a3176f8 Compare August 7, 2021 12:15

Dandandan reviewed Aug 7, 2021

View reviewed changes

datafusion/src/physical_plan/hash_aggregate.rs Show resolved Hide resolved

alamb force-pushed the alamb/gby_null_new branch from a3176f8 to 89e61ef Compare August 9, 2021 13:04

alamb mentioned this pull request Aug 9, 2021

Add ScalarValue::eq_array optimized comparison function #844

Merged

alamb force-pushed the alamb/gby_null_new branch from 89e61ef to 6f1283d Compare August 9, 2021 16:00

alamb changed the title ~~(WIP) Rework GroupByHash to for faster performance and support grouping by nulls~~ Rework GroupByHash to for faster performance and support grouping by nulls Aug 9, 2021

alamb marked this pull request as ready for review August 9, 2021 19:54

NGA-TRAN approved these changes Aug 9, 2021

View reviewed changes

Dandandan approved these changes Aug 10, 2021

View reviewed changes

jorgecarleitao approved these changes Aug 10, 2021

View reviewed changes

alamb mentioned this pull request Aug 10, 2021

Optimize hash_aggregate when there are no null group keys #850

Closed

alamb added 5 commits August 11, 2021 04:46

Implement faster GroupByHash design

270ee6a

Rewrite to use RawMap per Dandandan suggestion

6cd506f

remove stub

e2b6c8e

Return error with create_accumulators

c3cf0d5

Do not memoize group key creation

c5bc0c1

alamb force-pushed the alamb/gby_null_new branch from 0051a85 to c5bc0c1 Compare August 11, 2021 08:46

Dandandan merged commit fa3f099 into apache:master Aug 12, 2021

alamb deleted the alamb/gby_null_new branch August 12, 2021 19:51

alamb mentioned this pull request Aug 30, 2021

WIP Optimize hash_aggregate when there are no null group keys #922

Closed

houqp mentioned this pull request Sep 22, 2021

Experimenting with arrow2 #68

Closed

alamb mentioned this pull request Jan 31, 2022

Introduce a Vec<u8> based row-wise representation for DataFusion #1708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework GroupByHash to for faster performance and support grouping by nulls #808

Rework GroupByHash to for faster performance and support grouping by nulls #808

alamb commented Aug 1, 2021 •

edited

Loading

alamb commented Aug 9, 2021

Dandandan commented Aug 9, 2021 •

edited

Loading

NGA-TRAN left a comment

NGA-TRAN Aug 9, 2021

NGA-TRAN Aug 9, 2021

NGA-TRAN Aug 9, 2021

alamb Aug 10, 2021

Dandandan commented Aug 10, 2021

jorgecarleitao left a comment

alamb commented Aug 10, 2021

alamb commented Aug 11, 2021

Dandandan commented Aug 11, 2021 •

edited

Loading

Dandandan commented Aug 12, 2021

alamb commented Aug 12, 2021

jorgecarleitao commented Aug 14, 2021

alamb commented Aug 14, 2021 via email

Dandandan commented Aug 14, 2021

Rework GroupByHash to for faster performance and support grouping by nulls #808

Rework GroupByHash to for faster performance and support grouping by nulls #808

Conversation

alamb commented Aug 1, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Potential Follow On Work

Performance Summary

db-query

DataFusion Synthetic Aggregate Benchmark (newly created for this work):

aggregate micro benchmarks

Performance Source

Are there any user-facing changes?

Notes:

Items completed

alamb commented Aug 9, 2021

Dandandan commented Aug 9, 2021 • edited Loading

NGA-TRAN left a comment

Choose a reason for hiding this comment

NGA-TRAN Aug 9, 2021

Choose a reason for hiding this comment

NGA-TRAN Aug 9, 2021

Choose a reason for hiding this comment

NGA-TRAN Aug 9, 2021

Choose a reason for hiding this comment

alamb Aug 10, 2021

Choose a reason for hiding this comment

Dandandan commented Aug 10, 2021

jorgecarleitao left a comment

Choose a reason for hiding this comment

alamb commented Aug 10, 2021

alamb commented Aug 11, 2021

Dandandan commented Aug 11, 2021 • edited Loading

Dandandan commented Aug 12, 2021

alamb commented Aug 12, 2021

jorgecarleitao commented Aug 14, 2021

alamb commented Aug 14, 2021 via email

Dandandan commented Aug 14, 2021

alamb commented Aug 1, 2021 •

edited

Loading

Dandandan commented Aug 9, 2021 •

edited

Loading

Dandandan commented Aug 11, 2021 •

edited

Loading