-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework GroupByHash to for faster performance and support grouping by nulls #808
Conversation
4265132
to
9ad6719
Compare
de95fef
to
a3176f8
Compare
a3176f8
to
89e61ef
Compare
89e61ef
to
6f1283d
Compare
I am basically done with this PR. All that remains in my mind is to run some benchmarks and I'll mark it as ready for review |
On the db-benchmark aggregation queries: PR:
Master:
It looks like it's a small perf hit on q2, but I think the other 4 queries do greatly compensate for this 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have skimmed #812 and this. I do not blame to understand everything but the code does it is described to do. The tests and performance numbers look great.
/// scratch space used to collect indices for input rows in a | ||
/// bach that have values to aggregate. Reset on each batch | ||
indices: Vec<u32>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
datafusion/src/scalar.rs
Outdated
eq_array_primitive!(array, index, IntervalDayTimeArray, val) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
datafusion/src/scalar.rs
Outdated
let u32_vals = make_typed_vec!(u8_vals, u32); | ||
let u64_vals = make_typed_vec!(u8_vals, u64); | ||
|
||
let str_vals = vec![Some("foo"), None, Some("bar")]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why all the second value is always NULL? Will it be more general to have it random (first or third)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NULL
is present to test null handling (which found a bug in my dictionary implementation, actually)
It is always the second entry because:
- I basically copy/pasted the tests
- I figured putting the validity bit in the middle (rather than at the ends) would be more likely to catch potential latent bugs (though your suggestion of varying its location is probably better). In theory all the null edge cases should be handled in the underlying arrow code
I had a good look and think all looks GREAT! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work and results 🥇. Thanks a lot for this, @alamb !
Thanks @Dandandan and @jorgecarleitao -- I plan to merge #812 in first and leave this one open for another few days in case anyone else wants to comment. |
0051a85
to
c5bc0c1
Compare
Rebased now that #812 has been merged |
A TCP-H query that got quite a bit faster is q13, on parquet SF=100 from 37.8s -> 29.5s |
Thanks @alamb ! 🎉 🎉 🎉 🎉 |
Thanks everyone for all the help. This was a very cool experience of collaborative development for me |
That is an interesting article.
Looking at the summary:
The common implementation of the function using hashing techniques
suffers lower throughput rate due to the collision of the insert keys in
the hashing techniques.....
I actually found it very hard to test the group by collision handling
correctness because the hashing technique in `create_hashes` was so good I
could not find any example data that hased to the same value in a
reasonable amount of time -- LOL
However, the technique to search several slots at once might indeed be
relevant
<https://www.researchgate.net/figure/SIMD-accelerated-cuckoo-hashing-extended-from-Ross-et-al-14_fig1_326669722>
…On Sat, Aug 14, 2021 at 4:58 AM Jorge Leitao ***@***.***> wrote:
Potentially relevant:
https://www.researchgate.net/publication/326669722_SIMD_Vectorized_Hashing_for_Grouped_Aggregation_22nd_European_Conference_ADBIS_2018_Budapest_Hungary_September_2-5_2018_Proceedings
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#808 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXZMNMY67U56FRXCORIV3T4Y45BANCNFSM5BLDRZWA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Hashbrown already implements many tricks like this I believe, it's one of the fastest hash table implementations: There is also a nightly rawtable API to retrieve multiple values at once So far, in profiling results, I haven't seen the probing/hashmap itself being a very expensive part currently. AFAIK It's mostly other parts that could be optimized: updating the states/values, collision checks, converting to array, creating hash values, actual |
Which issue does this PR close?
Closes #790 by implementing a new design for group by hash. Built on #812 so it may be easier to review that one first
This PR is an amazing collaborative effort and includes ideas from @Dandandan @jhorstmann @rdettai @jorgecarleitao and likely others I forgot.
Rationale for this change
What changes are included in this PR?
Potential Follow On Work
Performance Summary
db-query
The numbers @Dandandan measureed are at #808 (comment)
DataFusion Synthetic Aggregate Benchmark (newly created for this work):
Full results, https://github.com/alamb/datafusion_aggregate_bench/blob/main/RESULTS.md . Summary:
gby_null
/master
(less than 1 is better)100 Groups; 100M rows, int64_keys(10% nulls), f64 values(1% nulls)
100 Groups; 100M rows, utf8_keys(10% nulls), f64 values(1% nulls)
100 Groups; 100M rows, dictionary(utf8, int32) keys(10% nulls), f64 values(1% nulls)
aggregate micro benchmarks
Tested via
Results
Performance Source
This approach avoids the following operations which should improve its speed:
group_values
insteadAre there any user-facing changes?
Faster performance
Notes:
I tried to keep the same names and structure of the existing hash algorithm (as I found that easy to follow -- nice work @Dandandan and @andygrove ) and I think that will make this easier to review
Items completed
Add ascii art diagrams(the design got simpler usingRawTable
API)ScalarValue
andArrayRef
Add ScalarValue::eq_array optimized comparison function #844