Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework GroupByHash to for faster performance and support grouping by nulls #808

Merged
merged 5 commits into from
Aug 12, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 1, 2021

Which issue does this PR close?

Closes #790 by implementing a new design for group by hash. Built on #812 so it may be easier to review that one first

This PR is an amazing collaborative effort and includes ideas from @Dandandan @jhorstmann @rdettai @jorgecarleitao and likely others I forgot.

Rationale for this change

  1. Regain performance lost when we added support for GROUP BY NULL; See Rework GroupByHash for faster performance and support grouping by nulls #790 for more details

What changes are included in this PR?

  1. Use a hash to to create the appropriate grouping, use indexes rather than hash keys many time

Potential Follow On Work

Performance Summary

db-query

The numbers @Dandandan measureed are at #808 (comment)

DataFusion Synthetic Aggregate Benchmark (newly created for this work):

Full results, https://github.com/alamb/datafusion_aggregate_bench/blob/main/RESULTS.md . Summary:

test master arrow-datafusion #808 gby_null / master (less than 1 is better)
100 Groups; 100M rows, int64_keys(10% nulls), f64 values(1% nulls) 22.40s 16.27s .73
100 Groups; 100M rows, utf8_keys(10% nulls), f64 values(1% nulls) 29.46s 22.73s .77
100 Groups; 100M rows, dictionary(utf8, int32) keys(10% nulls), f64 values(1% nulls) 31.54s 26.96s .85

aggregate micro benchmarks

Tested via

cargo bench --bench aggregate_query_sql -- --save-baseline <test name>

Results

group                                                gby_null_new1                           master
-----                                                -------------                           ------
aggregate_query_group_by                             1.00      2.9±0.13ms        ? ?/sec     1.10      3.2±0.35ms        ? ?/sec
aggregate_query_group_by_u64 15 12                   1.00      3.1±0.16ms        ? ?/sec     1.00      3.1±0.08ms        ? ?/sec
aggregate_query_group_by_with_filter                 1.12      2.2±0.12ms        ? ?/sec     1.00  1969.5±41.36µs        ? ?/sec
aggregate_query_group_by_with_filter_u64 15 12       1.09      2.2±0.08ms        ? ?/sec     1.00      2.1±0.08ms        ? ?/sec
aggregate_query_no_group_by 15 12                    1.00  1195.8±70.21µs        ? ?/sec     1.02  1223.4±108.78µs        ? ?/sec
aggregate_query_no_group_by_count_distinct_narrow    1.00      5.5±0.23ms        ? ?/sec     1.09      6.0±0.90ms        ? ?/sec
aggregate_query_no_group_by_count_distinct_wide      1.09      8.0±0.59ms        ? ?/sec     1.00      7.4±0.30ms        ? ?/sec
aggregate_query_no_group_by_min_max_f64              1.08  1177.6±117.36µs        ? ?/sec    1.00  1092.1±29.93µs        ? ?/sec

Performance Source

This approach avoids the following operations which should improve its speed:

  1. Avoids copying GroupValues into a Vec to hash, saving both time and space
  2. Avoids several hash table lookups (used indexes into group_values instead
  3. Uses vectorized hashing

Are there any user-facing changes?

Faster performance

Notes:

I tried to keep the same names and structure of the existing hash algorithm (as I found that easy to follow -- nice work @Dandandan and @andygrove ) and I think that will make this easier to review

Items completed

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Aug 1, 2021
@alamb alamb force-pushed the alamb/gby_null_new branch 3 times, most recently from 4265132 to 9ad6719 Compare August 5, 2021 19:03
@alamb
Copy link
Contributor Author

alamb commented Aug 9, 2021

I am basically done with this PR. All that remains in my mind is to run some benchmarks and I'll mark it as ready for review

@Dandandan
Copy link
Contributor

Dandandan commented Aug 9, 2021

On the db-benchmark aggregation queries:

PR:

q1 took 33 ms
q2 took 377 ms
q3 took 986 ms
q4 took 47 ms
q5 took 973 ms
q7 took 932 ms
q10 took 4040 ms

Master:

q1 took 37 ms
q2 took 325 ms
q3 took 1431 ms
q4 took 56 ms
q5 took 1287 ms
q7 took 1304 ms
q10 took 9380 ms

It looks like it's a small perf hit on q2, but I think the other 4 queries do greatly compensate for this 🎉

@alamb alamb changed the title (WIP) Rework GroupByHash to for faster performance and support grouping by nulls Rework GroupByHash to for faster performance and support grouping by nulls Aug 9, 2021
@alamb alamb marked this pull request as ready for review August 9, 2021 19:54
Copy link
Contributor

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have skimmed #812 and this. I do not blame to understand everything but the code does it is described to do. The tests and performance numbers look great.

/// scratch space used to collect indices for input rows in a
/// bach that have values to aggregate. Reset on each batch
indices: Vec<u32>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

eq_array_primitive!(array, index, IntervalDayTimeArray, val)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

let u32_vals = make_typed_vec!(u8_vals, u32);
let u64_vals = make_typed_vec!(u8_vals, u64);

let str_vals = vec![Some("foo"), None, Some("bar")];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why all the second value is always NULL? Will it be more general to have it random (first or third)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NULL is present to test null handling (which found a bug in my dictionary implementation, actually)

It is always the second entry because:

  1. I basically copy/pasted the tests
  2. I figured putting the validity bit in the middle (rather than at the ends) would be more likely to catch potential latent bugs (though your suggestion of varying its location is probably better). In theory all the null edge cases should be handled in the underlying arrow code

@Dandandan
Copy link
Contributor

I had a good look and think all looks GREAT!

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: Amazing work and results 🥇. Thanks a lot for this, @alamb !

@alamb
Copy link
Contributor Author

alamb commented Aug 10, 2021

Thanks @Dandandan and @jorgecarleitao -- I plan to merge #812 in first and leave this one open for another few days in case anyone else wants to comment.

@alamb alamb force-pushed the alamb/gby_null_new branch from 0051a85 to c5bc0c1 Compare August 11, 2021 08:46
@alamb
Copy link
Contributor Author

alamb commented Aug 11, 2021

Rebased now that #812 has been merged

@Dandandan
Copy link
Contributor

Dandandan commented Aug 11, 2021

A TCP-H query that got quite a bit faster is q13, on parquet SF=100 from 37.8s -> 29.5s

@Dandandan Dandandan merged commit fa3f099 into apache:master Aug 12, 2021
@Dandandan
Copy link
Contributor

Thanks @alamb ! 🎉 🎉 🎉 🎉

@alamb
Copy link
Contributor Author

alamb commented Aug 12, 2021

Thanks everyone for all the help. This was a very cool experience of collaborative development for me

@alamb alamb deleted the alamb/gby_null_new branch August 12, 2021 19:51
@alamb
Copy link
Contributor Author

alamb commented Aug 14, 2021 via email

@Dandandan
Copy link
Contributor

Hashbrown already implements many tricks like this I believe, it's one of the fastest hash table implementations:
https://docs.rs/hashbrown/0.11.2/hashbrown/hash_map/index.html

There is also a nightly rawtable API to retrieve multiple values at once get_each_mut, which might be a bit faster.

So far, in profiling results, I haven't seen the probing/hashmap itself being a very expensive part currently. AFAIK It's mostly other parts that could be optimized: updating the states/values, collision checks, converting to array, creating hash values, actual sum over the array, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate performance Make DataFusion faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rework GroupByHash for faster performance and support grouping by nulls
4 participants