-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive loop unrolling and forced inlining causes performance issues in sort with custom comparators #754
Comments
Thanks for the detailed write up! I'll be expanding our benchmarking / performance regression suite over the next few releases. It sounds like we should benchmark the sort algorithms with a variety of comparators so we can tune this and monitor for regressions afterwards. |
We need to benchmark removing |
For posterity, I want to document that there is an easy way to work around this problem. You can annotate your custom comparator with a From the original example,
https://godbolt.org/z/Txe4Penc8 We should still explore more robust options for allowing users to configure and control this behavior, but I think this should unblock many who run into similar problems. |
In the libcudf component of RAPIDS we have a sort API that calls
thrust::sort
andthrust::stable_sort
using a custom comparator for columns of data.Reference libcudf calling sort/stable_sort: https://github.com/rapidsai/cudf/blob/branch-0.17/cpp/src/sort/sort_impl.cuh
As we add more column data types, the row-comparator gets a little more complex and the compile time and code size has increased dramatically. The problem appears to be the aggressive inlining of calls to the comparator as can be seen in this simple godbolt example: https://godbolt.org/z/hhachG (Note: I used
std::sqrt()
here only to illustrate how many times the comparator is inlined).Tracing through the source I found some
#pragma unroll
statements inthrust/system/cuda/sort.h
like the following: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/sort.h#L111-L116I believe the
ITEMS_PER_THREAD
value is ~10 normally but there are a couple other unrolls including this one: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/sort.h#L353-L354where
BLOCK_THREADS
I believe is > 100 (e.g. 128, 256, or 512 as far as I can tell).In general, small comparators here have no issue and these
#pragma unroll
statements likely provide a performance boost. But as the comparator size increases, the code size increases and performance actually starts to suffer.Using the godbolt example above, I created 10 different programs each with slightly bigger comparators (just adding more
std::sqrt()
calls). I measured compile time, file size and execution time with originalsort.h
against a modifiedsort.h
with some of unroll statements disabled (on just for-loops that called the comparator).The execution time was measured using
nsys
and captured the call tothrust::sort
followed by a call tocudaStreamSynchronize(0)
.Note that simple comparators will run faster (bottom-left of the last graph above) but even just a few extra statements in the comparator can cause it to run slower with unrolled for-loops.
The text was updated successfully, but these errors were encountered: