Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive loop unrolling and forced inlining causes performance issues in sort with custom comparators #754

Open
davidwendt opened this issue Nov 20, 2020 · 3 comments
Labels
thrust For all items related to Thrust.

Comments

@davidwendt
Copy link
Contributor

In the libcudf component of RAPIDS we have a sort API that calls thrust::sort and thrust::stable_sort using a custom comparator for columns of data.
Reference libcudf calling sort/stable_sort: https://github.com/rapidsai/cudf/blob/branch-0.17/cpp/src/sort/sort_impl.cuh

As we add more column data types, the row-comparator gets a little more complex and the compile time and code size has increased dramatically. The problem appears to be the aggressive inlining of calls to the comparator as can be seen in this simple godbolt example: https://godbolt.org/z/hhachG (Note: I used std::sqrt() here only to illustrate how many times the comparator is inlined).

Tracing through the source I found some #pragma unroll statements in thrust/system/cuda/sort.h like the following: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/sort.h#L111-L116

#pragma unroll
    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
    {
      bool p = (keys2_beg < keys2_end) &&
               ((keys1_beg >= keys1_end) ||
                compare_op(key2,key1));
...

I believe the ITEMS_PER_THREAD value is ~10 normally but there are a couple other unrolls including this one: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/sort.h#L353-L354

#pragma unroll
        for (int coop = 2; coop <= BLOCK_THREADS; coop *= 2)

where BLOCK_THREADS I believe is > 100 (e.g. 128, 256, or 512 as far as I can tell).

In general, small comparators here have no issue and these #pragma unroll statements likely provide a performance boost. But as the comparator size increases, the code size increases and performance actually starts to suffer.

Using the godbolt example above, I created 10 different programs each with slightly bigger comparators (just adding more std::sqrt() calls). I measured compile time, file size and execution time with original sort.h against a modified sort.h with some of unroll statements disabled (on just for-loops that called the comparator).

unrolled_compile_time

unrolled_file_size

unrolled_measured

The execution time was measured using nsys and captured the call to thrust::sort followed by a call to cudaStreamSynchronize(0).

  nvtxRangePushA("mysort");
  thrust::sort( thrust::device, d_vin.begin(), d_vin.end(), comparator{});
  cudaStreamSynchronize(0);
  nvtxRangePop(); 

Note that simple comparators will run faster (bottom-left of the last graph above) but even just a few extra statements in the comparator can cause it to run slower with unrolled for-loops.

@alliepiper
Copy link
Collaborator

Thanks for the detailed write up! I'll be expanding our benchmarking / performance regression suite over the next few releases. It sounds like we should benchmark the sort algorithms with a variety of comparators so we can tune this and monitor for regressions afterwards.

@brycelelbach
Copy link
Collaborator

We need to benchmark removing __forceinline__ and #pragma unroll from everywhere in CUB.

@jrhemstad
Copy link
Collaborator

For posterity, I want to document that there is an easy way to work around this problem.

You can annotate your custom comparator with a __noinline__ which will circumvent CUB/Thrust's overuse of unrolling/forceinling.

From the original example,

struct comparator {
  __noinline__ __device__ bool operator()(double lhs, double rhs)
  {
    return std::sqrt(lhs+rhs) < 1.0;
  }
};

https://godbolt.org/z/Txe4Penc8

We should still explore more robust options for allowing users to configure and control this behavior, but I think this should unblock many who run into similar problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
thrust For all items related to Thrust.
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

5 participants