Excessive loop unrolling and forced inlining causes performance issues in sort with custom comparators #754

davidwendt · 2020-11-20T21:13:51Z

In the libcudf component of RAPIDS we have a sort API that calls thrust::sort and thrust::stable_sort using a custom comparator for columns of data.
Reference libcudf calling sort/stable_sort: https://github.com/rapidsai/cudf/blob/branch-0.17/cpp/src/sort/sort_impl.cuh

As we add more column data types, the row-comparator gets a little more complex and the compile time and code size has increased dramatically. The problem appears to be the aggressive inlining of calls to the comparator as can be seen in this simple godbolt example: https://godbolt.org/z/hhachG (Note: I used std::sqrt() here only to illustrate how many times the comparator is inlined).

Tracing through the source I found some #pragma unroll statements in thrust/system/cuda/sort.h like the following: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/sort.h#L111-L116

#pragma unroll
    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
    {
      bool p = (keys2_beg < keys2_end) &&
               ((keys1_beg >= keys1_end) ||
                compare_op(key2,key1));
...

I believe the ITEMS_PER_THREAD value is ~10 normally but there are a couple other unrolls including this one: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/sort.h#L353-L354

#pragma unroll
        for (int coop = 2; coop <= BLOCK_THREADS; coop *= 2)

where BLOCK_THREADS I believe is > 100 (e.g. 128, 256, or 512 as far as I can tell).

In general, small comparators here have no issue and these #pragma unroll statements likely provide a performance boost. But as the comparator size increases, the code size increases and performance actually starts to suffer.

Using the godbolt example above, I created 10 different programs each with slightly bigger comparators (just adding more std::sqrt() calls). I measured compile time, file size and execution time with original sort.h against a modified sort.h with some of unroll statements disabled (on just for-loops that called the comparator).

The execution time was measured using nsys and captured the call to thrust::sort followed by a call to cudaStreamSynchronize(0).

  nvtxRangePushA("mysort");
  thrust::sort( thrust::device, d_vin.begin(), d_vin.end(), comparator{});
  cudaStreamSynchronize(0);
  nvtxRangePop();

Note that simple comparators will run faster (bottom-left of the last graph above) but even just a few extra statements in the comparator can cause it to run slower with unrolled for-loops.

The text was updated successfully, but these errors were encountered:

alliepiper · 2020-11-20T21:40:33Z

Thanks for the detailed write up! I'll be expanding our benchmarking / performance regression suite over the next few releases. It sounds like we should benchmark the sort algorithms with a variety of comparators so we can tune this and monitor for regressions afterwards.

brycelelbach · 2020-12-04T22:45:18Z

We need to benchmark removing __forceinline__ and #pragma unroll from everywhere in CUB.

jrhemstad · 2022-06-23T17:22:13Z

For posterity, I want to document that there is an easy way to work around this problem.

You can annotate your custom comparator with a __noinline__ which will circumvent CUB/Thrust's overuse of unrolling/forceinling.

From the original example,

struct comparator {
  __noinline__ __device__ bool operator()(double lhs, double rhs)
  {
    return std::sqrt(lhs+rhs) < 1.0;
  }
};

https://godbolt.org/z/Txe4Penc8

We should still explore more robust options for allowing users to configure and control this behavior, but I think this should unblock many who run into similar problems.

alliepiper changed the title ~~Performance issue with sort and stable_sort with custom comparators~~ Excessive loop unrolling and forced inlining causes performance issues in sort with custom comparators Dec 7, 2020

alliepiper mentioned this issue Dec 7, 2020

warning: loop not unrolled in dispatch_radix_sort.cuh NVIDIA/cub#246

Closed

davidwendt mentioned this issue Dec 9, 2020

[FEA] Disable pragma unroll in select places in thrust sort.h rapidsai/cudf#6955

Closed

alliepiper mentioned this issue Jan 21, 2021

(cudaErrorInvalidDevice) when trying to perform a thrust::reduce NVIDIA/thrust#1371

Closed

alliepiper assigned gevtushenko May 18, 2021

This was referenced May 25, 2021

Add custom comparator benchmark for thrust sort alliepiper/thrust_benchmark#1

Merged

Remove extra loop unrolling in merge sort NVIDIA/thrust#1441

Closed

alliepiper unassigned gevtushenko Feb 24, 2022

lroberts36 mentioned this issue Jun 6, 2022

test_unit_sort.cpp compilation hangs on some platforms parthenon-hpc-lab/parthenon#672

Closed

davidwendt mentioned this issue Aug 4, 2022

Workaround for groupyby-min/max compile-time issue with thrust-1.17 rapidsai/cudf#11467

Closed

3 tasks

jrhemstad mentioned this issue Feb 20, 2023

Remove pragma unroll from device radix sort, thread reduce, histogram, radix rank, select if and block exchange NVIDIA/cub#315

Closed

jrhemstad added the thrust For all items related to Thrust. label Feb 22, 2023

github-project-automation bot added this to CCCL Nov 8, 2023

github-project-automation bot moved this to Todo in CCCL Nov 8, 2023

jarmak-nv transferred this issue from NVIDIA/thrust Nov 8, 2023

gevtushenko mentioned this issue Jan 28, 2025

[FEA]: Redesign default tuning #3570

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive loop unrolling and forced inlining causes performance issues in sort with custom comparators #754

Excessive loop unrolling and forced inlining causes performance issues in sort with custom comparators #754

davidwendt commented Nov 20, 2020

alliepiper commented Nov 20, 2020

brycelelbach commented Dec 4, 2020

jrhemstad commented Jun 23, 2022

Excessive loop unrolling and forced inlining causes performance issues in sort with custom comparators #754

Excessive loop unrolling and forced inlining causes performance issues in sort with custom comparators #754

Comments

davidwendt commented Nov 20, 2020

alliepiper commented Nov 20, 2020

brycelelbach commented Dec 4, 2020

jrhemstad commented Jun 23, 2022