Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Remove pragma unroll from device radix sort, thread reduce, histogram, radix rank, select if and block exchange #315

Conversation

gevtushenko
Copy link
Collaborator

  1. There is no difference in performance and compilation time for the reduce with simple operators. On complex operators (256 sqrt calls), the compilation time is up to 2.4 times faster, and the runtime speedup is about 2x without pragma unroll.
  2. There is no difference in performance and compilation time for the radix sort.
  3. There is no difference in performance and compilation time for the histogram.
  4. There is a 20% slowdown for scan with some complex operators, so it's better to leave pragma unroll there.
  5. There is no difference in performance and compilation time for the select_if algorithm with simple operators. On complex operators (128 sqrt calls), there is a 3.2x compilation time improvement and a 1.6x runtime speedup.
  6. There is no difference in performance for the default block exchange algorithms. There is a significant slowdown (2x) for time-sliced versions, though. Therefore I've left pragma unroll in time-sliced versions only.

@gevtushenko gevtushenko requested a review from alliepiper June 1, 2021 11:08
@alliepiper
Copy link
Collaborator

alliepiper commented Jun 1, 2021

LGTM -- can you add a comment to the pragmas that turned out to help performance, noting what situations they help with? This should be ready to test then.

@alliepiper alliepiper added this to the 1.13.0 milestone Jun 1, 2021
@alliepiper alliepiper modified the milestones: 1.13.0, 1.14.0 Jun 11, 2021
@alliepiper alliepiper removed this from the 1.14.0 milestone Aug 17, 2021
@jrhemstad
Copy link
Collaborator

@senior-zero do you think we can close this PR?

From offline discussion re: NVIDIA/cccl#754, I believe we would like come up with a more flexible user-controlled tuning API that would allow users to control the level of unrolling themselves.

@gevtushenko
Copy link
Collaborator Author

@jrhemstad removing pragma unroll unconditionally leads to unpredictable performance changes in some corner cases. We'd like to go with a user-controlled unrolling API indeed. Closing this PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants