[EPIC] Thrust large input support #49

jrhemstad · 2023-04-21T17:17:37Z

As a higher-level interface, Thrust should provide behavior that is safe by default. This means every Thrust algorithm should work with large inputs with no user intervention required.

This will likely involve querying the extent of the input range and dynamically dispatching to a code path that uses a 4B or 8B offset type.

The dynamic dispatch will require always instantiating at least two code paths for a 4B and 8B offset type. This will increase compile time, and therefore we should provide users with a means to explicitly choose an offset type.

We should consider enabling users to configure the offset type at a global scope (e.g., through a preprocessor definition) as well as at a per-algorithm scope. For per-algorithm specification, a likely solution is to allow controlling the offset type via the execution policy, e.g., by adding a offset_type member type definition.

[FEA]: Add Thrust build option to disable dynamic offset type dispatch #1958
Identify Thrust algorithms that do/do not supprt large inputs
Determine and finalize design for large input support in Thrust

Existing Thrust issues:

The text was updated successfully, but these errors were encountered:

@stonea

Addresses Step 1 in #23324 Resolves Cray/chapel-private#5513 This PR adds several `gpu*Reduce` functions to perform whole-array reductions for arrays that are allocated in GPU memory. The functions added cover Chapel's default reductions and they are named: - `gpuSumReduce` - `gpuMinReduce` - `gpuMaxReduce` - `gpuMinLocReduce` - `gpuMaxLocReduce` ### NVIDIA implementation This is done by wrapping CUB: https://nvlabs.github.io/cub/. CUB is a C++ header-only template library. We wrap it with some macro magic in the runtime. This currently increases runtime build time by quite a bit. We might consider wrapping the functions from the library in non-inline helpers that can help a bit. ### AMD implementation AMD has hipCUB: https://rocm.docs.amd.com/projects/hipCUB/en/latest/ and a slightly lower-level, AMD-only rocPRIM: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/. I couldn't get either to work. I can't run a simple HIP reproducer based off of one of their tests. I might be doing something wrong in compilation, but what I am getting is a segfault in the launched kernel (or `hipLaunchKernel`). I filed ROCm/hipCUB#304, but haven't received a response quick enough to address in this PR. This is really unfortunate, but maybe we'll have a better/native reduction support soon and we can cover AMD there, too. ### Implementation details: - `chpl_gpu_X_reduce_Y` functions are added to the main runtime interface via macros. Here, X is the reduction kind, Y is the data type. This one prints debugging output, finds the stream to run the reduction on and calls its `impl` cousin. - `chpl_gpu_impl_X_reduce_Y` are added to the implementation layer in a similar fashion. - These functions are added to `gpu/Z/gpu-Z-reduce.cc` in the runtime where Z is either `nvidia` or `amd` - AMD versions are mostly "implemented" the way I think they should work, but because of the segfaults that I was getting, they are in the `else` branch of an `#if 1` at the moment. - The module code has a private `doGpuReduce` that calls the appropriate runtime function for any reduction type. This function has some similarities to how atomics are implemented. Unfortunately the interfaces are different enough that I can't come up with a good way to refactor some of the helpers. All the reduction helpers are nested in `doGpuReduce` to avoid confusion. - To workaround a CUB limitation that prevents reducing arrays whose size is close to `max(int(32))`, the implementation runs the underlying CUB implementation with at most 2B elements at a time and stitches the result on the host, if it ends up calling the implementation multiple times. The underlying issue is captured in: - https://github.com/NVIDIA/thrust/issues/1271 - NVIDIA/cccl#49 ### Future work: - Keep an eye on the AMD bug report - Implement a fall back when we're ready to run an in-house reduction if the bug remains unresolved. [Reviewed by @stonea] ### Test Status - [x] nvidia - [x] amd - [x] flat `make check`

jrhemstad mentioned this issue Apr 21, 2023

[THEME] Universal 64-bit index type support in Thrust/CUB algorithms #47

Open

3 tasks

jrhemstad changed the title ~~Determine and finalize design for large input support in Thrust~~ Thrust large input support Apr 21, 2023

github-project-automation bot added this to CCCL Jun 28, 2023

github-project-automation bot moved this to Todo in CCCL Jun 28, 2023

e-kayrakli mentioned this issue Oct 26, 2023

Add initial support for whole-array reduction on NVIDIA GPUs chapel-lang/chapel#23689

Merged

3 tasks

elstehle changed the title ~~Thrust large input support~~ [EPIC] Thrust large input support Dec 3, 2024

elstehle mentioned this issue Jan 20, 2025

Uses unsigned offset types in thrust's sort algorithm calling into DispatchMergeSort #3437

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Thrust large input support #49

[EPIC] Thrust large input support #49

jrhemstad commented Apr 21, 2023 •

edited

Loading

[EPIC] Thrust large input support #49

[EPIC] Thrust large input support #49

Comments

jrhemstad commented Apr 21, 2023 • edited Loading

jrhemstad commented Apr 21, 2023 •

edited

Loading