Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] JIT Compilation of CUDF Kernels #17399

Open
lamarrr opened this issue Nov 21, 2024 · 5 comments
Open

[FEA] JIT Compilation of CUDF Kernels #17399

lamarrr opened this issue Nov 21, 2024 · 5 comments
Labels
feature request New feature or request

Comments

@lamarrr
Copy link
Contributor

lamarrr commented Nov 21, 2024

Is your feature request related to a problem? Please describe.
As described in #15366, we intend to adopt JIT compilation to some of our kernels using JITify/NVRTC.
JITify presently only supports compilation of device code, this means we can't mix host code with it, like we'd do with NVCC. Some important headers are not supported, i.e.

  • <stdexcept>
  • <atomic> (replace with cuda/std/atomic)

Describe the solution you'd like
We'd need to:

  • Separate headers required in kernel code into device-only and host-only code headers
  • Patch Thrust and RMM headers to separate their headers into device-only code headers

Describe alternatives you've considered

  • Using macros to separate the device-only code from host code, this doesn't work and would have been very brittle
  • Depending on NVCC for PTX and loading it into the driver at runtime
@lamarrr lamarrr added the feature request New feature or request label Nov 21, 2024
@jrhemstad
Copy link
Contributor

Patch Thrust and RMM headers to separate their headers into device-only code headers

Please no. RAPIDS is banned from applying any more patches to Thrust headers 🙂

You should only include RMM or Thrust headers in the host-only headers, so there shouldn't be any need to modify those.

@lamarrr
Copy link
Contributor Author

lamarrr commented Nov 25, 2024

Please no. RAPIDS is banned from applying any more patches to Thrust headers 🙂

Even locally within cudf?

You should only include RMM or Thrust headers in the host-only headers, so there shouldn't be any need to modify those.

Alright, I'll investigate that

@jrhemstad
Copy link
Contributor

Please no. RAPIDS is banned from applying any more patches to Thrust headers 🙂

Even locally within cudf?

Yes. Local patches to CCCL code causes all sorts of problems as the patch files need to be updated anytime any line of code in the patched file is updated. CCCL runs RAPIDS in its CI and patches become a nightmare.

@vyasr
Copy link
Contributor

vyasr commented Dec 2, 2024

For CCCL in particular we are aiming to get all patches that we have historically needed upstreamed so that we can rely on CCCL's CI like Jake mentioned. Also more generally we do not want to ship patched libraries any more. It causes loads of unrelated potential packaging problems down the line.

@lamarrr
Copy link
Contributor Author

lamarrr commented Jan 13, 2025

After meeting with @jrhemstad and @robertmaynard, We'll be having the following next steps:

Immediate Exploration: Driver PTX-JIT

We'll first explore driver PTX-JIT compilation (per-module lazy loading) and evaluate the performance overhead and startup cost. This isn't optimal and we've previously avoided this as we'd be compiling for the lowest common denominator architecture (i.e sm_60) and thus, leaving some performance on the table. Implementing this would be quick to do and will be done at the CMake and/or preprocessor level.
If the mixed-joins compilation and runtime overhead is non-satisfactory we could also try separating the template instantiations into different translation units and JIT-compile and lazy-load each instantiation.

Metrics to measure:

  • Driver PTX-JIT time
  • Throughput difference
  • Binary size difference

Future Exploration: LTO-IR

LTO-IR as described in the CUDA parallel developer overview and the NVIDIA Developer Blog. Using LTO-IR has the advantage that we can use the full instruction set of the target architecture while also avoiding costs associated with driver PTX-JIT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants