Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Expand JIT functionality in libcudf #18023

Open
GregoryKimball opened this issue Feb 16, 2025 · 1 comment
Open

[FEA] Expand JIT functionality in libcudf #18023

GregoryKimball opened this issue Feb 16, 2025 · 1 comment
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Feb 16, 2025

Is your feature request related to a problem? Please describe.
There are a some areas where JIT-compiled kernels can provide performance improvements over existing libcudf functions.

Please note that this issue is focused on CUDA C++ features in libcudf that use JITIFY and nvrtc, rather than cuDF-python features using Numba to generate PTX from user-defined Python functions.

JIT transforms, JIT projection expressions
JIT transforms, or UDF (user defined function) transforms, can be used to fuse together multiple binary ops or function calls within a single kernel. This eliminates the materialization of intermediates and for complex expressions can lead to significant speedup. We've written a custom "polynomials" benchmark in #17695 that shows >10x speedup for JIT-compiled kernels versus binary ops and AST (abstract syntax tree) implementations.

JIT aggregation
JIT aggregations, or UDAFs (user defined aggregation functions), can be used to complete complex transformations on the groups of a groupby aggregation. libcudf supports both CUDA and PTX aggregation kinds.

Some examples of UDAFs could include "compute score" with additional flexibility for feature engineering. Here are some "compute score" examples from the archived TorchArrow project.

To support some of these functions, the user might create a struct column that contains a list of id's, a list of targets, and a score per target. Ref: https://pytorch.org/torcharrow/beta/functional.html

get_score_sum | Return the sum of all the scores in matching_id_scores that has a corresponding id in matching_ids that is also in input_ids.
get_score_min | Return the min among of all the scores in matching_id_scores that has a corresponding id in matching_ids that is also in input_ids.
get_score_max | Return the min among of all the scores in matching_id_scores that has a corresponding id in matching_ids that is also in input_ids.

JIT join
Currently libcudf uses mixed_join to fuse together hash join with post-filter. Mixed joins accept an AST predicate that is applied as thread-per-row when the probe table equality keys are found in the build table. Mixed joins have poor warp occupancy due to heavy register pressure, as a result of combined hash join and AST expression functionality into a single kernel.

One alternative would be to use code gen to check the post-equality predicate and JIT-compile the resulting kernel.

Improving JIT infrastructure

As part of expanding JIT functionality in libcudf, we will need better tools for tracking JIT-compilation time (NVIDIA/jitify#137). We will also need better tools for JIT cache management such as clearing and pre-populating. Collaboration with Spark-RAPIDS and other partners will be critical for success.

@GregoryKimball GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Feb 16, 2025
@GregoryKimball GregoryKimball moved this to Story Issue in libcudf Feb 16, 2025
@vyasr
Copy link
Contributor

vyasr commented Feb 18, 2025

Note that we're already discussing large chunks of this idea in #17399 and #15366.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: Story Issue
Development

No branches or pull requests

3 participants