You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
There are a some areas where JIT-compiled kernels can provide performance improvements over existing libcudf functions.
Please note that this issue is focused on CUDA C++ features in libcudf that use JITIFY and nvrtc, rather than cuDF-python features using Numba to generate PTX from user-defined Python functions.
JIT transforms, JIT projection expressions
JIT transforms, or UDF (user defined function) transforms, can be used to fuse together multiple binary ops or function calls within a single kernel. This eliminates the materialization of intermediates and for complex expressions can lead to significant speedup. We've written a custom "polynomials" benchmark in #17695 that shows >10x speedup for JIT-compiled kernels versus binary ops and AST (abstract syntax tree) implementations.
compare imbalanced_tree benchmarks for JIT vs binary ops vs AST
collect data on NDS and NDS-H runtime impact of JIT compiled expressions
support operators with string input and fixed width output
support operators with string input and string output
JIT aggregation
JIT aggregations, or UDAFs (user defined aggregation functions), can be used to complete complex transformations on the groups of a groupby aggregation. libcudf supports both CUDA and PTX aggregation kinds.
get_score_sum | Return the sum of all the scores in matching_id_scores that has a corresponding id in matching_ids that is also in input_ids.
get_score_min | Return the min among of all the scores in matching_id_scores that has a corresponding id in matching_ids that is also in input_ids.
get_score_max | Return the min among of all the scores in matching_id_scores that has a corresponding id in matching_ids that is also in input_ids.
JIT join
Currently libcudf uses mixed_join to fuse together hash join with post-filter. Mixed joins accept an AST predicate that is applied as thread-per-row when the probe table equality keys are found in the build table. Mixed joins have poor warp occupancy due to heavy register pressure, as a result of combined hash join and AST expression functionality into a single kernel.
One alternative would be to use code gen to check the post-equality predicate and JIT-compile the resulting kernel.
Improving JIT infrastructure
As part of expanding JIT functionality in libcudf, we will need better tools for tracking JIT-compilation time (NVIDIA/jitify#137). We will also need better tools for JIT cache management such as clearing and pre-populating. Collaboration with Spark-RAPIDS and other partners will be critical for success.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
There are a some areas where JIT-compiled kernels can provide performance improvements over existing libcudf functions.
Please note that this issue is focused on CUDA C++ features in libcudf that use JITIFY and nvrtc, rather than cuDF-python features using Numba to generate PTX from user-defined Python functions.
JIT transforms, JIT projection expressions
JIT transforms, or UDF (user defined function) transforms, can be used to fuse together multiple binary ops or function calls within a single kernel. This eliminates the materialization of intermediates and for complex expressions can lead to significant speedup. We've written a custom "polynomials" benchmark in #17695 that shows >10x speedup for JIT-compiled kernels versus binary ops and AST (abstract syntax tree) implementations.
imbalanced_tree
benchmarks for JIT vs binary ops vs ASTJIT aggregation
JIT aggregations, or UDAFs (user defined aggregation functions), can be used to complete complex transformations on the groups of a groupby aggregation. libcudf supports both
CUDA
andPTX
aggregation kinds.Some examples of UDAFs could include "compute score" with additional flexibility for feature engineering. Here are some "compute score" examples from the archived TorchArrow project.
To support some of these functions, the user might create a struct column that contains a list of id's, a list of targets, and a score per target. Ref: https://pytorch.org/torcharrow/beta/functional.html
JIT join
Currently libcudf uses
mixed_join
to fuse together hash join with post-filter. Mixed joins accept an AST predicate that is applied as thread-per-row when the probe table equality keys are found in the build table. Mixed joins have poor warp occupancy due to heavy register pressure, as a result of combined hash join and AST expression functionality into a single kernel.One alternative would be to use code gen to check the post-equality predicate and JIT-compile the resulting kernel.
Improving JIT infrastructure
As part of expanding JIT functionality in libcudf, we will need better tools for tracking JIT-compilation time (NVIDIA/jitify#137). We will also need better tools for JIT cache management such as clearing and pre-populating. Collaboration with Spark-RAPIDS and other partners will be critical for success.
The text was updated successfully, but these errors were encountered: