-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Improve the compile times of gptq_marlin.cu #7317
Comments
It might be nice to do something like the following. Define all the combinations in the header file and use them in the dispatching section and manual instantiation files. I don't know how fine grained the instantiations need to be so maybe a range is overkill. For example, maybe processing one tuple Boost has a lot of nice preprocessor utilities to do things like this. Bringing in boost is a bit overkill just for this but it might be easy enough to extract out the bits that are needed.
|
I think it should be sufficient to just break down by a combination of weight-type and zero-point support. For example that would mean breaking down the current
(Note the rest of this comment assumes #7323 is merged) Just like @tlrmchlsmth mentioned I would think that
Then the
I suspect some macro magic like @bnellnm suggested would be appropriate here. Then
|
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
@LucasWilkinson is there still stuff we can do here? |
I think this could still be an improvement (especially now with HQQ part of it), albeit lower priority due to the addition of python only development paths and better arch code handling over the last couple months. I think part of this could also be unifying QQQ Marlin and Fp8 Marlin into whatever the new structure ends up being. |
🚀 The feature, motivation and pitch
The compile times for the GPTQ Marlin kernels are quite long, and have become painful for developers.
gptq_marlin.cu
is monolithic and heavily templated, and many cases of the marlin kernel are instantiated. Compile times got particularly bad in #6612, where the number of cases being compiled approximately doubled.Namely, 320 kernels are defined in this block of code in
gptq_marlin.cu
:I think the best option for improving this right now it to split up
gptq_marlin.cu
into multiple files so that compilation can be parallelized.Details:
First, this function and its dependencies should be moved into a file called something like
gptq_marlin_kernel.cuh
Next, we need to spread the instantiations of the template function across a sensible number of .cu files. Too many will likely be counter-productive, so some experimentation will be needed. Each new
.cu
file should includegptq_marlin_kernel.cuh
, but to create a firewall, I think it's best ifgptq_marlin.cu
does not includegptq_marlin_kernel.cuh
. Instead we can add a new file calledgptq_marlin.cuh
that just contains the declarations of the template specializations that have been defined.Summary of the proposed file structure
I think
gptq_marlin.cu
should be broken down into something like the following:gptq_marlin_kernel.cuh
-- The bulk ofgptq_marlin.cu
should go in here.gptq_marlin_a.cu
throughgptq_marlin_d.cu
- These should includegptq_marlin_kernel.cuh
and each define some number of marlin configs. Name them better than I did here.gptq_marlin.cu
-- drastically smaller than it is currently. Should contain dispatch logic and includegpq_marlin.cuh
gptq_marlin.cuh
-- Should only declare the Marlin configs that have been defined in the new files e.g.gptq_marlin_a.cu
-gptq_marlin_d.cu
Alternatives
If there are any template parameters that can be made dynamic without losing any performance, that is an even better option.
Additional context
No response
The text was updated successfully, but these errors were encountered: