-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] AWQ Faster Kernels #3289
[WIP] AWQ Faster Kernels #3289
Conversation
@WoosukKwon I have used the same shapes as referenced in the original implementation, yet it does not load in vLLM for reasons I am unsure how to fix. If I add interleaving to the packed shards, nothing happens as the interleaving and packed factor cancel each other out. See the WQLinear_GEMVFast in AutoAWQ for reference. How should we proceed to implement weight loading for this new format? |
Hello, is there any progress? |
@shiqingzhangCSU currently there is no progress. if you have suggestions or fixes, please open a PR to my fork. i am hoping to have this feature in vLLM soon, but the weight loading is a blocker. |
@casper-hansen Hi, I'm meeting this same issue. To unblock, would you mind sharing which previous version of AutoAWQ works with vLLM? |
qweight, { | ||
"input_dim": 1, | ||
"output_dim": 0, | ||
"packed_dim": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to "packed_dim": 0,
can load the weight
I have identified the source of the issue. There is faulty logic in Working on a fix that avoids breaking GPTQ |
@robertgshaw2-neuralmagic any luck with this patch? I benchmarked and those kernels are really something. Great boost on my internal tests! |
@bratao I believe rob has a branch over in the neuralmagic fork. We discussed how to solve the issues and it seems there is a path forward for loading weights correctly. The forward pass also needs a modification from current state in the referenced branch, similar to the PR I recently created in AutoAWQ. https://github.com/neuralmagic/nm-vllm/tree/awq_faster_kernel |
Fix gemv_fast model loading
I merged @chu-tianxiang's PR and made some more modifications to catch up to the main branch. I will abandon this PR for now and leave it as a draft for someone else to finish. Here is my list of issues that I was facing:
|
This should be safe to close since the optimized Marlin kernel has supported awq models for several months now #6612 |
New AWQ kernels have been introduced by the AWQ authors:
Testing Model:
casperhansen/mistral-instruct-v0.2-gemvfast-awq
This PR is currently implemented as a draft:
Benchmark (1x A100)
Planning some benchmarks: