Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q4_0_R4 on CUDA #127

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Q4_0_R4 on CUDA #127

wants to merge 3 commits into from

Conversation

ikawrakow
Copy link
Owner

With the massive improvements in prompt processing speed on the CPU achieved via interleaving 4 tensor rows (see #118, #119, #120, #121, #122, #123, #124), I was curious to see if one can get a good implementation for the X_R4 quants on CUDA. This PR is a POC that implements CUDA dequantization and matrix x vector multiplication for Q4_0_R4. It achieves the same TG speed as Q4_0. It was disappointing to not get a speedup via row interleaving, but at least there is no performance regression. To make it a full PR I should also implement quantized matrix x matrix multiplication for Q4_0_R4 (here it is done via dequantize to f16 and cuBLAS, so it is slower than Q4_0 MMQ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants