Q4_0_R4 on CUDA #127

ikawrakow · 2024-12-08T09:01:23Z

With the massive improvements in prompt processing speed on the CPU achieved via interleaving 4 tensor rows (see #118, #119, #120, #121, #122, #123, #124), I was curious to see if one can get a good implementation for the X_R4 quants on CUDA. This PR is a POC that implements CUDA dequantization and matrix x vector multiplication for Q4_0_R4. It achieves the same TG speed as Q4_0. It was disappointing to not get a speedup via row interleaving, but at least there is no performance regression. To make it a full PR I should also implement quantized matrix x matrix multiplication for Q4_0_R4 (here it is done via dequantize to f16 and cuBLAS, so it is slower than Q4_0 MMQ).

I get basically the same TG performance as Q4_0.

Kawrakow added 3 commits January 9, 2025 11:58

cuda q4_0_r4: dequantize works

85a0730

cuda q4_0_r4: dot product works

d9589d8

I get basically the same TG performance as Q4_0.

Adapt to iq4_nl_x4 -> iq4_nl_r4 change

8bc80e0

ikawrakow force-pushed the ik/cuda_q4_0_r4 branch from ca900d9 to 8bc80e0 Compare January 9, 2025 10:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q4_0_R4 on CUDA #127

Q4_0_R4 on CUDA #127

ikawrakow commented Dec 8, 2024

Q4_0_R4 on CUDA #127

Are you sure you want to change the base?

Q4_0_R4 on CUDA #127

Conversation

ikawrakow commented Dec 8, 2024