Adding IQ6_K #14

ikawrakow · 2024-08-09T08:03:33Z

This PR

Adds IQ6_K - see New quantization types IQ2_K, IQ3_K, IQ4_K, IQ5_K #8 for motivation
Fixes the Zen4 implementation of IQ3_K, IQ4_K and IQ5_K

New IQ6_K

The graph below is a copy of the graph in #8 with the quantization error of the new IQ6_K non-linear quantization type added (cyan circle near 6.6 bpw). We observe a significant improvement compared to Q6_K (0.4% vs 0.65%). LLaMA-3.1-8B quantization error is better too (0.15% vs 0.26%), so I think this is a worthwhile addition.

Fixing the Zen4 implementation of `IQ3_K`, `IQ4_K` and `IQ5_K`

While working on IQ6_K, I have noticed that there is a problem with the Zen4 implementation of the IQ3,4,5_K quants. I was using the standard k-quants matrix multiplication template (mul_mat_qX_K_q8_K_AVX512). On Zen4, this template uses the _mm512_dpbusd_epi32 instruction to perform the dot product between the quants of the left matrix and the Q8_K quants of the right matrix, which produces a SIMD vector containing 32-bit integer results. But for k-quants these 32-bit integers fall within int16_t range, so they get packed to 16-bit and are then multiplied with the block scales. But for the 3+ bit non-linear quants, the _mm512_dpbusd_epi32 may go outside of the int16_t range, which then leads to truncation and a wrong result. I have now corrected the implementation. This results in a small performance regression. The table below shows a performance comparison for LLaMA-3.1-8B between the original Zen4 implementation and the corrected Zen4 implementation for IQ3_K on a Ryzen-7950X (using 16 threads for PP-512 and 4 threads for TG-128)

	t/s (PP-512)	t/s (TG-128)
Before fix	180.77 ± 0.62	16.10 ± 0.16
After fix	167.69 ± 0.69	15.84 ± 0.33
Ratio	0.940	0.984

We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K (0.14% vs 0.26% quantization error).

90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.

We need to do 4 shuffles to get the non-uniform values, so this makes it slower than other iqX_k quants. And then I realized that I was using the standard Zen4 template for all iqX_k quants. The standard template converts the 32-bit integers obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and iq6_k. I guess, I did not notice with iq4_k and iq5_k because the PPL difference to CUDA was relatively small, and I attributed it to Q8_K not being accurate enough for the activations. But for iq6_k the PPL difference was much too big to be attributable to Q8_K inaccuracies, so that's when I realized that I cannot be packing the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k quants. For now I fixed it for iq6_k, but the outcome is that it is significantly slower than Q6_K: I get PP-512 = 125 t/s for LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better approach.

We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B. TG-128 is 9.5 t/s. This is better than last commit, but still kind of slow compared to Q6_K. My last commit message is wrong: also iq3_k needs a fix for overflow.

Respectable performance, only slightly slower than Q6_K.

About 4% slower than Q6_K for PP-512, but 10% faster for TG-128. Someone has screwed up Q6_K TG performance on Metal? With the cobntinuous "improvements" in ggml I wouldn't be surprised. Need to look into it later.

See comments in f3a823c

Kawrakow added 10 commits August 7, 2024 15:24

iq6_k: WIP (nothing works)

54ce23b

iq6_k: WIP (quantize/dequantize)

85f448e

iq6_k: CUDA dequantize

b3d6e10

We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K (0.14% vs 0.26% quantization error).

iq6_k: CUDA dot product

07d3b4c

90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.

iq6_k: slightly better Zen4 iqk_mul_mat

0d55d5f

We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B. TG-128 is 9.5 t/s. This is better than last commit, but still kind of slow compared to Q6_K. My last commit message is wrong: also iq3_k needs a fix for overflow.

iq6_k: NEON

b8abe10

Respectable performance, only slightly slower than Q6_K.

iq6_k: Metal

c9c5ed5

About 4% slower than Q6_K for PP-512, but 10% faster for TG-128. Someone has screwed up Q6_K TG performance on Metal? With the cobntinuous "improvements" in ggml I wouldn't be surprised. Need to look into it later.

iq6_k: AVX2

3ef9f7a

Fix Zen4 implementation of iq3_k, iq4_k, iq5_k

bf74535

See comments in f3a823c

ikawrakow merged commit f0d7a0d into main Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding IQ6_K #14

Adding IQ6_K #14

ikawrakow commented Aug 9, 2024

Adding IQ6_K #14

Adding IQ6_K #14

Conversation

ikawrakow commented Aug 9, 2024

New IQ6_K

Fixing the Zen4 implementation of IQ3_K, IQ4_K and IQ5_K

Fixing the Zen4 implementation of `IQ3_K`, `IQ4_K` and `IQ5_K`