Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding IQ6_K #14

Merged
merged 10 commits into from
Aug 9, 2024
Merged

Adding IQ6_K #14

merged 10 commits into from
Aug 9, 2024

Conversation

ikawrakow
Copy link
Owner

This PR

New IQ6_K

The graph below is a copy of the graph in #8 with the quantization error of the new IQ6_K non-linear quantization type added (cyan circle near 6.6 bpw). We observe a significant improvement compared to Q6_K (0.4% vs 0.65%). LLaMA-3.1-8B quantization error is better too (0.15% vs 0.26%), so I think this is a worthwhile addition.

l31_70B

Fixing the Zen4 implementation of IQ3_K, IQ4_K and IQ5_K

While working on IQ6_K, I have noticed that there is a problem with the Zen4 implementation of the IQ3,4,5_K quants. I was using the standard k-quants matrix multiplication template (mul_mat_qX_K_q8_K_AVX512). On Zen4, this template uses the _mm512_dpbusd_epi32 instruction to perform the dot product between the quants of the left matrix and the Q8_K quants of the right matrix, which produces a SIMD vector containing 32-bit integer results. But for k-quants these 32-bit integers fall within int16_t range, so they get packed to 16-bit and are then multiplied with the block scales. But for the 3+ bit non-linear quants, the _mm512_dpbusd_epi32 may go outside of the int16_t range, which then leads to truncation and a wrong result. I have now corrected the implementation. This results in a small performance regression. The table below shows a performance comparison for LLaMA-3.1-8B between the original Zen4 implementation and the corrected Zen4 implementation for IQ3_K on a Ryzen-7950X (using 16 threads for PP-512 and 4 threads for TG-128)

t/s (PP-512) t/s (TG-128)
Before fix 180.77 ± 0.62 16.10 ± 0.16
After fix 167.69 ± 0.69 15.84 ± 0.33
Ratio 0.940 0.984

Kawrakow added 10 commits August 7, 2024 15:24
We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K
(0.14% vs 0.26% quantization error).
90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.
We need to do 4 shuffles to get the non-uniform values, so this
makes it slower than other iqX_k quants.

And then I realized that I was using the standard Zen4 template for
all iqX_k quants. The standard template converts the 32-bit integers
obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples
with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and
iq6_k. I guess, I did not notice with iq4_k and iq5_k because the
PPL difference to CUDA was relatively small, and I attributed it to
Q8_K not being accurate enough for the activations. But for iq6_k
the PPL difference was much too big to be attributable to Q8_K
inaccuracies, so that's when I realized that I cannot be packing
the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k
quants.

For now I fixed it for iq6_k, but the outcome is that it is
significantly slower than Q6_K: I get PP-512 = 125 t/s for
LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better
approach.
We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B.
TG-128 is 9.5 t/s. This is better than last commit,
but still kind of slow compared to Q6_K.

My last commit message is wrong: also iq3_k needs a fix
for overflow.
Respectable performance, only slightly slower than Q6_K.
About 4% slower than Q6_K for PP-512, but 10% faster for TG-128.
Someone has screwed up Q6_K TG performance on Metal? With the
cobntinuous "improvements" in ggml I wouldn't be surprised.
Need to look into it later.
@ikawrakow ikawrakow merged commit f0d7a0d into main Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants