Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR
IQ6_K
- see New quantization types IQ2_K, IQ3_K, IQ4_K, IQ5_K #8 for motivationIQ3_K
,IQ4_K
andIQ5_K
New IQ6_K
The graph below is a copy of the graph in #8 with the quantization error of the new
IQ6_K
non-linear quantization type added (cyan circle near 6.6 bpw). We observe a significant improvement compared toQ6_K
(0.4% vs 0.65%). LLaMA-3.1-8B quantization error is better too (0.15% vs 0.26%), so I think this is a worthwhile addition.Fixing the Zen4 implementation of
IQ3_K
,IQ4_K
andIQ5_K
While working on
IQ6_K
, I have noticed that there is a problem with the Zen4 implementation of theIQ3,4,5_K
quants. I was using the standard k-quants matrix multiplication template (mul_mat_qX_K_q8_K_AVX512
). On Zen4, this template uses the_mm512_dpbusd_epi32
instruction to perform the dot product between the quants of the left matrix and theQ8_K
quants of the right matrix, which produces a SIMD vector containing 32-bit integer results. But for k-quants these 32-bit integers fall withinint16_t
range, so they get packed to 16-bit and are then multiplied with the block scales. But for the 3+ bit non-linear quants, the_mm512_dpbusd_epi32
may go outside of theint16_t
range, which then leads to truncation and a wrong result. I have now corrected the implementation. This results in a small performance regression. The table below shows a performance comparison for LLaMA-3.1-8B between the original Zen4 implementation and the corrected Zen4 implementation forIQ3_K
on a Ryzen-7950X (using 16 threads for PP-512 and 4 threads for TG-128)