New quantization types IQ2_K, IQ3_K, IQ4_K, IQ5_K #8
ikawrakow
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Why?
I can hear what some are thinking: "Are you crazy? Even more quantization types? Doesn't
llama.cpp
already have enough?"That was what I was thinking too. Until LLaMA-3 came along, that is.
Quantization errors for LLaMA-3 models are much higher than they have been for all previous models I have experimented with. This is best illustrated with the graph below. LLaMA-3.1 is all the rage these days, but I don't have the ability to run LLaMA-3.1-405B, so I have settled for LLaMA-3.1-70B to generate the graph. We will measure quantization error
QError
of a quantizationQ
using perplexityPPL
asAs we are not evaluating model performance in language tasks, but are only interested in the performance of a quantized model compared to the same full precision model, there is no benefit from looking at commonly used language modeling / reasoning benchmarks, which a) are typically less sensitive to quantization errors than PPL and b) take much longer to evaluate.
One could also use KL divergence, but KL divergence and
PPL
are closely related, andPPL
is more convenient to calculate withllama.cpp
, soPPL
it is.Blue symbols represent legacy quants (
Q4_0, Q4_1, Q5_0, Q5_1
), red symbols show results for k-quants, i-quants are depicted in black. To show how much higher the quantization error of LLaMA-3.1-70B is, I have included results for LLaMA-v2-70B shown in brown (just for k-quants as I have somehow lost the i-quants runs and did not feel like re-running the quite lengthy calculations). We see that there is basically about 1 bit-per-weight (bpw) gap between LLaMA-v2-70B and LLaMA-3.1-70B. I.e., it looks like the additional tokens used for training LLaMA-3 have paid off, the model has "learned" more from the data, and the model parameters in LLaMA-3.1 contain about 1 bpw extra information. This then results in a higher quantization error for a given bpw quantization budget.We can now discuss the new quants shown with cyan circles. Please note that the y-axis is logarithmic so that the differences between the data points are quite large, even if they look fairly close to each other. For instance, the blue point around 5.5 bpw (
Q5_0
), which looks quite close to the red point (Q5_K_S
), has a quantization error of 2.9% vs 1.9%. The cyan point around 5.5 bpw isIQ5_K
, with a quantization error of 1.4%, i.e.,IQ5_K
has a quantization error that is 2.1X lower compared toQ5_0
, and 40% lower compared toQ5_K_S
. The cyan point around 4.5 bpw (IQ4_K
) has a 2.7X lower quantization error compared toQ4_0
, and 40% lower compared toQ4_K_S
. So, even thoughIQ4_K
andIQ5_K
don't come anywhere close to what we used to have for 4- and 5-bit quantization in the pre-LLaMA-3.1 days, they do give a nice improvement compared to the SOTA in the 4+ bpw range."But what about the cyan points around 3.5 and 2.4 bpw? They are basically the same as i-quants!" - I hear you asking. These two exist for two reasons:
Curiosity
i-quants are much better than k-quants in the sub-4-bpw range. i-quants in the sub-4-bpw range all use "codebooks" that encode groups of 8 or 4 model weights on the E8 or D4 lattice. The "codebook" idea comes originally from QuIP# and is also being used in, e.g., AQLM. I have been curious for some time to what extent the use of a "codebook" contributes to the better quantization quality of i-quants compared to k-quants. The "codebook" certainly acts as a kind of regularization to avoid/reduce overfitting: one only has a subset of all possible lattice points available in the "codebook" to represent a group of model weights, and hence the quantization algorithm cannot focus too much on individual quants, possibly missing more important model weights in the process. But is there more to it than just it being a regularization technique? I was curious and, as we can see in the above graph, it is indeed possible to match i-quants quantization accuracy with a non-linear quantization technique.
Performance
The use of a "codebook" requires a lookup in a fairly large table to convert the "codebook" index (which is stored in the quantized model) to actual quantized model weights when performing matrix multiplications. The lookup is handled quite OK by modern GPU's, but leads to a massive performance penalty on CPU's (and, from what I gather from
llama.cpp
user comments, also on older GPU's). The newIQK
quants use a non-linear mapping between the quantized value stored in the model data (0...15
for 4-bit quantization,0...7
for 3-bit, etc.) and the actual model weight, which also needs a lookup table. But these lookup tables are much smaller (4, 8, 16, 32INT8
values for 2-, 3-, 4-, 5-bit quantization), so they fit into 1 or 2 SIMD registers, and thus can be handled very efficiently with SIMD instructions (_mm256_shuffle_epi8
onAVX2
,vqtbl1q_s8
onARM_NEON
), resulting in a performance that is (nearly) the same as corresponding linear mapping between quants and model weights.Let's look how this translates into observed inference performance. We compare
IQ2_K
to the matchingIQ2_XS
, andIQ3_K
to the matchingIQ3_S
quants (matching in the sense that they use basically the same bpw and have very similar quantization accuracy). The following table shows performance in tokens per second (t/s) for prompt processing (pp512
, so a prompt of 512 tokens) and token generation (tg128
, so generating 128 tokens one-by-one) between matching quants onAVX2
(Ryzen-7950X) andARM_NEON
(M2-Max CPU). I have also added mainlinellama.cpp
results. The two values in theSpeedup
column are thet/s
ratios between the newIQK
quants and the corresponding i-quant inllama.cpp
and in this repository. For instance, if we look atIQ3_S
on the Ryzen-7950X, we see thatIQ3_K
will perform prompt processing 6.45 times faster thanllama.cpp
, and token generation speed will be 2.37X!What are non-linear quants anyway?
Will add later.
IQ6_K?
Before LLaMA-3,
Q6_K
quantization always had a quantization error in the 0.1-0.15% range, i.e., it was basically as good as the full precision model. But for LLaMA-3.1-70BQ6_K
quantization error is 0.65%!Q8_0
does match the full precision model, but it uses 2 extra bpw. I have experimented with 6-bit non-linear quantization in the past, butQ6_K
quantization error was so low that it was basically not possible to a see a benefit from the non-linearity. Given the much higherQ6_K
quantization error for LLaMA-3 models, it may be worthwhile to resurrect 6-bit non-linear quantization.Update See PR #14
Beta Was this translation helpful? Give feedback.
All reactions