New quantization types IQ2_K, IQ3_K, IQ4_K, IQ5_K #8

ikawrakow · 2024-08-01T07:23:31Z

ikawrakow
Aug 1, 2024
Maintainer

Why?

I can hear what some are thinking: "Are you crazy? Even more quantization types? Doesn't llama.cpp already have enough?"

That was what I was thinking too. Until LLaMA-3 came along, that is.

Quantization errors for LLaMA-3 models are much higher than they have been for all previous models I have experimented with. This is best illustrated with the graph below. LLaMA-3.1 is all the rage these days, but I don't have the ability to run LLaMA-3.1-405B, so I have settled for LLaMA-3.1-70B to generate the graph. We will measure quantization error QError of a quantization Q using perplexity PPL as

 QError =  PPL(Q)/PPL(fp16) - 1

As we are not evaluating model performance in language tasks, but are only interested in the performance of a quantized model compared to the same full precision model, there is no benefit from looking at commonly used language modeling / reasoning benchmarks, which a) are typically less sensitive to quantization errors than PPL and b) take much longer to evaluate.
One could also use KL divergence, but KL divergence and PPL are closely related, and PPL is more convenient to calculate with llama.cpp, so PPL it is.

Blue symbols represent legacy quants (Q4_0, Q4_1, Q5_0, Q5_1), red symbols show results for k-quants, i-quants are depicted in black. To show how much higher the quantization error of LLaMA-3.1-70B is, I have included results for LLaMA-v2-70B shown in brown (just for k-quants as I have somehow lost the i-quants runs and did not feel like re-running the quite lengthy calculations). We see that there is basically about 1 bit-per-weight (bpw) gap between LLaMA-v2-70B and LLaMA-3.1-70B. I.e., it looks like the additional tokens used for training LLaMA-3 have paid off, the model has "learned" more from the data, and the model parameters in LLaMA-3.1 contain about 1 bpw extra information. This then results in a higher quantization error for a given bpw quantization budget.

We can now discuss the new quants shown with cyan circles. Please note that the y-axis is logarithmic so that the differences between the data points are quite large, even if they look fairly close to each other. For instance, the blue point around 5.5 bpw (Q5_0), which looks quite close to the red point (Q5_K_S), has a quantization error of 2.9% vs 1.9%. The cyan point around 5.5 bpw is IQ5_K, with a quantization error of 1.4%, i.e., IQ5_K has a quantization error that is 2.1X lower compared to Q5_0, and 40% lower compared to Q5_K_S. The cyan point around 4.5 bpw (IQ4_K) has a 2.7X lower quantization error compared to Q4_0, and 40% lower compared to Q4_K_S. So, even though IQ4_K and IQ5_K don't come anywhere close to what we used to have for 4- and 5-bit quantization in the pre-LLaMA-3.1 days, they do give a nice improvement compared to the SOTA in the 4+ bpw range.

"But what about the cyan points around 3.5 and 2.4 bpw? They are basically the same as i-quants!" - I hear you asking. These two exist for two reasons:

My curiosity
Much better inference performance compared to i-quants on the CPU and old GPU's.

Curiosity

i-quants are much better than k-quants in the sub-4-bpw range. i-quants in the sub-4-bpw range all use "codebooks" that encode groups of 8 or 4 model weights on the E8 or D4 lattice. The "codebook" idea comes originally from QuIP# and is also being used in, e.g., AQLM. I have been curious for some time to what extent the use of a "codebook" contributes to the better quantization quality of i-quants compared to k-quants. The "codebook" certainly acts as a kind of regularization to avoid/reduce overfitting: one only has a subset of all possible lattice points available in the "codebook" to represent a group of model weights, and hence the quantization algorithm cannot focus too much on individual quants, possibly missing more important model weights in the process. But is there more to it than just it being a regularization technique? I was curious and, as we can see in the above graph, it is indeed possible to match i-quants quantization accuracy with a non-linear quantization technique.

Performance

The use of a "codebook" requires a lookup in a fairly large table to convert the "codebook" index (which is stored in the quantized model) to actual quantized model weights when performing matrix multiplications. The lookup is handled quite OK by modern GPU's, but leads to a massive performance penalty on CPU's (and, from what I gather from llama.cpp user comments, also on older GPU's). The new IQK quants use a non-linear mapping between the quantized value stored in the model data (0...15 for 4-bit quantization, 0...7 for 3-bit, etc.) and the actual model weight, which also needs a lookup table. But these lookup tables are much smaller (4, 8, 16, 32 INT8 values for 2-, 3-, 4-, 5-bit quantization), so they fit into 1 or 2 SIMD registers, and thus can be handled very efficiently with SIMD instructions (_mm256_shuffle_epi8 on AVX2, vqtbl1q_s8 on ARM_NEON), resulting in a performance that is (nearly) the same as corresponding linear mapping between quants and model weights.

Let's look how this translates into observed inference performance. We compare IQ2_K to the matching IQ2_XS, and IQ3_K to the matching IQ3_S quants (matching in the sense that they use basically the same bpw and have very similar quantization accuracy). The following table shows performance in tokens per second (t/s) for prompt processing (pp512, so a prompt of 512 tokens) and token generation (tg128, so generating 128 tokens one-by-one) between matching quants on AVX2 (Ryzen-7950X) and ARM_NEON (M2-Max CPU). I have also added mainline llama.cpp results. The two values in the Speedup column are the t/s ratios between the new IQK quants and the corresponding i-quant in llama.cpp and in this repository. For instance, if we look at IQ3_S on the Ryzen-7950X, we see that IQ3_K will perform prompt processing 6.45 times faster than llama.cpp, and token generation speed will be 2.37X!

Case	test	threads	t/s llama.cpp	t/s this repo	t/s iqk	Speedup
8B IQ2_XS AVX2	pp512	16	46.45 ± 0.27	125.46 ± 0.43	194.64 ± 0.66	4.19 / 1.55
	tg128	4	10.88 ± 0.09	12.07 ± 0.07	21.46 ± 0.03	1.97 / 1.78
8B IQ3_S AVX2	pp512	16	28.04 ± 0.08	96.28 ± 0.45	180.77 ± 0.62	6.45 / 1.88
	tg128	4	6.80 ± 0.01	7.62 ± 0.10	16.10 ± 0.16	2.37 / 2.11
7B IQ2_XS NEON	pp512	8	22.77 ± 0.21	51.15 ± 0.24	60.60 ± 0.97	2.66 / 1.18
	tg128	8	18.19 ± 1.30	20.94 ± 0.19	28.24 ± 0.39	1.55 / 1.35
7B IQ3_S NEON	pp512	8	12.08 ± 0.30	49.72 ± 0.06	55.65 ± 0.82	4.61 / 1.12
	tg128	8	10.32 ± 0.25	11.11 ± 0.37	20.33 ± 0.06	1.97 / 1.83

What are non-linear quants anyway?

Will add later.

IQ6_K?

Before LLaMA-3, Q6_K quantization always had a quantization error in the 0.1-0.15% range, i.e., it was basically as good as the full precision model. But for LLaMA-3.1-70B Q6_K quantization error is 0.65%! Q8_0 does match the full precision model, but it uses 2 extra bpw. I have experimented with 6-bit non-linear quantization in the past, but Q6_K quantization error was so low that it was basically not possible to a see a benefit from the non-linearity. Given the much higher Q6_K quantization error for LLaMA-3 models, it may be worthwhile to resurrect 6-bit non-linear quantization.

Update See PR #14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New quantization types IQ2_K, IQ3_K, IQ4_K, IQ5_K #8

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

New quantization types IQ2_K, IQ3_K, IQ4_K, IQ5_K #8

ikawrakow Aug 1, 2024 Maintainer

Why?

Curiosity

Performance

What are non-linear quants anyway?

IQ6_K?

Replies: 0 comments

ikawrakow
Aug 1, 2024
Maintainer