I Benchmarked 25 models at 16GB, 6.5GB, and 3.5GB sizes to find out whether a large model with smaller quant is better than a small model with bigger quant #11468

ZoontS · 2025-01-28T18:41:32Z

ZoontS
Jan 28, 2025

Q1 is never a good idea, even on relatively large models like Llama 3.1 70B and Gemma 2 27B
Same story for Q2, Mistral Small IQ2_XS and Mistral Nemo IQ2_XXS handily gets beaten by other Mistral models with lower parameter counts but on Q3 quantization
Gemma 2 seems to be the exception here as their 9B IQ2_M model still beats the 2B Q8 model. Maybe gemma is more resilient to quantization? Another interesting note is that if we instead compare their non-LC win rates then the 2B model wins
IQ3_XS is the turning point where sometimes it is better and sometimes it is worse than their lower paramater counterparts at the same size
Somewhere between IQ3_XS and Q6_K seems to be the sweet spot for efficiency
Q8_0 should only be used if you want faster tokens/s over the K-quants for CPU inference
FP16 and BF16 is not worth the 2x size over Q8

You should use higher parameter count models IF you could fit anything better than the IQ3_XS quants. Q2 and Q1 quants are not worth it.

Personally, I would target IQ4_XS for GPU inference, and Q4_0 for CPU-only inference for the extra speed.