Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization #4111

kalomaze · 2023-11-17T10:31:41Z

This pr mentioned a while back that, since Llama 70b used GQA, there is a specific k-quantization trick that allows them to quantize with marginal model size increases:

Mistral 7b, a very popular model released after this PR was made, also uses Grouped Query Attention.
Checking for this if the 7b is a Mistral model and applying the same treatment should theoretically provide similar gains unless I am mistaken.

I think in general quantization optimization is sorely overlooked, lots of low hanging fruit there for sure....

ggerganov · 2023-11-17T11:19:09Z

The quantum mixtures currently available in llama.cpp have been mostly optimized towards the OG LLaMA models and also for Falcon to some extend. There is no guarantee that these mixtures are optimal for any other model or finetune.

I still think that the correct way to generate per-model quantum mixtures is via #2783, but I haven't came around to implement it yet.

github-actions · 2024-04-02T01:11:15Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

kalomaze added the bug-unconfirmed label Nov 17, 2023

kalomaze changed the title ~~Llama 2 70b quantizes in way that's optimal for GQA; Mistral 7b is missing that optimization~~ Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization Nov 17, 2023

github-actions bot added the stale label Mar 19, 2024

github-actions bot closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization #4111

Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization #4111

kalomaze commented Nov 17, 2023 •

edited

Loading

ggerganov commented Nov 17, 2023

github-actions bot commented Apr 2, 2024

Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization #4111

Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization #4111

Comments

kalomaze commented Nov 17, 2023 • edited Loading

ggerganov commented Nov 17, 2023

github-actions bot commented Apr 2, 2024

kalomaze commented Nov 17, 2023 •

edited

Loading