You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This pr mentioned a while back that, since Llama 70b used GQA, there is a specific k-quantization trick that allows them to quantize with marginal model size increases:
Mistral 7b, a very popular model released after this PR was made, also uses Grouped Query Attention.
Checking for this if the 7b is a Mistral model and applying the same treatment should theoretically provide similar gains unless I am mistaken.
I think in general quantization optimization is sorely overlooked, lots of low hanging fruit there for sure....
The text was updated successfully, but these errors were encountered:
kalomaze
changed the title
Llama 2 70b quantizes in way that's optimal for GQA; Mistral 7b is missing that optimization
Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization
Nov 17, 2023
The quantum mixtures currently available in llama.cpp have been mostly optimized towards the OG LLaMA models and also for Falcon to some extend. There is no guarantee that these mixtures are optimal for any other model or finetune.
I still think that the correct way to generate per-model quantum mixtures is via #2783, but I haven't came around to implement it yet.
This pr mentioned a while back that, since Llama 70b used GQA, there is a specific k-quantization trick that allows them to quantize with marginal model size increases:
Mistral 7b, a very popular model released after this PR was made, also uses Grouped Query Attention.
Checking for this if the 7b is a Mistral model and applying the same treatment should theoretically provide similar gains unless I am mistaken.
I think in general quantization optimization is sorely overlooked, lots of low hanging fruit there for sure....
The text was updated successfully, but these errors were encountered: