Extend sgemm.cpp support for Q5_0 models #10010

Srihari-mcw · 2024-10-23T06:18:29Z

The PR tries to extend the sgemm.cpp to support Q5_0 quantization along with Q4_0 and Q8_0 quantizations. Good gains were seen in AMD Raphael 7600X post the changes with prompt processing

GCC Linux :

Q5_0 Model :

model	size	params	backend	threads	test	t/s	speedup
llama 7B Q5_0	4.33 GiB	6.74 B	CPU	6	pp 512	26.10 ± 0.02
llama 7B Q5_0	4.33 GiB	6.74 B	CPU	6	pp 512	53.26 ± 0.12	104.61%
llama 7B Q5_0	6.67 GiB	6.74 B	CPU	6	tg 128	12.40 ± 0.01
llama 7B Q5_0	6.67 GiB	6.74 B	CPU	6	tg 128	12.40 ± 0.00	0.00%

GCC Version = 12.3

The PR was tested in AMD Raphael 7600X which supports the following flags by default :

Original Unquantized Models :

Llama2 7B : https://huggingface.co/meta-llama/Llama-2-7b

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ggerganov · 2024-10-23T10:38:22Z

Can you show PPL results for a few chunks to make sure the results are correct?

Btw, the table header says Q8_0 Model but the model in the first column is Q5_0

Srihari-mcw · 2024-10-24T07:44:35Z

Hi @ggerganov ,

Thanks for pointing out the typo. The readings are indeed for the Q5_0 model

Also, the perplexity readings were found to be the same before and after the changes

The perplexity was measured for models quantized from meta llama2 7B model with the following command :
./llama-perplexity -m ../test_models/ggml-model-q5_0.gguf -f wikitext-2-raw/wiki.test.raw --chunks 128

It calculated perplexity over 128 chunks :
perplexity: calculating perplexity over 128 chunks, n_ctx=512, batch_size=2048, n_seq=4

The perplexity results are tabulated as follows :

model	perplexity (Final estimate PPL)	Commit id
llama 7B Q5_0	6.3549 +/- 0.08131	d42b46bc8
llama 7B Q5_0	6.3549 +/- 0.08131	6f1d9d71

Extend sgemm.cpp support for Q5_0

d42b46b

ggerganov approved these changes Oct 25, 2024

View reviewed changes

ggerganov merged commit 2f8bd2b into ggerganov:master Oct 25, 2024
50 of 53 checks passed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

llamafile : extend sgemm.cpp support for Q5_0 models (ggerganov#10010)

b26ad6e

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

llamafile : extend sgemm.cpp support for Q5_0 models (ggerganov#10010)

745afd6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend sgemm.cpp support for Q5_0 models #10010

Extend sgemm.cpp support for Q5_0 models #10010

Srihari-mcw commented Oct 23, 2024 •

edited

Loading

ggerganov commented Oct 23, 2024

Srihari-mcw commented Oct 24, 2024

Extend sgemm.cpp support for Q5_0 models #10010

Extend sgemm.cpp support for Q5_0 models #10010

Conversation

Srihari-mcw commented Oct 23, 2024 • edited Loading

ggerganov commented Oct 23, 2024

Srihari-mcw commented Oct 24, 2024

Srihari-mcw commented Oct 23, 2024 •

edited

Loading