Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend sgemm.cpp support for Q5_0 models #10010

Merged
merged 1 commit into from
Oct 25, 2024

Conversation

Srihari-mcw
Copy link
Contributor

@Srihari-mcw Srihari-mcw commented Oct 23, 2024

The PR tries to extend the sgemm.cpp to support Q5_0 quantization along with Q4_0 and Q8_0 quantizations. Good gains were seen in AMD Raphael 7600X post the changes with prompt processing

GCC Linux :

Q5_0 Model :

model size params backend threads test t/s speedup
llama 7B Q5_0 4.33 GiB 6.74 B CPU 6 pp 512 26.10 ± 0.02
llama 7B Q5_0 4.33 GiB 6.74 B CPU 6 pp 512 53.26 ± 0.12 104.61%
llama 7B Q5_0 6.67 GiB 6.74 B CPU 6 tg 128 12.40 ± 0.01
llama 7B Q5_0 6.67 GiB 6.74 B CPU 6 tg 128 12.40 ± 0.00 0.00%

GCC Version = 12.3

The PR was tested in AMD Raphael 7600X which supports the following flags by default :

AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1|

Original Unquantized Models :

Llama2 7B : https://huggingface.co/meta-llama/Llama-2-7b

@ggerganov
Copy link
Owner

Can you show PPL results for a few chunks to make sure the results are correct?

Btw, the table header says Q8_0 Model but the model in the first column is Q5_0

@Srihari-mcw
Copy link
Contributor Author

Hi @ggerganov ,

Thanks for pointing out the typo. The readings are indeed for the Q5_0 model

Also, the perplexity readings were found to be the same before and after the changes

The perplexity was measured for models quantized from meta llama2 7B model with the following command :
./llama-perplexity -m ../test_models/ggml-model-q5_0.gguf -f wikitext-2-raw/wiki.test.raw --chunks 128

It calculated perplexity over 128 chunks :
perplexity: calculating perplexity over 128 chunks, n_ctx=512, batch_size=2048, n_seq=4

The perplexity results are tabulated as follows :

model perplexity (Final estimate PPL) Commit id
llama 7B Q5_0 6.3549 +/- 0.08131 d42b46bc8
llama 7B Q5_0 6.3549 +/- 0.08131 6f1d9d71

@ggerganov ggerganov merged commit 2f8bd2b into ggerganov:master Oct 25, 2024
50 of 53 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants