Add QK=128 q4_1 for GPTQ #937

qwopqwop200 · 2023-04-13T06:25:44Z

Currently, GPTQ is converted to q4_1(QK=32) format. However, this is very inefficient considering that GPTQ generally recommends a QK value of 128.
And to solve this, I created a new q4_1 (qk=128) called q4_2.
Applying this gives an approximate 20% speed improvement and a 17% reduction in memory usage on GPTQ.
I'm sure this implementation will make llama run faster and more robustly.

Currently, I don't have an ARM cpu, so ARM may not work because I haven't tested the operation.

Q4_2 is qk=128 q4_1

qwopqwop200 · 2023-04-13T06:35:42Z

make gptq model (not support act-order)
https://github.com/qwopqwop200/GPTQ-for-LLaMa
python llama.py ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --groupsize 128 --save llama7b-4bit-128g.pt

use python 3.10

cd llama.cpp
make

# obtain the original LLaMA model weights and place them in ./models
ls ./models
tokenizer_checklist.chk tokenizer.model  llama7b-4bit-128g.pt

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the 7B model to ggml int4 format
python3 ./convert.py ./models/llama7b-4bit-128g.pt --outtype q4_2 --outfile ./models/llama7b-4bit-128g-ggjt.bin

# run the inference
./main -m ./models/llama7b-4bit-128g-ggjt.bin -n 128

do not force the prompt file to end with a new line (#908)

ggerganov · 2023-04-13T14:06:55Z

Can you provide perplexity comparison between original Q4_1 quantization and the proposed one?

qwopqwop200 · 2023-04-13T21:03:38Z

Currently, some parts of my experiment seem to be wrong (set ctx=1024). In particular, the result of the current experiment is that the result is worse than that of Q4_1. Because of this, I will close the current issue.

qwopqwop200 added 10 commits April 13, 2023 14:52

add q4_2

db77f1b

Q4_2 is qk=128 q4_1

Add files via upload

716bd8f

add Q4_2

f0b14e8

add Q4_2

ff0efc7

add Q4_2

d405209

fix tab

b0c6171

fix tab

75b39c4

fix tab

e01b2d0

fix tab

8b9316b

Add files via upload

db4b293

Merge pull request #1 from ggerganov/master

ed70ea9

do not force the prompt file to end with a new line (#908)

qwopqwop200 closed this Apr 13, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

Add Zephyr format (ggml-org#937)

d68fc07

jeroen-mostert pushed a commit to jeroen-mostert/llama.cpp that referenced this pull request Aug 30, 2024

Give the CI builds a recognizable AVX1 name (ggml-org#937)

fdca385

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add QK=128 q4_1 for GPTQ #937

Add QK=128 q4_1 for GPTQ #937

qwopqwop200 commented Apr 13, 2023 •

edited

Loading

qwopqwop200 commented Apr 13, 2023 •

edited

Loading

ggerganov commented Apr 13, 2023

qwopqwop200 commented Apr 13, 2023

Add QK=128 q4_1 for GPTQ #937

Add QK=128 q4_1 for GPTQ #937

Conversation

qwopqwop200 commented Apr 13, 2023 • edited Loading

qwopqwop200 commented Apr 13, 2023 • edited Loading

ggerganov commented Apr 13, 2023

qwopqwop200 commented Apr 13, 2023

qwopqwop200 commented Apr 13, 2023 •

edited

Loading

qwopqwop200 commented Apr 13, 2023 •

edited

Loading