server : add speculative decoding support #10455

ggerganov · 2024-11-22T11:53:37Z

Initial implementation that enables speculative decoding in llama-server. Test with this command:

./bin/llama-server \
    -m  ../models/qwen2.5-32b-coder-instruct/ggml-model-q8_0.gguf \
    -md ../models/qwen2.5-0.5b-coder-instruct/ggml-model-q4_0.gguf \
    -ngl 99 -ngld 99 -fa --port 8033 -c 32768 \
    --draft-max 16 --draft-min 5

The --draft-max and --draft-min might need tuning
Use the built-in llama.cpp Web UI client
Set Top K = 1
With multiple GPUs, use the new -devd argument to put the draft model on only one of them (llama : accept a list of devices to use to offload a model #10497)

Feedback is appreciated.

TODO:

simplify
control draft context size
rename server.params to something else to avoid confusions
test multi-user
test offloading draft model with RPC

3Simplex · 2024-11-22T15:58:01Z

From what I have read the goal is faster inference while retaining quality of the larger model.

I am using rx6900xt with vulkan
Using Qwen2.5-Coder-7B-Instruct-Q8_0.gguf alone I see 50 t/s
Using Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf alone I see 230 t/s

I get about 10-12 t/s with an incorrect configuration.

.\llama-server.exe -m "...Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf" -md "...Qwen2.5-Coder-7B-Instruct-Q8_0.gguf" -ngl 99 -ngld 99 -fa --port 8080 -c 32768 --draft 10 --draft-min 5

Flipping the models increased speed and the output looks similar. This makes sense since the -md is the draft model which is supposed to be the smaller model.

I get about 16 t/s with the correct configuration.

.\llama-server.exe -m "...Qwen2.5-Coder-7B-Instruct-Q8_0.gguf" -md "...Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf" -ngl 99 -ngld 99 -fa --port 8080 -c 32768 --draft 10 --draft-min 5

Setting a lower context 2048, when the limit is reached the server crashed.

ggerganov · 2024-11-24T15:50:17Z

@3Simplex What is the output of the following bench on your machine:

llama-bench.exe -m "...Qwen2.5-Coder-7B-Instruct-Q8_0.gguf" -p 1,1,2,3,4,5,6,7,8,12,16,32 -r 20 -n 0 -ngl 99 -fa 1

3Simplex · 2024-11-24T15:56:01Z

@ggerganov

.\llama-bench.exe -m "...\Qwen2.5-Coder-7B-Instruct-Q8_0.gguf" -p 1,1,2,3,4,5,6,7,8,12,16,32 -r 20 -n 0 -ngl 99 -fa 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6900 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64

model	size	params	backend	ngl	fa	test	t/s
ggml_vulkan: Compiling shaders..............................Done!
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp1	37.79 ± 0.30
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp1	37.81 ± 0.29
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp2	16.14 ± 0.04
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp3	23.40 ± 0.06
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp4	31.10 ± 0.04
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp5	37.39 ± 1.74
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp6	45.52 ± 0.06
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp7	51.53 ± 0.09
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp8	58.57 ± 0.28
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp12	80.38 ± 0.13
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp16	105.83 ± 0.54
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp32	202.53 ± 0.21

build: 0c74590 (4160)

mostlygeek · 2024-11-24T19:54:49Z

I tried out commit e80f758 with my P40s, 3xP40s and 3090. These are the commands for the baselines and the tests.

Baseline:

./llama-server -m /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf -ngl 99 -ngld 99 -fa --port 9999 -c 4096 --draft-max 16 --draft-min 5

With speculative model (just removed the -md model.gguf):

./llama-server -m /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf -md /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf -ngl 99 -ngld 99 -fa --port 9999 -c 4096 --draft-max 16 --draft-min 5

Tested it with curl using:

for n in seq 1 5; do curl http://10.0.1.50:9999/v1/chat/completions -N -v -d '{"messages":[{"role":"user","content":"write hello world in golang"}],"temperature":0.1, "stream":false,"max_tokens":1000, "model":"coder" }'; done

Data:

GPU	baseline pp	baseline eval	`-md ...` pp	`-md ...` eval
3090	299 tps	34 tps	300 tps	31 tps
P40	101 tps	11.22 tps	101 tps	10.52 tps
3xP40	91 tps	10.6 tps	90 tps	9.8 tps

ggerganov · 2024-11-24T20:13:22Z

Currently, it requires cache_prompt: true to be set do speculation. Will be fixed in next PRs. Using greedy sampling should improve things as well:

cache_prompt: true, top_k: 1, samplers: ["top_k"]

The biggest benefit from speculative sampling is when you have more grounding. For example, if you have enough memory for a bigger context, you can try something like this:

# get the llama.vim plugin source code
code=$(curl -s https://raw.githubusercontent.com/ggml-org/llama.vim/refs/heads/master/autoload/llama.vim | jq -sRr @json)

# ask qwen to implement something (speculative decoding disabled)
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n --arg code "$code" \
  '{ messages: [{ role: "system", content: "You are an expert computer scientist. Respond only with code blocks. Do not add any other comments except code." }, { role: "user", content: "Suggest an improvement for the `chunk_sim` function using Levenstein distance: ```\($code)```" }], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 0 }')" | jq -r .choices[0].message.content

# speculative decoding enabled
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n --arg code "$code" \
  '{ messages: [{ role: "system", content: "You are an expert computer scientist. Respond only with code blocks. Do not add any other comments except code." }, { role: "user", content: "Suggest an improvement for the `chunk_sim` function using Levenstein distance: ```\($code)```" }], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 16 }')" | jq -r .choices[0].message.content

With CUDA, you might want to try setting "speculative.n_min": 0 or 1 since I think it has efficient small-batch kernels for Q4_K, so no need to skip the small batches.

mostlygeek · 2024-11-24T21:28:34Z

Thank you for the guidance. Using d905266, I reran the tests.

Results look quite good.

GPU	`n_max:0`	`n_max:16`	change
P40	8.7 tps	39.4 tps	4.45x
3xP40 `-sm row`	12.70 tps	53 tps	4.17x
3090	29 tps	167 tps	5.73x

Server command:

./llama-server -m /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf -md /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf -ngl 99 -ngld 99 -fa --port 9999 -c 10240 --draft-max 16 --draft-min 0 --host 0.0.0.0 2>&1 | grep 'eval time'

Kept this pretty consistent, except for the 3xP40 run where I added -sm row

Client side:

$ code=$(curl -s https://raw.githubusercontent.com/ggml-org/llama.vim/refs/heads/master/autoload/llama.vim | jq -sRr @json)

$ for n in `seq 1 5`; \
do \
    curl --request POST --url http://10.0.1.50:9999/v1/chat/completions \
        -H "Content-Type: application/json" -H "Authorization: Bearer no-key" \
        -d "$(jq -n --arg code "$code" '{ messages: [{ role: "system", content: "You are an expert computer scientist. Respond only with code blocks. Do not add any other comments except code." }, { role: "user", content: "Suggest an improvement for the `chunk_sim` function using Levenstein distance: ```\($code)```" }], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 16 }')" | jq -r .choices[0].message.content; \
done

For the client side curl, I changed speculative.n_max between 0 and 16 to get the different timings.

Here are the raw results. Some observations first:

with n_max: 0, 437 tokens were generated. With n_max: 16, 440 tokens were generated.
the server was restarted between tests to clear the cache
the code generated was identical (ran it through a diff)

3090 data

# speculative.n_max: 0
prompt eval time =    8032.34 ms /  8318 tokens (    0.97 ms per token,  1035.56 tokens per second)
       eval time =   14975.84 ms /   437 tokens (   34.27 ms per token,    29.18 tokens per second)
prompt eval time =      37.56 ms /     1 tokens (   37.56 ms per token,    26.62 tokens per second)
       eval time =   14988.71 ms /   437 tokens (   34.30 ms per token,    29.16 tokens per second)
prompt eval time =      37.15 ms /     1 tokens (   37.15 ms per token,    26.92 tokens per second)
       eval time =   15005.60 ms /   437 tokens (   34.34 ms per token,    29.12 tokens per second)
prompt eval time =      37.27 ms /     1 tokens (   37.27 ms per token,    26.83 tokens per second)
       eval time =   15017.94 ms /   437 tokens (   34.37 ms per token,    29.10 tokens per second)
prompt eval time =      37.49 ms /     1 tokens (   37.49 ms per token,    26.67 tokens per second)
       eval time =   15026.50 ms /   437 tokens (   34.39 ms per token,    29.08 tokens per second)

# speculative.n_max: 16
prompt eval time =    7915.24 ms /  8318 tokens (    0.95 ms per token,  1050.88 tokens per second)
       eval time =    9432.51 ms /   440 tokens (   21.44 ms per token,    46.65 tokens per second)
prompt eval time =      38.44 ms /     1 tokens (   38.44 ms per token,    26.02 tokens per second)
       eval time =    2626.82 ms /   440 tokens (    5.97 ms per token,   167.50 tokens per second)
prompt eval time =      37.93 ms /     1 tokens (   37.93 ms per token,    26.37 tokens per second)
       eval time =    2629.31 ms /   440 tokens (    5.98 ms per token,   167.34 tokens per second)
prompt eval time =      37.91 ms /     1 tokens (   37.91 ms per token,    26.38 tokens per second)
       eval time =    2628.70 ms /   440 tokens (    5.97 ms per token,   167.38 tokens per second)
prompt eval time =      38.20 ms /     1 tokens (   38.20 ms per token,    26.18 tokens per second)
       eval time =    2637.09 ms /   440 tokens (    5.99 ms per token,   166.85 tokens per second)

single P40

# speculative.n_max: 0
prompt eval time =   55669.14 ms /  8318 tokens (    6.69 ms per token,   149.42 tokens per second)
       eval time =   50050.73 ms /   437 tokens (  114.53 ms per token,     8.73 tokens per second)
prompt eval time =     114.98 ms /     1 tokens (  114.98 ms per token,     8.70 tokens per second)
       eval time =   50075.91 ms /   437 tokens (  114.59 ms per token,     8.73 tokens per second)
prompt eval time =     113.24 ms /     1 tokens (  113.24 ms per token,     8.83 tokens per second)
       eval time =   50097.56 ms /   437 tokens (  114.64 ms per token,     8.72 tokens per second)
       
# speculative.n_max: 16
prompt eval time =   55362.42 ms /  8318 tokens (    6.66 ms per token,   150.25 tokens per second)
       eval time =   29859.49 ms /   440 tokens (   67.86 ms per token,    14.74 tokens per second)
prompt eval time =     113.02 ms /     1 tokens (  113.02 ms per token,     8.85 tokens per second)
       eval time =   11146.53 ms /   440 tokens (   25.33 ms per token,    39.47 tokens per second)
prompt eval time =     113.75 ms /     1 tokens (  113.75 ms per token,     8.79 tokens per second)
       eval time =   11142.33 ms /   440 tokens (   25.32 ms per token,    39.49 tokens per second)
prompt eval time =     113.19 ms /     1 tokens (  113.19 ms per token,     8.83 tokens per second)
       eval time =   11175.47 ms /   440 tokens (   25.40 ms per token,    39.37 tokens per second)
prompt eval time =     112.65 ms /     1 tokens (  112.65 ms per token,     8.88 tokens per second)
       eval time =   11159.70 ms /   440 tokens (   25.36 ms per token,    39.43 tokens per second)

3xP40 (-sm row)

# speculative.n_max: 0
prompt eval time =   36909.28 ms /  8318 tokens (    4.44 ms per token,   225.36 tokens per second)
       eval time =   34412.92 ms /   437 tokens (   78.75 ms per token,    12.70 tokens per second)
prompt eval time =      79.49 ms /     1 tokens (   79.49 ms per token,    12.58 tokens per second)
       eval time =   34414.53 ms /   437 tokens (   78.75 ms per token,    12.70 tokens per second)
prompt eval time =      79.40 ms /     1 tokens (   79.40 ms per token,    12.60 tokens per second)
       eval time =   34413.66 ms /   437 tokens (   78.75 ms per token,    12.70 tokens per second)

# speculative.n_max: 16
prompt eval time =   36858.25 ms /  8318 tokens (    4.43 ms per token,   225.68 tokens per second)
       eval time =   27168.81 ms /   440 tokens (   61.75 ms per token,    16.20 tokens per second)
prompt eval time =      79.72 ms /     1 tokens (   79.72 ms per token,    12.54 tokens per second)
       eval time =    8290.25 ms /   440 tokens (   18.84 ms per token,    53.07 tokens per second)
prompt eval time =      79.73 ms /     1 tokens (   79.73 ms per token,    12.54 tokens per second)
       eval time =    8295.16 ms /   440 tokens (   18.85 ms per token,    53.04 tokens per second)
prompt eval time =      79.99 ms /     1 tokens (   79.99 ms per token,    12.50 tokens per second)
       eval time =    8295.91 ms /   440 tokens (   18.85 ms per token,    53.04 tokens per second)
prompt eval time =      79.88 ms /     1 tokens (   79.88 ms per token,    12.52 tokens per second)
       eval time =    8301.95 ms /   440 tokens (   18.87 ms per token,    53.00 tokens per second)

Code generated:

function! s:chunk_sim(c0, c1)
    let l:lines0 = join(a:c0, "\n")
    let l:lines1 = join(a:c1, "\n")

    let l:distance = levenshtein(l:lines0, l:lines1)

    return 1 - (l:distance / max([strlen(l:lines0), strlen(l:lines1)]))
endfunction

function! levenshtein(s1, s2)
    let l:len1 = strlen(a:s1)
    let l:len2 = strlen(a:s2)

    if l:len1 == 0
        return l:len2
    endif

    if l:len2 == 0
        return l:len1
    endif

    let l:dp = []
    for i in range(l:len1 + 1)
        call add(l:dp, [])
        for j in range(l:len2 + 1)
            call add(l:dp[i], 0)
        endfor
    endfor

    for i in range(l:len1 + 1)
        let l:dp[i][0] = i
    endfor

    for j in range(l:len2 + 1)
        let l:dp[0][j] = j
    endfor

    for i in range(1, l:len1 + 1)
        for j in range(1, l:len2 + 1)
            let l:cost = (strcharpart(a:s1, i - 1, 1) == strcharpart(a:s2, j - 1, 1)) ? 0 : 1
            let l:dp[i][j] = min([l:dp[i - 1][j] + 1, l:dp[i][j - 1] + 1, l:dp[i - 1][j - 1] + l:cost])
        endfor
    endfor

    return l:dp[l:len1][l:len2]
endfunction

mostlygeek · 2024-11-24T21:36:05Z

Also, is 0 and 16 the only valid values for speculative.n_max? I tried it with 4, 12, and got this error: common/common.cpp:1480: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed

ggerganov · 2024-11-24T21:47:19Z

Thanks for the detailed tests. The results are inflated because there is one tricky side effect from the caching - consecutive runs with the same prompt will reuse the previous draft context which combined with greedy sampling would make the drafting instantaneous. So basically, in the following data for example, only the first result is relevant:

# speculative.n_max: 16
prompt eval time =    7915.24 ms /  8318 tokens (    0.95 ms per token,  1050.88 tokens per second)
       eval time =    9432.51 ms /   440 tokens (   21.44 ms per token,    46.65 tokens per second)    <--- only this is relevant
prompt eval time =      38.44 ms /     1 tokens (   38.44 ms per token,    26.02 tokens per second)
       eval time =    2626.82 ms /   440 tokens (    5.97 ms per token,   167.50 tokens per second)
prompt eval time =      37.93 ms /     1 tokens (   37.93 ms per token,    26.37 tokens per second)
       eval time =    2629.31 ms /   440 tokens (    5.98 ms per token,   167.34 tokens per second)
prompt eval time =      37.91 ms /     1 tokens (   37.91 ms per token,    26.38 tokens per second)
       eval time =    2628.70 ms /   440 tokens (    5.97 ms per token,   167.38 tokens per second)
prompt eval time =      38.20 ms /     1 tokens (   38.20 ms per token,    26.18 tokens per second)
       eval time =    2637.09 ms /   440 tokens (    5.99 ms per token,   166.85 tokens per second)

i.e. 46.65 t/s. The next runs are reusing the drafts and are not representative.

Also, is 0 and 16 the only valid values for speculative.n_max? I tried it with 4, 12, and got this error: common/common.cpp:1480: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed

This was a bug - it is fixed now. You should be able to change n_max to any value. Btw, for CUDA it might make sense to set n_min to 0 or 1 and keep n_max ~ 16. But feel free to experiment.

Btw, here is another fun test that I came up with which uses less context and is suitable for speculation:

# get top 10 stories from Hacker News
hn=$(curl -s https://hacker-news.firebaseio.com/v0/topstories.json | jq -r '.[:10] | @tsv' | tr '\t' '\n' | xargs -I {} curl -s "https://hacker-news.firebaseio.com/v0/item/{}.json" | jq -sRr @json)

# make a Markdown table based on some criteria
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n --arg hn "$hn" \
  '{ messages: [{ role: "system", content: "You are a helpful text-editing assistant. Respond only with the requested text. Do not add any other comments to your response." }, { role: "user", content: "Extract a Markdown table that contains only stories about software engineering, AI or machine learning from the front-page of HN. The table should include: author, title, score, comments and an URL to the story: ```\($hn)```." }], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 16 }')" | jq -r .choices[0].message.content

mostlygeek · 2024-11-24T22:14:22Z

i.e. 46.65 t/s. The next runs are reusing the drafts and are not representative.

Thanks. That seems a lot more realistic.

I did some tests with a much shorter prompt: "write snake game in swift"

GPU	`n_max:0`	`n_max:16`	change
P40	10.54 tps	17.11 tps	1.62x
3xP40 `-sm row`	16.22 tps	22.80 tps	1.4x
3090	34.78 tps	51.31 tps	1.47x

curl --request POST --url http://10.0.1.50:9999/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n --arg code "$code" '{ messages: [{ role: "system", content: "You are an expert computer scientist. Respond only with code blocks. Do not add any other comments except code." }, { role: "user", content: "write snake game in swift"}], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 0 }')" | jq -r .choices[0].message.content;

ggerganov · 2024-11-24T22:23:35Z

These numbers look reasonable. The speedup can vary in both ways based on the inputs, but enabling speculative should almost never result in slower than normal decoding.

3Simplex · 2024-11-24T22:34:50Z

These numbers look reasonable. The speedup can vary in both ways based on the inputs, but enabling speculative should almost never result in slower than normal decoding.

With this build I am up to 25t/s on first run generation with speculative decoding using 15/5 draft tokens.

mostlygeek · 2024-11-25T03:56:46Z

A bit of data with llama-3.1 70B and llama-3.2 1B as the draft model. Prompt: "write a story about the natural resources in Canada".

GPU	`n_max:0`	`n_max:16`	change
3xP40 `-sm row`	9.80 tps	12.27 tps	1.25x

Server:

$ ./llama-server -m /mnt/nvme/models/Meta-Llama-3.1-70B-Instruct-Q4_K_L.gguf \
-md /mnt/nvme/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
-ngl 99 -ngld 99 -fa --port 9999 -c 10240 --draft-max 16 --draft-min 1 \
--host 0.0.0.0 -sm row

client (changed speculative.n_max between 0 and 16)

$ curl --request POST --url http://10.0.1.50:9999/v1/chat/completions \
-d "$(jq -n --arg code "$code" '{ messages: [{ role: "system", content: "You are a helpful AI."}, {role: "user",content: "write a story about the natural resources in Canada"}], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 0 }')" \
| jq -r .choices[0].message.content;

ggerganov · 2024-11-25T06:58:19Z

Note that I am not very sure what happens with multiple GPUs, but it is possible that the draft model gets split across them, which is not desired (see the logs if that is the case). You would want to keep the draft model fully on one GPU.

ggml-ci

sorasoras · 2024-11-25T08:42:38Z

Note that I am not very sure what happens with multiple GPUs, but it is possible that the draft model gets split across them, which is not desired (see the logs if that is the case). You would want to keep the draft model fully on one GPU.

I wonder if it is possible to load draft and main model onto different backend. Ie a 7900xtx and P40 in a -cb process

webbigdata-jp · 2024-12-01T14:04:50Z

@dagbdagb
Hello.
thank you for your reply.

What models/model sizes/quants are you using

-m  .\gemma-2-27B-it-Q4_K_M-fp16.gguf ^ [18 GB]
-md .\gemma-2-2b-it-IQ3_XXS.gguf ^ [1.18 GB]

repository
https://huggingface.co/dahara1/gemma-2-2b-it-gguf-japanese-imatrix
https://huggingface.co/dahara1/gemma-2-27b-it-gguf-japanese-imatrix

What CPU/memory do you have?

AMD Ryzen 9 7940HS w/ Radeon 780M Graphics
Windows 11 Pro (not docker, not WSL)

How much memory?

Total 32.0GB available 27.8 GB (4GB for iGPU)
I visually check during execution that no swapping is occurring.

What are your prompts for testing?

extract specific data from around 1500 tokens of text in Japanese (repeat 26 times)

Results with/without a draft model

(1)b4219(llama.cpp official binary)

..\llama-b4219-bin-win-avx512-x64\llama-server.exe ^
    -m  .\gemma-2-27B-it-Q4_K_M-fp16.gguf ^
    -md .\gemma-2-2b-it-IQ3_XXS.gguf ^
    -e --temp 0 -c 4096 ^
    --draft-max 16 --draft-min 5

5764.07 second

..\llama-b4219-bin-win-avx512-x64\llama-server.exe ^
    -m  .\gemma-2-27B-it-Q4_K_M-fp16.gguf ^
    -e --temp 0 -c 4096

4968.42 second

(2)locally built myself(b4227)

..\llama.cpp\build\bin\Release\llama-server.exe ^
    -m  .\gemma-2-27B-it-Q4_K_M-fp16.gguf ^
    -md .\gemma-2-2b-it-IQ3_XXS.gguf ^
    -e --temp 0 -c 4096 ^
    --draft-max 16 --draft-min 5

5807.13 second

..\llama.cpp\build\bin\Release\llama-server.exe ^
    -m  .\gemma-2-27B-it-Q4_K_M-fp16.gguf ^
    -e --temp 0 -c 4096

5003.03 second

(3)ROCm (b4215)

set HSA_OVERRIDE_GFX_VERSION=gfx1103 && .\llama-server.exe ^
    -m  .\gemma-2-27B-it-Q4_K_M-fp16.gguf ^
    -md .\gemma-2-2b-it-IQ3_XXS.gguf ^
    -ngl 10 -ngld 10 -e --temp 0 -c 4096 ^
    --draft-max 16 --draft-min 5

1576.67 second

I feel that the 2B model may not be able to run fast enough on the CPU. This causing a bottleneck.

mybyte · 2024-12-02T20:48:52Z

Tried it with Qwen-2.5 on my 2x 3090s. No performance improvements whatsoever with 72b split across both GPUs. Lost some performance, actually. Ran a bunch of experiments using different hints I picked up here. No performance gains still, running the 14b variant on the same gpu as draft models (tried 0.5, 1.5b, 3b) or the other gpu, any permutation of draft-p-min and speculative.n_max. Best I got was 2-3 tps more (around 56 tps) as compare to ~54 I'm getting running without the draft model.

llama-server -m /models/qwen2.5-14b-instruct-q6_k.gguf -md /models/qwen2.5-0.5b-instruct-q5_0.gguf -ngl 99 -ngld 99 -c 32000 --n-gpu-layers 99 --host 0.0.0.0 -fa --draft-max 16 --draft-min 5 -devd CUDA1 -dev CUDA1 --n-gpu-layers-draft 99 --draft-p-min 0.5 --top-k 1 --n-gpu-layers-draft 99 -cd 32000

Maybe I'm missing something obvious, but no clue how other folks got such huge performance gains.

JeroenAdam · 2024-12-02T21:47:05Z

I tested b4240 with 150% speed bump which must be an optimal use case with my non-optimal hardware (16GB P5000 + 8GB RTX 2070 Max-Q). These tweaks contributed to that: draft-min 0, draft-p-min 0.5 and temperature 0.1.
That said, I lose 30% of that optimal speed when using ctk q8_0 and ctv q8_0 with 32K context instead of 16K context unquantized.

llama-server -m Qwen2.5-Coder-32B-Instruct-IQ4_XS.gguf -md Qwen2.5-Coder-0.5B-Instruct-Q4_0.gguf -ngl 99 -ngld 99 -fa -c 32768 -ctk q8_0 -ctv q8_0 --draft-max 16 --draft-min 0 --draft-p-min 0.5 --device-draft CUDA0 -ts 0.4,1
eval time = 70247.85 ms / 1077 tokens ( 65.23 ms per token, 15.33 tokens per second)
eval time = 128839.95 ms / 1439 tokens ( 89.53 ms per token, 11.17 tokens per second)
eval time = 153567.05 ms / 1805 tokens ( 85.08 ms per token, 11.75 tokens per second)

llama-server -m Qwen2.5-Coder-32B-Instruct-IQ4_XS.gguf -md Qwen2.5-Coder-0.5B-Instruct-Q4_0.gguf -ngl 99 -ngld 99 -fa -c 16384 --draft-max 16 --draft-min 0 --draft-p-min 0.5 --device-draft CUDA0 -ts 0.4,1
eval time = 60899.25 ms / 1077 tokens ( 56.55 ms per token, 17.68 tokens per second)
eval time = 74965.36 ms / 1180 tokens ( 63.53 ms per token, 15.74 tokens per second)
eval time = 87163.09 ms / 1695 tokens ( 51.42 ms per token, 19.45 tokens per second)

ggerganov · 2024-12-03T13:07:53Z

@mybyte Did you remember to set Top K = 1 in the UI?

ggerganov · 2024-12-03T13:09:01Z

That said, I lose 30% of that optimal speed when using ctk q8_0 and ctv q8_0 with 32K context instead of 16K context unquantized.

@JeroenAdam This should be fixed now (#10586)

mybyte · 2024-12-03T13:16:00Z

@mybyte Did you remember to set Top K = 1 in the UI?

Yup. Tried both, UI and my app via REST.

I wonder if a potential culprit is the model choice?

David-AU-github · 2024-12-09T01:19:53Z

First - wow ; is this some upgrade- Hats off.
My testing is a different direction:
How does this affect creative generation?

So I "mixed" two versions of the same model (llama3, 8B) , one reg quant, one imatrix quant (draft):
(release b4291)

``
./llama-server --model 'd:/llm/L3-Dark-Planet-8B-D_AU-IQ4_XS.gguf'
--ctx-size 2048 `
--n-gpu-layers 99 `
--model-draft 'd:/llm/L3-Dark-Planet-8B-NEO-IMAT-D_AU-IQ2_S-imat.gguf' `
--ctx-size-draft 2048 `
--n-gpu-layers-draft 99 `
--draft-p-min 0.5 `
--draft-min 3 `
--draft-max 16

``
Dark Planet is a model I created; Neo Imatrix is a dataset I also created (specifically SI-FI).
This setup takes 7.5 GB of VRAM.

Using a custom sampler profile (includes use of DRY, temp .84, and penalty samplers) ... here is part output from a ONE LINE prompt:

_I stood frozen at my desk, staring into the abyss of the office park outside our high-rise building. My fingers were still poised above the keyboard, the cursor blinking mockingly at me like some malignant eye. The air was thick with tension, the kind that precedes an announcement no one wants to hear. But this was different. This was personal.

As I gazed out over the sea of cubicles and fluorescent lights, something began to rumble beneath my feet, like thunder growing nearer. A low, ominous hum filled my ears, making my teeth ache. Suddenly, the sky scraper swayed, as if buffeted by a storm force wind. I grabbed onto the edge of my desk for support, my heart racing with alarm.

My coworkers gasped and screamed as we all stumbled about, our chairs scraping against tile floors. Glass shattered somewhere nearby, and a panicked shout rose up from below. It felt like a seismic event, a catastrophic shift in the earth's crust that had nothing to do with geology.

Then came the blast._
(full output is just over 1000 tokens)

You don't get this level of detail at 8B. Or prose, or metaphors.
Just does not happen.
Lucky if you can get this with a 35B... or higher.

This is like a MOE on steroids. A lot of steroids.
And on my card (4060TI geforce, 16GB): 30 t/s. (a minus 20 t/s hit from IQ4XS , well worth it)
thank you - this is fantastic.

Mushoz · 2024-12-09T03:37:00Z

@David-AU-github I don't think you quite understand how speculative decoding works. It will generate identical results to the non-speculative decoding case and will always generate what the main model would have generated on its own. It's only useful as a speed boost, it will not alter the output at all.

David-AU-github · 2024-12-09T06:18:53Z

@Mushoz
From what I read speculative decoding takes tokens from the draft model and either accepts or rejects them.
Please clarify if this is off base / incorrect.
The information on this is not clear.

Also - number of "drafts" ? Great variance here in output here.
I read all the charts and comments - still not sure.
This model has been tested a lot, and I have tested a lot of 8B models ; this level of detail is unheard of.

I also tested this model - both main and draft - separately to see if I could replicate this level of detail.
Going to try again and see what happens.

Note : The two models - even thou the same model - one is imatrix version and the other non-imatrix version. In static tests (temp=0) each model will output different content from the same prompt.

Mushoz · 2024-12-09T06:37:55Z

Normally a model can only predict one token at a time, because the token at position N depends on all previous tokens 0 through N-1. It would be much quicker if a model could predict not only token at position N, but also (depending on number of drafts) N+1, N+2, N+3. The reason why this is much faster, is because all the weight data of the big slow model only needs to be retrieved once for all 4 tokens, and LLMs are generally memory bandwidth limited. More calculations need to be done, but GPUs are extremely good at parallel computations which is what this is. But this cannot normally be done, because you need all previous tokens to be able to generate the next.

What the draft model does, is generate a sequence of draft tokens N, N+1, N+2. The big model then assumes these to be true and generates 1 token ahead of each of these draft token, so it can do multiple at the same time. That means that despite the draft model generating N, N+1, N+2, the big model still generates these as well to verify them, but is able to do so in parallel (fast) instead of in sequence as is done in normal generation.

If the base model generates a different token than what the draft model predicted, all subsequent tokens are discarded and the drafting is started all over again. This means that tokens are only retained if the draft model predicted the token the base model generated, which is why the output in speculative decoding is identical to what the base model would have generated in non speculative decoding. And this is also why a speed up is only observed if the predictions are good enough, because if not, all the extra work is simply discarded.

David-AU-github · 2024-12-09T07:17:08Z

@Mushoz

Thank you.

To clarify ; you get a speed increase in the vast majority of draft sequence tokens are "in agreement" between the draft and main model.

The "draft" min / max is the number of tokens to generate for sequence? -> that is the min/max size of sequence?
or is that the min/max to accept ?

If the draft sequence / token(s) are not in agreement do both models "go back to the drawing board" and both models "redraft" a sequence of tokens?

If this point is true - specifically both models - , that explains what I am observing and I can work with that.
One of the core issues with creative use of a model is poor token choice(s).

It sounds like when I am err... using speculative decoding in this way it is forcing different choices to occur than would otherwise happen. Almost like a strange version of temp and/or a "light" rep pen sampler ? or adding a random element into the generation?

I have tested this method with other models / archs too and observing an increase in generational quality, with a decrease in T/S.
I do understand the intent of spc decoding is a net speed increase.
I am just looking at alt uses... because you just never know.

Mushoz · 2024-12-09T07:33:34Z

The maximum amount of tokens to draft, is just that: How long of a sequence the draft model will draft. The higher this value is, the higher potential speed increase (up to a maximum, where you become compute bound), as long as the predictions are correct. For longer sequences, the draft model will most definitely generate something else than the main model. That means there is a sweet spot somewhere. Too high and you're just wasting work that is discarded, leading to slowdowns.

The draft min is a variable that will tune how long your draft sequence needs to be at a minimum before the main model uses the draft predictions to do the final predictions. Some GPUs might not be terribly efficient at certain batch sizes, so it might be better to force them to higher batch sizes where the kernels are better optimized for batch processing.

When the main model is in disagreement, all draft tokens are discarded and all tokens generated by the main model that were BASED ON THE INCORRECT DRAFT TOKEN(S) are discarded as well. Importantly, the token that was generated by the main model that proved the draft wrong is NOT discarded, and the main model essentially falls back to the normal non-speculative decoding case.

Again, the main model will generate the exact same tokens with speculative decoding on vs off. The differences you are observing are purely due to sampler settings. Speculative decoding does not alter the output in any way, and anything you believe you are seeing is merely placebo.

David-AU-github · 2024-12-09T07:42:11Z

@Mushoz

Excellent. thank you.
One last question:

RE:
Importantly, the token that was generated by the main model that proved the draft wrong is NOT discarded, and the main model essentially falls back to the normal non-speculative decoding case.

How long does the model fall back into "normal non-speculative decoding" operation?

Until the next sequence of draft tokens from the draft model?
or is this a hard fall back - like the draft model is ignored from this point forward until generation is complete?

Mushoz · 2024-12-09T07:53:40Z

It doesn't really fall back in the literal sense. What I mean is that the draft tokens that were generated incorrectly are simply ignored as if speculative decoding had never been turned on in the first place. Speculative decoding will remain effective in the sense that the draft model will immediately generate a new draft sequence after getting corrected by the main model and the main model will then again use that sequence to do the validation, just as it had been doing before.

Mushoz · 2024-12-09T07:57:29Z

Note that this only applies to the incorrect draft token itself, and all subsequent tokens (as they are based on an incorrect preceding token). All correct draft tokens before the incorrect one are retained of course. If a draft sequence is 16 tokens long, it's perfectly possible the first 8 tokens are correct (which are retained) and the 9th is incorrect, which means token 9 through 16 of the sequence are discarded.

David-AU-github · 2024-12-11T23:27:17Z

@Mushoz Thank you for your help with this.
I think there is something there - for off use case - just not sure what.

There is an interesting divergence for creative use with very low bit quants VS mid/high which may benefit or be a benefit. (this is separate and part from spec decoding). Hmmm.

Never mind two different models all together (with same vocab)... hmmm 2x.

MOEs... raise even more questions.

firelex · 2024-12-18T21:13:41Z

I can't seem to get any performance gains on my Mac. I ordered a brand new M4 Max 128GB to get the most out of it. I'm runnign a Q4 K-M version of L3.3 70bn as the verification model and have tried a Q4 K-M version of L3.1 8bn, L3.2 3bn and L3.2 1bn as the drafting model and in no constellation do I get any benefit. We've seen significant benefits with Nvidia hard- and software, but I'd like to see the same on Metal.

I'm using this command on the server:

./build/bin/llama-server -m ../models/verification.gguf -md /Users/mattsinalco/.cache/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct-GGUF/snapshots/a5594fb18df5dfc6b43281423fcce6750cd92de5/Llama-3.2-1B-Instruct-Q4_K_M.gguf -ngl 99 -ngld 90 --port 8033 -c 4096 --draft-min 5 --draft-max 16 --temp 0.0 --draft-p-min 0.5

And this on the client:

"cache_prompt": true,
"top_k": 1,
"samplers": ["top_k"],
"speculative.n_min": 0,
"speculative.n_max": 16

WIth just the 70bn verification model, I'm getting 8.7 t/s, but with the setup described here I get 7.99 and it doesn't really mattter which drafting model I use.

Is there a problem with Metal, @ggerganov ?

firelex · 2024-12-19T09:07:56Z

I can't seem to get any performance gains on my Mac. I ordered a brand new M4 Max 128GB to get the most out of it. I'm runnign a Q4 K-M version of L3.3 70bn as the verification model and have tried a Q4 K-M version of L3.1 8bn, L3.2 3bn and L3.2 1bn as the drafting model and in no constellation do I get any benefit. We've seen significant benefits with Nvidia hard- and software, but I'd like to see the same on Metal.

I'm using this command on the server:

./build/bin/llama-server -m ../models/verification.gguf -md /Users/mattsinalco/.cache/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct-GGUF/snapshots/a5594fb18df5dfc6b43281423fcce6750cd92de5/Llama-3.2-1B-Instruct-Q4_K_M.gguf -ngl 99 -ngld 90 --port 8033 -c 4096 --draft-min 5 --draft-max 16 --temp 0.0 --draft-p-min 0.5

And this on the client:

"cache_prompt": true, "top_k": 1, "samplers": ["top_k"], "speculative.n_min": 0, "speculative.n_max": 16

WIth just the 70bn verification model, I'm getting 8.7 t/s, but with the setup described here I get 7.99 and it doesn't really mattter which drafting model I use.

Is there a problem with Metal, @ggerganov ?

Could be related to this: #10581 (speculative decoding not yielding benefits for quantized models).

ggerganov · 2024-12-19T09:56:06Z

Are you using latest llama.cpp? You need to have #10581 to get improved performance for Q4_K verification/target models.

This is the config I am using and it works pretty good on M2 Ultra:

./build-chat/bin/llama-server \
    -m  ./models/qwen2.5-32b-coder-instruct/ggml-model-q8_0.gguf \
    -md ./models/qwen2.5-1.5b-coder-instruct/ggml-model-q4_0.gguf \
    --log-file ./service-chat.log \
    --host 0.0.0.0 --port 8013 \
    --ctx-size 0 \
    --cache-reuse 256 \
    -ub 4096 -b 4096 -ngl 99 -ngld 99 -fa -dt 0.1 -lv 1 -t 1 --draft-max 16 --draft-min 5

* server : add speculative decoding support ggml-ci * server : add helper function slot.can_speculate() ggml-ci

firelex · 2024-12-20T16:51:09Z

Thanks, @ggerganov. I'm on the latest build.

I use this command:

./build/bin/llama-server
-m ../models/verification.gguf
-md ../models/drafting.gguf
--port 8033
--ctx-size 4096
--cache-reuse 256
-ub 4096 -b 4096 -ngl 99 -ngld 99 -fa -lv 1 -dt 0.1 -t 1 --draft-max 16 --draft-min 5

And this on the client:

"top_k": 1,
"samplers": ["top_k"]

And the result varies quite a bit - from slower to same to a little faster than just running the Q4 K-M 70bn directly. I copy some of the logging below. I can't read the logs, but from what I can tell, the speculative decoding part is working, but the draft_candidates have a lot of 0s and 1s, which I assume are probabilities - is this how it should be?

slot process_toke: id 0 | task 118 | n_decoded = 1, n_remaining = -1, next token: 5018 '{"'
slot update_slots: id 0 | task 118 | max possible draft: 16
common_speculative_gen_draft: reuse_i = 0, reuse_n = 6, prompt = 3809
common_speculative_gen_draft: n_past = 2104

draft candidate 0, pos 0: 1723 ( 1.000) 'function'
draft candidate 1, pos 0: 2900 ( 0.000) 'func'
draft candidate 2, pos 0: 22124 ( 0.000) 'functions'
draft candidate 0, pos 1: 794 ( 1.000) '":'
draft candidate 1, pos 1: 23118 ( 0.000) '":{"'
draft candidate 2, pos 1: 1 ( 0.000) '"'
draft candidate 0, pos 2: 5324 ( 1.000) ' {"'
draft candidate 1, pos 2: 62853 ( 0.000) ' [{"'
draft candidate 2, pos 2: 5473 ( 0.000) ' {''
draft candidate 0, pos 3: 609 ( 1.000) 'name'
draft candidate 1, pos 3: 14105 ( 0.000) 'parameters'
draft candidate 2, pos 3: 12682 ( 0.000) 'nam'
draft candidate 0, pos 4: 794 ( 1.000) '":'
draft candidate 1, pos 4: 3332 ( 0.000) '":"'
draft candidate 2, pos 4: 1232 ( 0.000) '':'
draft candidate 0, pos 5: 330 ( 1.000) ' "'
draft candidate 1, pos 5: 7492 ( 0.000) ' "",'
draft candidate 2, pos 5: 9177 ( 0.000) ' "_'
draft candidate 0, pos 6: 24396 ( 1.000) 'extract'
draft candidate 1, pos 6: 38458 ( 0.000) '.extract'
draft candidate 2, pos 6: 8819 ( 0.000) ' extract'
draft candidate 0, pos 7: 9351 ( 1.000) '_email'
draft candidate 1, pos 7: 30648 ( 0.000) '_EMAIL'
draft candidate 2, pos 7: 4886 ( 0.000) 'Email'
draft candidate 0, pos 8: 23012 ( 1.000) '_metadata'
draft candidate 1, pos 8: 18103 ( 0.000) 'metadata'
draft candidate 2, pos 8: 97531 ( 0.000) '-metadata'
draft candidate 0, pos 9: 498 ( 1.000) '",'
draft candidate 1, pos 9: 2247 ( 0.000) '","'
draft candidate 2, pos 9: 761 ( 0.000) '",
'
draft candidate 0, pos 10: 330 ( 1.000) ' "'
draft candidate 1, pos 10: 220 ( 0.000) ' '
draft candidate 2, pos 10: 720 ( 0.000) '
'
draft candidate 0, pos 11: 14105 ( 1.000) 'parameters'
draft candidate 1, pos 11: 28427 ( 0.000) '.parameters'
draft candidate 2, pos 11: 3603 ( 0.000) 'params'
draft candidate 0, pos 12: 794 ( 1.000) '":'
draft candidate 1, pos 12: 1232 ( 0.000) '':'
draft candidate 2, pos 12: 13320 ( 0.000) ' ":'
draft candidate 0, pos 13: 5324 ( 1.000) ' {"'
draft candidate 1, pos 13: 314 ( 0.000) ' {'
draft candidate 2, pos 13: 62853 ( 0.000) ' [{"'
draft candidate 0, pos 14: 2386 ( 1.000) 'email'
draft candidate 1, pos 14: 3646 ( 0.000) 'most'
draft candidate 2, pos 14: 14922 ( 0.000) 'eam'
draft candidate 0, pos 15: 6886 ( 1.000) '_address'
draft candidate 1, pos 15: 42733 ( 0.000) '-address'
draft candidate 2, pos 15: 4383 ( 0.000) 'Address'
slot update_slots: id 0 | task 118 | decoding speculative batch, size = 17
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 1723 'function'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 794 '":'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 5324 ' {"'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 609 'name'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 794 '":'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 330 ' "'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 24396 'extract'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 9351 '_email'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 23012 '_metadata'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 498 '",'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 330 ' "'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 14105 'parameters'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 794 '":'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 5324 ' {"'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 2386 'email'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 6886 '_address'
slot process_toke: id 0 | task 118 | n_decoded = 18, n_remaining = -1, next token: 3659 '_of'
slot update_slots: id 0 | task 118 | accepted 16/16 draft tokens, new n_past = 2121
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 119
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 120, front = 0
slot update_slots: id 0 | task 118 | slot decode token, n_ctx = 4096, n_past = 2122, n_cache_tokens = 2122, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
slot process_toke: id 0 | task 118 | n_decoded = 19, n_remaining = -1, next token: 16454 '_the'
slot update_slots: id 0 | task 118 | max possible draft: 16
common_speculative_gen_draft: reuse_i = 0, reuse_n = 2120, prompt = 2120
common_speculative_gen_draft: n_past = 2122
draft candidate 0, pos 0: 24309 ( 1.000) '_person'
draft candidate 1, pos 0: 7508 ( 0.000) ' Person'
draft candidate 2, pos 0: 10909 ( 0.000) 'Person'
draft candidate 0, pos 1: 62 ( 1.000) '_'
draft candidate 1, pos 1: 62139 ( 0.000) '_ul'
draft candidate 2, pos 1: 67343 ( 0.000) 'Ultimately'
draft candidate 0, pos 2: 495 ( 1.000) 'ult'
draft candidate 1, pos 2: 67666 ( 0.000) 'ultimate'
draft candidate 2, pos 2: 7213 ( 0.000) 'ulti'
draft candidate 0, pos 3: 7253 ( 1.000) 'imately'
draft candidate 1, pos 3: 3509 ( 0.000) 'imate'
draft candidate 2, pos 3: 318 ( 0.000) 'im'
draft candidate 0, pos 4: 8052 ( 1.000) '_request'
draft candidate 1, pos 4: 45908 ( 0.000) '-request'
draft candidate 2, pos 4: 14793 ( 0.000) '_REQUEST'
draft candidate 0, pos 5: 287 ( 1.000) 'ing'
draft candidate 1, pos 5: 1753 ( 0.000) 'ING'
draft candidate 2, pos 5: 76019 ( 0.000) 'ingt'
draft candidate 0, pos 6: 16454 ( 1.000) '_the'
draft candidate 1, pos 6: 33188 ( 0.000) ' the'
draft candidate 2, pos 6: 12461 ( 0.000) '_task'
draft candidate 0, pos 7: 12461 ( 1.000) '_task'
draft candidate 1, pos 7: 53679 ( 0.000) '-task'
draft candidate 2, pos 7: 6396 ( 0.000) 'Task'
draft candidate 0, pos 8: 2401 ( 1.000) '_to'
draft candidate 1, pos 8: 32809 ( 0.000) ' to'
draft candidate 2, pos 8: 8820 ( 0.000) '_TO'
draft candidate 0, pos 9: 21960 ( 1.000) '_be'
draft candidate 1, pos 9: 8809 ( 0.000) '.be'
draft candidate 2, pos 9: 387 ( 0.000) ' be'
draft candidate 0, pos 10: 5796 ( 1.000) '_per'
draft candidate 1, pos 10: 15200 ( 0.000) 'Performed'
draft candidate 2, pos 10: 10659 ( 0.000) '_pre'
draft candidate 0, pos 11: 10365 ( 1.000) 'formed'
draft candidate 1, pos 11: 55857 ( 0.000) 'forming'
draft candidate 2, pos 11: 630 ( 0.000) 'form'
draft candidate 0, pos 12: 794 ( 1.000) '":'
draft candidate 1, pos 12: 1232 ( 0.000) '':'
draft candidate 2, pos 12: 38151 ( 0.000) '"):'
draft candidate 0, pos 13: 330 ( 1.000) ' "'
draft candidate 1, pos 13: 7492 ( 0.000) ' "",'
draft candidate 2, pos 13: 9177 ( 0.000) ' "_'
draft candidate 0, pos 14: 329 ( 1.000) 'ad'
draft candidate 1, pos 14: 64 ( 0.000) 'a'
draft candidate 2, pos 14: 67 ( 0.000) 'd'
draft candidate 0, pos 15: 23156 ( 1.000) 'avis'
draft candidate 1, pos 15: 2749 ( 0.000) 'vis'
draft candidate 2, pos 15: 99816 ( 0.000) 'AVIS'
slot update_slots: id 0 | task 118 | decoding speculative batch, size = 17
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 24309 'person'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 62 ''
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 495 'ult'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 7253 'imately'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 8052 '_request'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 287 'ing'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 16454 '_the'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 12461 '_task'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 2401 '_to'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 21960 '_be'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 5796 '_per'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 10365 'formed'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 794 '":'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 330 ' "'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 329 'ad'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 23156 'avis'
slot process_toke: id 0 | task 118 | n_decoded = 36, n_remaining = -1, next token: 31 '@'
slot update_slots: id 0 | task 118 | accepted 16/16 draft tokens, new n_past = 2139
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 120
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 121, front = 0
slot update_slots: id 0 | task 118 | slot decode token, n_ctx = 4096, n_past = 2140, n_cache_tokens = 2140, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
slot process_toke: id 0 | task 118 | n_decoded = 37, n_remaining = -1, next token: 73216 'bright'
slot update_slots: id 0 | task 118 | max possible draft: 16
common_speculative_gen_draft: reuse_i = 0, reuse_n = 2138, prompt = 2138
common_speculative_gen_draft: n_past = 2140
draft candidate 0, pos 0: 21733 ( 1.000) 'future'
draft candidate 1, pos 0: 25184 ( 0.000) 'Future'
draft candidate 2, pos 0: 60840 ( 0.000) '_future'
draft candidate 0, pos 1: 916 ( 1.000) '.com'
draft candidate 1, pos 1: 6973 ( 0.000) '.co'
draft candidate 2, pos 1: 498 ( 0.000) '",'
draft candidate 0, pos 2: 498 ( 1.000) '",'
draft candidate 1, pos 2: 2247 ( 0.000) '","'
draft candidate 2, pos 2: 3755 ( 0.000) ' ",'
draft candidate 0, pos 3: 330 ( 1.000) ' "'
draft candidate 1, pos 3: 220 ( 0.000) ' '
draft candidate 2, pos 3: 720 ( 0.000) '
'
draft candidate 0, pos 4: 5094 ( 1.000) 'project'
draft candidate 1, pos 4: 20489 ( 0.000) 'reason'
draft candidate 2, pos 4: 3646 ( 0.000) 'most'
draft candidate 0, pos 5: 28441 ( 1.000) '_database'
draft candidate 1, pos 5: 46610 ( 0.000) '_DATABASE'
draft candidate 2, pos 5: 12494 ( 0.000) 'database'
draft candidate 0, pos 6: 8237 ( 1.000) '_ids'
draft candidate 1, pos 6: 12990 ( 0.000) 'Ids'
draft candidate 2, pos 6: 14483 ( 0.000) ' ids'
draft candidate 0, pos 7: 1311 ( 1.000) '_re'
draft candidate 1, pos 7: 74486 ( 0.000) '_refer'
draft candidate 2, pos 7: 1351 ( 0.000) '.re'
draft candidate 0, pos 8: 5671 ( 1.000) 'ferred'
draft candidate 1, pos 8: 809 ( 0.000) 'fer'
draft candidate 2, pos 8: 3018 ( 0.000) 'ffer'
draft candidate 0, pos 9: 2401 ( 1.000) '_to'
draft candidate 1, pos 9: 32809 ( 0.000) ' to'
draft candidate 2, pos 9: 4791 ( 0.000) '-to'
draft candidate 0, pos 10: 1265 ( 1.000) '_in'
draft candidate 1, pos 10: 3502 ( 0.000) '-in'
draft candidate 2, pos 10: 17886 ( 0.000) ' in'
draft candidate 0, pos 11: 9351 ( 1.000) '_email'
draft candidate 1, pos 11: 30648 ( 0.000) '_EMAIL'
draft candidate 2, pos 11: 8463 ( 0.000) ' Email'
draft candidate 0, pos 12: 31683 ( 1.000) '_chain'
draft candidate 1, pos 12: 66286 ( 0.000) '-chain'
draft candidate 2, pos 12: 53141 ( 0.000) '.chain'
draft candidate 0, pos 13: 794 ( 1.000) '":'
draft candidate 1, pos 13: 1232 ( 0.000) '':'
draft candidate 2, pos 13: 1680 ( 0.000) '):'
draft candidate 0, pos 14: 10277 ( 1.000) ' [],'
draft candidate 1, pos 14: 4482 ( 0.000) ' ["'
draft candidate 2, pos 14: 3132 ( 0.000) ' []'
draft candidate 0, pos 15: 330 ( 1.000) ' "'
draft candidate 1, pos 15: 1 ( 0.000) '"'
draft candidate 2, pos 15: 220 ( 0.000) ' '

firelex · 2024-12-20T16:57:46Z

BTW, @ggerganov: From what I can tell, llama-server doesn't support multi-slot KV caches right now, allowing different prompts to maintain separate caches simultaneously. Is that right? This feature would go a long way to speeding up function calling on Metal. Potentially with disk offloading, although I suspect most agentic apps should get by on a dozen different base prompts. I've seen references to this for llama-cli, but I don't think llama-server supports this. I've asked one of my team members to look into this and make a proposal for adding it.

firelex · 2024-12-20T17:23:08Z

Thanks, @ggerganov. I'm on the latest build.

I use this command:

./build/bin/llama-server -m ../models/verification.gguf -md ../models/drafting.gguf --port 8033 --ctx-size 4096 --cache-reuse 256 -ub 4096 -b 4096 -ngl 99 -ngld 99 -fa -lv 1 -dt 0.1 -t 1 --draft-max 16 --draft-min 5

And this on the client:

"top_k": 1, "samplers": ["top_k"]

And the result varies quite a bit - from slower to same to a little faster than just running the Q4 K-M 70bn directly. I copy some of the logging below. I can't read the logs, but from what I can tell, the speculative decoding part is working, but the draft_candidates have a lot of 0s and 1s, which I assume are probabilities - is this how it should be?

Hmmm... the zero/one probabilities I got when I ran a 4-bit 8bn drafting model that was fine-tuned on the same data as the verification model. If I use a non-finetuned 1bn or 3bn model, the probabilities vary more, but still no sustained speed-up (just sometimes). Next step: Fine-tune the 1bn model. That shoudl give me more speed than the 8bn bu thten hopefully better proability distributions.

firelex · 2024-12-21T13:53:55Z

BTW, @ggerganov: From what I can tell, llama-server doesn't support multi-slot KV caches right now, allowing different prompts to maintain separate caches simultaneously. Is that right? This feature would go a long way to speeding up function calling on Metal. Potentially with disk offloading, although I suspect most agentic apps should get by on a dozen different base prompts. I've seen references to this for llama-cli, but I don't think llama-server supports this. I've asked one of my team members to look into this and make a proposal for adding it.

Okay, so this is already addresse (and solved) here: #9135

firelex · 2024-12-25T14:33:55Z

@ggerganov - I've finally got some good news to report. As reported previously, speculative decoding had almost no effect on my M4 Max. I reliably got 8.7 t/s for Llama 3.3 70bn with a 3bn drafting model. But when I started to crank up the -np value, t/s went up significantly. It maxed out at -np 7, when I reliably got 16+ t/s (a doubling of what I had originally, which is amazing). When I went to -np 8, performance cratered and I ended up with 7 t/s.

Cranking up the -np value when I run JUST the 70bn model with a drafting model has no effect, so it's definitely an enabler for speculative decoding, at least on my setup.

Next up: Getting KV caching to work. With no -np parameter, the system prompt gets cached. With the -np parameter, I see no such effect. I will now try to force the server to use pre-populated KV slots. Will report back when I have that working.

firelex · 2024-12-25T17:23:12Z

I'm confused. Shouldn't the number of slots available for KV caching be independent of whatever causes spec. decoding to speed up when I set -np to 6 or 7?

github-actions bot added examples server labels Nov 22, 2024

ggerganov force-pushed the gg/speculative-server branch from 1973399 to 7dc6ae5 Compare November 22, 2024 14:12

ggerganov force-pushed the gg/speculative-server branch 5 times, most recently from c5ddee2 to e80f758 Compare November 24, 2024 15:09

ggerganov marked this pull request as ready for review November 24, 2024 15:11

ggerganov mentioned this pull request Nov 24, 2024

speculative : refactor and add a simpler example #10362

Merged

ggerganov force-pushed the gg/speculative-server branch from e80f758 to d905266 Compare November 24, 2024 19:59

Base automatically changed from gg/speculative-refactor to master November 25, 2024 07:58

server : add speculative decoding support

156aa6d

ggml-ci

ggerganov force-pushed the gg/speculative-server branch from c277c4d to 156aa6d Compare November 25, 2024 08:05

server : add helper function slot.can_speculate()

0ba40c3

ggml-ci

ggerganov merged commit 9ca2e67 into master Nov 25, 2024
62 checks passed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024

server : add speculative decoding support (ggml-org#10455)

1cb813b

* server : add speculative decoding support ggml-ci * server : add helper function slot.can_speculate() ggml-ci

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024

server : add more information about error (ggml-org#10455)

acbea67

ggerganov mentioned this pull request Jan 12, 2025

Support speculative decoding in server example #5877

Closed

4 tasks

kerthcet mentioned this pull request Jan 16, 2025

Support speculative decoding with llama.cpp InftyAI/llmaz#240

Open

3 tasks

rndmcnlly mentioned this pull request Feb 24, 2025

[Feat]: draft model for speculative decoding a-ghorbani/pocketpal-ai#226

Open

server : add speculative decoding support #10455

server : add speculative decoding support #10455

Conversation

ggerganov commented Nov 22, 2024 • edited Loading

3Simplex commented Nov 22, 2024 • edited Loading

ggerganov commented Nov 24, 2024

3Simplex commented Nov 24, 2024

mostlygeek commented Nov 24, 2024

ggerganov commented Nov 24, 2024 • edited Loading

mostlygeek commented Nov 24, 2024 • edited Loading

mostlygeek commented Nov 24, 2024 • edited Loading

ggerganov commented Nov 24, 2024 • edited Loading

mostlygeek commented Nov 24, 2024

ggerganov commented Nov 24, 2024

3Simplex commented Nov 24, 2024

mostlygeek commented Nov 25, 2024 • edited Loading

ggerganov commented Nov 25, 2024

sorasoras commented Nov 25, 2024

webbigdata-jp commented Dec 1, 2024 • edited Loading

mybyte commented Dec 2, 2024

JeroenAdam commented Dec 2, 2024 • edited Loading

ggerganov commented Dec 3, 2024

ggerganov commented Dec 3, 2024 • edited Loading

mybyte commented Dec 3, 2024

David-AU-github commented Dec 9, 2024 • edited Loading

Mushoz commented Dec 9, 2024

David-AU-github commented Dec 9, 2024

Mushoz commented Dec 9, 2024 • edited Loading

David-AU-github commented Dec 9, 2024

Mushoz commented Dec 9, 2024 • edited Loading

David-AU-github commented Dec 9, 2024 • edited Loading

Mushoz commented Dec 9, 2024

Mushoz commented Dec 9, 2024

David-AU-github commented Dec 11, 2024

firelex commented Dec 18, 2024

firelex commented Dec 19, 2024

ggerganov commented Dec 19, 2024

firelex commented Dec 20, 2024

firelex commented Dec 20, 2024

firelex commented Dec 20, 2024

firelex commented Dec 21, 2024 • edited Loading

firelex commented Dec 25, 2024

firelex commented Dec 25, 2024

ggerganov commented Nov 22, 2024 •

edited

Loading

3Simplex commented Nov 22, 2024 •

edited

Loading

ggerganov commented Nov 24, 2024 •

edited

Loading

mostlygeek commented Nov 24, 2024 •

edited

Loading

mostlygeek commented Nov 24, 2024 •

edited

Loading

ggerganov commented Nov 24, 2024 •

edited

Loading

mostlygeek commented Nov 25, 2024 •

edited

Loading

webbigdata-jp commented Dec 1, 2024 •

edited

Loading

JeroenAdam commented Dec 2, 2024 •

edited

Loading

ggerganov commented Dec 3, 2024 •

edited

Loading

David-AU-github commented Dec 9, 2024 •

edited

Loading

Mushoz commented Dec 9, 2024 •

edited

Loading

Mushoz commented Dec 9, 2024 •

edited

Loading

David-AU-github commented Dec 9, 2024 •

edited

Loading

firelex commented Dec 21, 2024 •

edited

Loading