-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 #3582
Conversation
Great work! |
Could you please clarify if I understand correctly that speculative decoding does not increase throughput, and even decreases it under high load? How can I properly find the optimal load point? |
I wonder if MTP supports bf16? |
FYI you can use these checkpoints for V3 NextN and R1 NextN instead of exporting them yourself. Cheers! https://huggingface.co/SGLang/DeepSeek-V3-NextN |
During handling of the above exception, another exception occurred: Traceback (most recent call last): disable cuda graph by --disable-cuda-graph |
@lambert0312 Which bf16 model did you use and what GPU did you use? It seems the checkpoint is not correct. Maybe you can try to convert it with this guide. |
Current |
Speculative decoding methods can speedup for small batch sizes but is not designed for high load. But I think the nextn method can get speedup at higher batch sizes since it has a higher accept rate so that we can use less draft steps and draft tokens to get good performance.
Maybe you can do the benchmark with different request rate and check the throughput. |
I use 4xA800 gpu and covert bf16 mtp nextn model. @ispobock |
Hmm, but then #3466 has merged... so maybe there's just some glue missing for the non-cude path? |
hmmmmmmmmmmmmmmmm |
|
Hi @lambert0312 , have you fixed this problem on 4*A100 nodes? I met this problem too. |
Not yet, trying @ehuaa |
These two parameters make me feel very confused, what are the specific meanings of these two parameters? Why can't it be as simple as vllm, which has only one parameter --num_speculative_tokens, which is to predict how much tokens |
Hello, could MTP be combined with quantization for deployment on a single machine with 8*H20? |
i use latest code occur error with 8*H20
[2025-02-20 10:32:44 TP7] Scheduler hit an exception: Traceback (most recent call last): |
@YosanHo Maybe you need to adjust the mem-fraction-static parameter |
Run the benchmark provided by @ispobock for 2 nodes 8*H800, but mtp spec decode is much slower than normal. I'm not sure if it is expected
|
I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput. batch size 1: from 17 t/s to 52 t/s But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear. my start command is |
Hi, may I ask the version of SGLang you use and the accept length in your test? I use 0.4.3.post2, while MTP has double speed with bs=1, the speed is alomost same with bs=8. |
0.4.3.post2, same with u. I don't know what accept length is, you can see all arguments in my command . |
|
@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs. |
thanks. |
do you succeed? |
When i try to run on 2 nodes 8xh100 using docker image lmsysorg/sglang:v0.4.3.post2-cu125-srt
it stucks at
if I add --disable-cuda-graph it starts but output throughput is only 15token/s
If i run with
it obtains ~30 output tokens/s
|
Hi, does NextN compatible with bench_one_batch?I try deepseek R1 on 8*H200 with
|
the same problem with 8*H200 |
I run succeed with static at 0.87,and modify soucecode in model_runner.py(line 280) to skip validate,but the performance is very poor |
Hi @lambert0312 ,how did you fix the problem on 4*A800 nodes, i still stucked here. Is it caused by chunked_prefill? |
|
@ehuaa What version are you using? |
I did a benchmark test with A very strange phenomenon is:
Why does MTP become less effective when isl becomes longer? |
Could you share what output tokens per second and latency you got for each
of those tests? Many thanks
…On Wed, Feb 26, 2025 at 7:42 AM kimlee1874 ***@***.***> wrote:
I did a benchmark test with bench_serving.py on 2 x 8 x H800, and here is
my startup script with MTP:
python -m sglang.launch_server --model-path ./DeepSeek-R1/ --tp 16
--dist-init-addr $IP_PORT --nnodes 2 --node-rank 0 --trust-remote-code
--host 0.0.0.0 --speculative-algo NEXTN --speculative-draft
./DeepSeek-V3-NextN/ --speculative-num-steps 2 --speculative-eagle-topk 4
--speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.75
A very strange phenomenon is:
1. When the isl/osl=1k/1k, the speed increase brought by MTP is 1.6X
(bs 1) and 1.4X (bs 8)
2. But when isl is increased to 8K, MTP has almost no speed increase
from bs 1, and starts to show negative growth at bs 16
Why does MTP become less effective when isl becomes longer?
—
Reply to this email directly, view it on GitHub
<#3582 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASVG6CSDFOANYFUS3CFC5F32RVV55AVCNFSM6AAAAABXEVQ2B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBUGE3TQMBWGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
[image: kimlee1874]*kimlee1874* left a comment (sgl-project/sglang#3582)
<#3582 (comment)>
I did a benchmark test with bench_serving.py on 2 x 8 x H800, and here is
my startup script with MTP:
python -m sglang.launch_server --model-path ./DeepSeek-R1/ --tp 16
--dist-init-addr $IP_PORT --nnodes 2 --node-rank 0 --trust-remote-code
--host 0.0.0.0 --speculative-algo NEXTN --speculative-draft
./DeepSeek-V3-NextN/ --speculative-num-steps 2 --speculative-eagle-topk 4
--speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.75
A very strange phenomenon is:
1. When the isl/osl=1k/1k, the speed increase brought by MTP is 1.6X
(bs 1) and 1.4X (bs 8)
2. But when isl is increased to 8K, MTP has almost no speed increase
from bs 1, and starts to show negative growth at bs 16
Why does MTP become less effective when isl becomes longer?
—
Reply to this email directly, view it on GitHub
<#3582 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASVG6CSDFOANYFUS3CFC5F32RVV55AVCNFSM6AAAAABXEVQ2B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBUGE3TQMBWGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
That's pretty interesting, where did you get that idea from to use flashinfer-mla? Shouldn't that be automatic as shown by "MLA optimization is turned on. Use triton backend." in the logs? |
@pipul My guess is speculative-num-steps indicate how many times you forward the draft model (each time you select top k of the tree path from root to leave, and get k new node), and num_speculative_tokens represent the node number of the draft tree, according to EAGLE-2 paper. |
I saw that log and I think it is telling me triton backend is used, instead of flashinfer 😂. |
Motivation
We implemented NextN (MTP) speculative decoding for DeepSeek-V3/R1 based on EAGLE 2 on Triton backend (#3466) and achieved 1.76x speed up with CUDA Graph and Torch.compile compatibility. In current benchmark, we achieved 77 token/s output throughput on batch size 1.
In our implementation, we only use the 1 MTP module (NextN layer) from the official model checkpoint. We found it also can be used for autoregressive prediction like EAGLE. The accept rate of the MTP module is very high (~1.9 avg accept length for draft 2 tokens, e.g.
--speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2
). We try to use it to draft more tokens and achieved better speedup. (2.5~3 avg accept length for draft 4 tokens for 2 steps, e.g.--speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4
)Best practices should be further investigated through additional experiments, as predicting more tokens can increase overhead and impact throughput, especially for large batch sizes. A careful trade-off between latency and throughput is necessary to determine the optimal number of speculative tokens.
Benchmark Results
Usage
Option1: Export nextn weights manually
scripts/export_deepseek_nextn.py
Option2: Use the exported nextn weights directly
Ref: #3582 (comment)