Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 #3582

Merged
merged 11 commits into from
Feb 14, 2025

Conversation

ispobock
Copy link
Collaborator

@ispobock ispobock commented Feb 14, 2025

Motivation

We implemented NextN (MTP) speculative decoding for DeepSeek-V3/R1 based on EAGLE 2 on Triton backend (#3466) and achieved 1.76x speed up with CUDA Graph and Torch.compile compatibility. In current benchmark, we achieved 77 token/s output throughput on batch size 1.

In our implementation, we only use the 1 MTP module (NextN layer) from the official model checkpoint. We found it also can be used for autoregressive prediction like EAGLE. The accept rate of the MTP module is very high (~1.9 avg accept length for draft 2 tokens, e.g. --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2). We try to use it to draft more tokens and achieved better speedup. (2.5~3 avg accept length for draft 4 tokens for 2 steps, e.g. --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4)

Best practices should be further investigated through additional experiments, as predicting more tokens can increase overhead and impact throughput, especially for large batch sizes. A careful trade-off between latency and throughput is necessary to determine the optimal number of speculative tokens.

Benchmark Results

# benchmark
python3 -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:30000 --batch-size 1 --input-len 256 --output-len 256

# baseline on main branch
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --trust-remote --tp 8

batch size: 1
latency: 6.70 s
output throughput: 38.19 token/s
(input + output) throughput: 76.39 token/s

# w/ nextn speculative decoding
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft /sgl-workspace/DeepSeek-V3-nextn --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --trust-remote --tp 8

batch size: 1
latency: 3.77 s
output throughput: 67.93 token/s
(input + output) throughput: 135.87 token/s

# w/ nextn speculative decoding + Torch.compile
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft /sgl-workspace/DeepSeek-V3-nextn --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8 --enable-torch-compile --torch-compile-max-bs 1

batch size: 1
latency: 3.29 s
output throughput: 77.73 token/s
(input + output) throughput: 155.45 token/s

Usage

Option1: Export nextn weights manually

  1. Export the weights of nextn layer with script scripts/export_deepseek_nextn.py
python3 export_deepseek_nextn.py --input-dir /path/to/DeepSeek-V3 --output-dir /path/to/DeepSeek-V3-NextN
  1. Use the nextn layer as draft model and launch the server
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft /path/to/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8

Option2: Use the exported nextn weights directly

Ref: #3582 (comment)

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft SGLang/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8

@zhyncs zhyncs merged commit 862dd76 into sgl-project:main Feb 14, 2025
2 of 17 checks passed
@freeliuzc
Copy link

Great work!
Regarding the tokens for Draft 1, what is the average accepted length?
Thanks

@Swipe4057
Copy link

Could you please clarify if I understand correctly that speculative decoding does not increase throughput, and even decreases it under high load? How can I properly find the optimal load point?

@lambert0312
Copy link

I wonder if MTP supports bf16?

chongli-uw pushed a commit to chongli-uw/sglang that referenced this pull request Feb 15, 2025
@zhyncs
Copy link
Member

zhyncs commented Feb 15, 2025

FYI you can use these checkpoints for V3 NextN and R1 NextN instead of exporting them yourself. Cheers!

https://huggingface.co/SGLang/DeepSeek-V3-NextN
https://huggingface.co/SGLang/DeepSeek-R1-NextN

@lambert0312
Copy link

  1. I use bf16 model and export the weights of nextn layer
    python3 export_deepseek_nextn.py --input-dir /path/to/DeepSeek-V3-bf16 --output-dir /path/to/DeepSeek-V3-NextN-bf16`

  2. Use the nextn layer as draft model and launch the server
    python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-bf16 --speculative-algo NEXTN --speculative-draft /path/to/DeepSeek-V3-NextN-bf16 --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8

  3. The log is as follows:
    [2025-02-15 06:17:26 TP3] Scheduler hit an exception: Traceback (most recent call last):
    File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 80, in init
    self.capture()
    File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 101, in capture
    CudaGraphRunner.capture(self)
    File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 304, in capture
    ) = self.capture_one_batch_size(bs, forward)
    File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 164, in capture_one_batch_size
    run_once()
    File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 154, in run_once
    ret = self.eagle_worker.draft_forward(forward_batch)
    File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 260, in draft_forward
    logits_output = self.model_runner.model.forward(
    File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
    File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_nextn.py", line 140, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
    File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_nextn.py", line 96, in forward
    hidden_states, residual = self.decoder(
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
    File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 770, in forward
    hidden_states = self.self_attn(
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
    File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 528, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
    File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 620, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
    File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 67, in forward
    return forward_batch.attn_backend.forward(
    File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/init.py", line 67, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
    File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 441, in forward_decode
    forward_batch.token_to_kv_pool.set_kv_buffer(
    File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 288, in set_kv_buffer
    self.k_buffer[layer_id][loc] = cache_k
    RuntimeError: shape mismatch: value tensor of shape [4, 1, 576] cannot be broadcast to indexing result of shape [4, 4, 56]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 252, in init
self.draft_worker = EAGLEWorker(
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 99, in init
self.init_cuda_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 110, in init_cuda_graphs
self.cuda_graph_runner = EAGLEDraftCudaGraphRunner(self)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 82, in init
raise Exception(
Exception: Capture cuda graph failed: shape mismatch: value tensor of shape [4, 1, 576] cannot be broadcast to indexing result of shape [4, 4, 56]
Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
disable torch compile by not using --enable-torch-compile
specify --dtype to the same dtype (e.g. bfloat16)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

@ispobock
Copy link
Collaborator Author

@lambert0312 Which bf16 model did you use and what GPU did you use? It seems the checkpoint is not correct. Maybe you can try to convert it with this guide.

@ispobock
Copy link
Collaborator Author

Regarding the tokens for Draft 1, what is the average accepted length?

Current --speculative-num-steps is at least 2. We will support one draft step in the following update. I think the accepted length can match the result in the paper.

@ispobock
Copy link
Collaborator Author

Could you please clarify if I understand correctly that speculative decoding does not increase throughput, and even decreases it under high load?

Speculative decoding methods can speedup for small batch sizes but is not designed for high load. But I think the nextn method can get speedup at higher batch sizes since it has a higher accept rate so that we can use less draft steps and draft tokens to get good performance.

How can I properly find the optimal load point?

Maybe you can do the benchmark with different request rate and check the throughput.

@lambert0312
Copy link

lambert0312 commented Feb 16, 2025

@lambert0312 Which bf16 model did you use and what GPU did you use? It seems the checkpoint is not correct. Maybe you can try to convert it with this guide.

I use 4xA800 gpu and covert bf16 mtp nextn model. @ispobock

@tot0
Copy link

tot0 commented Feb 18, 2025

upon doing memory static at 0.7

^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_utils.py", line 194, in create
    build_tree_kernel(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/build_eagle_tree.py", line 149, in build_tree_kernel
    sgl_build_tree_kernel(
    ^^^^^^^^^^^^^^^^^^^^^
NameError: name 'sgl_build_tree_kernel' is not defined. Did you mean: 'build_tree_kernel'?

[2025-02-18 14:31:01] Received sigquit from a child proces. It usually means the child failed.
[2025-02-18 14:31:01 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 478, in event_loop_normal
    result = self.run_batch(batch)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1089, in run_batch
    ) = self.draft_worker.forward_batch_speculative_generation(batch)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 116, in forward_batch_speculative_generation
    spec_info: EagleVerifyInput = self.draft(batch)
                                  ^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 199, in draft
    ret = EagleVerifyInput.create(
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_utils.py", line 194, in create
    build_tree_kernel(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/build_eagle_tree.py", line 149, in build_tree_kernel
    sgl_build_tree_kernel(
    ^^^^^^^^^^^^^^^^^^^^^
NameError: name 'sgl_build_tree_kernel' is not defined. Did you mean: 'build_tree_kernel'?

[2025-02-18 14:31:01] Received sigquit from a child proces. It usually means the child failed.
Killed
root@aaaa5ace6a62:/sgl-workspace#  ```

It would appear that the EAGLE tree kernels have only been implemented for CUDA natively in sgl-kernel, not via triton, so ROCm based GPUs aren't supported yet. Seems like this is tracked by #2940

Hmm, but then #3466 has merged... so maybe there's just some glue missing for the non-cude path?

@ShivamB25
Copy link

upon doing memory static at 0.7

^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_utils.py", line 194, in create
    build_tree_kernel(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/build_eagle_tree.py", line 149, in build_tree_kernel
    sgl_build_tree_kernel(
    ^^^^^^^^^^^^^^^^^^^^^
NameError: name 'sgl_build_tree_kernel' is not defined. Did you mean: 'build_tree_kernel'?

[2025-02-18 14:31:01] Received sigquit from a child proces. It usually means the child failed.
[2025-02-18 14:31:01 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 478, in event_loop_normal
    result = self.run_batch(batch)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1089, in run_batch
    ) = self.draft_worker.forward_batch_speculative_generation(batch)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 116, in forward_batch_speculative_generation
    spec_info: EagleVerifyInput = self.draft(batch)
                                  ^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 199, in draft
    ret = EagleVerifyInput.create(
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_utils.py", line 194, in create
    build_tree_kernel(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/build_eagle_tree.py", line 149, in build_tree_kernel
    sgl_build_tree_kernel(
    ^^^^^^^^^^^^^^^^^^^^^
NameError: name 'sgl_build_tree_kernel' is not defined. Did you mean: 'build_tree_kernel'?

[2025-02-18 14:31:01] Received sigquit from a child proces. It usually means the child failed.
Killed
root@aaaa5ace6a62:/sgl-workspace#  ```

It would appear that the EAGLE tree kernels have only been implemented for CUDA natively in sgl-kernel, not via triton, so ROCm based GPUs aren't supported yet. Seems like this is tracked by #2940

Hmm, but then #3466 has merged... so maybe there's just some glue missing for the non-cude path?

hmmmmmmmmmmmmmmmm

@tot0
Copy link

tot0 commented Feb 18, 2025

upon doing memory static at 0.7

^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_utils.py", line 194, in create
    build_tree_kernel(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/build_eagle_tree.py", line 149, in build_tree_kernel
    sgl_build_tree_kernel(
    ^^^^^^^^^^^^^^^^^^^^^
NameError: name 'sgl_build_tree_kernel' is not defined. Did you mean: 'build_tree_kernel'?

[2025-02-18 14:31:01] Received sigquit from a child proces. It usually means the child failed.
[2025-02-18 14:31:01 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 478, in event_loop_normal
    result = self.run_batch(batch)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1089, in run_batch
    ) = self.draft_worker.forward_batch_speculative_generation(batch)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 116, in forward_batch_speculative_generation
    spec_info: EagleVerifyInput = self.draft(batch)
                                  ^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 199, in draft
    ret = EagleVerifyInput.create(
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_utils.py", line 194, in create
    build_tree_kernel(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/build_eagle_tree.py", line 149, in build_tree_kernel
    sgl_build_tree_kernel(
    ^^^^^^^^^^^^^^^^^^^^^
NameError: name 'sgl_build_tree_kernel' is not defined. Did you mean: 'build_tree_kernel'?

[2025-02-18 14:31:01] Received sigquit from a child proces. It usually means the child failed.
Killed
root@aaaa5ace6a62:/sgl-workspace#  ```

It would appear that the EAGLE tree kernels have only been implemented for CUDA natively in sgl-kernel, not via triton, so ROCm based GPUs aren't supported yet. Seems like this is tracked by #2940

Hmm, but then #3466 has merged... so maybe there's just some glue missing for the non-cude path?

hmmmmmmmmmmmmmmmm

#3670

@ehuaa
Copy link

ehuaa commented Feb 19, 2025

When I use the v0.4.3.post1 cu124 version of the docker image, I use the following command to start 4 A800 nodes:

python3 -m sglang.launch_server --model-path deepseek-a/DeepSeek-V3-bf16 --dtype bfloat16 --trust-remote-code --host 0.0.0.0 --port 30000 --grammar-backend xgrammar --mem-fraction-static 0.85 --context-length 131072 --max-running-requests 128 --disable-overlap --speculative-algo NEXTN --speculative-draft deepseek-a/DeepSeek-V3-NextN-bf16 --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --tp 32 --dist-init-addr x.x.x.x:20000 --nnodes 4 --node-rank 0

python3 -m sglang.launch_server --model-path deepseek-a/DeepSeek-V3-bf16 --dtype bfloat16 --trust-remote-code --host 0.0.0.0 --port 30000 --grammar-backend xgrammar --mem-fraction-static 0.85 --context-length 131072 --max-running-requests 128 --disable-overlap --speculative-algo NEXTN --speculative-draft deepseek-a/DeepSeek-V3-NextN-bf16 --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --tp 32 --dist-init-addr x.x.x.x:20000 --nnodes 4 --node-rank 1

python3 -m sglang.launch_server --model-path deepseek-a/DeepSeek-V3-bf16 --dtype bfloat16 --trust-remote-code --host 0.0.0.0 --port 30000 --grammar-backend xgrammar --mem-fraction-static 0.85 --context-length 131072 --max-running-requests 128 --disable-overlap --speculative-algo NEXTN --speculative-draft deepseek-a/DeepSeek-V3-NextN-bf16 --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --tp 32 --dist-init-addr x.x.x.x:20000 --nnodes 4 --node-rank 2

python3 -m sglang.launch_server --model-path deepseek-a/DeepSeek-V3-bf16 --dtype bfloat16 --trust-remote-code --host 0.0.0.0 --port 30000 --grammar-backend xgrammar --mem-fraction-static 0.85 --context-length 131072 --max-running-requests 128 --disable-overlap --speculative-algo NEXTN --speculative-draft deepseek-a/DeepSeek-V3-NextN-bf16 --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --tp 32 --dist-init-addr x.x.x.x:20000 --nnodes 4 --node-rank 3

The following error occurs when running batch verification:

ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.

[rank1]:[E218 05:25:48.030937973 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd4756b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd4756636e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd4757a5a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fd42b625726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fd42b62a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fd42b631b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fd42b63361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0x145c0 (0x7fd4772a55c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #8: + 0x94ac3 (0x7fd47812aac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7fd4781bba04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

, terminate called after throwing an instance of 'markupsafe._speedupsc10::DistBackendError' , PIL._imagingterminate called after throwing an instance of 'c10::DistBackendError' , PIL._imagingft, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, setproctitle[rank3]:[E218 05:25:48.031546640 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f36f1d6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f36f1d166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f36f214da18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f36a7c25726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f36a7c2a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f36a7c31b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f36a7c3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0x145c0 (0x7f36f39465c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #8: + 0x94ac3 (0x7f36f47cbac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7f36f485ca04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

, zmq.backend.cython._zmqterminate called after throwing an instance of ', regex._regexc10::DistBackendError' , , cuda_utilsmsgspec._core, __triton_launcher, yaml._yaml, multidict._multidict (total: 52) , yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, markupsafe._speedups, PIL._imaging what(): [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9038f6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f9038f166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f903939ca18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f8feee25726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f8feee2a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f8feee31b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8feee3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0x145c0 (0x7f903ab955c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #8: + 0x94ac3 (0x7f903ba1aac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7f903baaba04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9038f6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f8feeaa071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f903ab955c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f903ba1aac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #4: clone + 0x44 (0x7f903baaba04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

, PIL._imagingft, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, msgpack._cmsgpack, google._upb._message, ray._raylet what(): [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff0ed6b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7ff0ed6636e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7ff0ed7a5a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7ff0a3625726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7ff0a362a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7ff0a3631b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ff0a363361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0x145c0 (0x7ff0ef2f75c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #8: + 0x94ac3 (0x7ff0f017cac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7ff0f020da04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff0ed6b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7ff0a32a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7ff0ef2f75c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7ff0f017cac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #4: clone + 0x44 (0x7ff0f020da04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007fd83bfff640 (most recent call first): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007ff0f00e74c0 (most recent call first): File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/init.py", line 146 in patched_isinstance File "/usr/lib/python3.10/inspect.py", line 288 in isfunction File "/usr/lib/python3.10/inspect.py", line 299 in _has_code_flag File "/usr/lib/python3.10/inspect.py", line 321 in isasyncgenfunction File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 540 in _inject_tracing_into_class File "/usr/local/lib/python3.10/dist-packages/ray/actor.py", line 1742 in _make_actor File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 3181 in _make_remote File "/usr/local/lib/python3.10/dist-packages/ray/experimental/channel/cpu_communicator.py", line 17 in File "", line 241 in _call_with_frames_removed File "", line 883 in exec_module File "", line 688 in _load_unlocked File "", line 1006 in _find_and_load_unlocked File "", line 1027 in _find_and_load File "/usr/local/lib/python3.10/dist-packages/ray/experimental/channel/torch_tensor_nccl_channel.py", line 13 in File "", line 241 in _call_with_frames_removed File "", line 883 in exec_module File "", line 688 in _load_unlocked File "", line 1006 in , _find_and_load_unlockedsentencepiece._sentencepiece File "", line 1027 in _find_and_load File "/usr/local/lib/python3.10/dist-packages/ray/experimental/channel/init.py", line 21 in File "", line 241 in _call_with_frames_removed File "", line 883 in exec_module File "", line 688 in _load_unlocked File "", line 1006 in _find_and_load_unlocked File "", line 1027 in _find_and_load File "", line 241 in _call_with_frames_removed File "", line 992 in _find_and_load_unlocked File "", line 1027 in _find_and_load File "/usr/local/lib/python3.10/dist-packages/ray/dag/dag_node.py", line 2 in File "", line 241 in _call_with_frames_removed File "", line 883 in exec_module File "", line 688 in _load_unlocked File "", line 1006 in _find_and_load_unlocked File "", line 1027 in _find_and_load File "/usr/local/lib/python3.10/dist-packages/ray/dag/init.py", line 1 in File "", line 241 in _call_with_frames_removed File " what(): "[PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f36f1d6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f36f1d166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f36f214da18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f36a7c25726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f36a7c2a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f36a7c31b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f36a7c3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0x145c0 (0x7f36f39465c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #8: + 0x94ac3 (0x7f36f47cbac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7f36f485ca04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f36f1d6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f36a78a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f36f39465c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f36f47cbac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #4: clone + 0x44 (0x7f36f485ca04 in /usr/lib/x86_64-linux-gnu/libc.so.6) , line 883 in exec_module File "", line 688 in _load_unlocked File "", line 1006 in _find_and_load_unlocked File "", line 1027 in _find_and_load File "", line 241 in _call_with_frames_removed File "", line 992 in _find_and_load_unlocked File "", line 1027 in _find_and_load File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 1903 in shutdown File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103 in wrapper , regex._regex, cuda_utils, __triton_launcher, msgspec._core (total: 52)

Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, multidict._multidict, uvloop.loop, yarl._quoting_c, propcache._helpers_c, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, Killed

Is there something wrong with my configuration? @zhyncs @ispobock

Hi @lambert0312 , have you fixed this problem on 4*A100 nodes? I met this problem too.

@lambert0312
Copy link

lambert0312 commented Feb 19, 2025

Hi @lambert0312 , have you fixed this problem on 4*A100 nodes? I met this problem too.

Not yet, trying @ehuaa

@pipul
Copy link

pipul commented Feb 19, 2025

       parser.add_argument(
            "--speculative-num-steps",
            type=int,
            help="The number of steps sampled from draft model in Speculative Decoding.",
            default=ServerArgs.speculative_num_steps,
        )
        parser.add_argument(
            "--speculative-num-draft-tokens",
            type=int,
            help="The number of token sampled from draft model in Speculative Decoding.",
            default=ServerArgs.speculative_num_draft_tokens,
        )

These two parameters make me feel very confused, what are the specific meanings of these two parameters? Why can't it be as simple as vllm, which has only one parameter --num_speculative_tokens, which is to predict how much tokens

@liweiqing1997
Copy link

Hello, could MTP be combined with quantization for deployment on a single machine with 8*H20?

@YosanHo
Copy link

YosanHo commented Feb 20, 2025

i use latest code occur error with 8*H20

python -m sglang.launch_server --model-path /opt/model/DeepSeek-R1 --trust-remote-code --served-model-name deepseek-r1 --enable-metrics --speculative-algo NEXTN --speculative-draft /opt/model/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.9 --tp 8

[2025-02-20 10:32:44 TP7] Scheduler hit an exception: Traceback (most recent call last):
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 252, in init
self.draft_worker = EAGLEWorker(
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 47, in init
super().init(
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in init
self.model_runner = ModelRunner(
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 187, in init
min_per_gpu_memory = self.init_torch_distributed()
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 280, in init_torch_distributed
raise ValueError(
ValueError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes.

@lambert0312
Copy link

mem-fraction-static

@YosanHo Maybe you need to adjust the mem-fraction-static parameter

@cermeng
Copy link
Contributor

cermeng commented Feb 20, 2025

Run the benchmark provided by @ispobock for 2 nodes 8*H800, but mtp spec decode is much slower than normal. I'm not sure if it is expected

# mtp
batch size: 1
latency: 13.54 s
output throughput: 18.91 token/s
(input + output) throughput: 37.82 token/s

# w/o mtp(normal)
batch size: 1
latency: 8.53 s
output throughput: 30.02 token/s
(input + output) throughput: 60.04 token/s

@caseylai
Copy link

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.

batch size 1: from 17 t/s to 52 t/s
batch size 30: from 160 t/s to 500 t/s

But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.

my start command is
python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

@lishicheng1996
Copy link

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.

batch size 1: from 17 t/s to 52 t/s batch size 30: from 160 t/s to 500 t/s

But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.

my start command is python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

Hi, may I ask the version of SGLang you use and the accept length in your test? I use 0.4.3.post2, while MTP has double speed with bs=1, the speed is alomost same with bs=8.

@caseylai
Copy link

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.
batch size 1: from 17 t/s to 52 t/s batch size 30: from 160 t/s to 500 t/s
But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.
my start command is python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

Hi, may I ask the version of SGLang you use and the accept length in your test? I use 0.4.3.post2, while MTP has double speed with bs=1, the speed is alomost same with bs=8.

0.4.3.post2, same with u. I don't know what accept length is, you can see all arguments in my command .

@lishicheng1996
Copy link

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.
batch size 1: from 17 t/s to 52 t/s batch size 30: from 160 t/s to 500 t/s
But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.
my start command is python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

Hi, may I ask the version of SGLang you use and the accept length in your test? I use 0.4.3.post2, while MTP has double speed with bs=1, the speed is alomost same with bs=8.

0.4.3.post2, same with u. I don't know what accept length is, you can see all arguments in my command .

Thanks very much for your reply! We can see accept length in the log of sglang. It's the accept token num in the draft tokens, and decides the speed gain of MTP. In my test the accept length is about 2.3
Screenshot 2025-02-21 at 16 09 30

@Zhou-sx
Copy link

Zhou-sx commented Feb 21, 2025

lambert0312
do you succeed? I'm trying to deploy on 8*H20, too.

@lambert0312
Copy link

do you succeed? I'm trying to deploy on 8*H20, too.

@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs.

@Zhou-sx
Copy link

Zhou-sx commented Feb 21, 2025

do you succeed? I'm trying to deploy on 8*H20, too.

@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs.

thanks.

@Zhou-sx
Copy link

Zhou-sx commented Feb 21, 2025

i use latest code occur error with 8*H20

python -m sglang.launch_server --model-path /opt/model/DeepSeek-R1 --trust-remote-code --served-model-name deepseek-r1 --enable-metrics --speculative-algo NEXTN --speculative-draft /opt/model/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.9 --tp 8

[2025-02-20 10:32:44 TP7] Scheduler hit an exception: Traceback (most recent call last): File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 252, in init self.draft_worker = EAGLEWorker( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 47, in init super().init( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in init self.model_runner = ModelRunner( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 187, in init min_per_gpu_memory = self.init_torch_distributed() File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 280, in init_torch_distributed raise ValueError( ValueError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes.

do you succeed?

@victorserbu2709
Copy link

When i try to run on 2 nodes 8xh100 using docker image lmsysorg/sglang:v0.4.3.post2-cu125-srt

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics --watchdog-timeout=3000 --speculative-algo NEXTN --speculative-draft SGLang/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix 

it stucks at

0%| | 0/34 [00:00<?, ?it/s][2025-02-21 12:51:48 TP6] Capture cuda graph begin. This can take up to several minutes.

if I add --disable-cuda-graph it starts but output throughput is only 15token/s

[2025-02-21 13:11:34 TP0] Decode batch. #running-req: 1, #token: 1435, token usage: 0.00, accept len: 2.15, gen throughput (token/s): 14.30, #queue-req: 0
[2025-02-21 13:11:40 TP0] Decode batch. #running-req: 1, #token: 1525, token usage: 0.01, accept len: 2.25, gen throughput (token/s): 15.06, #queue-req: 0

If i run with

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics  --enable-flashinfer-mla  --watchdog-timeout=3000

it obtains ~30 output tokens/s

[2025-02-21 13:34:54 TP0] Decode batch. #running-req: 1, #token: 184, token usage: 0.00, gen throughput (token/s): 29.53, #queue-req: 0

@yuqie
Copy link

yuqie commented Feb 22, 2025

Hi, does NextN compatible with bench_one_batch?I try deepseek R1 on 8*H200 with python3 -m sglang.bench_one_batch --trust-remote-code --run-name DeepSeekR1 --model-path /mnt/model/ --batch-size 2 --speculative-algo NEXTN --speculative-draft /mnt/huggingface/DeepSeek-R1-NextN/ --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --input-len 1000 --output-len 1 --tensor-parallel-size 8 --disable-radix and encounter the “tensor size does not match” error as following

max_total_num_tokens=480079
Warmup ...
[2025-02-22 01:58:01 TP2] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP7] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP6] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP3] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP4] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP1] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP0] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP5] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
Prefill. latency: 8.30952 s, throughput:    240.69 token/s
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 432, in latency_test
    latency_test_run_once(
  File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 370, in latency_test_run_once
    next_token_ids, _ = decode(next_token_ids, batch, model_runner)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 254, in decode
    logits_output = model_runner.forward(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 791, in forward
    return self.cuda_graph_runner.replay(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 423, in replay
    self.input_ids[:raw_num_token].copy_(forward_batch.input_ids)
RuntimeError: The size of tensor a (8) must match the size of tensor b (2) at non-singleton dimension 0

@jifa513
Copy link

jifa513 commented Feb 22, 2025

i use latest code occur error with 8*H20

python -m sglang.launch_server --model-path /opt/model/DeepSeek-R1 --trust-remote-code --served-model-name deepseek-r1 --enable-metrics --speculative-algo NEXTN --speculative-draft /opt/model/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.9 --tp 8

[2025-02-20 10:32:44 TP7] Scheduler hit an exception: Traceback (most recent call last): File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 252, in init self.draft_worker = EAGLEWorker( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 47, in init super().init( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in init self.model_runner = ModelRunner( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 187, in init min_per_gpu_memory = self.init_torch_distributed() File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 280, in init_torch_distributed raise ValueError( ValueError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes.

do you succeed?

the same problem with 8*H200

@YosanHo
Copy link

YosanHo commented Feb 22, 2025

mem-fraction-static

@YosanHo Maybe you need to adjust the mem-fraction-static parameter

I run succeed with static at 0.87,and modify soucecode in model_runner.py(line 280) to skip validate,but the performance is very poor

@ehuaa
Copy link

ehuaa commented Feb 23, 2025

Hi @lambert0312 , have you fixed this problem on 4*A100 nodes? I met this problem too.

Not yet, trying @ehuaa

do you succeed? I'm trying to deploy on 8*H20, too.

@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs.

Hi @lambert0312 ,how did you fix the problem on 4*A800 nodes, i still stucked here. Is it caused by chunked_prefill?

@Zhou-sx
Copy link

Zhou-sx commented Feb 24, 2025

mem-fraction-static

@YosanHo Maybe you need to adjust the mem-fraction-static parameter

I run succeed with static at 0.87,and modify soucecode in model_runner.py(line 280) to skip validate,but the performance is very poor
Why modifying mem-fraction-static can solve the problem of unbalanced memory capacity?

@lambert0312
Copy link

Hi @lambert0312 ,how did you fix the problem on 4*A800 nodes, i still stucked here. Is it caused by chunked_prefill?

@ehuaa What version are you using?

@kimlee1874
Copy link

kimlee1874 commented Feb 26, 2025

I did a benchmark test with bench_serving.py on 2 x 8 x H800, and here is my startup script with MTP (0.4.3.post2):
python -m sglang.launch_server --model-path ./DeepSeek-R1/ --tp 16 --dist-init-addr $IP_PORT --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --speculative-algo NEXTN --speculative-draft ./DeepSeek-R1-NextN/ --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.75

A very strange phenomenon is:

  1. When the isl/osl=1k/1k, the speed increase brought by MTP is 1.6X (bs 1) and 1.4X (bs 8)
  2. But when isl is increased to 8K, MTP has almost no speed increase from bs 1, and starts to show negative growth at bs 16

Why does MTP become less effective when isl becomes longer?

@RonanKMcGovern
Copy link

RonanKMcGovern commented Feb 26, 2025 via email

@RonanKMcGovern
Copy link

RonanKMcGovern commented Feb 26, 2025

--nnodes 2

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics  --enable-flashinfer-mla  --watchdog-timeout=3000

it obtains ~30 output tokens/s

That's pretty interesting, where did you get that idea from to use flashinfer-mla? Shouldn't that be automatic as shown by "MLA optimization is turned on. Use triton backend." in the logs?

@jokerwyt
Copy link

jokerwyt commented Feb 27, 2025

       parser.add_argument(
            "--speculative-num-steps",
            type=int,
            help="The number of steps sampled from draft model in Speculative Decoding.",
            default=ServerArgs.speculative_num_steps,
        )
        parser.add_argument(
            "--speculative-num-draft-tokens",
            type=int,
            help="The number of token sampled from draft model in Speculative Decoding.",
            default=ServerArgs.speculative_num_draft_tokens,
        )

These two parameters make me feel very confused, what are the specific meanings of these two parameters? Why can't it be as simple as vllm, which has only one parameter --num_speculative_tokens, which is to predict how much tokens

@pipul My guess is speculative-num-steps indicate how many times you forward the draft model (each time you select top k of the tree path from root to leave, and get k new node), and num_speculative_tokens represent the node number of the draft tree, according to EAGLE-2 paper.

@jokerwyt
Copy link

--nnodes 2

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics  --enable-flashinfer-mla  --watchdog-timeout=3000

it obtains ~30 output tokens/s

That's pretty interesting, where did you get that idea from to use flashinfer-mla? Shouldn't that be automatic as shown by "MLA optimization is turned on. Use triton backend." in the logs?

I saw that log and I think it is telling me triton backend is used, instead of flashinfer 😂.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.