Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 #3582

ispobock · 2025-02-14T13:46:02Z

Motivation

We implemented NextN (MTP) speculative decoding for DeepSeek-V3/R1 based on EAGLE 2 on Triton backend (#3466) and achieved 1.76x speed up with CUDA Graph and Torch.compile compatibility. In current benchmark, we achieved 77 token/s output throughput on batch size 1.

In our implementation, we only use the 1 MTP module (NextN layer) from the official model checkpoint. We found it also can be used for autoregressive prediction like EAGLE. The accept rate of the MTP module is very high (~1.9 avg accept length for draft 2 tokens, e.g. --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2). We try to use it to draft more tokens and achieved better speedup. (2.5~3 avg accept length for draft 4 tokens for 2 steps, e.g. --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4)

Best practices should be further investigated through additional experiments, as predicting more tokens can increase overhead and impact throughput, especially for large batch sizes. A careful trade-off between latency and throughput is necessary to determine the optimal number of speculative tokens.

Benchmark Results

# benchmark
python3 -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:30000 --batch-size 1 --input-len 256 --output-len 256

# baseline on main branch
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --trust-remote --tp 8

batch size: 1
latency: 6.70 s
output throughput: 38.19 token/s
(input + output) throughput: 76.39 token/s

# w/ nextn speculative decoding
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft /sgl-workspace/DeepSeek-V3-nextn --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --trust-remote --tp 8

batch size: 1
latency: 3.77 s
output throughput: 67.93 token/s
(input + output) throughput: 135.87 token/s

# w/ nextn speculative decoding + Torch.compile
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft /sgl-workspace/DeepSeek-V3-nextn --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8 --enable-torch-compile --torch-compile-max-bs 1

batch size: 1
latency: 3.29 s
output throughput: 77.73 token/s
(input + output) throughput: 155.45 token/s

Usage

Option1: Export nextn weights manually

Export the weights of nextn layer with script scripts/export_deepseek_nextn.py

python3 export_deepseek_nextn.py --input-dir /path/to/DeepSeek-V3 --output-dir /path/to/DeepSeek-V3-NextN

Use the nextn layer as draft model and launch the server

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft /path/to/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8

Option2: Use the exported nextn weights directly

Ref: #3582 (comment)

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft SGLang/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8

freeliuzc · 2025-02-15T05:27:36Z

Great work！
Regarding the tokens for Draft 1, what is the average accepted length?
Thanks

Swipe4057 · 2025-02-15T08:09:12Z

Could you please clarify if I understand correctly that speculative decoding does not increase throughput, and even decreases it under high load? How can I properly find the optimal load point?

lambert0312 · 2025-02-15T11:06:21Z

I wonder if MTP supports bf16?

…ect#3582)

zhyncs · 2025-02-15T17:43:54Z

FYI you can use these checkpoints for V3 NextN and R1 NextN instead of exporting them yourself. Cheers!

https://huggingface.co/SGLang/DeepSeek-V3-NextN
https://huggingface.co/SGLang/DeepSeek-R1-NextN

lambert0312 · 2025-02-16T00:27:32Z

I use bf16 model and export the weights of nextn layer
python3 export_deepseek_nextn.py --input-dir /path/to/DeepSeek-V3-bf16 --output-dir /path/to/DeepSeek-V3-NextN-bf16`
Use the nextn layer as draft model and launch the server
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-bf16 --speculative-algo NEXTN --speculative-draft /path/to/DeepSeek-V3-NextN-bf16 --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8
The log is as follows:
[2025-02-15 06:17:26 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 80, in init
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 101, in capture
CudaGraphRunner.capture(self)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 304, in capture
) = self.capture_one_batch_size(bs, forward)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 164, in capture_one_batch_size
run_once()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 154, in run_once
ret = self.eagle_worker.draft_forward(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 260, in draft_forward
logits_output = self.model_runner.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_nextn.py", line 140, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_nextn.py", line 96, in forward
hidden_states, residual = self.decoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 770, in forward
hidden_states = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 528, in forward
return self.forward_absorb(positions, hidden_states, forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 620, in forward_absorb
attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 67, in forward
return forward_batch.attn_backend.forward(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/init.py", line 67, in forward
return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 441, in forward_decode
forward_batch.token_to_kv_pool.set_kv_buffer(
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 288, in set_kv_buffer
self.k_buffer[layer_id][loc] = cache_k
RuntimeError: shape mismatch: value tensor of shape [4, 1, 576] cannot be broadcast to indexing result of shape [4, 4, 56]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 252, in init
self.draft_worker = EAGLEWorker(
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 99, in init
self.init_cuda_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 110, in init_cuda_graphs
self.cuda_graph_runner = EAGLEDraftCudaGraphRunner(self)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 82, in init
raise Exception(
Exception: Capture cuda graph failed: shape mismatch: value tensor of shape [4, 1, 576] cannot be broadcast to indexing result of shape [4, 4, 56]
Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
disable torch compile by not using --enable-torch-compile
specify --dtype to the same dtype (e.g. bfloat16)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

ispobock · 2025-02-16T08:05:17Z

@lambert0312 Which bf16 model did you use and what GPU did you use? It seems the checkpoint is not correct. Maybe you can try to convert it with this guide.

ispobock · 2025-02-16T08:39:39Z

Regarding the tokens for Draft 1, what is the average accepted length?

Current --speculative-num-steps is at least 2. We will support one draft step in the following update. I think the accepted length can match the result in the paper.

ispobock · 2025-02-16T08:52:28Z

Could you please clarify if I understand correctly that speculative decoding does not increase throughput, and even decreases it under high load?

Speculative decoding methods can speedup for small batch sizes but is not designed for high load. But I think the nextn method can get speedup at higher batch sizes since it has a higher accept rate so that we can use less draft steps and draft tokens to get good performance.

How can I properly find the optimal load point?

Maybe you can do the benchmark with different request rate and check the throughput.

lambert0312 · 2025-02-16T09:21:26Z

@lambert0312 Which bf16 model did you use and what GPU did you use? It seems the checkpoint is not correct. Maybe you can try to convert it with this guide.

I use 4xA800 gpu and covert bf16 mtp nextn model. @ispobock

tot0 · 2025-02-18T21:14:21Z

upon doing memory static at 0.7

^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_utils.py", line 194, in create
    build_tree_kernel(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/build_eagle_tree.py", line 149, in build_tree_kernel
    sgl_build_tree_kernel(
    ^^^^^^^^^^^^^^^^^^^^^
NameError: name 'sgl_build_tree_kernel' is not defined. Did you mean: 'build_tree_kernel'?

[2025-02-18 14:31:01] Received sigquit from a child proces. It usually means the child failed.
[2025-02-18 14:31:01 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 478, in event_loop_normal
    result = self.run_batch(batch)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1089, in run_batch
    ) = self.draft_worker.forward_batch_speculative_generation(batch)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 116, in forward_batch_speculative_generation
    spec_info: EagleVerifyInput = self.draft(batch)
                                  ^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 199, in draft
    ret = EagleVerifyInput.create(
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_utils.py", line 194, in create
    build_tree_kernel(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/build_eagle_tree.py", line 149, in build_tree_kernel
    sgl_build_tree_kernel(
    ^^^^^^^^^^^^^^^^^^^^^
NameError: name 'sgl_build_tree_kernel' is not defined. Did you mean: 'build_tree_kernel'?

[2025-02-18 14:31:01] Received sigquit from a child proces. It usually means the child failed.
Killed
root@aaaa5ace6a62:/sgl-workspace#  ```

It would appear that the EAGLE tree kernels have only been implemented for CUDA natively in sgl-kernel, not via triton, so ROCm based GPUs aren't supported yet. Seems like this is tracked by #2940

Hmm, but then #3466 has merged... so maybe there's just some glue missing for the non-cude path?

ShivamB25 · 2025-02-18T21:47:01Z

upon doing memory static at 0.7

^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_utils.py", line 194, in create
    build_tree_kernel(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/build_eagle_tree.py", line 149, in build_tree_kernel
    sgl_build_tree_kernel(
    ^^^^^^^^^^^^^^^^^^^^^
NameError: name 'sgl_build_tree_kernel' is not defined. Did you mean: 'build_tree_kernel'?

[2025-02-18 14:31:01] Received sigquit from a child proces. It usually means the child failed.
[2025-02-18 14:31:01 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 478, in event_loop_normal
    result = self.run_batch(batch)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1089, in run_batch
    ) = self.draft_worker.forward_batch_speculative_generation(batch)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 116, in forward_batch_speculative_generation
    spec_info: EagleVerifyInput = self.draft(batch)
                                  ^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 199, in draft
    ret = EagleVerifyInput.create(
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_utils.py", line 194, in create
    build_tree_kernel(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/build_eagle_tree.py", line 149, in build_tree_kernel
    sgl_build_tree_kernel(
    ^^^^^^^^^^^^^^^^^^^^^
NameError: name 'sgl_build_tree_kernel' is not defined. Did you mean: 'build_tree_kernel'?

[2025-02-18 14:31:01] Received sigquit from a child proces. It usually means the child failed.
Killed
root@aaaa5ace6a62:/sgl-workspace#  ```

It would appear that the EAGLE tree kernels have only been implemented for CUDA natively in sgl-kernel, not via triton, so ROCm based GPUs aren't supported yet. Seems like this is tracked by #2940

Hmm, but then #3466 has merged... so maybe there's just some glue missing for the non-cude path?

hmmmmmmmmmmmmmmmm

tot0 · 2025-02-18T22:50:59Z

upon doing memory static at 0.7

^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_utils.py", line 194, in create
    build_tree_kernel(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/build_eagle_tree.py", line 149, in build_tree_kernel
    sgl_build_tree_kernel(
    ^^^^^^^^^^^^^^^^^^^^^
NameError: name 'sgl_build_tree_kernel' is not defined. Did you mean: 'build_tree_kernel'?

[2025-02-18 14:31:01] Received sigquit from a child proces. It usually means the child failed.
[2025-02-18 14:31:01 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 478, in event_loop_normal
    result = self.run_batch(batch)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1089, in run_batch
    ) = self.draft_worker.forward_batch_speculative_generation(batch)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 116, in forward_batch_speculative_generation
    spec_info: EagleVerifyInput = self.draft(batch)
                                  ^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 199, in draft
    ret = EagleVerifyInput.create(
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_utils.py", line 194, in create
    build_tree_kernel(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/build_eagle_tree.py", line 149, in build_tree_kernel
    sgl_build_tree_kernel(
    ^^^^^^^^^^^^^^^^^^^^^
NameError: name 'sgl_build_tree_kernel' is not defined. Did you mean: 'build_tree_kernel'?

[2025-02-18 14:31:01] Received sigquit from a child proces. It usually means the child failed.
Killed
root@aaaa5ace6a62:/sgl-workspace#  ```

It would appear that the EAGLE tree kernels have only been implemented for CUDA natively in sgl-kernel, not via triton, so ROCm based GPUs aren't supported yet. Seems like this is tracked by #2940

Hmm, but then #3466 has merged... so maybe there's just some glue missing for the non-cude path?

hmmmmmmmmmmmmmmmm

#3670

ehuaa · 2025-02-19T03:46:53Z

When I use the v0.4.3.post1 cu124 version of the docker image, I use the following command to start 4 A800 nodes:

python3 -m sglang.launch_server --model-path deepseek-a/DeepSeek-V3-bf16 --dtype bfloat16 --trust-remote-code --host 0.0.0.0 --port 30000 --grammar-backend xgrammar --mem-fraction-static 0.85 --context-length 131072 --max-running-requests 128 --disable-overlap --speculative-algo NEXTN --speculative-draft deepseek-a/DeepSeek-V3-NextN-bf16 --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --tp 32 --dist-init-addr x.x.x.x:20000 --nnodes 4 --node-rank 0

python3 -m sglang.launch_server --model-path deepseek-a/DeepSeek-V3-bf16 --dtype bfloat16 --trust-remote-code --host 0.0.0.0 --port 30000 --grammar-backend xgrammar --mem-fraction-static 0.85 --context-length 131072 --max-running-requests 128 --disable-overlap --speculative-algo NEXTN --speculative-draft deepseek-a/DeepSeek-V3-NextN-bf16 --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --tp 32 --dist-init-addr x.x.x.x:20000 --nnodes 4 --node-rank 1

python3 -m sglang.launch_server --model-path deepseek-a/DeepSeek-V3-bf16 --dtype bfloat16 --trust-remote-code --host 0.0.0.0 --port 30000 --grammar-backend xgrammar --mem-fraction-static 0.85 --context-length 131072 --max-running-requests 128 --disable-overlap --speculative-algo NEXTN --speculative-draft deepseek-a/DeepSeek-V3-NextN-bf16 --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --tp 32 --dist-init-addr x.x.x.x:20000 --nnodes 4 --node-rank 2

python3 -m sglang.launch_server --model-path deepseek-a/DeepSeek-V3-bf16 --dtype bfloat16 --trust-remote-code --host 0.0.0.0 --port 30000 --grammar-backend xgrammar --mem-fraction-static 0.85 --context-length 131072 --max-running-requests 128 --disable-overlap --speculative-algo NEXTN --speculative-draft deepseek-a/DeepSeek-V3-NextN-bf16 --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --tp 32 --dist-init-addr x.x.x.x:20000 --nnodes 4 --node-rank 3

The following error occurs when running batch verification:

ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.ERROR: invalid eagle tree!!! Detected a token with no parent token selected. Check the logprob. The token will be dropped.

[rank1]:[E218 05:25:48.030937973 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd4756b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd4756636e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd4757a5a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fd42b625726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fd42b62a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fd42b631b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fd42b63361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0x145c0 (0x7fd4772a55c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #8: + 0x94ac3 (0x7fd47812aac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7fd4781bba04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

, terminate called after throwing an instance of 'markupsafe._speedupsc10::DistBackendError' , PIL._imagingterminate called after throwing an instance of 'c10::DistBackendError' , PIL._imagingft, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, setproctitle[rank3]:[E218 05:25:48.031546640 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f36f1d6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f36f1d166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f36f214da18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f36a7c25726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f36a7c2a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f36a7c31b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f36a7c3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0x145c0 (0x7f36f39465c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #8: + 0x94ac3 (0x7f36f47cbac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7f36f485ca04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

, zmq.backend.cython._zmqterminate called after throwing an instance of ', regex._regexc10::DistBackendError' , , cuda_utilsmsgspec._core, __triton_launcher, yaml._yaml, multidict._multidict (total: 52) , yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, markupsafe._speedups, PIL._imaging what(): [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9038f6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f9038f166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f903939ca18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f8feee25726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f8feee2a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f8feee31b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8feee3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0x145c0 (0x7f903ab955c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #8: + 0x94ac3 (0x7f903ba1aac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7f903baaba04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9038f6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f8feeaa071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f903ab955c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f903ba1aac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #4: clone + 0x44 (0x7f903baaba04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

, PIL._imagingft, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, msgpack._cmsgpack, google._upb._message, ray._raylet what(): [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff0ed6b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7ff0ed6636e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7ff0ed7a5a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7ff0a3625726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7ff0a362a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7ff0a3631b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ff0a363361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0x145c0 (0x7ff0ef2f75c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #8: + 0x94ac3 (0x7ff0f017cac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7ff0f020da04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff0ed6b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7ff0a32a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7ff0ef2f75c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7ff0f017cac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #4: clone + 0x44 (0x7ff0f020da04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007fd83bfff640 (most recent call first): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007ff0f00e74c0 (most recent call first): File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/init.py", line 146 in patched_isinstance File "/usr/lib/python3.10/inspect.py", line 288 in isfunction File "/usr/lib/python3.10/inspect.py", line 299 in _has_code_flag File "/usr/lib/python3.10/inspect.py", line 321 in isasyncgenfunction File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 540 in _inject_tracing_into_class File "/usr/local/lib/python3.10/dist-packages/ray/actor.py", line 1742 in _make_actor File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 3181 in _make_remote File "/usr/local/lib/python3.10/dist-packages/ray/experimental/channel/cpu_communicator.py", line 17 in File "", line 241 in _call_with_frames_removed File "", line 883 in exec_module File "", line 688 in _load_unlocked File "", line 1006 in _find_and_load_unlocked File "", line 1027 in _find_and_load File "/usr/local/lib/python3.10/dist-packages/ray/experimental/channel/torch_tensor_nccl_channel.py", line 13 in File "", line 241 in _call_with_frames_removed File "", line 883 in exec_module File "", line 688 in _load_unlocked File "", line 1006 in , _find_and_load_unlockedsentencepiece._sentencepiece File "", line 1027 in _find_and_load File "/usr/local/lib/python3.10/dist-packages/ray/experimental/channel/init.py", line 21 in File "", line 241 in _call_with_frames_removed File "", line 883 in exec_module File "", line 688 in _load_unlocked File "", line 1006 in _find_and_load_unlocked File "", line 1027 in _find_and_load File "", line 241 in _call_with_frames_removed File "", line 992 in _find_and_load_unlocked File "", line 1027 in _find_and_load File "/usr/local/lib/python3.10/dist-packages/ray/dag/dag_node.py", line 2 in File "", line 241 in _call_with_frames_removed File "", line 883 in exec_module File "", line 688 in _load_unlocked File "", line 1006 in _find_and_load_unlocked File "", line 1027 in _find_and_load File "/usr/local/lib/python3.10/dist-packages/ray/dag/init.py", line 1 in File "", line 241 in _call_with_frames_removed File " what(): "[PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f36f1d6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f36f1d166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f36f214da18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f36a7c25726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f36a7c2a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f36a7c31b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f36a7c3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0x145c0 (0x7f36f39465c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #8: + 0x94ac3 (0x7f36f47cbac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7f36f485ca04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f36f1d6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f36a78a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f36f39465c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f36f47cbac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #4: clone + 0x44 (0x7f36f485ca04 in /usr/lib/x86_64-linux-gnu/libc.so.6) , line 883 in exec_module File "", line 688 in _load_unlocked File "", line 1006 in _find_and_load_unlocked File "", line 1027 in _find_and_load File "", line 241 in _call_with_frames_removed File "", line 992 in _find_and_load_unlocked File "", line 1027 in _find_and_load File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 1903 in shutdown File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103 in wrapper , regex._regex, cuda_utils, __triton_launcher, msgspec._core (total: 52)

Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, multidict._multidict, uvloop.loop, yarl._quoting_c, propcache._helpers_c, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, Killed

Is there something wrong with my configuration? @zhyncs @ispobock

Hi @lambert0312 , have you fixed this problem on 4*A100 nodes? I met this problem too.

lambert0312 · 2025-02-19T04:01:59Z

Hi @lambert0312 , have you fixed this problem on 4*A100 nodes? I met this problem too.

Not yet, trying @ehuaa

pipul · 2025-02-19T14:05:47Z

       parser.add_argument(
            "--speculative-num-steps",
            type=int,
            help="The number of steps sampled from draft model in Speculative Decoding.",
            default=ServerArgs.speculative_num_steps,
        )
        parser.add_argument(
            "--speculative-num-draft-tokens",
            type=int,
            help="The number of token sampled from draft model in Speculative Decoding.",
            default=ServerArgs.speculative_num_draft_tokens,
        )

These two parameters make me feel very confused, what are the specific meanings of these two parameters? Why can't it be as simple as vllm, which has only one parameter --num_speculative_tokens, which is to predict how much tokens

liweiqing1997 · 2025-02-20T08:51:14Z

Hello, could MTP be combined with quantization for deployment on a single machine with 8*H20?

YosanHo · 2025-02-20T11:31:46Z

i use latest code occur error with 8*H20

python -m sglang.launch_server --model-path /opt/model/DeepSeek-R1 --trust-remote-code --served-model-name deepseek-r1 --enable-metrics --speculative-algo NEXTN --speculative-draft /opt/model/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.9 --tp 8

[2025-02-20 10:32:44 TP7] Scheduler hit an exception: Traceback (most recent call last):
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 252, in init
self.draft_worker = EAGLEWorker(
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 47, in init
super().init(
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in init
self.model_runner = ModelRunner(
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 187, in init
min_per_gpu_memory = self.init_torch_distributed()
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 280, in init_torch_distributed
raise ValueError(
ValueError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes.

lambert0312 · 2025-02-20T14:13:42Z

mem-fraction-static

@YosanHo Maybe you need to adjust the mem-fraction-static parameter

cermeng · 2025-02-20T15:58:12Z

Run the benchmark provided by @ispobock for 2 nodes 8*H800, but mtp spec decode is much slower than normal. I'm not sure if it is expected

# mtp
batch size: 1
latency: 13.54 s
output throughput: 18.91 token/s
(input + output) throughput: 37.82 token/s

# w/o mtp(normal)
batch size: 1
latency: 8.53 s
output throughput: 30.02 token/s
(input + output) throughput: 60.04 token/s

caseylai · 2025-02-21T00:58:00Z

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.

batch size 1: from 17 t/s to 52 t/s
batch size 30: from 160 t/s to 500 t/s

But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.

my start command is
python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

lishicheng1996 · 2025-02-21T07:25:20Z

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.

batch size 1: from 17 t/s to 52 t/s batch size 30: from 160 t/s to 500 t/s

But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.

my start command is python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

Hi, may I ask the version of SGLang you use and the accept length in your test? I use 0.4.3.post2, while MTP has double speed with bs=1, the speed is alomost same with bs=8.

caseylai · 2025-02-21T07:35:09Z

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.
batch size 1: from 17 t/s to 52 t/s batch size 30: from 160 t/s to 500 t/s
But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.
my start command is python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

Hi, may I ask the version of SGLang you use and the accept length in your test? I use 0.4.3.post2, while MTP has double speed with bs=1, the speed is alomost same with bs=8.

0.4.3.post2, same with u. I don't know what accept length is, you can see all arguments in my command .

lishicheng1996 · 2025-02-21T08:13:35Z

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.
batch size 1: from 17 t/s to 52 t/s batch size 30: from 160 t/s to 500 t/s
But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.
my start command is python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

Hi, may I ask the version of SGLang you use and the accept length in your test? I use 0.4.3.post2, while MTP has double speed with bs=1, the speed is alomost same with bs=8.

0.4.3.post2, same with u. I don't know what accept length is, you can see all arguments in my command .

Thanks very much for your reply! We can see accept length in the log of sglang. It's the accept token num in the draft tokens, and decides the speed gain of MTP. In my test the accept length is about 2.3

Zhou-sx · 2025-02-21T08:55:39Z

lambert0312
do you succeed? I'm trying to deploy on 8*H20, too.

lambert0312 · 2025-02-21T11:28:23Z

do you succeed? I'm trying to deploy on 8*H20, too.

@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs.

Zhou-sx · 2025-02-21T11:45:15Z

do you succeed? I'm trying to deploy on 8*H20, too.

@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs.

thanks.

Zhou-sx · 2025-02-21T11:46:12Z

i use latest code occur error with 8*H20
python -m sglang.launch_server --model-path /opt/model/DeepSeek-R1 --trust-remote-code --served-model-name deepseek-r1 --enable-metrics --speculative-algo NEXTN --speculative-draft /opt/model/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.9 --tp 8
[2025-02-20 10:32:44 TP7] Scheduler hit an exception: Traceback (most recent call last): File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 252, in init self.draft_worker = EAGLEWorker( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 47, in init super().init( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in init self.model_runner = ModelRunner( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 187, in init min_per_gpu_memory = self.init_torch_distributed() File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 280, in init_torch_distributed raise ValueError( ValueError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes.

do you succeed?

victorserbu2709 · 2025-02-21T14:11:08Z

When i try to run on 2 nodes 8xh100 using docker image lmsysorg/sglang:v0.4.3.post2-cu125-srt

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics --watchdog-timeout=3000 --speculative-algo NEXTN --speculative-draft SGLang/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix

it stucks at

0%| | 0/34 [00:00<?, ?it/s][2025-02-21 12:51:48 TP6] Capture cuda graph begin. This can take up to several minutes.

if I add --disable-cuda-graph it starts but output throughput is only 15token/s

[2025-02-21 13:11:34 TP0] Decode batch. #running-req: 1, #token: 1435, token usage: 0.00, accept len: 2.15, gen throughput (token/s): 14.30, #queue-req: 0
[2025-02-21 13:11:40 TP0] Decode batch. #running-req: 1, #token: 1525, token usage: 0.01, accept len: 2.25, gen throughput (token/s): 15.06, #queue-req: 0

If i run with

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics  --enable-flashinfer-mla  --watchdog-timeout=3000

it obtains ~30 output tokens/s

[2025-02-21 13:34:54 TP0] Decode batch. #running-req: 1, #token: 184, token usage: 0.00, gen throughput (token/s): 29.53, #queue-req: 0

yuqie · 2025-02-22T03:38:26Z

Hi, does NextN compatible with bench_one_batch？I try deepseek R1 on 8*H200 with python3 -m sglang.bench_one_batch --trust-remote-code --run-name DeepSeekR1 --model-path /mnt/model/ --batch-size 2 --speculative-algo NEXTN --speculative-draft /mnt/huggingface/DeepSeek-R1-NextN/ --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --input-len 1000 --output-len 1 --tensor-parallel-size 8 --disable-radix and encounter the “tensor size does not match” error as following

max_total_num_tokens=480079
Warmup ...
[2025-02-22 01:58:01 TP2] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP7] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP6] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP3] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP4] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP1] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP0] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP5] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
Prefill. latency: 8.30952 s, throughput:    240.69 token/s
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 432, in latency_test
    latency_test_run_once(
  File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 370, in latency_test_run_once
    next_token_ids, _ = decode(next_token_ids, batch, model_runner)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 254, in decode
    logits_output = model_runner.forward(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 791, in forward
    return self.cuda_graph_runner.replay(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 423, in replay
    self.input_ids[:raw_num_token].copy_(forward_batch.input_ids)
RuntimeError: The size of tensor a (8) must match the size of tensor b (2) at non-singleton dimension 0

jifa513 · 2025-02-22T06:30:35Z

i use latest code occur error with 8*H20
python -m sglang.launch_server --model-path /opt/model/DeepSeek-R1 --trust-remote-code --served-model-name deepseek-r1 --enable-metrics --speculative-algo NEXTN --speculative-draft /opt/model/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.9 --tp 8
[2025-02-20 10:32:44 TP7] Scheduler hit an exception: Traceback (most recent call last): File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 252, in init self.draft_worker = EAGLEWorker( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 47, in init super().init( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in init self.model_runner = ModelRunner( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 187, in init min_per_gpu_memory = self.init_torch_distributed() File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 280, in init_torch_distributed raise ValueError( ValueError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes.
do you succeed?

the same problem with 8*H200

YosanHo · 2025-02-22T23:49:27Z

mem-fraction-static

@YosanHo Maybe you need to adjust the mem-fraction-static parameter

I run succeed with static at 0.87，and modify soucecode in model_runner.py(line 280) to skip validate，but the performance is very poor

ehuaa · 2025-02-23T14:47:46Z

Hi @lambert0312 , have you fixed this problem on 4*A100 nodes? I met this problem too.

Not yet, trying @ehuaa

do you succeed? I'm trying to deploy on 8*H20, too.

@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs.

Hi @lambert0312 ,how did you fix the problem on 4*A800 nodes, i still stucked here. Is it caused by chunked_prefill?

Zhou-sx · 2025-02-24T10:14:57Z

mem-fraction-static

@YosanHo Maybe you need to adjust the mem-fraction-static parameter

I run succeed with static at 0.87，and modify soucecode in model_runner.py(line 280) to skip validate，but the performance is very poor
Why modifying mem-fraction-static can solve the problem of unbalanced memory capacity？

lambert0312 · 2025-02-24T23:52:27Z

Hi @lambert0312 ,how did you fix the problem on 4*A800 nodes, i still stucked here. Is it caused by chunked_prefill?

@ehuaa What version are you using?

kimlee1874 · 2025-02-26T07:41:57Z

I did a benchmark test with bench_serving.py on 2 x 8 x H800, and here is my startup script with MTP (0.4.3.post2):
python -m sglang.launch_server --model-path ./DeepSeek-R1/ --tp 16 --dist-init-addr $IP_PORT --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --speculative-algo NEXTN --speculative-draft ./DeepSeek-R1-NextN/ --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.75

A very strange phenomenon is:

When the isl/osl=1k/1k, the speed increase brought by MTP is 1.6X (bs 1) and 1.4X (bs 8)
But when isl is increased to 8K, MTP has almost no speed increase from bs 1, and starts to show negative growth at bs 16

Why does MTP become less effective when isl becomes longer?

RonanKMcGovern · 2025-02-26T08:08:35Z

Could you share what output tokens per second and latency you got for each of those tests? Many thanks

…

On Wed, Feb 26, 2025 at 7:42 AM kimlee1874 ***@***.***> wrote: I did a benchmark test with bench_serving.py on 2 x 8 x H800, and here is my startup script with MTP: python -m sglang.launch_server --model-path ./DeepSeek-R1/ --tp 16 --dist-init-addr $IP_PORT --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --speculative-algo NEXTN --speculative-draft ./DeepSeek-V3-NextN/ --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.75 A very strange phenomenon is: 1. When the isl/osl=1k/1k, the speed increase brought by MTP is 1.6X (bs 1) and 1.4X (bs 8) 2. But when isl is increased to 8K, MTP has almost no speed increase from bs 1, and starts to show negative growth at bs 16 Why does MTP become less effective when isl becomes longer? — Reply to this email directly, view it on GitHub <#3582 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASVG6CSDFOANYFUS3CFC5F32RVV55AVCNFSM6AAAAABXEVQ2B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBUGE3TQMBWGA> . You are receiving this because you commented.Message ID: ***@***.***> [image: kimlee1874]*kimlee1874* left a comment (sgl-project/sglang#3582) <#3582 (comment)> I did a benchmark test with bench_serving.py on 2 x 8 x H800, and here is my startup script with MTP: python -m sglang.launch_server --model-path ./DeepSeek-R1/ --tp 16 --dist-init-addr $IP_PORT --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --speculative-algo NEXTN --speculative-draft ./DeepSeek-V3-NextN/ --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.75 A very strange phenomenon is: 1. When the isl/osl=1k/1k, the speed increase brought by MTP is 1.6X (bs 1) and 1.4X (bs 8) 2. But when isl is increased to 8K, MTP has almost no speed increase from bs 1, and starts to show negative growth at bs 16 Why does MTP become less effective when isl becomes longer? — Reply to this email directly, view it on GitHub <#3582 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASVG6CSDFOANYFUS3CFC5F32RVV55AVCNFSM6AAAAABXEVQ2B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBUGE3TQMBWGA> . You are receiving this because you commented.Message ID: ***@***.***>

RonanKMcGovern · 2025-02-26T08:54:31Z

--nnodes 2

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics  --enable-flashinfer-mla  --watchdog-timeout=3000

it obtains ~30 output tokens/s

That's pretty interesting, where did you get that idea from to use flashinfer-mla? Shouldn't that be automatic as shown by "MLA optimization is turned on. Use triton backend." in the logs?

jokerwyt · 2025-02-27T07:26:13Z

       parser.add_argument(
            "--speculative-num-steps",
            type=int,
            help="The number of steps sampled from draft model in Speculative Decoding.",
            default=ServerArgs.speculative_num_steps,
        )
        parser.add_argument(
            "--speculative-num-draft-tokens",
            type=int,
            help="The number of token sampled from draft model in Speculative Decoding.",
            default=ServerArgs.speculative_num_draft_tokens,
        )
These two parameters make me feel very confused, what are the specific meanings of these two parameters? Why can't it be as simple as vllm, which has only one parameter --num_speculative_tokens, which is to predict how much tokens

@pipul My guess is speculative-num-steps indicate how many times you forward the draft model (each time you select top k of the tree path from root to leave, and get k new node), and num_speculative_tokens represent the node number of the draft tree, according to EAGLE-2 paper.

jokerwyt · 2025-02-27T07:56:35Z

--nnodes 2
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics  --enable-flashinfer-mla  --watchdog-timeout=3000
it obtains ~30 output tokens/s
That's pretty interesting, where did you get that idea from to use flashinfer-mla? Shouldn't that be automatic as shown by "MLA optimization is turned on. Use triton backend." in the logs?

I saw that log and I think it is telling me triton backend is used, instead of flashinfer 😂.

ispobock added 9 commits February 14, 2025 14:03

load nextn weights

360098d

update

1a08743

add script

7c56a60

update

6a36826

fix

4766d0d

add usage

718e97a

update

d437bb2

update

e84f21f

fix layer

ea107f2

ispobock requested review from merrymercy, Ying1123, hnyls2002, zhyncs and ByronHsu as code owners February 14, 2025 13:46

Merge branch 'main' into nextn

6f04db1

zhyncs added the high priority label Feb 14, 2025

Merge branch 'main' into nextn

76664bb

zhyncs approved these changes Feb 14, 2025

View reviewed changes

zhyncs merged commit 862dd76 into sgl-project:main Feb 14, 2025
2 of 17 checks passed

ispobock mentioned this pull request Feb 11, 2025

[Track] DeepSeek V3/R1 nextn progress #3472

Open

13 tasks

chongli-uw pushed a commit to chongli-uw/sglang that referenced this pull request Feb 15, 2025

Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 (sgl-proj…

da0a9ba

…ect#3582)

Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 #3582

Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 #3582

Conversation

ispobock commented Feb 14, 2025 • edited Loading

Motivation

Benchmark Results

Usage

Option1: Export nextn weights manually

Option2: Use the exported nextn weights directly

freeliuzc commented Feb 15, 2025

Swipe4057 commented Feb 15, 2025

lambert0312 commented Feb 15, 2025

zhyncs commented Feb 15, 2025

lambert0312 commented Feb 16, 2025

ispobock commented Feb 16, 2025

ispobock commented Feb 16, 2025

ispobock commented Feb 16, 2025

lambert0312 commented Feb 16, 2025 • edited Loading

tot0 commented Feb 18, 2025

ShivamB25 commented Feb 18, 2025

tot0 commented Feb 18, 2025

ehuaa commented Feb 19, 2025

lambert0312 commented Feb 19, 2025 • edited Loading

pipul commented Feb 19, 2025 • edited Loading

liweiqing1997 commented Feb 20, 2025

YosanHo commented Feb 20, 2025 • edited Loading

lambert0312 commented Feb 20, 2025

cermeng commented Feb 20, 2025

caseylai commented Feb 21, 2025

lishicheng1996 commented Feb 21, 2025

caseylai commented Feb 21, 2025

lishicheng1996 commented Feb 21, 2025

Zhou-sx commented Feb 21, 2025

lambert0312 commented Feb 21, 2025

Zhou-sx commented Feb 21, 2025

Zhou-sx commented Feb 21, 2025

victorserbu2709 commented Feb 21, 2025

yuqie commented Feb 22, 2025

jifa513 commented Feb 22, 2025

YosanHo commented Feb 22, 2025 • edited Loading

ehuaa commented Feb 23, 2025 • edited Loading

Zhou-sx commented Feb 24, 2025

lambert0312 commented Feb 24, 2025

kimlee1874 commented Feb 26, 2025 • edited Loading

RonanKMcGovern commented Feb 26, 2025 via email

RonanKMcGovern commented Feb 26, 2025 • edited Loading

jokerwyt commented Feb 27, 2025 • edited Loading

jokerwyt commented Feb 27, 2025

ispobock commented Feb 14, 2025 •

edited

Loading

lambert0312 commented Feb 16, 2025 •

edited

Loading

lambert0312 commented Feb 19, 2025 •

edited

Loading

pipul commented Feb 19, 2025 •

edited

Loading

YosanHo commented Feb 20, 2025 •

edited

Loading

YosanHo commented Feb 22, 2025 •

edited

Loading

ehuaa commented Feb 23, 2025 •

edited

Loading

kimlee1874 commented Feb 26, 2025 •

edited

Loading

RonanKMcGovern commented Feb 26, 2025 •

edited

Loading

jokerwyt commented Feb 27, 2025 •

edited

Loading