[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100 #3658

ToughK · 2025-02-18T06:53:23Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

According to the dp-attention performance & usage, I turn on it by --enable-dp-attention when launching DeepSeek v3 on 2x8xH100. My command is like as below:
docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1
When I run test scripts, the server crashed

[2025-02-18 06:32:23 DP7 TP7] Prefill batch. #new-seq: 8, #new-token: 4096, #cached-token: 8, cache hit rate: 0.23%, token usage: 0.00, #running-req: 2, #queue-req: 0
[2025-02-18 06:32:23 DP6 TP6] Prefill batch. #new-seq: 8, #new-token: 2265, #cached-token: 8, cache hit rate: 0.38%, token usage: 0.00, #running-req: 2, #queue-req: 0
[rank2]:[E218 06:32:24.828983440 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3545f6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f3545f166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f354633ea18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f34fbe25726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f34fbe2a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f34fbe31b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f34fbe3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7f3547b375c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #8: + 0x94ac3 (0x7f35489c0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7f3548a52850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[2025-02-18 06:32:24 DP2 TP2] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
self.forward_thread_func_()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func
logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 795, in forward
return self.forward_extend(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 760, in forward_extend
return self.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 868, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 829, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 781, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 177, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 589, in forward
final_hidden_states = self.quant_method.apply(
File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 820, in apply
return fused_experts(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 851, in fused_experts
torch.ops.sglang.inplace_fused_experts(
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in call
return self._op(*args, **(kwargs or {}))
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 731, in inplace_fused_experts
fused_experts_impl(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1057, in fused_experts_impl
torch.sum(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 108, in forward_thread_func
with torch.get_device_module(self.device).stream(self.forward_stream):
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 595, in exit
torch.cuda.set_stream(self.src_prev_stream) # type: ignore[arg-type]
File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 962, in _patched_set_stream
prev_set_stream(stream)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 636, in set_stream
_set_stream_by_id(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 618, in _set_stream_by_id
torch._C._cuda_setStream(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

terminate called after throwing an instance of 'c10::DistBackendError'
terminate called recursively
Fatal Python error: Aborted

Thread 0x00007f2fe8afc640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f1f035fe640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 113 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 what(): in [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I tried to change the value of mem-fraction-static, but it doesn't work.

Environment

Reproduction

on node 1
docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1

on node 2
docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1

Environment

Python: 3.10.12 CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100
GPU 0,1,2,3,4,5,6,7 CUDA_HOME: /usr/local/cuda
NVCC: Cuda CUDA Driver Version: 535.183.06
PyTorch: 2.5.1+cu124
sgl_kernel: 0.0.3.post6
flashinfer: triton: 3.1.0
transformers: 4.49.0
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.12
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.23.0
orjson: 3.10.15
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.63.2
tiktoken: 0.9.0
anthropic: 0.45.2
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU0 X GPU1 NV18 X GPU2 NV18 NV18 X GPU3 NV18 NV18 NV18 GPU4 NV18 GPU5 NV18 GPU6 NV18 GPU7 NV18 NIC0 PIX NIC1 NODE NIC2 NODE NIC3 NODE PIX NIC4 NODE NODE PIX NIC5 NODE NIC6 SYS NIC7 SYS NIC8 SYS NIC9 SYS (main, Jan 17 2025, 14:35:34) [GCC 11.4.0]
Compute Capability: 9.0
compilation tools, release 12.4, V12.4.131
0.2.1.post2+cu124torch2.5
GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU AffinityNUMA Affinity GPU NUMA ID
NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE NODE NODE SYS SYS SYS SYS 0-47,96-1430N/A
NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PIX NODE NODE SYS SYS SYS SYS 0-47,96-1430N/A
NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PIX NODE SYS SYS SYS SYS 0-47,96-1430N/A
X NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE PIX SYS SYS SYS SYS 0-47,96-1430N/A
NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PIX NODE NODE NODE 48-95,144-191 1 N/A
NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS NODE PIX NODE NODE 48-95,144-191 1 N/A
NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS NODE NODE PIX NODE 48-95,144-191 1 N/A
NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS NODE NODE NODE PIX 48-95,144-191 1 N/A
NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE NODE SYS SYS SYS SYS
NODE NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODE NODE SYS SYS SYS SYS
NODE NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODE NODE SYS SYS SYS SYS
NODE NODE SYS SYS SYS SYS NODE NODE NODE X NODE NODE SYS SYS SYS SYS
NODE SYS SYS SYS SYS NODE NODE NODE NODE X NODE SYS SYS SYS SYS
NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE NODE NODE X SYS SYS SYS SYS
SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE NODE
SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS SYS SYS NODE X NODE NODE
SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS SYS SYS NODE NODE X NODE
SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS SYS SYS NODE NODE NODE X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9

ulimit soft: 1048576

The text was updated successfully, but these errors were encountered:

hariag · 2025-02-18T13:38:06Z

same issue on 8*H200 single server

ispobock · 2025-02-18T13:59:20Z

@hariag could you share the commands for 8*H200?

hariag · 2025-02-18T14:20:46Z

server:
ulimit -n 4096000
python3 -m sglang.launch_server --model DeepSeek-R1 --tp 8 --trust-remote-code --port 8000 --watchdog-timeout 3600 --enable-dp-attention

test:
ulimit -n 4096000
evalscope perf --url 'http://127.0.0.1:8000/v1/chat/completions' --parallel 2048 --model 'DeepSeek-R1' --api-key EMPTY --number 20480 --api openai --stream --temperature 0.6 --log-every-n-query 1024 --max-tokens 100 --max-prompt-length 100 --read-timeout 600 --connect-timeout 600 --prompt "hello"

by the way, if I remove --enable-dp-attention option, its working perfect but much slower.

ispobock · 2025-02-18T16:32:25Z

Could you try to add --disable-overlap-schedule and test it again?

hariag · 2025-02-19T01:12:55Z

add --disable-overlap-schedule not help.

I attached the server side log, please check it.

debug.log

yuqie · 2025-02-19T06:35:06Z

I also have the same issue on a single server of 8*H200. Add --disable-overlap-schedule can not help.

python3 -m sglang.launch_server --model-path /mnt/model/  --tensor-parallel-size 8 --trust-remote-code --enable-torch-compile  --disable-cuda-graph --enable-dp-attention

python3 -m sglang.bench_serving \
        --backend sglang \
        --dataset-name random \
        --random-range-ratio 1 \
        --num-prompt 300 \
        --request-rate 8 \
        --random-input 1024 \
        --random-output 1024 |tee -a SGLang_${model_name}_${input_len}_${output_len}_rps${i}_${DATETIME}_servering.log

ispobock · 2025-02-19T12:27:52Z

I attached the server side log, please check it.
debug.log

I checked the log, it seems an issue for sgl_kernels.fp8_blockwise_scaled_mm
cc: @zhyncs @yizhang2077

changqingla · 2025-02-20T03:31:42Z

I have same problem.

yizhang2077 · 2025-02-20T03:48:11Z

Hi @changqingla , could you try if this case still happen when remove options --enable-dp-attention option?

hiyforever · 2025-02-20T06:43:25Z

I meet "output tensor size must be equal to world_size times input tensor size" error when add --enable-dp-attention option in two 8*H800, and without it everything ok. Is anyone try it on deepseek r1，can give me help

lshmouse · 2025-02-20T07:38:33Z

Hi @changqingla , could you try if this case still happen when remove options --enable-dp-attention option?

In my env, the inferences are successed after removing this option. What's wrong with this option: --enable-dp-attention. It is recommended in doc: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended

yizhang2077 · 2025-02-20T08:13:51Z

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

lshmouse · 2025-02-20T09:40:33Z

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

OK, let me make a hotfix image and test it~

YEXINGZHE54 · 2025-02-20T10:45:04Z

mark! the same problem with tp 16 on 2 H800*8 source version 3c7bfd7.

Lzhang-hub · 2025-02-21T02:05:55Z

I have same problem without dp attention

ispobock · 2025-02-21T02:37:40Z

@Lzhang-hub Did you try the latest main branch?

dwq370 · 2025-02-21T02:47:33Z

I met another problem when launched a deepseek-r1 model server with arguments --enable-dp-attention --dp-size 16 --tp 16 on 28H100. The rank 1 node threw Segmentation fault.

[2025-02-21 02:29:22 DP11 TP11] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP14 TP14] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP10 TP10] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP13 TP13] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP15 TP15] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP9 TP9] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP12 TP12] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP8 TP8] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:40000 (Press CTRL+C to quit)
[worker0:268  :0:36540] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x16)
==== backtrace (tid:  36540) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000000494f4 uploadProxyOps()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1131
 2 0x0000000000051a7f hostStreamPlanTask()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1163
 3 0x0000000000051bd9 hostStreamPlanCallback()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1175
 4 0x0000000000253ffd cuEGLApiInit()  ???:0
 5 0x0000000000263373 cuEGLApiInit()  ???:0
 6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
 7 0x0000000000126850 __xmknodat()  ???:0
=================================
Fatal Python error: Segmentation fault

Thread 0x00007f1755ffc640 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f17567fd640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 88 in replay
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 449 in replay
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 791 in forward
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f27f27d0640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f27f17ce640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f2d322ca4c0 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/streams.py", line 225 in synchronize
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 170 in resolve_batch_result
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1123 in process_batch_result
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1825 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

Lzhang-hub · 2025-02-21T03:24:18Z

@Lzhang-hub Did you try the latest main branch?

@ispobock I use commit 32b44d2fcac, I will try latest main branch.

lshmouse · 2025-02-21T03:28:24Z

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

OK, let me make a hotfix image and test it~

@yizhang2077 Test sglang:v0.4.3 , sglang will not crash with pr 3727. But I found that the TTFT increased huge with --enable-dp-attention.

The serving benchmark result without --enable-dp-attention.

The serving benchmark result with --enable-dp-attention.

ToughK · 2025-02-21T06:05:42Z

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

@yizhang2077 Thanks, sglang won't crash now, but the throughput is decreased significantly with --enable-dp-attention, even when the qps is about 2 req/s.

For high QPS scenarios, add the --enable-dp-attention argument to boost throughput

Lzhang-hub · 2025-02-21T06:09:18Z

@Lzhang-hub Did you try the latest main branch?

@ispobock I use commit 32b44d2fcac, I will try latest main branch.

update: I try latest main branch, got error is same with 3424

ispobock · 2025-02-21T06:10:29Z

@lshmouse @ToughK The dp attention is aimed to improve throughput for large batch size (>128). The latency is higher than TP.

zhyncs assigned ispobock Feb 18, 2025

yizhang2077 mentioned this issue Feb 20, 2025

add control for cutlass fp8 blockwise gemm #3727

Merged

6 tasks

ispobock mentioned this issue Feb 21, 2025

[Bug] CUDA error: an illegal memory access #3696

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100 #3658

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100 #3658

ToughK commented Feb 18, 2025 •

edited

Loading

hariag commented Feb 18, 2025

ispobock commented Feb 18, 2025

hariag commented Feb 18, 2025

ispobock commented Feb 18, 2025

hariag commented Feb 19, 2025

yuqie commented Feb 19, 2025

ispobock commented Feb 19, 2025

changqingla commented Feb 20, 2025

yizhang2077 commented Feb 20, 2025

hiyforever commented Feb 20, 2025

lshmouse commented Feb 20, 2025

yizhang2077 commented Feb 20, 2025 •

edited

Loading

lshmouse commented Feb 20, 2025

YEXINGZHE54 commented Feb 20, 2025

Lzhang-hub commented Feb 21, 2025

ispobock commented Feb 21, 2025

dwq370 commented Feb 21, 2025 •

edited

Loading

Lzhang-hub commented Feb 21, 2025 •

edited

Loading

lshmouse commented Feb 21, 2025 •

edited

Loading

ToughK commented Feb 21, 2025

Lzhang-hub commented Feb 21, 2025

ispobock commented Feb 21, 2025

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100 #3658

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100 #3658

Comments

ToughK commented Feb 18, 2025 • edited Loading

Checklist

Describe the bug

Environment

Reproduction

Environment

hariag commented Feb 18, 2025

ispobock commented Feb 18, 2025

hariag commented Feb 18, 2025

ispobock commented Feb 18, 2025

hariag commented Feb 19, 2025

yuqie commented Feb 19, 2025

ispobock commented Feb 19, 2025

changqingla commented Feb 20, 2025

yizhang2077 commented Feb 20, 2025

hiyforever commented Feb 20, 2025

lshmouse commented Feb 20, 2025

yizhang2077 commented Feb 20, 2025 • edited Loading

lshmouse commented Feb 20, 2025

YEXINGZHE54 commented Feb 20, 2025

Lzhang-hub commented Feb 21, 2025

ispobock commented Feb 21, 2025

dwq370 commented Feb 21, 2025 • edited Loading

Lzhang-hub commented Feb 21, 2025 • edited Loading

lshmouse commented Feb 21, 2025 • edited Loading

ToughK commented Feb 21, 2025

Lzhang-hub commented Feb 21, 2025

ispobock commented Feb 21, 2025

ToughK commented Feb 18, 2025 •

edited

Loading

yizhang2077 commented Feb 20, 2025 •

edited

Loading

dwq370 commented Feb 21, 2025 •

edited

Loading

Lzhang-hub commented Feb 21, 2025 •

edited

Loading

lshmouse commented Feb 21, 2025 •

edited

Loading