-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100 #3658
Comments
same issue on 8*H200 single server |
@hariag could you share the commands for 8*H200? |
server: test: by the way, if I remove --enable-dp-attention option, its working perfect but much slower. |
Could you try to add |
add --disable-overlap-schedule not help. I attached the server side log, please check it. |
I also have the same issue on a single server of 8*H200. Add
|
I checked the log, it seems an issue for |
I have same problem. |
Hi @changqingla , could you try if this case still happen when remove options |
I meet "output tensor size must be equal to world_size times input tensor size" error when add --enable-dp-attention option in two 8*H800, and without it everything ok. Is anyone try it on deepseek r1,can give me help |
In my env, the inferences are successed after removing this option. What's wrong with this option: --enable-dp-attention. It is recommended in doc: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended |
I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you! |
OK, let me make a hotfix image and test it~ |
mark! the same problem with tp 16 on 2 H800*8 source version 3c7bfd7. |
I have same problem without dp attention |
@Lzhang-hub Did you try the latest main branch? |
I met another problem when launched a deepseek-r1 model server with arguments --enable-dp-attention --dp-size 16 --tp 16 on 28H100. The rank 1 node threw Segmentation fault.
|
@ispobock I use commit 32b44d2fcac, I will try latest main branch. |
@yizhang2077 Test sglang:v0.4.3 , sglang will not crash with pr 3727. But I found that the TTFT increased huge with --enable-dp-attention. The serving benchmark result without --enable-dp-attention. The serving benchmark result with --enable-dp-attention. |
@yizhang2077 Thanks, sglang won't crash now, but the throughput is decreased significantly with
|
update: I try latest main branch, got error is same with 3424 |
server.log
Checklist
Describe the bug
According to the dp-attention performance & usage, I turn on it by --enable-dp-attention when launching DeepSeek v3 on 2x8xH100. My command is like as below:
docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1
When I run test scripts, the server crashed
I tried to change the value of
mem-fraction-static
, but it doesn't work.Environment
Reproduction
on node 1
docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1
on node 2
docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1
Environment
Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.183.06
PyTorch: 2.5.1+cu124
sgl_kernel: 0.0.3.post6
flashinfer: 0.2.1.post2+cu124torch2.5
triton: 3.1.0
transformers: 4.49.0
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.12
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.23.0
orjson: 3.10.15
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.63.2
tiktoken: 0.9.0
anthropic: 0.45.2
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU AffinityNUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE NODE NODE SYS SYS SYS SYS 0-47,96-1430N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PIX NODE NODE SYS SYS SYS SYS 0-47,96-1430N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PIX NODE SYS SYS SYS SYS 0-47,96-1430N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE PIX SYS SYS SYS SYS 0-47,96-1430N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PIX NODE NODE NODE 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS NODE PIX NODE NODE 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS NODE NODE PIX NODE 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS NODE NODE NODE PIX 48-95,144-191 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE NODE SYS SYS SYS SYS
NIC1 NODE NODE NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODE NODE SYS SYS SYS SYS
NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODE NODE SYS SYS SYS SYS
NIC3 NODE PIX NODE NODE SYS SYS SYS SYS NODE NODE NODE X NODE NODE SYS SYS SYS SYS
NIC4 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE NODE NODE X NODE SYS SYS SYS SYS
NIC5 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE NODE NODE X SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE NODE
NIC7 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS SYS SYS NODE X NODE NODE
NIC8 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS SYS SYS NODE NODE X NODE
NIC9 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS SYS SYS NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
ulimit soft: 1048576
The text was updated successfully, but these errors were encountered: