Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100 #3658

Open
5 tasks done
ToughK opened this issue Feb 18, 2025 · 22 comments
Open
5 tasks done
Assignees

Comments

@ToughK
Copy link

ToughK commented Feb 18, 2025

server.log

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

According to the dp-attention performance & usage, I turn on it by --enable-dp-attention when launching DeepSeek v3 on 2x8xH100. My command is like as below:
docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1
When I run test scripts, the server crashed

[2025-02-18 06:32:23 DP7 TP7] Prefill batch. #new-seq: 8, #new-token: 4096, #cached-token: 8, cache hit rate: 0.23%, token usage: 0.00, #running-req: 2, #queue-req: 0
[2025-02-18 06:32:23 DP6 TP6] Prefill batch. #new-seq: 8, #new-token: 2265, #cached-token: 8, cache hit rate: 0.38%, token usage: 0.00, #running-req: 2, #queue-req: 0
[rank2]:[E218 06:32:24.828983440 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3545f6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f3545f166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f354633ea18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f34fbe25726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f34fbe2a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f34fbe31b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f34fbe3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7f3547b375c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #8: + 0x94ac3 (0x7f35489c0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7f3548a52850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[2025-02-18 06:32:24 DP2 TP2] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
self.forward_thread_func_()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func

logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 795, in forward
return self.forward_extend(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 760, in forward_extend
return self.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 868, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 829, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 781, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 177, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 589, in forward
final_hidden_states = self.quant_method.apply(
File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 820, in apply
return fused_experts(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 851, in fused_experts
torch.ops.sglang.inplace_fused_experts(
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in call
return self._op(*args, **(kwargs or {}))
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 731, in inplace_fused_experts
fused_experts_impl(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1057, in fused_experts_impl
torch.sum(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 108, in forward_thread_func
with torch.get_device_module(self.device).stream(self.forward_stream):
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 595, in exit
torch.cuda.set_stream(self.src_prev_stream) # type: ignore[arg-type]
File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 962, in _patched_set_stream
prev_set_stream(stream)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 636, in set_stream
_set_stream_by_id(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 618, in _set_stream_by_id
torch._C._cuda_setStream(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

terminate called after throwing an instance of 'c10::DistBackendError'
terminate called recursively
Fatal Python error: Aborted

Thread 0x00007f2fe8afc640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f1f035fe640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 113 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 what(): in [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I tried to change the value of mem-fraction-static, but it doesn't work.

Environment

Reproduction

on node 1
docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1

on node 2
docker run --gpus all -d --entrypoint=python3 --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -e NCCL_IB_QPS_PER_CONNECTION=2 -e NCCL_IB_ADAPTIVE_ROUTING=1 -e NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -e NCCL_NVLS_ENABLE=0 -e NCCL_IB_GID_INDEX=3 -e NCCL_DEBUG=TRACE --network=host --ipc=host lmsysorg/sglang:v0.4.3-cu124 -m sglang.launch_server --model-path /sgl-workspace/deepseek-ai/DeepSeekV3/ --tp 16 --nccl-init-addr sgl-master:50001 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 8000 --watchdog-timeout 3600 --kv-cache-dtype fp8_e5m2 --enable-dp-attention --mem-fraction-static 0.78 2>&1

Environment

Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.183.06
PyTorch: 2.5.1+cu124
sgl_kernel: 0.0.3.post6
flashinfer: 0.2.1.post2+cu124torch2.5
triton: 3.1.0
transformers: 4.49.0
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.12
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.23.0
orjson: 3.10.15
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.63.2
tiktoken: 0.9.0
anthropic: 0.45.2
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU AffinityNUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE NODE NODE SYS SYS SYS SYS 0-47,96-1430N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PIX NODE NODE SYS SYS SYS SYS 0-47,96-1430N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PIX NODE SYS SYS SYS SYS 0-47,96-1430N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE PIX SYS SYS SYS SYS 0-47,96-1430N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PIX NODE NODE NODE 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS NODE PIX NODE NODE 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS NODE NODE PIX NODE 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS NODE NODE NODE PIX 48-95,144-191 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE NODE SYS SYS SYS SYS
NIC1 NODE NODE NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODE NODE SYS SYS SYS SYS
NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODE NODE SYS SYS SYS SYS
NIC3 NODE PIX NODE NODE SYS SYS SYS SYS NODE NODE NODE X NODE NODE SYS SYS SYS SYS
NIC4 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE NODE NODE X NODE SYS SYS SYS SYS
NIC5 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE NODE NODE X SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE NODE
NIC7 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS SYS SYS NODE X NODE NODE
NIC8 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS SYS SYS NODE NODE X NODE
NIC9 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS SYS SYS NODE NODE NODE X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9

ulimit soft: 1048576

@hariag
Copy link

hariag commented Feb 18, 2025

same issue on 8*H200 single server

@ispobock
Copy link
Collaborator

@hariag could you share the commands for 8*H200?

@hariag
Copy link

hariag commented Feb 18, 2025

server:
ulimit -n 4096000
python3 -m sglang.launch_server --model DeepSeek-R1 --tp 8 --trust-remote-code --port 8000 --watchdog-timeout 3600 --enable-dp-attention

test:
ulimit -n 4096000
evalscope perf --url 'http://127.0.0.1:8000/v1/chat/completions' --parallel 2048 --model 'DeepSeek-R1' --api-key EMPTY --number 20480 --api openai --stream --temperature 0.6 --log-every-n-query 1024 --max-tokens 100 --max-prompt-length 100 --read-timeout 600 --connect-timeout 600 --prompt "hello"

by the way, if I remove --enable-dp-attention option, its working perfect but much slower.

@ispobock
Copy link
Collaborator

Could you try to add --disable-overlap-schedule and test it again?

@hariag
Copy link

hariag commented Feb 19, 2025

add --disable-overlap-schedule not help.

I attached the server side log, please check it.

debug.log

@yuqie
Copy link

yuqie commented Feb 19, 2025

I also have the same issue on a single server of 8*H200. Add --disable-overlap-schedule can not help.

python3 -m sglang.launch_server --model-path /mnt/model/  --tensor-parallel-size 8 --trust-remote-code --enable-torch-compile  --disable-cuda-graph --enable-dp-attention

python3 -m sglang.bench_serving \
        --backend sglang \
        --dataset-name random \
        --random-range-ratio 1 \
        --num-prompt 300 \
        --request-rate 8 \
        --random-input 1024 \
        --random-output 1024 |tee -a SGLang_${model_name}_${input_len}_${output_len}_rps${i}_${DATETIME}_servering.log

@ispobock
Copy link
Collaborator

I attached the server side log, please check it.
debug.log

I checked the log, it seems an issue for sgl_kernels.fp8_blockwise_scaled_mm
cc: @zhyncs @yizhang2077

@changqingla
Copy link

I have same problem.

@yizhang2077
Copy link
Collaborator

Hi @changqingla , could you try if this case still happen when remove options --enable-dp-attention option?

@hiyforever
Copy link

I meet "output tensor size must be equal to world_size times input tensor size" error when add --enable-dp-attention option in two 8*H800, and without it everything ok. Is anyone try it on deepseek r1,can give me help

@lshmouse
Copy link

Hi @changqingla , could you try if this case still happen when remove options --enable-dp-attention option?

In my env, the inferences are successed after removing this option. What's wrong with this option: --enable-dp-attention. It is recommended in doc: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended

@yizhang2077
Copy link
Collaborator

yizhang2077 commented Feb 20, 2025

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

@lshmouse
Copy link

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

OK, let me make a hotfix image and test it~

@YEXINGZHE54
Copy link

mark! the same problem with tp 16 on 2 H800*8 source version 3c7bfd7.

@Lzhang-hub
Copy link
Contributor

I have same problem without dp attention

@ispobock
Copy link
Collaborator

@Lzhang-hub Did you try the latest main branch?

@dwq370
Copy link

dwq370 commented Feb 21, 2025

I met another problem when launched a deepseek-r1 model server with arguments --enable-dp-attention --dp-size 16 --tp 16 on 28H100. The rank 1 node threw Segmentation fault.

[2025-02-21 02:29:22 DP11 TP11] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP14 TP14] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP10 TP10] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP13 TP13] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP15 TP15] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP9 TP9] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP12 TP12] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-02-21 02:29:22 DP8 TP8] max_total_num_tokens=66295, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:40000 (Press CTRL+C to quit)
[worker0:268  :0:36540] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x16)
==== backtrace (tid:  36540) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000000494f4 uploadProxyOps()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1131
 2 0x0000000000051a7f hostStreamPlanTask()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1163
 3 0x0000000000051bd9 hostStreamPlanCallback()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1175
 4 0x0000000000253ffd cuEGLApiInit()  ???:0
 5 0x0000000000263373 cuEGLApiInit()  ???:0
 6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
 7 0x0000000000126850 __xmknodat()  ???:0
=================================
Fatal Python error: Segmentation fault

Thread 0x00007f1755ffc640 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f17567fd640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 88 in replay
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 449 in replay
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 791 in forward
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f27f27d0640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f27f17ce640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f2d322ca4c0 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/streams.py", line 225 in synchronize
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 170 in resolve_batch_result
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1123 in process_batch_result
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1825 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

@Lzhang-hub
Copy link
Contributor

Lzhang-hub commented Feb 21, 2025

@Lzhang-hub Did you try the latest main branch?

@ispobock I use commit 32b44d2fcac, I will try latest main branch.

@lshmouse
Copy link

lshmouse commented Feb 21, 2025

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

OK, let me make a hotfix image and test it~

@yizhang2077 Test sglang:v0.4.3 , sglang will not crash with pr 3727. But I found that the TTFT increased huge with --enable-dp-attention.

The serving benchmark result without --enable-dp-attention.

Image

The serving benchmark result with --enable-dp-attention.

Image

@ToughK
Copy link
Author

ToughK commented Feb 21, 2025

I post a hot fix for this case in #3727 , could you try it again ? @lshmouse @hariag @ToughK @hiyforever Thank you!

@yizhang2077 Thanks, sglang won't crash now, but the throughput is decreased significantly with --enable-dp-attention, even when the qps is about 2 req/s.

For high QPS scenarios, add the --enable-dp-attention argument to boost throughput

@Lzhang-hub
Copy link
Contributor

@Lzhang-hub Did you try the latest main branch?

@ispobock I use commit 32b44d2fcac, I will try latest main branch.

update: I try latest main branch, got error is same with 3424

@ispobock
Copy link
Collaborator

@lshmouse @ToughK The dp attention is aimed to improve throughput for large batch size (>128). The latency is higher than TP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests