[Bug] H20 8 gpu x 2 with --enable-dp-attention occurred CUDA error: an illegal memory access #3892

mahaocong90 · 2025-02-26T14:00:36Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

I use two H20 with 8 gpus and 8 IB device(mlx5_1 to mlx5_8) on each node to test DeepSeep-R1 with --enable-dp-attention.

Error log

[2025-02-26 13:18:19 DP3 TP3] Prefill batch. #new-seq: 1, #new-token: 4096, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 7
[2025-02-26 13:18:19 DP0 TP0] Prefill batch. #new-seq: 1, #new-token: 4096, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 7
[2025-02-26 13:18:19 DP1 TP1] Prefill batch. #new-seq: 1, #new-token: 4096, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.03, #running-req: 0, #queue-req: 8
[2025-02-26 13:18:21 DP3 TP3] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
self.forward_thread_func_()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func
logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 796, in forward
return self.forward_extend(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 761, in forward_extend
return self.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 874, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 835, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 773, in forward
hidden_states = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 527, in forward
and forward_batch.extend_prefix_lens.sum() == 0
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 108, in forward_thread_func
with torch.get_device_module(self.device).stream(self.forward_stream):
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 595, in exit
torch.cuda.set_stream(self.src_prev_stream) # type: ignore[arg-type]
File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 962, in _patched_set_stream
prev_set_stream(stream)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 636, in set_stream
_set_stream_by_id(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 618, in _set_stream_by_id
torch._C._cuda_setStream(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank3]:[E226 13:18:21.063621914 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f03ad6b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f03ad6636e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f03ad7a5a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f0363625726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f036362a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f0363631b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f036363361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7f03af2c65c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #8: + 0x94ac3 (0x7f03b014bac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7f03b01dd850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f03ad6b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f03ad6636e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f03ad7a5a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f0363625726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f036362a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f0363631b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f036363361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7f03af2c65c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #8: + 0x94ac3 (0x7f03b014bac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7f03b01dd850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f03ad6b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7f03632a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f03af2c65c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7f03b014bac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7f03b01dd850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007efe63fc7640 (most recent call first):
[2025-02-26 13:18:21 DP4 TP4] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
self.forward_thread_func_()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func
logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 796, in forward
return self.forward_extend(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 761, in forward_extend
return self.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 874, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 835, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 773, in forward
hidden_states = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 527, in forward
and forward_batch.extend_prefix_lens.sum() == 0
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 108, in forward_thread_func
with torch.get_device_module(self.device).stream(self.forward_stream):
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 595, in exit
torch.cuda.set_stream(self.src_prev_stream) # type: ignore[arg-type]
File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 962, in _patched_set_stream
prev_set_stream(stream)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 636, in set_stream
_set_stream_by_id(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 618, in _set_stream_by_id
torch._C._cuda_setStream(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Reproduction

node0 server:
python3 -m sglang.launch_server --model-path /mnt/model/DeepSeek-R1/ --tp 16 --dist-init-addr 192.168.81.54:20000 --nnodes 2 --node-rank 0 --trust-remote-code --enable-dp-attention --host 0.0.0.0 --port 40000

node1 server:
python3 -m sglang.launch_server --model-path /mnt/model/DeepSeek-R1/ --tp 16 --dist-init-addr 192.168.81.54:20000 --nnodes 2 --node-rank 1 --trust-remote-code --enable-dp-attention --host 0.0.0.0 --port 40000

test cmd:
python3 -m sglang.bench_serving --backend sglang --dataset-path /workspace/dataset/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name random --random-range-ratio 1 --num-prompt 128 --max-concurrency 128 --random-input 16384 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl

Environment

lmsysorg/sglang:v0.4.3.post2-cu124

CUDA Driver Version: 535.216.03

The text was updated successfully, but these errors were encountered:

Fjallraven-hc · 2025-02-26T19:57:40Z

Similar settings, I run H100 8 GPUs x 2 with --enable-dp-attetion on DeepSeek-R1, found out the client cannot receive request output.

Fridge003 · 2025-02-26T20:24:34Z

cc @ispobock

ispobock · 2025-02-27T02:41:01Z

Could you try the latest main branch? Some bugs are fixed after 127998c.

TangChangcheng · 2025-02-27T02:53:35Z

I encountered this error during re-computation, but not when limiting total request tokens below max_num_total_tokens. Could there be an issue with request retraction?

mahaocong90 · 2025-02-27T03:31:04Z

Could you try the latest main branch? Some bugs are fixed after 127998c.

OK, I'll try.

Fridge003 self-assigned this Feb 26, 2025

mahaocong90 closed this as completed Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] H20 8 gpu x 2 with --enable-dp-attention occurred CUDA error: an illegal memory access #3892

[Bug] H20 8 gpu x 2 with --enable-dp-attention occurred CUDA error: an illegal memory access #3892

mahaocong90 commented Feb 26, 2025

Fjallraven-hc commented Feb 26, 2025

Fridge003 commented Feb 26, 2025

ispobock commented Feb 27, 2025

TangChangcheng commented Feb 27, 2025

mahaocong90 commented Feb 27, 2025

[Bug] H20 8 gpu x 2 with --enable-dp-attention occurred CUDA error: an illegal memory access #3892

[Bug] H20 8 gpu x 2 with --enable-dp-attention occurred CUDA error: an illegal memory access #3892

Comments

mahaocong90 commented Feb 26, 2025

Checklist

Describe the bug

Reproduction

Environment

Fjallraven-hc commented Feb 26, 2025

Fridge003 commented Feb 26, 2025

ispobock commented Feb 27, 2025

TangChangcheng commented Feb 27, 2025

mahaocong90 commented Feb 27, 2025