-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Model Stuck at Prefill and then throw "Watchdog Timeout" Error After Idle Period (Deepseek-r1:671b on two H100*8) #3836
Comments
+1 We also encountered the same problem。Stuck Prefill. |
meet same issue too. |
1 similar comment
meet same issue too. |
same issue with h100 |
meet same issue too. |
same promblem too, nccl timeout |
While the failure is one issue, the problem is also that health status does not change from sglang. In such situations we restart the pod and things start working again but since health status is 200, we are unable to automate this I am open to debug and contribute on this if anyone can provide a direction |
same problem here! |
Same issue on H20 * 2, any suggestions? |
Every time, it gets stuck at the attn stage. |
same problem here! |
logs for set --watchdog_timeout=3600
terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): Fatal Python error: Aborted Thread 0x00007ee49bffe640 (most recent call first): Thread 0x00007ee49ffff640 (most recent call first): Thread 0x00007ee61bfff640 (most recent call first): Thread 0x00007eebedfff640 (most recent call first): Thread 0x00007efcf96a5480 (most recent call first): Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, zmq.backend.cython._zmq[rank4]:[E226 11:16:37.108069950 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600011 milliseconds before timing out. terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): Fatal Python error: Aborted Thread 0x00007f376fffe640 (most recent call first): Thread 0x00007f3773fff640 (most recent call first): Thread 0x00007f390bfff640 (most recent call first): Thread 0x00007f3edbfff640 (most recent call first): Thread 0x00007f4fe54b0480 (most recent call first): terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): Fatal Python error: Aborted Thread 0x00007fb5b7ffe640 (most recent call first): Thread 0x00007fb5bbfff640 (most recent call first): Thread 0x00007fb785fff640 (most recent call first): Thread 0x00007fbd2bfff640 (most recent call first): Thread 0x00007fce47ae1480 (most recent call first): 5] -> 6[6] via P2P/IPC (total: 52) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): Fatal Python error: Aborted Thread 0x00007f82b3ffe640 (most recent call first): Thread 0x00007f82b7fff640 (most recent call first): Thread 0x00007f844ffff640 (most recent call first): Thread 0x00007f8a1ffff640 (most recent call first): Thread 0x00007f9b29d14480 (most recent call first): Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, zmq.backend.cython._zmq, yaml._yaml, markupsafe._speedups, PIL._imaging, PIL._imagingft, msgspec._core, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, msgpack._cmsgpack, google._upb._message, ray._raylet, sentencepiece._sentencepiece, regex._regex, cuda_utils, __triton_launcher (total: 52) terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): Fatal Python error: Aborted Thread 0x00007f922bffe640 (most recent call first): Thread 0x00007f922ffff640 (most recent call first): Thread 0x00007f93bffff640 (most recent call first): Thread 0x00007f9991fff640 (most recent call first): Thread 0x00007faa9d5c9480 (most recent call first): Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, zmq.backend.cython._zmq, yaml._yaml, markupsafe._speedups, PIL._imaging, PIL._imagingft, msgspec._core, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, msgpack._cmsgpack, google._upb._message, ray._raylet, sentencepiece._sentencepiece, regex._regex, cuda_utils, __triton_launcher (total: 52) terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): Fatal Python error: Aborted Thread 0x00007ef94dffe640 (most recent call first): Thread 0x00007ef951fff640 (most recent call first): Thread 0x00007efabffff640 (most recent call first): Thread 0x00007f0091fff640 (most recent call first): Thread 0x00007f119cd3a480 (most recent call first): Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): Fatal Python error: Aborted Thread 0x00007fd79fffe640 (most recent call first): Thread 0x00007fd7a3fff640 (most recent call first): Thread 0x00007fd91bfff640 (most recent call first): Thread 0x00007fdeebfff640 (most recent call first): Thread 0x00007feff4fd9480 (most recent call first): terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): Fatal Python error: Aborted Thread 0x00007fa0b9ffe640 (most recent call first): Thread 0x00007fa0bdfff640 (most recent call first): Thread 0x00007fa233fff640 (most recent call first): Thread 0x00007fa803fff640 (most recent call first): Thread 0x00007fb90d364480 (most recent call first): Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, zmq.backend.cython._zmq, yaml._yaml, markupsafe._speedups, PIL._imaging, PIL._imagingft, msgspec._core, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, msgpack._cmsgpack, google._upb._message, ray._raylet, sentencepiece._sentencepiece, regex._regex, cuda_utils, __triton_launcher (total: 52) |
Can you install sglang from source and try the latest commit? |
can you push the image of latest commit to docker hub ? |
@verigle the official docker image will only be for the official release. but I think you can do this to update the repo in the docker # fetch latest commit
git fetch origin
git rebase origin/main
# reinstall from source
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python |
does I should add the param of "--dist-timeout"? |
You can try without it first, if timeout still occurs, then add it |
I'm trying this approach, and the first batch of tests (8000 input, 1000 output, running 5 requests simultaneously) found no problems. Stress testing takes a little time. |
Downloading https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.2/flashinfer_python-0.2.2%2Bcu124torch2.5-cp38-abi3-linux_x86_64.whl is very slowly, is there any plan change the package to pypi or other mirror? |
@echozyr2001 great! that's nice to hear! |
If you are in China, you can use:
|
thank you, but only flashinfer-python is not fast |
I think this issue has been solved in the latest commit, will close for now. |
when to release the image for docker hub? I hava not download flashinfer-python from github because it`s to slowly |
I use 2 * 8 * h20, the startup command is as follows: Lunch docker docker run --gpus all \
--rm -it \
--name sglang_node_1 \
-v /data/deepseek-r1:/root/deepseek-r1 \
-v /data/torchcache:/root/torchcache \
--privileged \
--env "GLOO_SOCKET_IFNAME=ens12f0np0" \
--env "NCCL_SOCKET_IFNAME=ens12f0np0" \
--env "NCCL_IB_HCA=ibp14s0,ibp71s0,ibp134s0,ibp195s0" \
--env "NCCL_IB_CUDA_SUPPORT=1" \
--env "NCCL_IB_ALLOW=1" \
--env "NCCL_IB_DISABLE=0" \
--env "NCCL_IB_RETRY_CNT=10" \
--env "NCCL_P2P_LEVEL=NVL" \
--env "NCCL_IB_GID_INDEX=3" \
--env "NCCL_DEBUG=TRACE" \
--env "TORCHINDUCTOR_CACHE_DIR=/root/torchcache" \
--ipc=host \
--network=host \
--shm-size 32g \
sglang:latest Then update sglang git fetch origin
git rebase origin/main
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple Lunch server python3 -m sglang.launch_server \
--model-path /root/deepseek-r1 \
--dist-init-addr 10.0.251.17:50000 \
--tp 16 \
--nnodes 2 \
--node-rank 0 \
--host 0.0.0.0 \
--port 30000 \
--trust-remote-code I ran a total of 150 requests, each with 3000 inputs and 1000 outputs, with 5 requests per batch. The single request throughput is around 20t/s, and the TTFT is around 1 second. I haven't encountered any prefill stuck issues yet. But I hope more people can continue testing to ensure that the problem does not occur again. |
It starts to be normal, but after a few days , there will be such an error, is there a way to automatically restart? |
Hi guys, a silly question. What's the difference between --dist-timeout 3600 and --watchdog_timeout 3600, and if we add --dist-timeout 3600 will the server hangs for 3599s at worst? |
Hello, could you please tell me which specific commit was submitted? |
Checklist
Describe the bug
I am currently using SGLang to deploy the deepseek-r1:671b model across two H800 GPUs. However, I have encountered a persistent issue when the system remains idle for some time. Upon resuming usage, even with simple prompts such as "Hello," the model gets stuck during the Prefill stage. Subsequently, the system throws a "watchdog timeout" error.
Following this error, the GPU resources are released, and any subsequent attempts to interact with the model fail to reload it. The only way to restore functionality is by restarting the service entirely.
Reproduction
Deploy the deepseek-r1:671b model using SGLang on two H800 GPUs.
It works at the begining.
Leave the system idle for a period of time (exact duration may vary).
Attempt to send a simple query like "Hello" after the idle period.
Observe that the model gets stuck at the Prefill stage.
Encounter the "watchdog timeout" error, followed by the release of GPU resources.
Note that further queries do not reload the model, necessitating a service restart.
Environment
Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.144.03
PyTorch: 2.5.1+cu124
sgl_kernel: 0.0.3.post6
flashinfer: 0.2.1.post2+cu124torch2.5
triton: 3.1.0
transformers: 4.48.3
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.12
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.23.0
orjson: 3.10.15
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.63.2
tiktoken: 0.9.0
anthropic: 0.45.2
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE PIX SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS PIX NODE NODE NODE 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS NODE PIX NODE NODE 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS NODE NODE PIX NODE 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS NODE NODE NODE PIX 48-95,144-191 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE SYS SYS SYS SYS
NIC1 NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE SYS SYS SYS SYS
NIC2 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE X SYS SYS SYS SYS
NIC3 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS X NODE NODE NODE
NIC4 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS NODE X NODE NODE
NIC5 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS NODE NODE X NODE
NIC6 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
ulimit soft: 1048576
The text was updated successfully, but these errors were encountered: