[Bug] Model Stuck at Prefill and then throw "Watchdog Timeout" Error After Idle Period (Deepseek-r1:671b on two H100*8) #3836

myoldcat · 2025-02-25T06:10:47Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

I am currently using SGLang to deploy the deepseek-r1:671b model across two H800 GPUs. However, I have encountered a persistent issue when the system remains idle for some time. Upon resuming usage, even with simple prompts such as "Hello," the model gets stuck during the Prefill stage. Subsequently, the system throws a "watchdog timeout" error.

Following this error, the GPU resources are released, and any subsequent attempts to interact with the model fail to reload it. The only way to restore functionality is by restarting the service entirely.

Reproduction

Deploy the deepseek-r1:671b model using SGLang on two H800 GPUs.

docker run --gpus all \
    --shm-size 32g \
    --network=host \
    -v /model/:/model \
    --name sglang_multinode \
    -it \
    --rm \
    --env "GLOO_SOCKET_IFNAME=ens24f0" \
    --env "NCCL_SOCKET_IFNAME=ens24f0" \
    --env "NCCL_DEBUG=INFO"  \
    --env "NCCL_IGNORE_DISABLED_P2P=1" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path /model/deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 10.2.17.101:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000

It works at the begining.
Leave the system idle for a period of time (exact duration may vary).
Attempt to send a simple query like "Hello" after the idle period.
Observe that the model gets stuck at the Prefill stage.
Encounter the "watchdog timeout" error, followed by the release of GPU resources.

[2025-02-25 05:31:35] INFO:     10.2.17.101:38344 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-02-25 05:31:35 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 5, cache hit rate: 42.12%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-25 05:32:28] INFO:     10.2.17.101:60714 - "GET /v1/models HTTP/1.1" 200 OK
[2025-02-25 05:36:52 TP4] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-25 05:36:53 TP7] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-25 05:36:53 TP2] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-25 05:36:54 TP1] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-25 05:36:54 TP3] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-25 05:36:54 TP5] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-25 05:36:54 TP0] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-25 05:36:55 TP6] Watchdog timeout (self.watchdog_timeout=300)
[2025-02-25 05:36:57] Received sigquit from a child proces. It usually means the child failed.
[2025-02-25 05:36:57] Received sigquit from a child proces. It usually means the child failed.
[2025-02-25 05:36:57] Received sigquit from a child proces. It usually means the child failed.
[2025-02-25 05:36:57] Received sigquit from a child proces. It usually means the child failed.
.......

Note that further queries do not reload the model, necessitating a service restart.

Environment

Python: 3.10.12 CUDA available: True
GPU 0,1,2,3,4,5,6,7: GPU 0,1,2,3,4,5,6,7 CUDA_HOME: /usr/local/cuda
NVCC: Cuda CUDA Driver Version: 550.144.03
PyTorch: 2.5.1+cu124
sgl_kernel: 0.0.3.post6
flashinfer: triton: 3.1.0
transformers: 4.48.3
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.12
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.23.0
orjson: 3.10.15
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.63.2
tiktoken: 0.9.0
anthropic: 0.45.2
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU0 X GPU1 NV18 X GPU2 NV18 NV18 X GPU3 NV18 NV18 NV18 GPU4 NV18 GPU5 NV18 GPU6 NV18 GPU7 NV18 NIC0 PIX NIC1 NODE PIX NIC2 NODE NIC3 SYS NIC4 SYS NIC5 SYS NIC6 SYS (main, Jan 17 2025, 14:35:34) [GCC 11.4.0]
NVIDIA H100 80GB HBM3
Compute Capability: 9.0
compilation tools, release 12.4, V12.4.131
0.2.1.post2+cu124torch2.5
GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 CPU Affinity NUMA Affinity GPU NUMA ID
NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
NV18 NV18 NV18 NV18 NV18 NODE NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
X NV18 NV18 NV18 NV18 NODE NODE PIX SYS SYS SYS SYS 0-47,96-143 0 N/A
NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS PIX NODE NODE NODE 48-95,144-191 1 N/A
NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS NODE PIX NODE NODE 48-95,144-191 1 N/A
NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS NODE NODE PIX NODE 48-95,144-191 1 N/A
NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS NODE NODE NODE PIX 48-95,144-191 1 N/A
NODE NODE NODE SYS SYS SYS SYS X NODE NODE SYS SYS SYS SYS
NODE NODE SYS SYS SYS SYS NODE X NODE SYS SYS SYS SYS
NODE NODE PIX SYS SYS SYS SYS NODE NODE X SYS SYS SYS SYS
SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS X NODE NODE NODE
SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS NODE X NODE NODE
SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS NODE NODE X NODE
SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS NODE NODE NODE X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6

ulimit soft: 1048576

The text was updated successfully, but these errors were encountered:

echozyr2001 · 2025-02-25T07:13:24Z

+1 We also encountered the same problem。Stuck Prefill.

zhouzhengjun · 2025-02-25T08:58:38Z

meet same issue too.

V-yw · 2025-02-25T10:08:43Z

meet same issue too.

mindhash · 2025-02-25T12:35:14Z

same issue with h100

BestVIncent · 2025-02-25T17:36:27Z

meet same issue too.

zwc163 · 2025-02-26T03:37:06Z

same promblem too, nccl timeout

mindhash · 2025-02-26T06:40:58Z

While the failure is one issue, the problem is also that health status does not change from sglang. In such situations we restart the pod and things start working again but since health status is 200, we are unable to automate this

I am open to debug and contribute on this if anyone can provide a direction

robscc · 2025-02-26T07:06:48Z

same problem here!

lzxzy · 2025-02-26T08:17:57Z

Same issue on H20 * 2, any suggestions?

CSEEduanyu · 2025-02-26T13:02:15Z

Every time, it gets stuck at the attn stage.

verigle · 2025-02-27T02:04:02Z

same problem here!

verigle · 2025-02-27T02:06:16Z

logs for set --watchdog_timeout=3600

[2025-02-26 11:06:36 TP0] Prefill batch. #new-seq: 1, #new-token: 8192, #cached-token: 0, cache hit rate: 60.44%, token usage: 0.01, #running-req: 0, #queue-req: 1
[2025-02-26 11:07:01] INFO: 127.0.0.1:34470 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:07:31] INFO: 127.0.0.1:52282 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:08:02] INFO: 127.0.0.1:39228 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:08:32] INFO: 127.0.0.1:55880 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:09:02] INFO: 127.0.0.1:34264 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:09:32] INFO: 127.0.0.1:60634 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:10:03] INFO: 127.0.0.1:32812 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:10:33] INFO: 127.0.0.1:43972 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:10:47] INFO: 172.21.0.4:34072 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-02-26 11:11:03] INFO: 127.0.0.1:40666 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:11:33] INFO: 127.0.0.1:52992 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:12:03] INFO: 127.0.0.1:48474 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:12:34] INFO: 127.0.0.1:51646 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:13:04] INFO: 127.0.0.1:59024 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:13:34] INFO: 127.0.0.1:41180 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:13:58] INFO: 172.21.0.4:44842 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-02-26 11:14:04] INFO: 127.0.0.1:32940 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:14:35] INFO: 127.0.0.1:43542 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:15:05] INFO: 127.0.0.1:36050 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:15:35] INFO: 127.0.0.1:51692 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:16:05] INFO: 127.0.0.1:42568 - "GET /health HTTP/1.1" 200 OK
[2025-02-26 11:16:36] INFO: 127.0.0.1:34408 - "GET /health HTTP/1.1" 200 OK
90:3197 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 02/0 : 11[3] -> 2[2] [receive] via NET/IBext_v8/2/GDRDMA
[rank2]:[E226 11:16:37.100997335 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600007 milliseconds before timing out.
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 10/0 : 11[3] -> 2[2] [receive] via NET/IBext_v8/2/GDRDMA
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 03/0 : 2[2] -> 11[3] [send] via NET/IBext_v8/3/GDRDMA
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 11/0 : 2[2] -> 11[3] [send] via NET/IBext_v8/3/GDRDMA
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC
[rank2]:[E226 11:16:37.101795257 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Connected all rings
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC
[rank2]:[E226 11:16:37.101826024 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 2] Timeout at NCCL work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC
[rank2]:[E226 11:16:37.101844400 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC
[rank2]:[E226 11:16:37.101852388 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 02/0 : 10[2] -> 2[2] [receive] via NET/IBext_v8/2/GDRDMA
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 10/0 : 10[2] -> 2[2] [receive] via NET/IBext_v8/2/GDRDMA
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 02/0 : 2[2] -> 10[2] [send] via NET/IBext_v8/2/GDRDMA
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 10/0 : 2[2] -> 10[2] [send] via NET/IBext_v8/2/GDRDMA
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC
deepseek-yanfa:190:3197 [2] NCCL INFO Connected all trees
deepseek-yanfa:190:3197 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
deepseek-yanfa:190:3197 [2] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
deepseek-yanfa:190:3197 [2] NCCL INFO ncclCommSplit comm 0x5601dc4ed8b0 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 41000 parent 0x5601b34852c0 color 1197013201 key 2 commId 0x44d8628a425864f9 - Init COMPLETE
[rank4]:[E226 11:16:37.104512039 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
4] -> 5[5] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 04/0 : 13[5] -> 4[4] [receive] via NET/IBext_v8/4/GDRDMA
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 12/0 : 13[5] -> 4[4] [receive] via NET/IBext_v8/4/GDRDMA
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 05/0 : 4[4] -> 13[5] [send] via NET/IBext_v8/5/GDRDMA
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 13/0 : 4[4] -> 13[5] [send] via NET/IBext_v8/5/GDRDMA
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Connected all rings
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 04/0 : 12[4] -> 4[4] [receive] via NET/IBext_v8/4/GDRDMA
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 12/0 : 12[4] -> 4[4] [receive] via NET/IBext_v8/4/GDRDMA
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 04/0 : 4[4] -> 12[4] [send] via NET/IBext_v8/4/GDRDMA
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 12/0 : 4[4] -> 12[4] [send] via NET/IBext_v8/4/GDRDMA
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC
deepseek-yanfa:192:3199 [4] NCCL INFO Connected all trees
deepseek-yanfa:192:3199 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
deepseek-yanfa:192:3199 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
deepseek-yanfa:192:3199 [4] NCCL INFO ncclCommSplit comm 0x55fecba09370 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 86000 parent 0x55fea4ebc9f0 color 1197013201 key 4 commId 0x44d8628a425864f9 - Init COMPLETE
[rank2]:[E226 11:16:37.104871119 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600007 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7efcf6cb9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7efcacc2a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7efcacc31bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7efcacc3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7efcf88b55c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7efcf973aac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7efcf97cc850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[rank4]:[E226 11:16:37.105265582 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
[rank4]:[E226 11:16:37.105292982 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 4] Timeout at NCCL work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
[rank4]:[E226 11:16:37.105309926 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E226 11:16:37.105317890 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
what(): [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600007 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7efcf6cb9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7efcacc2a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7efcacc31bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7efcacc3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7efcf88b55c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7efcf973aac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7efcf97cc850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7efcf6cb9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7efcac8a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7efcf88b55c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7efcf973aac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7efcf97cc850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007ee49bffe640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007ee49ffff640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 512 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 757 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 819 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 858 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 785 in forward
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007ee61bfff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007eebedfff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007efcf96a5480 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 167 in resolve_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1147 in process_batch_result_prefill
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1119 in process_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1796 in run_scheduler_process
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "", line 1 in

Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, zmq.backend.cython._zmq[rank4]:[E226 11:16:37.108069950 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4fe2ab9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f4f98a2a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4f98a31bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f4f98a3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f4fe46c05c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7f4fe5545ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f4fe55d7850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
, yaml._yaml, markupsafe._speedups, PIL._imaging, PIL._imagingft what(): [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4fe2ab9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f4f98a2a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4f98a31bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f4f98a3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f4fe46c05c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7f4fe5545ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f4fe55d7850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4fe2ab9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7f4f986a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f4fe46c05c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7f4fe5545ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7f4fe55d7850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007f376fffe640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f3773fff640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 512 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 757 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 819 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 858 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 785 in forward
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f390bfff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f3edbfff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f4fe54b0480 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 167 in resolve_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1147 in process_batch_result_prefill
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1119 in process_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1796 in run_scheduler_process
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "", line 1 in
, msgspec._core, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist
Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, msgpack._cmsgpack, google._upb._message, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, setproctitle, zmq.backend.cython._zmq, regex._regex, yaml._yaml, markupsafe._speedups, PIL._imaging, cuda_utils, __triton_launcher, PIL._imagingft (total: 52)
, msgspec._core, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, msgpack._cmsgpack, google._upb._message, ray._raylet, sentencepiece._sentencepiece, regex._regex[rank0]:[E226 11:16:37.114297795 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600021 milliseconds before timing out.
: 0[0] -> 8[0] [send] via NET/IBext_v8/0/GDRDMA
deepseek-yanfa:188:3195 [0] NCCL INFO Connected all trees
deepseek-yanfa:188:3195 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
deepseek-yanfa:188:3195 [0] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
deepseek-yanfa:188:3195 [0] NCCL INFO ncclCommSplit comm 0x55d4e5984bb0 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId f000 parent 0x55d4bf6ff660 color 1197013201 key 0 commId 0x44d8628a425864f9 - Init COMPLETE
, cuda_utils, __triton_launcher (total: 52)
[rank0]:[E226 11:16:37.114991563 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
[rank0]:[E226 11:16:37.115021453 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 0] Timeout at NCCL work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
[rank0]:[E226 11:16:37.115036954 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E226 11:16:37.115043974 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E226 11:16:37.118106327 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600021 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fce450b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fcdfb02a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fcdfb031bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fcdfb03361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7fce46cf15c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7fce47b76ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fce47c08850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600021 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fce450b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fcdfb02a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fcdfb031bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fcdfb03361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7fce46cf15c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7fce47b76ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fce47c08850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fce450b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7fcdfaca071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7fce46cf15c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7fce47b76ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7fce47c08850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007fb5b7ffe640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fb5bbfff640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 512 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 757 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 819 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 858 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 785 in forward
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fb785fff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fbd2bfff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fce47ae1480 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 167 in resolve_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1147 in process_batch_result_prefill
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1119 in process_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1796 in run_scheduler_process
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "", line 1 in

5] -> 6[6] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC
Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, zmq.backend.cython._zmq, yaml._yaml, markupsafe._speedups, PIL._imaging[rank5]:[E226 11:16:37.121617583 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600028 milliseconds before timing out.
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 04/0 : 5[5] -> 12[4] [send] via NET/IBext_v8/4/GDRDMA
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 12/0 : 5[5] -> 12[4] [send] via NET/IBext_v8/4/GDRDMA
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 05/0 : 12[4] -> 5[5] [receive] via NET/IBext_v8/5/GDRDMA
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 13/0 : 12[4] -> 5[5] [receive] via NET/IBext_v8/5/GDRDMA
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Connected all rings
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 05/0 : 13[5] -> 5[5] [receive] via NET/IBext_v8/5/GDRDMA
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 13/0 : 13[5] -> 5[5] [receive] via NET/IBext_v8/5/GDRDMA
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 05/0 : 5[5] -> 13[5] [send] via NET/IBext_v8/5/GDRDMA
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 13/0 : 5[5] -> 13[5] [send] via NET/IBext_v8/5/GDRDMA
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC
deepseek-yanfa:193:3196 [5] NCCL INFO Connected all trees
deepseek-yanfa:193:3196 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
deepseek-yanfa:193:3196 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
deepseek-yanfa:193:3196 [5] NCCL INFO ncclCommSplit comm 0x5582de1ab2b0 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId 87000 parent 0x5582b7382d70 color 1197013201 key 5 commId 0x44d8628a425864f9 - Init COMPLETE
, PIL._imagingft[rank5]:[E226 11:16:37.122279879 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
[rank5]:[E226 11:16:37.122302108 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 5] Timeout at NCCL work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
[rank5]:[E226 11:16:37.122315178 ProcessGroupNCCL.cpp:630] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E226 11:16:37.122323138 ProcessGroupNCCL.cpp:636] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
, msgspec._core, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, msgpack._cmsgpack, google._upb._message, ray._raylet, sentencepiece._sentencepiece, regex._regex, cuda_utils, __triton_launcher[rank5]:[E226 11:16:37.124848417 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600028 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9b2736c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f9add22a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f9add231bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f9add23361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f9b28f245c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7f9b29da9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f9b29e3b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

(total: 52)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600028 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9b2736c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f9add22a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f9add231bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f9add23361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f9b28f245c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7f9b29da9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f9b29e3b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9b2736c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7f9adcea071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f9b28f245c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7f9b29da9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7f9b29e3b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007f82b3ffe640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f82b7fff640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 512 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 757 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 819 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 858 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 785 in forward
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f844ffff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f8a1ffff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f9b29d14480 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 167 in resolve_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1147 in process_batch_result_prefill
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1119 in process_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1796 in run_scheduler_process
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "", line 1 in

Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, zmq.backend.cython._zmq, yaml._yaml, markupsafe._speedups, PIL._imaging, PIL._imagingft, msgspec._core, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, msgpack._cmsgpack, google._upb._message, ray._raylet, sentencepiece._sentencepiece, regex._regex, cuda_utils, __triton_launcher (total: 52)
6] -> 7[7] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 06/0 : 15[7] -> 6[6] [receive] via NET/IBext_v8/6/GDRDMA
[rank6]:[E226 11:16:37.142838419 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600049 milliseconds before timing out.
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 14/0 : 15[7] -> 6[6] [receive] via NET/IBext_v8/6/GDRDMA
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 07/0 : 6[6] -> 15[7] [send] via NET/IBext_v8/7/GDRDMA
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 15/0 : 6[6] -> 15[7] [send] via NET/IBext_v8/7/GDRDMA
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Connected all rings
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 06/0 : 14[6] -> 6[6] [receive] via NET/IBext_v8/6/GDRDMA
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 14/0 : 14[6] -> 6[6] [receive] via NET/IBext_v8/6/GDRDMA
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 06/0 : 6[6] -> 14[6] [send] via NET/IBext_v8/6/GDRDMA
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 14/0 : 6[6] -> 14[6] [send] via NET/IBext_v8/6/GDRDMA
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC
deepseek-yanfa:194:3192 [6] NCCL INFO Connected all trees
deepseek-yanfa:194:3192 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
deepseek-yanfa:194:3192 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
deepseek-yanfa:194:3192 [6] NCCL INFO ncclCommSplit comm 0x55b6387c43d0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId b9000 parent 0x55b611311d50 color 1197013201 key 6 commId 0x44d8628a425864f9 - Init COMPLETE
[rank6]:[E226 11:16:37.143595198 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
[rank6]:[E226 11:16:37.143622350 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 6] Timeout at NCCL work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
[rank6]:[E226 11:16:37.143638554 ProcessGroupNCCL.cpp:630] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E226 11:16:37.143647408 ProcessGroupNCCL.cpp:636] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E226 11:16:37.147285756 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600049 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7faa9ab6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7faa50a2a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7faa50a31bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7faa50a3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7faa9c7d95c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7faa9d65eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7faa9d6f0850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600049 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7faa9ab6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7faa50a2a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7faa50a31bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7faa50a3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7faa9c7d95c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7faa9d65eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7faa9d6f0850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7faa9ab6c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7faa506a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7faa9c7d95c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7faa9d65eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7faa9d6f0850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007f922bffe640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f922ffff640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 512 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 757 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 819 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 858 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 785 in forward
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f93bffff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f9991fff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007faa9d5c9480 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 167 in resolve_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1147 in process_batch_result_prefill
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1119 in process_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1796 in run_scheduler_process
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "", line 1 in

Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, zmq.backend.cython._zmq, yaml._yaml, markupsafe._speedups, PIL._imaging, PIL._imagingft, msgspec._core, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, msgpack._cmsgpack, google._upb._message, ray._raylet, sentencepiece._sentencepiece, regex._regex, cuda_utils, __triton_launcher (total: 52)
[rank1]:[E226 11:16:37.168555538 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600075 milliseconds before timing out.
Channel 11/0 : 1[1] -> 2[2] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 00/0 : 1[1] -> 8[0] [send] via NET/IBext_v8/0/GDRDMA
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 08/0 : 1[1] -> 8[0] [send] via NET/IBext_v8/0/GDRDMA
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 01/0 : 8[0] -> 1[1] [receive] via NET/IBext_v8/1/GDRDMA
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 09/0 : 8[0] -> 1[1] [receive] via NET/IBext_v8/1/GDRDMA
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Connected all rings
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 01/0 : 9[1] -> 1[1] [receive] via NET/IBext_v8/1/GDRDMA
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 09/0 : 9[1] -> 1[1] [receive] via NET/IBext_v8/1/GDRDMA
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 01/0 : 1[1] -> 9[1] [send] via NET/IBext_v8/1/GDRDMA
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 09/0 : 1[1] -> 9[1] [send] via NET/IBext_v8/1/GDRDMA
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC
deepseek-yanfa:189:3194 [1] NCCL INFO Connected all trees
deepseek-yanfa:189:3194 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
deepseek-yanfa:189:3194 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
deepseek-yanfa:189:3194 [1] NCCL INFO ncclCommSplit comm 0x55dd9a1cfae0 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 10000 parent 0x55dd72b31aa0 color 1197013201 key 1 commId 0x44d8628a425864f9 - Init COMPLETE
[rank1]:[E226 11:16:37.169246336 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
[rank1]:[E226 11:16:37.169271947 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 1] Timeout at NCCL work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
[rank1]:[E226 11:16:37.169287172 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E226 11:16:37.169294668 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
7] [receive] via NET/IBext_v8/7/GDRDMA
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/IPC
[rank7]:[E226 11:16:37.170295281 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600077 milliseconds before timing out.
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Connected all rings
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 07/0 : 15[7] -> 7[7] [receive] via NET/IBext_v8/7/GDRDMA
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 15/0 : 15[7] -> 7[7] [receive] via NET/IBext_v8/7/GDRDMA
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 07/0 : 7[7] -> 15[7] [send] via NET/IBext_v8/7/GDRDMA
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 15/0 : 7[7] -> 15[7] [send] via NET/IBext_v8/7/GDRDMA
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/IPC
deepseek-yanfa:195:3198 [7] NCCL INFO Connected all trees
deepseek-yanfa:195:3198 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
deepseek-yanfa:195:3198 [7] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
deepseek-yanfa:195:3198 [7] NCCL INFO ncclCommSplit comm 0x55712b6775c0 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId bc000 parent 0x55710127d700 color 1197013201 key 7 commId 0x44d8628a425864f9 - Init COMPLETE
[rank7]:[E226 11:16:37.171056470 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
[rank7]:[E226 11:16:37.171082995 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 7] Timeout at NCCL work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
[rank7]:[E226 11:16:37.171100059 ProcessGroupNCCL.cpp:630] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E226 11:16:37.171107818 ProcessGroupNCCL.cpp:636] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E226 11:16:37.172555440 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600075 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f119a36c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f115022a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f1150231bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f115023361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f119bf4a5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7f119cdcfac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f119ce61850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600075 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f119a36c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f115022a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f1150231bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f115023361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f119bf4a5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7f119cdcfac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f119ce61850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f119a36c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7f114fea071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f119bf4a5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7f119cdcfac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7f119ce61850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007ef94dffe640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007ef951fff640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 512 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 757 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 819 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 858 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 785 in forward
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007efabffff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap[rank7]:[E226 11:16:37.173602893 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600077 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7feff256c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fefa842a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fefa8431bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fefa843361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7feff41e95c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7feff506eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7feff5100850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Thread 0x00007f0091fff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in waitterminate called after throwing an instance of '
c10::DistBackendError File '
"/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f119cd3a480 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 167 in resolve_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1147 in process_batch_result_prefill
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1119 in process_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1796 in run_scheduler_process
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "", line 1 in
what(): [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600077 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7feff256c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fefa842a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fefa8431bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fefa843361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7feff41e95c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7feff506eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7feff5100850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7feff256c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7fefa80a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7feff41e95c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7feff506eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7feff5100850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007fd79fffe640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in
Extension modules: charset_normalizer.md_bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fd7a3fff640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 512 in forward
, requests.packages.charset_normalizer.md File ", /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.pyrequests.packages.chardet.md", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 757 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 819 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 858 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 785 in forward
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
, File uvloop.loop"/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fd91bfff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fdeebfff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007feff4fd9480 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 167 in resolve_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1147 in process_batch_result_prefill
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1119 in process_batch_result
File ", /sgl-workspace/sglang/python/sglang/srt/managers/scheduler.pynumpy.core._multiarray_umath", line 519 in event_loop_overlap
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1796 in run_scheduler_process
, File numpy.core._multiarray_tests"/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File , "numpy.linalg._umath_linalg/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "", line 1 in
, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special
91:3193 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 02/0 : 3[3] -> 10[2] [send] via NET/IBext_v8/2/GDRDMA
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 10/0 : 3[3] -> 10[2] [send] via NET/IBext_v8/2/GDRDMA
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 03/0 : 10[2] -> 3[3] [receive] via NET/IBext_v8/3/GDRDMA
Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, numpy.core._multiarray_umath, psutil._psutil_linux, psutil._psutil_posix, , numpy.core._multiarray_testssetproctitle, , zmq.backend.cython._zmqnumpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, yaml._yaml, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, PIL._imagingft, psutil._psutil_linux, psutil._psutil_posix, setproctitle, zmq.backend.cython._zmq, yaml._yaml, markupsafe._speedups, PIL._imaging, PIL._imagingft, msgspec._core, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, msgpack._cmsgpack, google._upb._message, ray._raylet, msgspec._core, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, ray._raylet, regex._regex, sentencepiece._sentencepiece, cuda_utils, __triton_launcher[rank3]:[E226 11:16:37.179395450 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
(total: 52)
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 11/0 : 10[2] -> 3[3] [receive] via NET/IBext_v8/3/GDRDMA
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Connected all rings
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 03/0 : 11[3] -> 3[3] [receive] via NET/IBext_v8/3/GDRDMA
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 11/0 : 11[3] -> 3[3] [receive] via NET/IBext_v8/3/GDRDMA
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 03/0 : 3[3] -> 11[3] [send] via NET/IBext_v8/3/GDRDMA
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 11/0 : 3[3] -> 11[3] [send] via NET/IBext_v8/3/GDRDMA
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC
deepseek-yanfa:191:3193 [3] NCCL INFO Connected all trees
deepseek-yanfa:191:3193 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
deepseek-yanfa:191:3193 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
deepseek-yanfa:191:3193 [3] NCCL INFO ncclCommSplit comm 0x55d3e104e6b0 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 44000 parent 0x55d3baebe2c0 color 1197013201 key 3 commId 0x44d8628a425864f9 - Init COMPLETE
, regex._regex, cuda_utils, __triton_launcher[rank3]:[E226 11:16:37.180204204 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
[rank3]:[E226 11:16:37.180233473 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 3] Timeout at NCCL work: 5653, last enqueued NCCL work: 5654, last completed NCCL work: 5652.
[rank3]:[E226 11:16:37.180251517 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E226 11:16:37.180260972 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
(total: 52)
[rank3]:[E226 11:16:37.184419801 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb90a96c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fb8c082a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb8c0831bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fb8c083361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7fb90c5745c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7fb90d3f9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb90d48b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5653, OpType=ALLREDUCE, NumelIn=58720256, NumelOut=58720256, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb90a96c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fb8c082a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb8c0831bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fb8c083361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7fb90c5745c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7fb90d3f9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb90d48b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb90a96c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7fb8c04a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7fb90c5745c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7fb90d3f9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7fb90d48b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007fa0b9ffe640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fa0bdfff640 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 512 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 757 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 819 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 858 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 785 in forward
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fa233fff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fa803fff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fb90d364480 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 167 in resolve_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1147 in process_batch_result_prefill
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1119 in process_batch_result
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1796 in run_scheduler_process
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "", line 1 in

Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, zmq.backend.cython._zmq, yaml._yaml, markupsafe._speedups, PIL._imaging, PIL._imagingft, msgspec._core, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, msgpack._cmsgpack, google._upb._message, ray._raylet, sentencepiece._sentencepiece, regex._regex, cuda_utils, __triton_launcher (total: 52)

FrankLeeeee · 2025-02-27T02:08:54Z

Can you install sglang from source and try the latest commit?

verigle · 2025-02-27T02:19:59Z

can you push the image of latest commit to docker hub ?

FrankLeeeee · 2025-02-27T02:22:42Z

@verigle the official docker image will only be for the official release. but I think you can do this to update the repo in the docker

# fetch latest commit
git fetch origin
git rebase origin/main

# reinstall from source
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python

verigle · 2025-02-27T02:39:33Z

does I should add the param of "--dist-timeout"?

FrankLeeeee · 2025-02-27T02:40:50Z

does I should add the param of "--dist-timeout"?

You can try without it first, if timeout still occurs, then add it

echozyr2001 · 2025-02-27T02:47:53Z

@verigle the official docker image will only be for the official release. but I think you can do this to update the repo in the docker

fetch latest commit

git fetch origin
git rebase origin/main

reinstall from source

pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python

I'm trying this approach, and the first batch of tests (8000 input, 1000 output, running 5 requests simultaneously) found no problems. Stress testing takes a little time.

verigle · 2025-02-27T02:48:48Z

Downloading https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.2/flashinfer_python-0.2.2%2Bcu124torch2.5-cp38-abi3-linux_x86_64.whl is very slowly, is there any plan change the package to pypi or other mirror?

FrankLeeeee · 2025-02-27T02:49:56Z

@echozyr2001 great! that's nice to hear!
@verigle flashinfer is hosted by the team itself, so there is no mirror. Do you have any VPN/proxy?

echozyr2001 · 2025-02-27T02:51:01Z

Downloading https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.2/flashinfer_python-0.2.2%2Bcu124torch2.5-cp38-abi3-linux_x86_64.whl is very slowly, is there any plan change the package to pypi or other mirror?

If you are in China, you can use:

pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

verigle · 2025-02-27T02:51:56Z

thank you, but only flashinfer-python is not fast

FrankLeeeee · 2025-02-27T03:05:51Z

I think this issue has been solved in the latest commit, will close for now.

verigle · 2025-02-27T03:09:53Z

can you push the image of latest commit to docker hub ?

when to release the image for docker hub? I hava not download flashinfer-python from github because it`s to slowly

echozyr2001 · 2025-02-27T03:13:31Z

I use 2 * 8 * h20, the startup command is as follows:

Lunch docker

docker run --gpus all \
        --rm -it \
        --name sglang_node_1 \
        -v /data/deepseek-r1:/root/deepseek-r1 \
        -v /data/torchcache:/root/torchcache \
        --privileged \
        --env "GLOO_SOCKET_IFNAME=ens12f0np0" \
        --env "NCCL_SOCKET_IFNAME=ens12f0np0" \
        --env "NCCL_IB_HCA=ibp14s0,ibp71s0,ibp134s0,ibp195s0" \
        --env "NCCL_IB_CUDA_SUPPORT=1" \
        --env "NCCL_IB_ALLOW=1" \
        --env "NCCL_IB_DISABLE=0" \
        --env "NCCL_IB_RETRY_CNT=10" \
        --env "NCCL_P2P_LEVEL=NVL" \
        --env "NCCL_IB_GID_INDEX=3" \
        --env "NCCL_DEBUG=TRACE" \
        --env "TORCHINDUCTOR_CACHE_DIR=/root/torchcache" \
        --ipc=host \
        --network=host \
        --shm-size 32g \
        sglang:latest

Then update sglang

git fetch origin
git rebase origin/main

pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

Lunch server

 python3 -m sglang.launch_server \
        --model-path /root/deepseek-r1 \
        --dist-init-addr 10.0.251.17:50000 \
        --tp 16 \
        --nnodes 2 \
        --node-rank 0 \
        --host 0.0.0.0 \
        --port 30000 \
        --trust-remote-code

I ran a total of 150 requests, each with 3000 inputs and 1000 outputs, with 5 requests per batch. The single request throughput is around 20t/s, and the TTFT is around 1 second. I haven't encountered any prefill stuck issues yet.

But I hope more people can continue testing to ensure that the problem does not occur again.

verigle · 2025-02-27T03:15:35Z

It starts to be normal, but after a few days , there will be such an error, is there a way to automatically restart?

ehuaa · 2025-02-27T10:52:01Z

Hi guys, a silly question. What's the difference between --dist-timeout 3600 and --watchdog_timeout 3600, and if we add --dist-timeout 3600 will the server hangs for 3599s at worst?

CSEEduanyu · 2025-02-27T11:42:58Z

Can you install sglang from source and try the latest commit?

Hello, could you please tell me which specific commit was submitted?

myoldcat changed the title ~~[Bug] Model Stuck at Prefill and then throw "Watchdog Timeout" Error After Idle Period (Deepseek-r1:671b on two H800*8)~~ [Bug] Model Stuck at Prefill and then throw "Watchdog Timeout" Error After Idle Period (Deepseek-r1:671b on two H100*8) Feb 25, 2025

minleminzui self-assigned this Feb 25, 2025

FrankLeeeee closed this as completed Feb 27, 2025

Fridge003 mentioned this issue Feb 27, 2025

[Bug] Watchdog caught collective operation timeout #3784

Open

5 tasks

Fridge003 mentioned this issue Feb 27, 2025

[Bug] sglang encounter Watchdog timeout error #3752

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Model Stuck at Prefill and then throw "Watchdog Timeout" Error After Idle Period (Deepseek-r1:671b on two H100*8) #3836

[Bug] Model Stuck at Prefill and then throw "Watchdog Timeout" Error After Idle Period (Deepseek-r1:671b on two H100*8) #3836

myoldcat commented Feb 25, 2025 •

edited

Loading

echozyr2001 commented Feb 25, 2025

zhouzhengjun commented Feb 25, 2025

V-yw commented Feb 25, 2025

mindhash commented Feb 25, 2025

BestVIncent commented Feb 25, 2025

zwc163 commented Feb 26, 2025

mindhash commented Feb 26, 2025

robscc commented Feb 26, 2025

lzxzy commented Feb 26, 2025

CSEEduanyu commented Feb 26, 2025

verigle commented Feb 27, 2025

verigle commented Feb 27, 2025 •

edited

Loading

FrankLeeeee commented Feb 27, 2025

verigle commented Feb 27, 2025

FrankLeeeee commented Feb 27, 2025

verigle commented Feb 27, 2025

FrankLeeeee commented Feb 27, 2025

echozyr2001 commented Feb 27, 2025

fetch latest commit

reinstall from source

verigle commented Feb 27, 2025

FrankLeeeee commented Feb 27, 2025

echozyr2001 commented Feb 27, 2025

verigle commented Feb 27, 2025

FrankLeeeee commented Feb 27, 2025

verigle commented Feb 27, 2025 •

edited

Loading

echozyr2001 commented Feb 27, 2025

verigle commented Feb 27, 2025

ehuaa commented Feb 27, 2025

CSEEduanyu commented Feb 27, 2025

[Bug] Model Stuck at Prefill and then throw "Watchdog Timeout" Error After Idle Period (Deepseek-r1:671b on two H100*8) #3836

[Bug] Model Stuck at Prefill and then throw "Watchdog Timeout" Error After Idle Period (Deepseek-r1:671b on two H100*8) #3836

Comments

myoldcat commented Feb 25, 2025 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

echozyr2001 commented Feb 25, 2025

zhouzhengjun commented Feb 25, 2025

V-yw commented Feb 25, 2025

mindhash commented Feb 25, 2025

BestVIncent commented Feb 25, 2025

zwc163 commented Feb 26, 2025

mindhash commented Feb 26, 2025

robscc commented Feb 26, 2025

lzxzy commented Feb 26, 2025

CSEEduanyu commented Feb 26, 2025

verigle commented Feb 27, 2025

verigle commented Feb 27, 2025 • edited Loading

FrankLeeeee commented Feb 27, 2025

verigle commented Feb 27, 2025

FrankLeeeee commented Feb 27, 2025

verigle commented Feb 27, 2025

FrankLeeeee commented Feb 27, 2025

echozyr2001 commented Feb 27, 2025

fetch latest commit

reinstall from source

verigle commented Feb 27, 2025

FrankLeeeee commented Feb 27, 2025

echozyr2001 commented Feb 27, 2025

verigle commented Feb 27, 2025

FrankLeeeee commented Feb 27, 2025

verigle commented Feb 27, 2025 • edited Loading

echozyr2001 commented Feb 27, 2025

verigle commented Feb 27, 2025

ehuaa commented Feb 27, 2025

CSEEduanyu commented Feb 27, 2025

myoldcat commented Feb 25, 2025 •

edited

Loading

verigle commented Feb 27, 2025 •

edited

Loading

verigle commented Feb 27, 2025 •

edited

Loading