You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
error is:
[rank16]:[E214 06:46:21.341483897 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 16] Exception (either an error or timeout) detected by watchdog at work: 9301, last enqueued NCCL work: 9301, last completed NCCL work: 9300.
[rank16]:[E214 06:46:21.341531487 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 16] Timeout at NCCL work: 9301, last enqueued NCCL work: 9301, last completed NCCL work: 9300.
[rank16]:[E214 06:46:21.341568027 ProcessGroupNCCL.cpp:630] [Rank 16] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank16]:[E214 06:46:21.341649967 ProcessGroupNCCL.cpp:636] [Rank 16] To avoid data inconsistency, we are taking the entire process down.
[rank16]:[E214 06:46:21.343994164 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 16] Process group watchdog thread terminated with exception: [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9301, OpType=ALLREDUCE, NumelIn=129024, NumelOut=129024, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f6d56c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f8f2342a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8f23431bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8f2343361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f8f6fd1d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7f9012012ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f90120a3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 2 PG GUID 3 Rank 16] Process group watchdog thread terminated with exception: [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9301, OpType=ALLREDUCE, NumelIn=129024, NumelOut=129024, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f6d56c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f8f2342a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8f23431bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8f2343361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f8f6fd1d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7f9012012ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f90120a3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f6d56c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7f8f230a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f8f6fd1d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7f9012012ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f90120a3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
root@a100-15:/workspace# python3 -m sglang.check_env
WARNING 02-14 08:00:33 cuda.py:23] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING:mistral_common.tokens.tokenizers.multimodal:Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Checklist
Describe the bug
The model is loaded successfully and the model service has been started on 4 node8A100,
http://0.0.0.0:50050/generate Able to output content
http://0.0.0.0:50050/v1/chat/completions block
error is:
[rank16]:[E214 06:46:21.341483897 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 16] Exception (either an error or timeout) detected by watchdog at work: 9301, last enqueued NCCL work: 9301, last completed NCCL work: 9300.
[rank16]:[E214 06:46:21.341531487 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 16] Timeout at NCCL work: 9301, last enqueued NCCL work: 9301, last completed NCCL work: 9300.
[rank16]:[E214 06:46:21.341568027 ProcessGroupNCCL.cpp:630] [Rank 16] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank16]:[E214 06:46:21.341649967 ProcessGroupNCCL.cpp:636] [Rank 16] To avoid data inconsistency, we are taking the entire process down.
[rank16]:[E214 06:46:21.343994164 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 16] Process group watchdog thread terminated with exception: [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9301, OpType=ALLREDUCE, NumelIn=129024, NumelOut=129024, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f6d56c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f8f2342a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8f23431bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8f2343361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f8f6fd1d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7f9012012ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f90120a3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 2 PG GUID 3 Rank 16] Process group watchdog thread terminated with exception: [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9301, OpType=ALLREDUCE, NumelIn=129024, NumelOut=129024, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f6d56c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f8f2342a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8f23431bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8f2343361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f8f6fd1d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7f9012012ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f90120a3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f6d56c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7f8f230a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f8f6fd1d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7f9012012ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f90120a3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Fatal Python error: Aborted
Thread 0x00007f88ddfff640 (most recent call first):
Reproduction
node1:
NCCL_P2P_LEVEL=NVL python3 -m sglang.launch_server --model-path /models/ --tp 32 --dist-init-addr 172.17.120.15:50030 --nnodes 4 --node-rank 0 --trust-remote-code --watchdog-timeout 21600 --mem-fraction-static 0.9 --host 0.0.0.0 --port 50050 --served-model-name atom --disable-cuda-graph
node2:
NCCL_DEBUG=TRACE NCCL_P2P_LEVEL=NVL python3 -m sglang.launch_server --model-path /models/ --tp 32 --dist-init-addr 172.17.120.15:50030 --nnodes 4 --node-rank 1 --trust-remote-code --watchdog-timeout 21600 --disable-cuda-graph --host 0.0.0.0 --port 50050
node3:
node4:
NCCL_DEBUG=TRACE NCCL_P2P_LEVEL=NVL python3 -m sglang.launch_server --model-path /models/ --tp 32 --dist-init-addr 172.17.120.15:50030 --nnodes 4 --node-rank 3 --trust-remote-code --watchdog-timeout 21600 --disable-cuda-graph --host 0.0.0.0 --port 50050
Environment
root@a100-15:/workspace# python3 -m sglang.check_env
WARNING 02-14 08:00:33 cuda.py:23] You are using a deprecated
pynvml
package. Please installnvidia-ml-py
instead, and make sure to uninstallpynvml
. When both of them are installed,pynvml
will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.WARNING:mistral_common.tokens.tokenizers.multimodal:Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
warnings.warn(message, UserWarning)
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.90.07
PyTorch: 2.5.1+cu124
sglang: 0.4.2
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.2
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.9.3
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 23.2
psutil: 5.9.4
pydantic: 2.10.6
multipart: 0.0.20
zmq: 25.1.2
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.60.2
anthropic: 0.45.2
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA AffinityGPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 NODE NODE PXB PXB SYS SYS SYS SYS 0-23,48-71 0 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 NODE NODE PXB PXB SYS SYS SYS SYS 0-23,48-71 0 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 PXB PXB NODE NODE SYS SYS SYS SYS 0-23,48-71 0 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 PXB PXB NODE NODE SYS SYS SYS SYS 0-23,48-71 0 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS NODE NODE PXB PXB 24-47,72-95 1 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS NODE NODE PXB PXB 24-47,72-95 1 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS PXB PXB NODE NODE 24-47,72-95 1 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS PXB PXB NODE NODE 24-47,72-95 1 N/A
NIC0 NODE NODE PXB PXB SYS SYS SYS SYS X PIX NODE NODE SYS SYS SYS SYS
NIC1 NODE NODE PXB PXB SYS SYS SYS SYS PIX X NODE NODE SYS SYS SYS SYS
NIC2 PXB PXB NODE NODE SYS SYS SYS SYS NODE NODE X PIX SYS SYS SYS SYS
NIC3 PXB PXB NODE NODE SYS SYS SYS SYS NODE NODE PIX X SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS X PIX NODE NODE
NIC5 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS PIX X NODE NODE
NIC6 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS NODE NODE X PIX
NIC7 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS NODE NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
ulimit soft: 1048576
The text was updated successfully, but these errors were encountered: