Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] DeepSeek-R1-BF16 can't output with /v1/chat/completions on 4 node*8*A100 #3572

Open
2 of 5 tasks
zhaotyer opened this issue Feb 14, 2025 · 1 comment
Open
2 of 5 tasks
Assignees

Comments

@zhaotyer
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

The model is loaded successfully and the model service has been started on 4 node8A100,
http://0.0.0.0:50050/generate Able to output content
http://0.0.0.0:50050/v1/chat/completions block

error is:
[rank16]:[E214 06:46:21.341483897 ProcessGroupNCCL.cpp:1785] [PG ID 2 PG GUID 3 Rank 16] Exception (either an error or timeout) detected by watchdog at work: 9301, last enqueued NCCL work: 9301, last completed NCCL work: 9300.
[rank16]:[E214 06:46:21.341531487 ProcessGroupNCCL.cpp:1834] [PG ID 2 PG GUID 3 Rank 16] Timeout at NCCL work: 9301, last enqueued NCCL work: 9301, last completed NCCL work: 9300.
[rank16]:[E214 06:46:21.341568027 ProcessGroupNCCL.cpp:630] [Rank 16] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank16]:[E214 06:46:21.341649967 ProcessGroupNCCL.cpp:636] [Rank 16] To avoid data inconsistency, we are taking the entire process down.
[rank16]:[E214 06:46:21.343994164 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 16] Process group watchdog thread terminated with exception: [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9301, OpType=ALLREDUCE, NumelIn=129024, NumelOut=129024, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f6d56c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f8f2342a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8f23431bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8f2343361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f8f6fd1d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7f9012012ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f90120a3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 2 PG GUID 3 Rank 16] Process group watchdog thread terminated with exception: [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9301, OpType=ALLREDUCE, NumelIn=129024, NumelOut=129024, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f6d56c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f8f2342a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8f23431bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8f2343361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f8f6fd1d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7f9012012ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f90120a3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f6d56c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7f8f230a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f8f6fd1d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7f9012012ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f90120a3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007f88ddfff640 (most recent call first):

Reproduction

node1:
NCCL_P2P_LEVEL=NVL python3 -m sglang.launch_server --model-path /models/ --tp 32 --dist-init-addr 172.17.120.15:50030 --nnodes 4 --node-rank 0 --trust-remote-code --watchdog-timeout 21600 --mem-fraction-static 0.9 --host 0.0.0.0 --port 50050 --served-model-name atom --disable-cuda-graph
node2:
NCCL_DEBUG=TRACE NCCL_P2P_LEVEL=NVL python3 -m sglang.launch_server --model-path /models/ --tp 32 --dist-init-addr 172.17.120.15:50030 --nnodes 4 --node-rank 1 --trust-remote-code --watchdog-timeout 21600 --disable-cuda-graph --host 0.0.0.0 --port 50050
node3:

node4:
NCCL_DEBUG=TRACE NCCL_P2P_LEVEL=NVL python3 -m sglang.launch_server --model-path /models/ --tp 32 --dist-init-addr 172.17.120.15:50030 --nnodes 4 --node-rank 3 --trust-remote-code --watchdog-timeout 21600 --disable-cuda-graph --host 0.0.0.0 --port 50050

Environment

root@a100-15:/workspace# python3 -m sglang.check_env
WARNING 02-14 08:00:33 cuda.py:23] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING:mistral_common.tokens.tokenizers.multimodal:Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:

  • 'fields' has been removed
    warnings.warn(message, UserWarning)
    Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
    CUDA available: True
    GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB
    GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.4, V12.4.131
    CUDA Driver Version: 550.90.07
    PyTorch: 2.5.1+cu124
    sglang: 0.4.2
    flashinfer: 0.1.6+cu124torch2.4
    triton: 3.1.0
    transformers: 4.48.2
    torchao: 0.8.0
    numpy: 1.26.4
    aiohttp: 3.9.3
    fastapi: 0.115.8
    hf_transfer: 0.1.9
    huggingface_hub: 0.28.1
    interegular: 0.3.3
    modelscope: 1.22.3
    orjson: 3.10.15
    packaging: 23.2
    psutil: 5.9.4
    pydantic: 2.10.6
    multipart: 0.0.20
    zmq: 25.1.2
    uvicorn: 0.34.0
    uvloop: 0.21.0
    vllm: 0.6.4.post1
    openai: 1.60.2
    anthropic: 0.45.2
    decord: 0.6.0
    NVIDIA Topology:
    GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA AffinityGPU NUMA ID
    GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 NODE NODE PXB PXB SYS SYS SYS SYS 0-23,48-71 0 N/A
    GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 NODE NODE PXB PXB SYS SYS SYS SYS 0-23,48-71 0 N/A
    GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 PXB PXB NODE NODE SYS SYS SYS SYS 0-23,48-71 0 N/A
    GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 PXB PXB NODE NODE SYS SYS SYS SYS 0-23,48-71 0 N/A
    GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS NODE NODE PXB PXB 24-47,72-95 1 N/A
    GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS NODE NODE PXB PXB 24-47,72-95 1 N/A
    GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS PXB PXB NODE NODE 24-47,72-95 1 N/A
    GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS PXB PXB NODE NODE 24-47,72-95 1 N/A
    NIC0 NODE NODE PXB PXB SYS SYS SYS SYS X PIX NODE NODE SYS SYS SYS SYS
    NIC1 NODE NODE PXB PXB SYS SYS SYS SYS PIX X NODE NODE SYS SYS SYS SYS
    NIC2 PXB PXB NODE NODE SYS SYS SYS SYS NODE NODE X PIX SYS SYS SYS SYS
    NIC3 PXB PXB NODE NODE SYS SYS SYS SYS NODE NODE PIX X SYS SYS SYS SYS
    NIC4 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS X PIX NODE NODE
    NIC5 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS PIX X NODE NODE
    NIC6 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS NODE NODE X PIX
    NIC7 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS NODE NODE PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7

ulimit soft: 1048576

@minleminzui minleminzui self-assigned this Feb 14, 2025
@zhyncs
Copy link
Member

zhyncs commented Feb 14, 2025

Please use the latest version v0.4.3
https://docs.sglang.ai/start/install.html#method-1-with-pip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants