NCCL hanging during inference #2770

flexwang · 2024-02-05T18:52:01Z

with vllm v0.2.7, I saw the nccl hanging for allreduce:

�[36m(RayWorkerVllm pid=5085)�[0m [E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20518, OpType=ALLREDUCE, NumelIn=106496, NumelOut=106496, Timeout(ms)=1800000) ran for 1800270 milliseconds before timing out.

after switching to v0.3.0(with custom all reduce), it's gather

(RayWorkerVllm pid=4775) [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=369526, OpType=GATHER, NumelIn=4000, NumelOut=0, Timeout(ms)=1800000) ran for 1800252 milliseconds before timing out.

The text was updated successfully, but these errors were encountered:

flexwang · 2024-02-06T08:32:26Z

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/cudagraph.html

Having multiple outstanding NCCL operations that are any combination of graph-captured or non-captured is supported. There is a caveat that the mechanism NCCL uses internally to accomplish this has been seen to cause CUDA to deadlock when the graphs of multiple communicators are cudaGraphLaunch()’d from the same thread. To disable this mechansim see the environment variable NCCL_GRAPH_MIXING_SUPPORT.

Looks like there is issue between nccl and cuda graph. I use enforce_eager and it seems to be fixed.

flexwang · 2024-02-07T05:52:18Z

Looks like there is issue between nccl and cuda graph. I use enforce_eager and it seems to be fixed.

More updates, even after disabling cudagraph by setting enforce_eager to true, we still see the nccl hanging issue.

(RayWorkerVllm pid=4850) [E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1494, OpType=GATHER, NumelIn=76000, NumelOut=0, Timeout(ms)=1800000) ran for 1800855 milliseconds before timing out.

NikolaBorisov · 2024-02-09T18:24:37Z

I reported the same issue in #2731. you need enforce_eager and disable_custom_all_reduce to fix it. But the real issue is how to fix the cuda_graphs and custom all reduce to make it work.

WoosukKwon · 2024-02-14T01:53:33Z

Closed as #2811 fixes this. Please feel free to re-open the issue if you find the bug persists.

LokiLiu · 2024-02-21T01:40:08Z

@WoosukKwon I met same problem, but after using PyTorch 2.2.0, this problem has been resolved.

WangErXiao · 2024-03-08T14:52:12Z

@WoosukKwon I met same problem, but after using PyTorch 2.2.0, this problem has been resolved.

hi, do you use 0.2.7 or 0.3.0?

WoosukKwon added the bug Something isn't working label Feb 6, 2024

flexwang mentioned this issue Feb 9, 2024

vLLM getting stuck. Nothing is generate while requests are running and pending. #2731

Closed

WoosukKwon closed this as completed Feb 14, 2024

youkaichao mentioned this issue Mar 23, 2024

[RFC]: Interface and Abstraction for Distributed Inference Environment #3587

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL hanging during inference #2770

NCCL hanging during inference #2770

flexwang commented Feb 5, 2024

flexwang commented Feb 6, 2024

flexwang commented Feb 7, 2024

NikolaBorisov commented Feb 9, 2024

WoosukKwon commented Feb 14, 2024

LokiLiu commented Feb 21, 2024

WangErXiao commented Mar 8, 2024

NCCL hanging during inference #2770

NCCL hanging during inference #2770

Comments

flexwang commented Feb 5, 2024

flexwang commented Feb 6, 2024

flexwang commented Feb 7, 2024

NikolaBorisov commented Feb 9, 2024

WoosukKwon commented Feb 14, 2024

LokiLiu commented Feb 21, 2024

WangErXiao commented Mar 8, 2024