Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL hanging during inference #2770

Closed
flexwang opened this issue Feb 5, 2024 · 6 comments
Closed

NCCL hanging during inference #2770

flexwang opened this issue Feb 5, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@flexwang
Copy link

flexwang commented Feb 5, 2024

with vllm v0.2.7, I saw the nccl hanging for allreduce:

�[36m(RayWorkerVllm pid=5085)�[0m [E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20518, OpType=ALLREDUCE, NumelIn=106496, NumelOut=106496, Timeout(ms)=1800000) ran for 1800270 milliseconds before timing out.

after switching to v0.3.0(with custom all reduce), it's gather

(RayWorkerVllm pid=4775) [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=369526, OpType=GATHER, NumelIn=4000, NumelOut=0, Timeout(ms)=1800000) ran for 1800252 milliseconds before timing out.
@flexwang
Copy link
Author

flexwang commented Feb 6, 2024

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/cudagraph.html

Having multiple outstanding NCCL operations that are any combination of graph-captured or non-captured is supported. There is a caveat that the mechanism NCCL uses internally to accomplish this has been seen to cause CUDA to deadlock when the graphs of multiple communicators are cudaGraphLaunch()’d from the same thread. To disable this mechansim see the environment variable NCCL_GRAPH_MIXING_SUPPORT.

Looks like there is issue between nccl and cuda graph. I use enforce_eager and it seems to be fixed.

@WoosukKwon WoosukKwon added the bug Something isn't working label Feb 6, 2024
@flexwang
Copy link
Author

flexwang commented Feb 7, 2024

Looks like there is issue between nccl and cuda graph. I use enforce_eager and it seems to be fixed.

More updates, even after disabling cudagraph by setting enforce_eager to true, we still see the nccl hanging issue.

(RayWorkerVllm pid=4850) [E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1494, OpType=GATHER, NumelIn=76000, NumelOut=0, Timeout(ms)=1800000) ran for 1800855 milliseconds before timing out.

@NikolaBorisov
Copy link
Contributor

I reported the same issue in #2731. you need enforce_eager and disable_custom_all_reduce to fix it. But the real issue is how to fix the cuda_graphs and custom all reduce to make it work.

@WoosukKwon
Copy link
Collaborator

Closed as #2811 fixes this. Please feel free to re-open the issue if you find the bug persists.

@LokiLiu
Copy link

LokiLiu commented Feb 21, 2024

@WoosukKwon I met same problem, but after using PyTorch 2.2.0, this problem has been resolved.

@WangErXiao
Copy link
Contributor

@WoosukKwon I met same problem, but after using PyTorch 2.2.0, this problem has been resolved.

hi, do you use 0.2.7 or 0.3.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants