-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vLLM getting stuck. Nothing is generate while requests are running and pending. #2731
Comments
Do you have details of the model and hardware so we can try our best to repro this? |
I have same issue during chatting with back-end after applied vLLM on FastChat . A ticker is opened at FastChat site but has no response: lm-sys/FastChat#3003 |
Same as #2728? |
We are trying to find simple way to reproduce. It happened to 3 instances running on A100 SXM. One of them got in this state after 5-10 min. The other one worked for hour fine before getting there. Might be related to #2728, but I don't think quantization is the issue, since we got it stuck with no quantization. Also it got stuck on Llama 70B and Mixtral. |
@NikolaBorisov Can you try if adding |
Have not had luck reproducing this reliably. Will run more experiments and update here. |
This worked for me! Thx! |
We got it to reproduce. Here is stack trace. I think #1889 is the cause.
|
Another stacktrace: This is stuck in the all reduce. Maybe it is not the sampler.
|
Added some prints in sampler.py:_random_sample logger.info("random_samples: %s", random_samples.shape)
# Find the maximum best_of value of the prompt phase requests.
random_samples = random_samples.cpu()
logger.info("random_samples: %s in cpu", random_samples.shape) It is getting stuck during the
|
I tried with
|
cc @WoosukKwon @zhuohan123. If |
Very strange. So i traced the issue back in time and is started happening after the cuda graphs got added, between 0.2.5 and 0.2.6. At 0.2.6 with enforce_eager i can not get it to hang. But at 0.3.0 even with enforce_eager it hangs. However at 0.3.0 with enforce_eager and disable_custom_all_reduce is stops hanging. To reproduce I do this: It also crashes with llama2-70b, but I was testing with codellama.
on 4xA100 I just send 100 request one ever 2 sec. And it usually get stuck around request 20-30. |
similar to #2770 @NikolaBorisov |
I haven't tested for hours, but with both of those options i can not get it stuck with 500 request, while without them it just get stuck quickly. So you managed to get it stuck with 0.3.0 with enforce_eager and disable_custom_all_reduce? |
Ah, i just tried this combo, seems working fine for now, will keep looking |
@NikolaBorisov @flexwang Can you both try #2760 with |
@hanzhi713 but even vllm 0.2.7(without your custom_all_reduce) has nccl hangs. |
Hi @NikolaBorisov @flexwang, sorry for the bug. The bug occurs when using CUDA graphs (i.e., |
@WoosukKwon is #2811 ready to try? Should I give it a go? |
@WoosukKwon thanks for the info. However, I looked at |
@NikolaBorisov Yes. We just merged the PR. Please try it! @flexwang We observed that the hanging issue was resolved when using Cupy. However, the safest way will still be to use |
@WoosukKwon Seems to work. I really want #2845 because the docker build are broken. |
@NikolaBorisov Thanks for the confirmation! @flexwang Please re-open the issue if the bug persists. |
in deepseek-r1-awq have same problem in 2 * 8 * A100(40G) key pakage: start command: if not set --disable-custom-all-reduce, it will stuck in few minites. |
We are seeing the latest version of vllm getting stuck randomly after some minutes of work. Sometimes after an hour.
The server still receives new request and can reply to health and metrics, but no tokens are generate, no requests complete.
Server keeps printing the status every 5 seconds, but no tokens are generated. As if the loop is stuck.
The text was updated successfully, but these errors were encountered: