vLLM getting stuck. Nothing is generate while requests are running and pending. #2731

NikolaBorisov · 2024-02-03T00:56:51Z

We are seeing the latest version of vllm getting stuck randomly after some minutes of work. Sometimes after an hour.

The server still receives new request and can reply to health and metrics, but no tokens are generate, no requests complete.
Server keeps printing the status every 5 seconds, but no tokens are generated. As if the loop is stuck.

INFO 02-01 06:36:05 llm_engine.py:921] Avg prompt throughput: 382.6 tokens/s, Avg generation throughput: 118.5 tokens/s, Max iteration time: 386.7 ms, Avg time/tok:149.4 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 115 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO 02-01 06:36:05 async_llm_engine.py:110] Finished request cmpl-50c32d7a66084c3f9980d2bf06d79900-0.
INFO 02-01 06:36:05 async_llm_engine.py:110] Finished request cmpl-90590e17ce6b4fa4b19f0812c0c98446-0.
INFO:     10.244.5.235:41834 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     10.244.5.237:53262 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-01 06:36:05 async_llm_engine.py:436] Received request cmpl-a523b4f84b1b491d9f61ddc4558f532b-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:05 async_llm_engine.py:436] Received request cmpl-25d5d0f7555c46f588570cc83d3a0f81-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO:     10.244.6.107:43538 - "GET /metrics HTTP/1.1" 200 OK
INFO 02-01 06:36:05 async_llm_engine.py:436] Received request cmpl-cf7f4d9b34144b3f8efc55498f75c782-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:07 async_llm_engine.py:436] Received request cmpl-fe5c335b41654d2b9e1141819f92e762-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:08 async_llm_engine.py:436] Received request cmpl-07d703ba4be6400d95704ae748e9c752-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:10 async_llm_engine.py:436] Received request cmpl-61ddd4fa2c074d108048d88e884f5bef-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:10 llm_engine.py:921] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 7.2 tokens/s, Max iteration time: 107.8 ms, Avg time/tok:107.8 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 116 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO:     10.244.39.1:46396 - "GET /health HTTP/1.1" 200 OK
INFO 02-01 06:36:11 async_llm_engine.py:436] Received request cmpl-37d1f11755354de88177e21d466f9ae4-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:12 async_llm_engine.py:436] Received request cmpl-bfd5818ad84548fdb8fbd3ed075d8a00-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:12 async_llm_engine.py:436] Received request cmpl-cd834640585140fabf9f9f5342d08617-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.5, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['USER:', 'ASSISTANT:', 'Reference(s):', 'Note:'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=250, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:13 async_llm_engine.py:436] Received request cmpl-d0a32735719f4425ac7bcc47d73e4c6a-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO:     10.244.41.46:34592 - "GET /health HTTP/1.1" 200 OK
INFO 02-01 06:36:15 async_llm_engine.py:436] Received request cmpl-28df8dd0e22140d09d2eb497eabb2ae6-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:16 async_llm_engine.py:436] Received request cmpl-0891eb16b2734aa0be69ac560ce76262-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:18 async_llm_engine.py:436] Received request cmpl-bfe0ccd536e84b858fa4a6455cc3c84e-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:18 async_llm_engine.py:436] Received request cmpl-fa92185c03d64100a87b055f8de9ebec-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:18 async_llm_engine.py:436] Received request cmpl-75fbe42838b44889b2365a8f896769d9-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:20 async_llm_engine.py:436] Received request cmpl-9a7be2f4394747bc86b924fff8729e53-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO:     10.244.6.107:41448 - "GET /metrics HTTP/1.1" 200 OK
INFO 02-01 06:36:20 llm_engine.py:921] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Max iteration time: 0.0 ms, Avg time/tok:0.0 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 116 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO:     10.244.39.1:50926 - "GET /health HTTP/1.1" 200 OK
INFO:     10.244.41.46:51998 - "GET /health HTTP/1.1" 200 OK
INFO 02-01 06:36:28 async_llm_engine.py:436] Received request cmpl-2ad4077818664575a61e96651cc8ff02-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:28 async_llm_engine.py:436] Received request cmpl-ee0f799a0def4b3abda0e6e3b782fc9a-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:28 async_llm_engine.py:436] Received request cmpl-452bd3a0cb2143feb93ff7c253350954-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:30 llm_engine.py:921] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Max iteration time: 0.0 ms, Avg time/tok:0.0 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 116 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO:     10.244.39.1:46878 - "GET /health HTTP/1.1" 200 OK
INFO:     10.244.41.46:54412 - "GET /health HTTP/1.1" 200 OK

The text was updated successfully, but these errors were encountered:

simon-mo · 2024-02-03T01:18:56Z

Do you have details of the model and hardware so we can try our best to repro this?

hanswang1 · 2024-02-03T04:52:02Z

I have same issue during chatting with back-end after applied vLLM on FastChat .
When vLLM is not applied, no issue found for FastChat.

A ticker is opened at FastChat site but has no response: lm-sys/FastChat#3003

SebastianBodza · 2024-02-03T09:34:15Z

Same as #2728?

NikolaBorisov · 2024-02-03T20:44:37Z

We are trying to find simple way to reproduce. It happened to 3 instances running on A100 SXM. One of them got in this state after 5-10 min. The other one worked for hour fine before getting there. Might be related to #2728, but I don't think quantization is the issue, since we got it stuck with no quantization. Also it got stuck on Llama 70B and Mixtral.

hanzhi713 · 2024-02-05T15:17:52Z

@NikolaBorisov Can you try if adding disable_custom_all_reduce=True help?

NikolaBorisov · 2024-02-06T06:15:08Z

Have not had luck reproducing this reliably. Will run more experiments and update here.

MrWaterZhou · 2024-02-06T07:18:18Z

@NikolaBorisov Can you try if adding disable_custom_all_reduce=True help?

This worked for me! Thx!

NikolaBorisov · 2024-02-06T21:47:28Z

We got it to reproduce. Here is stack trace. I think #1889 is the cause.

Thread 0x00007f6cdc9f4640 (most recent call first):
  File "/workspace/vllm/model_executor/layers/sampler.py", line 263 in _random_sample
  File "/workspace/vllm/model_executor/layers/sampler.py", line 411 in _sample
  File "/workspace/vllm/model_executor/layers/sampler.py", line 108 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/workspace/vllm/model_executor/models/llama.py", line 314 in sample
  File "/workspace/vllm/worker/model_runner.py", line 542 in execute_model
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/workspace/vllm/worker/worker.py", line 213 in execute_model
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58 in run
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83 in _worker
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

NikolaBorisov · 2024-02-06T23:33:54Z

Another stacktrace: This is stuck in the all reduce. Maybe it is not the sampler.

  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2050 in all_reduce
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47 in wrapper
  File "/workspace/vllm/model_executor/parallel_utils/communication_op.py", line 24 in tensor_model_parallel_all_reduce
  File "/workspace/vllm/model_executor/layers/linear.py", line 548 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/workspace/vllm/model_executor/models/llama.py", line 78 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/workspace/vllm/model_executor/models/llama.py", line 218 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/workspace/vllm/model_executor/models/llama.py", line 254 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/workspace/vllm/model_executor/models/llama.py", line 286 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
...

NikolaBorisov · 2024-02-07T01:16:55Z

Added some prints in sampler.py:_random_sample

    logger.info("random_samples: %s", random_samples.shape)
    # Find the maximum best_of value of the prompt phase requests.
    random_samples = random_samples.cpu()
    logger.info("random_samples: %s in cpu", random_samples.shape)

It is getting stuck during the .cpu() operation. This is with the custom_all_reduce. When that one is disabled it gets stuck in the pytorch_all_reduce.

INFO 02-07 01:07:59 sampler.py:264] random_samples: torch.Size([4, 1])
INFO 02-07 01:07:59 sampler.py:267] random_samples: torch.Size([4, 1]) in cpu
INFO 02-07 01:07:59 sampler.py:264] random_samples: torch.Size([4, 1])
INFO 02-07 01:07:59 sampler.py:267] random_samples: torch.Size([4, 1]) in cpu
INFO 02-07 01:07:59 sampler.py:264] random_samples: torch.Size([4, 1])
INFO 02-07 01:08:03 async_llm_engine.py:436] Received request cmpl-e3925e2097d849d8918ac292502d604d-0: ...

NikolaBorisov · 2024-02-07T18:37:04Z

I tried with disable_custom_all_reduce=True and still it gets stuck. Seems to be stuck in the pytorch distribute.

Thread 0x00007f86d99fc640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2050 in all_reduce
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47 in wrapper
  File "/workspace/vllm/model_executor/parallel_utils/communication_op.py", line 34 in tensor_model_parallel_all_reduce
  File "/workspace/vllm/model_executor/layers/linear.py", line 548 in forward

simon-mo · 2024-02-07T18:41:38Z

cc @WoosukKwon @zhuohan123. If disable_custom_all_reduce=True doesn't solve it, then the problem lies somewhere in torch nccl and cuda graph. @NikolaBorisov can you try enforce_eager next?

NikolaBorisov · 2024-02-08T03:47:05Z

Very strange. So i traced the issue back in time and is started happening after the cuda graphs got added, between 0.2.5 and 0.2.6. At 0.2.6 with enforce_eager i can not get it to hang. But at 0.3.0 even with enforce_eager it hangs. However at 0.3.0 with enforce_eager and disable_custom_all_reduce is stops hanging.

To reproduce I do this: It also crashes with llama2-70b, but I was testing with codellama.

docker run -it --rm --gpus='"device=4,5,6,7"' -p 8000:8000 --shm-size=40g -v /data/tgi-data:/data vllm:test1 --model codellama/CodeLlama-70b-Instruct-hf --download-dir=/data --tensor-parallel-size=4

on 4xA100

I just send 100 request one ever 2 sec. And it usually get stuck around request 20-30.

gyin94 · 2024-02-09T06:59:35Z

@zhuohan123

flexwang · 2024-02-09T09:12:29Z

similar to #2770

@NikolaBorisov However at 0.3.0 with enforce_eager and disable_custom_all_reduce is stops hanging. is it working 100% or just luck? On my side, sometimes it works fine for a few hours, but it just works until it doens't

NikolaBorisov · 2024-02-09T18:27:12Z

similar to #2770

@NikolaBorisov However at 0.3.0 with enforce_eager and disable_custom_all_reduce is stops hanging. is it working 100% or just luck? On my side, sometimes it works fine for a few hours, but it just works until it doens't

I haven't tested for hours, but with both of those options i can not get it stuck with 500 request, while without them it just get stuck quickly. So you managed to get it stuck with 0.3.0 with enforce_eager and disable_custom_all_reduce?

flexwang · 2024-02-09T18:53:49Z

So you managed to get it stuck with 0.3.0 with enforce_eager and disable_custom_all_reduce

Ah, i just tried this combo, seems working fine for now, will keep looking

hanzhi713 · 2024-02-10T04:59:48Z

@NikolaBorisov @flexwang Can you both try #2760 with enforce_eager=True and disable_custom_all_reduce=False? I'm hoping that some synchronization fixes would be enough.

flexwang · 2024-02-13T05:50:59Z

@hanzhi713 but even vllm 0.2.7(without your custom_all_reduce) has nccl hangs.

WoosukKwon · 2024-02-13T05:58:34Z

Hi @NikolaBorisov @flexwang, sorry for the bug. The bug occurs when using CUDA graphs (i.e., enforce_eager=False) regardless of the custom all reduce kernels. It will be fixed by #2811.

NikolaBorisov · 2024-02-13T06:01:29Z

@hanzhi713 but even vllm 0.2.7(without your custom_all_reduce) has nccl hangs.
There is one in cuda graphs and one in custom_all_reduce. @hanzhi713 is trying to fix the custom all reduce. I was trying to try #2760 but another bug in the docker file stopped me from trying it.

@WoosukKwon is #2811 ready to try? Should I give it a go?

flexwang · 2024-02-13T06:13:59Z

Hi @NikolaBorisov @flexwang, sorry for the bug. The bug occurs when using CUDA graphs (i.e., enforce_eager=False) regardless of the custom all reduce kernels. It will be fixed by #2811.

@WoosukKwon thanks for the info. However, I looked at top when hanging happens, and the memory seems fine. Are we sure this is due to memory leak?

WoosukKwon · 2024-02-13T19:39:57Z

@NikolaBorisov Yes. We just merged the PR. Please try it!

@flexwang We observed that the hanging issue was resolved when using Cupy. However, the safest way will still be to use enforce_eager=True.

NikolaBorisov · 2024-02-13T22:52:06Z

@WoosukKwon Seems to work. I really want #2845 because the docker build are broken.

WoosukKwon · 2024-02-13T23:41:09Z

@NikolaBorisov Thanks for the confirmation!

@flexwang Please re-open the issue if the bug persists.

Muttermal · 2024-04-08T09:48:55Z

Hi, I also encountered the same problem. I use the model for local inference and when the inference is almost complete, vllm gets stuck.

The last 7% has been stuck for a long time. When I break the program, the script internally stays at:

My inference script is as follows:

llm = LLM(  
    model=model_path,  
    tensor_parallel_size=4,  
    max_model_len=12288,  
    enforce_eager=True,  
    disable_custom_all_reduce=False  
    # trust_remote_code=True  
)  
sampling_params = SamplingParams(  
    temperature=temperature,  
    top_k=top_k,  
    top_p=top_p,  
    repetition_penalty=1.05,  
    # stop_token_ids=model_conv.stop_token_ids,  
    max_tokens=12288  
)
llm_outputs = llm.generate(  
    all_conversations,  # my data
    sampling_params,  
    use_tqdm=True  
)

The GPU I use is 4 * 48G A40, my cuda driver version is 12.2, my environment is as follows:

vllm                          0.4.0
torch                         2.1.2
ray                           2.9.3
flash-attn                    2.5.6
transformers                  4.38.2

Is there any solution?

huang-junhong · 2025-02-24T03:39:03Z

in deepseek-r1-awq have same problem in 2 * 8 * A100(40G)

key pakage:
vllm 0.7.2
torch 2.5.1
ray 0.42.0

start command:
VLLM_LOGGING_LEVEL=DEBUG python -m vllm.entrypoints.openai.api_server --model /path/to/model --served-model-name name --port port --pipeline-parallel-size 2 --tensor-parallel-size 8 --max_model_len 8192 --enable-reasoning --reasoning-parser deepseek_r1 --dtype float16 --trust-remote-code --gpu-memory-utilization 0.8 --quantization moe_wna16 --max-num-seqs 1 --disable-custom-all-reduce

if not set --disable-custom-all-reduce, it will stuck in few minites.
set flag --disable-custom-all-reduce, will keep alive more time, but after sometime it will stuck.

hanzhi713 mentioned this issue Feb 7, 2024

Some fixes for custom allreduce kernels #2760

Merged

nehalvaghasiya mentioned this issue Feb 7, 2024

vLLM ignores my requests when I increase the number of concurrent requests #2752

Closed

NikolaBorisov mentioned this issue Feb 9, 2024

NCCL hanging during inference #2770

Closed

NikolaBorisov mentioned this issue Feb 13, 2024

Use CuPy for CUDA graphs #2811

Merged

ali-firstparty added a commit to firstpartyinc/vllm that referenced this issue Feb 13, 2024

consider vllm-project#2731 (comment)

ebe7ecd

WoosukKwon closed this as completed Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM getting stuck. Nothing is generate while requests are running and pending. #2731

vLLM getting stuck. Nothing is generate while requests are running and pending. #2731

NikolaBorisov commented Feb 3, 2024 •

edited by simon-mo

Loading

simon-mo commented Feb 3, 2024

hanswang1 commented Feb 3, 2024 •

edited

Loading

SebastianBodza commented Feb 3, 2024

NikolaBorisov commented Feb 3, 2024

hanzhi713 commented Feb 5, 2024

NikolaBorisov commented Feb 6, 2024

MrWaterZhou commented Feb 6, 2024

NikolaBorisov commented Feb 6, 2024

NikolaBorisov commented Feb 6, 2024

NikolaBorisov commented Feb 7, 2024

NikolaBorisov commented Feb 7, 2024

simon-mo commented Feb 7, 2024

NikolaBorisov commented Feb 8, 2024

gyin94 commented Feb 9, 2024

flexwang commented Feb 9, 2024 •

edited

Loading

NikolaBorisov commented Feb 9, 2024

flexwang commented Feb 9, 2024

hanzhi713 commented Feb 10, 2024 •

edited

Loading

flexwang commented Feb 13, 2024

WoosukKwon commented Feb 13, 2024

NikolaBorisov commented Feb 13, 2024

flexwang commented Feb 13, 2024

WoosukKwon commented Feb 13, 2024

NikolaBorisov commented Feb 13, 2024

WoosukKwon commented Feb 13, 2024

Muttermal commented Apr 8, 2024

huang-junhong commented Feb 24, 2025

vLLM getting stuck. Nothing is generate while requests are running and pending. #2731

vLLM getting stuck. Nothing is generate while requests are running and pending. #2731

Comments

NikolaBorisov commented Feb 3, 2024 • edited by simon-mo Loading

simon-mo commented Feb 3, 2024

hanswang1 commented Feb 3, 2024 • edited Loading

SebastianBodza commented Feb 3, 2024

NikolaBorisov commented Feb 3, 2024

hanzhi713 commented Feb 5, 2024

NikolaBorisov commented Feb 6, 2024

MrWaterZhou commented Feb 6, 2024

NikolaBorisov commented Feb 6, 2024

NikolaBorisov commented Feb 6, 2024

NikolaBorisov commented Feb 7, 2024

NikolaBorisov commented Feb 7, 2024

simon-mo commented Feb 7, 2024

NikolaBorisov commented Feb 8, 2024

gyin94 commented Feb 9, 2024

flexwang commented Feb 9, 2024 • edited Loading

NikolaBorisov commented Feb 9, 2024

flexwang commented Feb 9, 2024

hanzhi713 commented Feb 10, 2024 • edited Loading

flexwang commented Feb 13, 2024

WoosukKwon commented Feb 13, 2024

NikolaBorisov commented Feb 13, 2024

flexwang commented Feb 13, 2024

WoosukKwon commented Feb 13, 2024

NikolaBorisov commented Feb 13, 2024

WoosukKwon commented Feb 13, 2024

Muttermal commented Apr 8, 2024

huang-junhong commented Feb 24, 2025

NikolaBorisov commented Feb 3, 2024 •

edited by simon-mo

Loading

hanswang1 commented Feb 3, 2024 •

edited

Loading

flexwang commented Feb 9, 2024 •

edited

Loading

hanzhi713 commented Feb 10, 2024 •

edited

Loading