Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vLLM getting stuck. Nothing is generate while requests are running and pending. #2731

Closed
NikolaBorisov opened this issue Feb 3, 2024 · 27 comments

Comments

@NikolaBorisov
Copy link
Contributor

NikolaBorisov commented Feb 3, 2024

We are seeing the latest version of vllm getting stuck randomly after some minutes of work. Sometimes after an hour.

The server still receives new request and can reply to health and metrics, but no tokens are generate, no requests complete.
Server keeps printing the status every 5 seconds, but no tokens are generated. As if the loop is stuck.

INFO 02-01 06:36:05 llm_engine.py:921] Avg prompt throughput: 382.6 tokens/s, Avg generation throughput: 118.5 tokens/s, Max iteration time: 386.7 ms, Avg time/tok:149.4 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 115 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO 02-01 06:36:05 async_llm_engine.py:110] Finished request cmpl-50c32d7a66084c3f9980d2bf06d79900-0.
INFO 02-01 06:36:05 async_llm_engine.py:110] Finished request cmpl-90590e17ce6b4fa4b19f0812c0c98446-0.
INFO:     10.244.5.235:41834 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     10.244.5.237:53262 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-01 06:36:05 async_llm_engine.py:436] Received request cmpl-a523b4f84b1b491d9f61ddc4558f532b-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:05 async_llm_engine.py:436] Received request cmpl-25d5d0f7555c46f588570cc83d3a0f81-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO:     10.244.6.107:43538 - "GET /metrics HTTP/1.1" 200 OK
INFO 02-01 06:36:05 async_llm_engine.py:436] Received request cmpl-cf7f4d9b34144b3f8efc55498f75c782-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:07 async_llm_engine.py:436] Received request cmpl-fe5c335b41654d2b9e1141819f92e762-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:08 async_llm_engine.py:436] Received request cmpl-07d703ba4be6400d95704ae748e9c752-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:10 async_llm_engine.py:436] Received request cmpl-61ddd4fa2c074d108048d88e884f5bef-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:10 llm_engine.py:921] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 7.2 tokens/s, Max iteration time: 107.8 ms, Avg time/tok:107.8 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 116 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO:     10.244.39.1:46396 - "GET /health HTTP/1.1" 200 OK
INFO 02-01 06:36:11 async_llm_engine.py:436] Received request cmpl-37d1f11755354de88177e21d466f9ae4-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:12 async_llm_engine.py:436] Received request cmpl-bfd5818ad84548fdb8fbd3ed075d8a00-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:12 async_llm_engine.py:436] Received request cmpl-cd834640585140fabf9f9f5342d08617-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.5, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['USER:', 'ASSISTANT:', 'Reference(s):', 'Note:'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=250, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:13 async_llm_engine.py:436] Received request cmpl-d0a32735719f4425ac7bcc47d73e4c6a-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO:     10.244.41.46:34592 - "GET /health HTTP/1.1" 200 OK
INFO 02-01 06:36:15 async_llm_engine.py:436] Received request cmpl-28df8dd0e22140d09d2eb497eabb2ae6-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:16 async_llm_engine.py:436] Received request cmpl-0891eb16b2734aa0be69ac560ce76262-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:18 async_llm_engine.py:436] Received request cmpl-bfe0ccd536e84b858fa4a6455cc3c84e-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:18 async_llm_engine.py:436] Received request cmpl-fa92185c03d64100a87b055f8de9ebec-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:18 async_llm_engine.py:436] Received request cmpl-75fbe42838b44889b2365a8f896769d9-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:20 async_llm_engine.py:436] Received request cmpl-9a7be2f4394747bc86b924fff8729e53-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO:     10.244.6.107:41448 - "GET /metrics HTTP/1.1" 200 OK
INFO 02-01 06:36:20 llm_engine.py:921] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Max iteration time: 0.0 ms, Avg time/tok:0.0 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 116 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO:     10.244.39.1:50926 - "GET /health HTTP/1.1" 200 OK
INFO:     10.244.41.46:51998 - "GET /health HTTP/1.1" 200 OK
INFO 02-01 06:36:28 async_llm_engine.py:436] Received request cmpl-2ad4077818664575a61e96651cc8ff02-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:28 async_llm_engine.py:436] Received request cmpl-ee0f799a0def4b3abda0e6e3b782fc9a-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:28 async_llm_engine.py:436] Received request cmpl-452bd3a0cb2143feb93ff7c253350954-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:30 llm_engine.py:921] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Max iteration time: 0.0 ms, Avg time/tok:0.0 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 116 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO:     10.244.39.1:46878 - "GET /health HTTP/1.1" 200 OK
INFO:     10.244.41.46:54412 - "GET /health HTTP/1.1" 200 OK
@simon-mo
Copy link
Collaborator

simon-mo commented Feb 3, 2024

Do you have details of the model and hardware so we can try our best to repro this?

@hanswang1
Copy link

hanswang1 commented Feb 3, 2024

I have same issue during chatting with back-end after applied vLLM on FastChat .
When vLLM is not applied, no issue found for FastChat.

A ticker is opened at FastChat site but has no response: lm-sys/FastChat#3003

@SebastianBodza
Copy link

Same as #2728?

@NikolaBorisov
Copy link
Contributor Author

We are trying to find simple way to reproduce. It happened to 3 instances running on A100 SXM. One of them got in this state after 5-10 min. The other one worked for hour fine before getting there. Might be related to #2728, but I don't think quantization is the issue, since we got it stuck with no quantization. Also it got stuck on Llama 70B and Mixtral.

@hanzhi713
Copy link
Contributor

@NikolaBorisov Can you try if adding disable_custom_all_reduce=True help?

@NikolaBorisov
Copy link
Contributor Author

Have not had luck reproducing this reliably. Will run more experiments and update here.

@MrWaterZhou
Copy link

@NikolaBorisov Can you try if adding disable_custom_all_reduce=True help?

This worked for me! Thx!

@NikolaBorisov
Copy link
Contributor Author

We got it to reproduce. Here is stack trace. I think #1889 is the cause.

Thread 0x00007f6cdc9f4640 (most recent call first):
  File "/workspace/vllm/model_executor/layers/sampler.py", line 263 in _random_sample
  File "/workspace/vllm/model_executor/layers/sampler.py", line 411 in _sample
  File "/workspace/vllm/model_executor/layers/sampler.py", line 108 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/workspace/vllm/model_executor/models/llama.py", line 314 in sample
  File "/workspace/vllm/worker/model_runner.py", line 542 in execute_model
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/workspace/vllm/worker/worker.py", line 213 in execute_model
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58 in run
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83 in _worker
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

@NikolaBorisov
Copy link
Contributor Author

Another stacktrace: This is stuck in the all reduce. Maybe it is not the sampler.

  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2050 in all_reduce
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47 in wrapper
  File "/workspace/vllm/model_executor/parallel_utils/communication_op.py", line 24 in tensor_model_parallel_all_reduce
  File "/workspace/vllm/model_executor/layers/linear.py", line 548 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/workspace/vllm/model_executor/models/llama.py", line 78 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/workspace/vllm/model_executor/models/llama.py", line 218 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/workspace/vllm/model_executor/models/llama.py", line 254 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/workspace/vllm/model_executor/models/llama.py", line 286 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
...

@NikolaBorisov
Copy link
Contributor Author

Added some prints in sampler.py:_random_sample

    logger.info("random_samples: %s", random_samples.shape)
    # Find the maximum best_of value of the prompt phase requests.
    random_samples = random_samples.cpu()
    logger.info("random_samples: %s in cpu", random_samples.shape)

It is getting stuck during the .cpu() operation. This is with the custom_all_reduce. When that one is disabled it gets stuck in the pytorch_all_reduce.

INFO 02-07 01:07:59 sampler.py:264] random_samples: torch.Size([4, 1])
INFO 02-07 01:07:59 sampler.py:267] random_samples: torch.Size([4, 1]) in cpu
INFO 02-07 01:07:59 sampler.py:264] random_samples: torch.Size([4, 1])
INFO 02-07 01:07:59 sampler.py:267] random_samples: torch.Size([4, 1]) in cpu
INFO 02-07 01:07:59 sampler.py:264] random_samples: torch.Size([4, 1])
INFO 02-07 01:08:03 async_llm_engine.py:436] Received request cmpl-e3925e2097d849d8918ac292502d604d-0: ...

@NikolaBorisov
Copy link
Contributor Author

I tried with disable_custom_all_reduce=True and still it gets stuck. Seems to be stuck in the pytorch distribute.

Thread 0x00007f86d99fc640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2050 in all_reduce
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47 in wrapper
  File "/workspace/vllm/model_executor/parallel_utils/communication_op.py", line 34 in tensor_model_parallel_all_reduce
  File "/workspace/vllm/model_executor/layers/linear.py", line 548 in forward

@simon-mo
Copy link
Collaborator

simon-mo commented Feb 7, 2024

cc @WoosukKwon @zhuohan123. If disable_custom_all_reduce=True doesn't solve it, then the problem lies somewhere in torch nccl and cuda graph. @NikolaBorisov can you try enforce_eager next?

@NikolaBorisov
Copy link
Contributor Author

Very strange. So i traced the issue back in time and is started happening after the cuda graphs got added, between 0.2.5 and 0.2.6. At 0.2.6 with enforce_eager i can not get it to hang. But at 0.3.0 even with enforce_eager it hangs. However at 0.3.0 with enforce_eager and disable_custom_all_reduce is stops hanging.

To reproduce I do this: It also crashes with llama2-70b, but I was testing with codellama.

docker run -it --rm --gpus='"device=4,5,6,7"' -p 8000:8000 --shm-size=40g -v /data/tgi-data:/data vllm:test1 --model codellama/CodeLlama-70b-Instruct-hf --download-dir=/data --tensor-parallel-size=4

on 4xA100

I just send 100 request one ever 2 sec. And it usually get stuck around request 20-30.

@gyin94
Copy link

gyin94 commented Feb 9, 2024

@flexwang
Copy link

flexwang commented Feb 9, 2024

similar to #2770

@NikolaBorisov However at 0.3.0 with enforce_eager and disable_custom_all_reduce is stops hanging. is it working 100% or just luck? On my side, sometimes it works fine for a few hours, but it just works until it doens't

@NikolaBorisov
Copy link
Contributor Author

similar to #2770

@NikolaBorisov However at 0.3.0 with enforce_eager and disable_custom_all_reduce is stops hanging. is it working 100% or just luck? On my side, sometimes it works fine for a few hours, but it just works until it doens't

I haven't tested for hours, but with both of those options i can not get it stuck with 500 request, while without them it just get stuck quickly. So you managed to get it stuck with 0.3.0 with enforce_eager and disable_custom_all_reduce?

@flexwang
Copy link

flexwang commented Feb 9, 2024

So you managed to get it stuck with 0.3.0 with enforce_eager and disable_custom_all_reduce

Ah, i just tried this combo, seems working fine for now, will keep looking

@hanzhi713
Copy link
Contributor

hanzhi713 commented Feb 10, 2024

@NikolaBorisov @flexwang Can you both try #2760 with enforce_eager=True and disable_custom_all_reduce=False? I'm hoping that some synchronization fixes would be enough.

ali-firstparty added a commit to firstpartyinc/vllm that referenced this issue Feb 13, 2024
@flexwang
Copy link

@hanzhi713 but even vllm 0.2.7(without your custom_all_reduce) has nccl hangs.

@WoosukKwon
Copy link
Collaborator

Hi @NikolaBorisov @flexwang, sorry for the bug. The bug occurs when using CUDA graphs (i.e., enforce_eager=False) regardless of the custom all reduce kernels. It will be fixed by #2811.

@NikolaBorisov
Copy link
Contributor Author

@hanzhi713 but even vllm 0.2.7(without your custom_all_reduce) has nccl hangs.
There is one in cuda graphs and one in custom_all_reduce. @hanzhi713 is trying to fix the custom all reduce. I was trying to try #2760 but another bug in the docker file stopped me from trying it.

@WoosukKwon is #2811 ready to try? Should I give it a go?

@flexwang
Copy link

Hi @NikolaBorisov @flexwang, sorry for the bug. The bug occurs when using CUDA graphs (i.e., enforce_eager=False) regardless of the custom all reduce kernels. It will be fixed by #2811.

@WoosukKwon thanks for the info. However, I looked at top when hanging happens, and the memory seems fine. Are we sure this is due to memory leak?

@WoosukKwon
Copy link
Collaborator

@NikolaBorisov Yes. We just merged the PR. Please try it!

@flexwang We observed that the hanging issue was resolved when using Cupy. However, the safest way will still be to use enforce_eager=True.

@NikolaBorisov
Copy link
Contributor Author

@WoosukKwon Seems to work. I really want #2845 because the docker build are broken.

@WoosukKwon
Copy link
Collaborator

@NikolaBorisov Thanks for the confirmation!

@flexwang Please re-open the issue if the bug persists.

@Muttermal
Copy link

Hi, I also encountered the same problem. I use the model for local inference and when the inference is almost complete, vllm gets stuck.
Pasted image 20240408173516
The last 7% has been stuck for a long time. When I break the program, the script internally stays at:
Pasted image 20240408174029

My inference script is as follows:

llm = LLM(  
    model=model_path,  
    tensor_parallel_size=4,  
    max_model_len=12288,  
    enforce_eager=True,  
    disable_custom_all_reduce=False  
    # trust_remote_code=True  
)  
sampling_params = SamplingParams(  
    temperature=temperature,  
    top_k=top_k,  
    top_p=top_p,  
    repetition_penalty=1.05,  
    # stop_token_ids=model_conv.stop_token_ids,  
    max_tokens=12288  
)
llm_outputs = llm.generate(  
    all_conversations,  # my data
    sampling_params,  
    use_tqdm=True  
)

The GPU I use is 4 * 48G A40, my cuda driver version is 12.2, my environment is as follows:

vllm                          0.4.0
torch                         2.1.2
ray                           2.9.3
flash-attn                    2.5.6
transformers                  4.38.2

Is there any solution?

@huang-junhong
Copy link

in deepseek-r1-awq have same problem in 2 * 8 * A100(40G)

key pakage:
vllm 0.7.2
torch 2.5.1
ray 0.42.0

start command:
VLLM_LOGGING_LEVEL=DEBUG python -m vllm.entrypoints.openai.api_server --model /path/to/model --served-model-name name --port port --pipeline-parallel-size 2 --tensor-parallel-size 8 --max_model_len 8192 --enable-reasoning --reasoning-parser deepseek_r1 --dtype float16 --trust-remote-code --gpu-memory-utilization 0.8 --quantization moe_wna16 --max-num-seqs 1 --disable-custom-all-reduce

if not set --disable-custom-all-reduce, it will stuck in few minites.
set flag --disable-custom-all-reduce, will keep alive more time, but after sometime it will stuck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests