You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
:540539 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Connected all rings
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Connected all trees
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO NVLS comm 0x55cffbc5c320 headRank 0 nHeads 8 buffSize 1048576 memSize 2097152 nvlsPerRankSize 100663296 nvlsTotalSize 805306368
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 16 p2p channels per peer
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO NCCL_LAUNCH_MODE set by environment to GROUP
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO ncclCommInitRank comm 0x55cffbc5c320 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0x408de440a1405b93 - Init COMPLETE
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Using non-device net plugin version 0
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Using network Socket
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO ncclCommInitRank comm 0x55d0094174e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 19000 commId 0xa4338029cd1bb6db - Init START
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555,55555555,55555555,55555555,55555555
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO comm 0x55d0094174e0 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Channel 00/02 : 0 1
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Channel 01/02 : 0 1
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO P2P Chunksize set to 131072
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/Socket/0
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/Socket/0
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/Socket/0
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/Socket/0
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Connected all rings
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO Connected all trees
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
dsw-1056-75bdc9d5c9-hp6cz:540539:540539 [0] NCCL INFO ncclCommInitRank comm 0x55d0094174e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 19000 commId 0xa4338029cd1bb6db - Init COMPLETE
dsw-1056-75bdc9d5c9-hp6cz:540539:551755 [0] NCCL INFO Using non-device net plugin version 0
dsw-1056-75bdc9d5c9-hp6cz:540539:551755 [0] NCCL INFO Using network Socket
dsw-1056-75bdc9d5c9-hp6cz:540539:551755 [0] NCCL INFO ncclCommInitRank comm 0x7f647c0203e0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0xd346c3c43ef7f12f - Init START
dsw-1056-75bdc9d5c9-hp6cz:540539:551755 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,5555555�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] Error executing method 'start_worker_execution_loop'. This might cause deadlock in distributed execution.
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] Traceback (most recent call last):
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] File "/opt/vllm/vllm/worker/worker_base.py", line 566, in execute_method
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] return run_method(target, method, args, kwargs)
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] File "/opt/vllm/vllm/utils.py", line 2220, in run_method
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] File "/opt/vllm/vllm/worker/worker_base.py", line 93, in start_worker_execution_loop
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] output = self.execute_model(execute_model_req=None)
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] File "/opt/vllm/vllm/worker/worker_base.py", line 406, in execute_model
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] get_pp_group().recv_tensor_dict(
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] File "/opt/vllm/vllm/distributed/parallel_state.py", line 747, in recv_tensor_dict
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] recv_metadata_list = self.recv_object(src=src)
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] File "/opt/vllm/vllm/distributed/parallel_state.py", line 561, in recv_object
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] rank_size = torch.distributed.recv(size_tensor,
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] File "/opt/py3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] File "/opt/py3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2203, in recv
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] pg.recv([tensor], group_src_rank, tag).wait()
�[36m(RayWorkerWrapper pid=286715, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
�[36m(RayWorkerWrapper pid=286711, ip=10.130.199.172)�[0m INFO 02-13 13:59:44 worker.py:267] Memory profiling takes 8.69 seconds�[32m [repeated 14x across cluster]�[0m
�[36m(RayWorkerWrapper pid=286711, ip=10.130.199.172)�[0m INFO 02-13 13:59:44 worker.py:267] the current vLLM instance can use total_gpu_memory (79.11GiB) x gpu_memory_utilization (0.90) = 71.20GiB�[32m [repeated 14x across cluster]�[0m
�[36m(RayWorkerWrapper pid=286711, ip=10.130.199.172)�[0m INFO 02-13 13:59:44 worker.py:267] model weights take 44.71GiB; non_torch_memory takes 4.48GiB; PyTorch activation peak memory takes 1.25GiB; the rest of the memory reserved for KV Cache is 20.75GiB.�[32m [repeated 14x across cluster]�[0m
�[36m(RayWorkerWrapper pid=286711, ip=10.130.199.172)�[0m ERROR 02-13 15:44:24 worker_base.py:574] Error executing method 'execute_model'. This might cause deadlock in distributed execution.
INFO 02-14 02:10:29 async_llm_engine.py:211] Added request chatcmpl-2e196a12b6e9431b8f0fcc238cccb333.
INFO 02-14 02:10:29 async_llm_engine.py:211] Added request chatcmpl-2b52dd1202a94e8bb58c2c0fa7e0b5bf.
DEBUG 02-14 02:10:29 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
ERROR 02-14 02:10:29 async_llm_engine.py:839] Engine iteration timed out. This should never happen!
ERROR 02-14 02:10:29 async_llm_engine.py:68] Engine background task failed
ERROR 02-14 02:10:29 async_llm_engine.py:68] Traceback (most recent call last):
ERROR 02-14 02:10:29 async_llm_engine.py:68] File "/opt/vllm/vllm/engine/async_llm_engine.py", line 819, in run_engine_loop
ERROR 02-14 02:10:29 async_llm_engine.py:68] done, _ = await asyncio.wait(
ERROR 02-14 02:10:29 async_llm_engine.py:68] File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 02-14 02:10:29 async_llm_engine.py:68] return await _wait(fs, timeout, return_when, loop)
ERROR 02-14 02:10:29 async_llm_engine.py:68] File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 02-14 02:10:29 async_llm_engine.py:68] await waiter
ERROR 02-14 02:10:29 async_llm_engine.py:68] asyncio.exceptions.CancelledError
ERROR 02-14 02:10:29 async_llm_engine.py:68]
ERROR 02-14 02:10:29 async_llm_engine.py:68] During handling of the above exception, another exception occurred:
ERROR 02-14 02:10:29 async_llm_engine.py:68]
ERROR 02-14 02:10:29 async_llm_engine.py:68] Traceback (most recent call last):
ERROR 02-14 02:10:29 async_llm_engine.py:68] File "/opt/vllm/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
ERROR 02-14 02:10:29 async_llm_engine.py:68] return_value = task.result()
ERROR 02-14 02:10:29 async_llm_engine.py:68] File "/opt/vllm/vllm/engine/async_llm_engine.py", line 818, in run_engine_loop
ERROR 02-14 02:10:29 async_llm_engine.py:68] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 02-14 02:10:29 async_llm_engine.py:68] File "/opt/vllm/vllm/engine/async_timeout.py", line 97, in __aexit__
ERROR 02-14 02:10:29 async_llm_engine.py:68] self._do_exit(exc_type)
ERROR 02-14 02:10:29 async_llm_engine.py:68] File "/opt/vllm/vllm/engine/async_timeout.py", line 180, in _do_exit
ERROR 02-14 02:10:29 async_llm_engine.py:68] raise asyncio.TimeoutError
ERROR 02-14 02:10:29 async_llm_engine.py:68] asyncio.exceptions.TimeoutError
Exception in callback functools.partial(<function _log_task_completion at 0x7fada490aef0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fad87b387f0>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7fada490aef0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fad87b387f0>>)>
Traceback (most recent call last):
File "/opt/vllm/vllm/engine/async_llm_engine.py", line 819, in run_engine_loop
done, _ = await asyncio.wait(
File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/vllm/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
return_value = task.result()
File "/opt/vllm/vllm/engine/async_llm_engine.py", line 818, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/opt/vllm/vllm/engine/async_timeout.py", line 97, in __aexit__
self._do_exit(exc_type)
File "/opt/vllm/vllm/engine/async_timeout.py", line 180, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/opt/vllm/vllm/engine/async_llm_engine.py", line 70, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Your current environment
The error log of `python collect_env.py`
🐛 Describe the bug
Engine failed after running some time. Found similar issue #5084 but it is supposed be fixed in v0.5.1
command
error log
The error log
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: