[Core] add an option to log every function call to for debugging hang/crash in distributed inference #4079

youkaichao · 2024-04-15T04:59:07Z

We often receive issue report from users that programs hang or crash in certain scenario. It is very difficult to debug. This PR adds an option to run with export VLLM_TRACE_FUNCTION=1 , which will log every function call in vllm. This way, we can know the final function and call stack before program hangs or crashes, so we can easily tell what is the bug.

Sometimes the bug might happen in unexpected places, e.g. #4027 finds the hang is because s3 bucket read is too slow; #3916 finds the core dump is due to corrupted libnccl.so .

Hopefully, this PR will help debugging in the future. Also might help #4019 .

TODO

Ideally we should have some launch_id, as described in [RFC]: Interface and Abstraction for Distributed Inference Environment #3587 and we should place these logs under /tmp/launch_id/ . Then users can zip the whole directory so that we can easily debug it. (This is VLLM_INSTANCE_ID, and default is vllm-instance-{random_uuid()})
Should update the issue templates to instruct users to use this option, when they report bugs about hangs or crashes.

youkaichao · 2024-04-18T20:49:57Z

Currently this functionality is only enabled for RayGPUExecutor, the only tensor-parallel executor.

We can enable it for all the executors, but would like to hear opinions on whether the effort is worthwhile.

youkaichao · 2024-04-18T21:33:31Z

I'm going to set this as a release blocker for v0.4.1 , because we will introduce vllm-managed nccl library in this release. This is kind of tricky and we need more debugging functionality.

vllm/utils.py

vllm/worker/worker_base.py

simon-mo · 2024-04-18T21:46:17Z

vllm/worker/worker_base.py

+                (f"VLLM_TRACE_FUNCTION_for_process_{os.getpid()}"
+                 f"_thread_{threading.get_ident()}_"
+                 f"at_{datetime.datetime.now()}.log").replace(" ", "_"))
+            os.makedirs(os.path.dirname(log_path), exist_ok=True)


this should be moved into enable_trace_function_call

I separate and simplify the logic in enable_trace_function_call so that it can be tested in a standalone way. The caller should be responsible for the logic of creating the log file path.

Co-authored-by: Simon Mo <[email protected]>

…rame

youkaichao · 2024-04-18T22:06:22Z

@simon-mo thanks for the quick and detailed review!

…/crash in distributed inference (vllm-project#4079) Co-authored-by: Simon Mo <[email protected]>

youkaichao added 4 commits April 14, 2024 21:29

add frame tracer

430bfd3

add frame logging

00656a8

only log vllm function call

d9874cf

fix return call

2c4e1ed

agt mentioned this pull request Apr 15, 2024

[Bug]: LLM is not getting loaded on multiple GPUs but works fine on a single GPU #3974

Closed

youkaichao mentioned this pull request Apr 18, 2024

Bump version of 0.4.1 #4177

Merged

youkaichao added 12 commits April 18, 2024 11:15

Merge branch 'main' into trace_frame

af67e00

separate enable_trace_function_call

9748054

add tests

bb35129

fix ruff

5a565a4

add get_vllm_instance_id

26c850e

broadcast VLLM_INSTANCE_ID to workers

65d9375

refine warning condition

b2980cd

enable trace_function_call

91735e9

broadcast VLLM_TRACE_FUNCTION to workers

0dca0a6

verbose VLLM_INSTANCE_ID

38bc9cc

explain instance id

844d8c9

add issue templates

3a06f61

simon-mo mentioned this pull request Apr 18, 2024

v0.4.1 Release Tracker #4181

Closed

9 tasks

simon-mo approved these changes Apr 18, 2024

View reviewed changes

youkaichao and others added 5 commits April 18, 2024 14:56

cache get_vllm_instance_id inside one process

4636490

Update vllm/worker/worker_base.py

509f7aa

Co-authored-by: Simon Mo <[email protected]>

use global import

6e2609c

Merge branch 'trace_frame' of github.com:youkaichao/vllm into trace_f…

c2d8275

…rame

improve temp directory finding

e9600f0

simon-mo merged commit 8a7a3e4 into vllm-project:main Apr 18, 2024
44 of 46 checks passed

youkaichao deleted the trace_frame branch April 18, 2024 23:19

xjpang pushed a commit to xjpang/vllm that referenced this pull request Apr 19, 2024

[Core] add an option to log every function call to for debugging hang…

e3c478c

…/crash in distributed inference (vllm-project#4079) Co-authored-by: Simon Mo <[email protected]>

This was referenced Apr 19, 2024

[Bug]: async llm engine failed unexpectedly (using mixtral-8x7b with tp=4) #4135

Closed

[RFC]: Interface and Abstraction for Distributed Inference Environment #3587

Closed

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 21, 2024

[Core] add an option to log every function call to for debugging hang…

b1983ad

…/crash in distributed inference (vllm-project#4079) Co-authored-by: Simon Mo <[email protected]>

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024

[Core] add an option to log every function call to for debugging hang…

6928163

…/crash in distributed inference (vllm-project#4079) Co-authored-by: Simon Mo <[email protected]>

alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024

[Core] add an option to log every function call to for debugging hang…

bec8ffd

…/crash in distributed inference (vllm-project#4079) Co-authored-by: Simon Mo <[email protected]>

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024

[Core] add an option to log every function call to for debugging hang…

d7e2c90

…/crash in distributed inference (vllm-project#4079) Co-authored-by: Simon Mo <[email protected]>

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] add an option to log every function call to for debugging hang/crash in distributed inference #4079

[Core] add an option to log every function call to for debugging hang/crash in distributed inference #4079

youkaichao commented Apr 15, 2024 •

edited

Loading

youkaichao commented Apr 18, 2024

youkaichao commented Apr 18, 2024

simon-mo Apr 18, 2024

youkaichao Apr 18, 2024

youkaichao commented Apr 18, 2024

[Core] add an option to log every function call to for debugging hang/crash in distributed inference #4079

[Core] add an option to log every function call to for debugging hang/crash in distributed inference #4079

Conversation

youkaichao commented Apr 15, 2024 • edited Loading

TODO

youkaichao commented Apr 18, 2024

youkaichao commented Apr 18, 2024

simon-mo Apr 18, 2024

Choose a reason for hiding this comment

youkaichao Apr 18, 2024

Choose a reason for hiding this comment

youkaichao commented Apr 18, 2024

youkaichao commented Apr 15, 2024 •

edited

Loading