-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] add an option to log every function call to for debugging hang/crash in distributed inference #4079
Conversation
Currently this functionality is only enabled for We can enable it for all the executors, but would like to hear opinions on whether the effort is worthwhile. |
I'm going to set this as a release blocker for |
(f"VLLM_TRACE_FUNCTION_for_process_{os.getpid()}" | ||
f"_thread_{threading.get_ident()}_" | ||
f"at_{datetime.datetime.now()}.log").replace(" ", "_")) | ||
os.makedirs(os.path.dirname(log_path), exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be moved into enable_trace_function_call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I separate and simplify the logic in enable_trace_function_call
so that it can be tested in a standalone way. The caller should be responsible for the logic of creating the log file path.
Co-authored-by: Simon Mo <[email protected]>
@simon-mo thanks for the quick and detailed review! |
…/crash in distributed inference (vllm-project#4079) Co-authored-by: Simon Mo <[email protected]>
…/crash in distributed inference (vllm-project#4079) Co-authored-by: Simon Mo <[email protected]>
…/crash in distributed inference (vllm-project#4079) Co-authored-by: Simon Mo <[email protected]>
…/crash in distributed inference (vllm-project#4079) Co-authored-by: Simon Mo <[email protected]>
…/crash in distributed inference (vllm-project#4079) Co-authored-by: Simon Mo <[email protected]>
We often receive issue report from users that programs hang or crash in certain scenario. It is very difficult to debug. This PR adds an option to run with
export VLLM_TRACE_FUNCTION=1
, which will log every function call invllm
. This way, we can know the final function and call stack before program hangs or crashes, so we can easily tell what is the bug.Sometimes the bug might happen in unexpected places, e.g. #4027 finds the hang is because s3 bucket read is too slow; #3916 finds the core dump is due to corrupted libnccl.so .
Hopefully, this PR will help debugging in the future. Also might help #4019 .
TODO
launch_id
, as described in [RFC]: Interface and Abstraction for Distributed Inference Environment #3587 and we should place these logs under/tmp/launch_id/
. Then users can zip the whole directory so that we can easily debug it. (This isVLLM_INSTANCE_ID
, and default isvllm-instance-{random_uuid()}
)