-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix broadcasting logic for multi_modal_kwargs
#6836
[Bugfix] Fix broadcasting logic for multi_modal_kwargs
#6836
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
Reopening to address the comment from @youkaichao . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the change! will take a look tmrw.
multi_modal_kwargs
Hmm, does this have performance impact? Like given a CPU tensor, how do you compare the following
I feel like on a single node with high bandwidth NCCL, option 2 may be faster? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the refactor to make our life easier!
@xwjiang2010 the tensor here to broadcast is small, and NCCL does not have a win here. It only adds code complexity. |
yeah while I am happy that I don't have to think of a way of making nested list of nested dict of nested list work, For example, are they small when there are dozens of queries in a batch? Or multiple images per query? Or when we are sending in embeddings (in the future)? Or when factors like these are compounding together? |
In the future we will only broadcast CPU data, without sending any GPU data via broadcast. GPU data should live in the process that produces it. |
And this also applies to things like input_ids right? |
Yes, in the future |
…t#6836) Signed-off-by: Alvant <[email protected]>
Currently
multi_modal_kwargs
are placed on GPU before distributed broadcast. Since the broadcast operation only allows flat dictionaries of tensors, custom logic is required to handle nested tensors. Currently, it only handles singly-nested dictionaries and fails to consider nested lists of tensors. It is complicated and even inefficient to extend this to arbitrarily nested structures such as interleaved nested dictionaries and lists.This PR addresses the root issue by keeping the tensors in CPU before the broadcast operation, only moving them to GPU right before they are inputted to the model. This lets us broadcast them as opaque Python objects, removing the need to handle nested structures.
FIX #5905 (comment)
FIX #6946
cc @xwjiang2010 @youkaichao