-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][distributed] support layer size undividable by pp size in pipeline parallel inference #6115
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a subtle point here:
vllm/vllm/executor/distributed_gpu_executor.py
Lines 38 to 46 in 966fe72
num_blocks = self._run_workers("determine_num_available_blocks", ) | |
# Since we use a shared centralized controller, we take the minimum | |
# number of blocks across all workers to make sure all the memory | |
# operators can be applied to all workers. | |
num_gpu_blocks = min(b[0] for b in num_blocks) | |
num_cpu_blocks = min(b[1] for b in num_blocks) | |
return num_gpu_blocks, num_cpu_blocks |
We take min of blocks across all workers - as a result of this the GPU mem utilization of 0...n-2th will be slightly lower than expected.
I don't think it's a big deal, just something to be aware of.
Good point. I also noticed this. Some GPU memory waste, but it is better than not being supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Alvant <[email protected]>
fixes #6114