-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: How do I configure Phi-3-vision for high throughput? #7751
Comments
Are you deployed any vision language model across two machines, like pipeline parallelism. Thanks if you suggest something on that. How to send the api request for the vision model. I need to send the image and prompt. Currently vllm supports text only? |
vLLM's server supports image input via OpenAI Chat Completions API. Please refer to OpenAI's docs for more details. |
I don't think either of these are relevant for my issue. I am using a single Nvidia-L4, not a multi-gpu setup. |
I suggest profiling the code to see where is the bottleneck. It's possible that most of the execution time is taken up by the model forward pass, in which case there can hardly be any improvement from adjusting the batching params. |
@youkaichao @ywang96 perhaps you have a better idea of this? |
definitely it needs profiling first. |
@hommayushi3 Can you share the information on how you currently set up the workload, including
Without this information, we can't really help on how to optimize for your workload. |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
How would you like to use vllm
I want to run Phi-3-vision with VLLM to support parallel calls with high throughput. In my setup (openai compatible 0.5.4 VLLM server on HuggingFace Inference Endpoints with Nvidia-L4 24GB GPU), I have set up Phi-3-vision with the following parameters:
I am running into the issue that no matter what settings I use, adding more concurrent calls is increasing the total inference time linearly; the batching parallelism is not working. For example, running 4 concurrent requests takes 12 seconds, but 1 request by itself takes 3 seconds.
The logs show:
Questions:
The text was updated successfully, but these errors were encountered: