[Usage]: How do I configure Phi-3-vision for high throughput? #7751

hommayushi3 · 2024-08-21T18:27:19Z

How would you like to use vllm

I want to run Phi-3-vision with VLLM to support parallel calls with high throughput. In my setup (openai compatible 0.5.4 VLLM server on HuggingFace Inference Endpoints with Nvidia-L4 24GB GPU), I have set up Phi-3-vision with the following parameters:

DISABLE_SLIDING_WINDOW=true
DTYPE=bfloat16
ENFORCE_EAGER=true   # Tried both true/false
GPU_MEMORY_UTILIZATION=0.98  # Tried 0.6-0.99
MAX_MODEL_LEN=3072  # Smallest token length that supports my work
MAX_NUM_BATCHED_TOKENS=12288  # Tried 3072-12288
MAX_NUM_SEQS=16  # Tried 2-32
QUANTIZATION=fp8  # Tried fp8 and None
TRUST_REMOTE_CODE=true
VLLM_ATTENTION_BACKEND=FLASH_ATTN

I am running into the issue that no matter what settings I use, adding more concurrent calls is increasing the total inference time linearly; the batching parallelism is not working. For example, running 4 concurrent requests takes 12 seconds, but 1 request by itself takes 3 seconds.

The logs show:

Avg prompt throughput: 3461 tokens/s, Avg generation throughput: 39.4 tokens/s, Running: 12 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 68.3%, CPU KV cache usage: 0.0%
Avg prompt throughput: 0 tokens/s, Avg generation throughput: 154.3 tokens/s, Running: 7 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 40.8%, CPU KV cache usage: 0.0%

Questions:

Is this a configuration/usage issue? What other parameters might I be missing?
Is this an issue with Phi-3-vision? (might be related to this issue)
Would this be fixed with Phi-3.5-vision?

The text was updated successfully, but these errors were encountered:

Dineshkumar-Anandan-ZS0367 · 2024-08-21T18:51:21Z

Are you deployed any vision language model across two machines, like pipeline parallelism.
Can you able to suggest some ideas.

Thanks if you suggest something on that. How to send the api request for the vision model. I need to send the image and prompt. Currently vllm supports text only?

DarkLight1337 · 2024-08-22T02:07:08Z

Are you deployed any vision language model across two machines, like pipeline parallelism.

PP is not yet supported for vision language models (#7684). Also, the model has been fully TP'ed yet (#7186). The performance should improve after these PRs are completed.

DarkLight1337 · 2024-08-22T02:08:19Z

Thanks if you suggest something on that. How to send the api request for the vision model. I need to send the image and prompt. Currently vllm supports text only?

vLLM's server supports image input via OpenAI Chat Completions API. Please refer to OpenAI's docs for more details.

hommayushi3 · 2024-08-22T02:10:51Z

I don't think either of these are relevant for my issue. I am using a single Nvidia-L4, not a multi-gpu setup.

DarkLight1337 · 2024-08-22T02:15:58Z

I suggest profiling the code to see where is the bottleneck. It's possible that most of the execution time is taken up by the model forward pass, in which case there can hardly be any improvement from adjusting the batching params.

DarkLight1337 · 2024-08-22T02:19:50Z

@youkaichao @ywang96 perhaps you have a better idea of this?

youkaichao · 2024-08-22T04:46:04Z

definitely it needs profiling first.

ywang96 · 2024-09-02T04:56:40Z

For example, running 4 concurrent requests takes 12 seconds, but 1 request by itself takes 3 seconds.

@hommayushi3 Can you share the information on how you currently set up the workload, including

vLLM version and launch args of the server/LLM class
How you send your requests, and a example of a single request that you're sending.

Without this information, we can't really help on how to optimize for your workload.

github-actions · 2024-12-02T02:07:31Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2025-01-02T01:59:40Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

hommayushi3 added the usage How to use vllm label Aug 21, 2024

github-actions bot added the stale label Dec 2, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: How do I configure Phi-3-vision for high throughput? #7751

[Usage]: How do I configure Phi-3-vision for high throughput? #7751

hommayushi3 commented Aug 21, 2024 •

edited

Loading

Dineshkumar-Anandan-ZS0367 commented Aug 21, 2024

DarkLight1337 commented Aug 22, 2024 •

edited

Loading

DarkLight1337 commented Aug 22, 2024 •

edited

Loading

hommayushi3 commented Aug 22, 2024

DarkLight1337 commented Aug 22, 2024 •

edited

Loading

DarkLight1337 commented Aug 22, 2024

youkaichao commented Aug 22, 2024

ywang96 commented Sep 2, 2024

github-actions bot commented Dec 2, 2024

github-actions bot commented Jan 2, 2025

[Usage]: How do I configure Phi-3-vision for high throughput? #7751

[Usage]: How do I configure Phi-3-vision for high throughput? #7751

Comments

hommayushi3 commented Aug 21, 2024 • edited Loading

How would you like to use vllm

Dineshkumar-Anandan-ZS0367 commented Aug 21, 2024

DarkLight1337 commented Aug 22, 2024 • edited Loading

DarkLight1337 commented Aug 22, 2024 • edited Loading

hommayushi3 commented Aug 22, 2024

DarkLight1337 commented Aug 22, 2024 • edited Loading

DarkLight1337 commented Aug 22, 2024

youkaichao commented Aug 22, 2024

ywang96 commented Sep 2, 2024

github-actions bot commented Dec 2, 2024

github-actions bot commented Jan 2, 2025

hommayushi3 commented Aug 21, 2024 •

edited

Loading

DarkLight1337 commented Aug 22, 2024 •

edited

Loading

DarkLight1337 commented Aug 22, 2024 •

edited

Loading

DarkLight1337 commented Aug 22, 2024 •

edited

Loading