-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: Execution speed of non-Lora requests #8368
Comments
I haven't done precise testing, but I think your scenario is as expected. |
Oh, thank you. But do you have any hypotheses why this is so? What could be causing the slowdown? After all, these are requests that do not use adapters and I still have enough free memory for caches |
It could be that there is an overhead that comes with un-applying the lora adapter to the model before processing the request. |
Sorry for the delay feedback. |
You are talking about some settings through vllm debug? Or about arbitrary profilers? |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
Your current environment
I'm using
vllm/vllm-openai:v0.6.0
How would you like to use vllm
I already use vllm for inference of some models and everything is fine. Also, I have some load tests for my scenario of usage. Recently I wanted to add some LoRA models. After running my load tests (which make requests to the base model, not to Lora) on an instance with LoRA, I noticed that latency increased by about 5 -10% (vs instance without LoRA).
My base model - openchat3.6 (finetune of llama2), LoRA with r=16 on ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] layers.
I run vllm (for only base model) with:
for LoRA
I understand that using LoRA consumes additional GPU memory, which may affect amount of available memory for KV-cache, but my
GPU KV cache usage:
so far from 100%.I found issue, which was fixed. But I didn’t understand from PR with fix, is it expected that non-LoRa requests to a vllm instance with LoRA will slow down now?
Is it normal that I facing with slowdown in this scenario?
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: