You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
curl -X POST localhost8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "lora8",
"messages": [
{
"role": "user",
"content": "Write a short story about a magical forest."
}
],
"max_tokens": 100
}'
When running LoRA module with rank 8 (the first pass), the operation completes very quickly (less than 5 seconds).
However, when running LoRA module with rank 16 (the first pass), the operation becomes significantly slower, taking around 3 minutes to complete.
Inability to run LoRA module (first pass) & base model simultaneously:
first query to run Lora (first pass)
##assume we already loaded the lora through /load_lora_adapter endpoint and trying to run the first inference pass
curl -X POST localhost8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "lora8",
"messages": [
{
"role": "user",
"content": "Write a short story about a magical forest."
}
],
"max_tokens": 100
}'
at the same time, try to run the base model now
curl -X POST localhost8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "base-model",
"messages": [
{
"role": "user",
"content": "Write a short story about a magical forest."
}
],
"max_tokens": 100
}'
now the (base model) request will not execute unless the 1. request has passed. Same thing happened with multiple first pass of loaded Lora module.
some finding
I notice the slowdown occur in two function
for lora in loras.values():
lora.optimize()
loras[module_name].lora_b = tensor.to(device=device,
dtype=dtype).t()
assert embedding_padding_modules is not None
if any(name in module_name
for name in embedding_padding_modules
) and target_embedding_padding is not None:
lora_b = loras[module_name].lora_b
assert target_embedding_padding >= lora_b.shape[1]
addition = target_embedding_padding - lora_b.shape[1]
loras[module_name].lora_b = torch.nn.functional.pad(
lora_b, (0, addition))
if pin_memory:
loras[module_name].lora_b = loras[
module_name].lora_b.pin_memory()
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
How would you like to use vllm
Hi, I’ve encountered a couple of issues while trying the new feat, and I’m hoping to get clarification or assistance.
VLLM container: vllm/vllm-openai:latest
lora rank 8 weight: https://huggingface.co/Akchacha/meta-llama-Meta-Llama-3-8B-Instruct-1726391523/blob/main/adapter_config.json
lora rank 16 weight: https://huggingface.co/Akchacha/meta-llama-Meta-Llama-3-8B-Instruct-1725954636/blob/main/adapter_config.json
server launch cmd
First inference request took too long between LoRA ranks:
When running LoRA module with rank 8 (the first pass), the operation completes very quickly (less than 5 seconds).
However, when running LoRA module with rank 16 (the first pass), the operation becomes significantly slower, taking around 3 minutes to complete.
Example log
Inability to run LoRA module (first pass) & base model simultaneously:
some finding
I notice the slowdown occur in two function
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: