[Usage]: Behavior with LoRA Ranks dynamic loading #8559

zhao-lun · 2024-09-18T06:38:20Z

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

Hi, I’ve encountered a couple of issues while trying the new feat, and I’m hoping to get clarification or assistance.

VLLM container: vllm/vllm-openai:latest
lora rank 8 weight: https://huggingface.co/Akchacha/meta-llama-Meta-Llama-3-8B-Instruct-1726391523/blob/main/adapter_config.json
lora rank 16 weight: https://huggingface.co/Akchacha/meta-llama-Meta-Llama-3-8B-Instruct-1725954636/blob/main/adapter_config.json
server launch cmd

python3 -m vllm.entrypoints.openai.api_server --port 8080 \
 --model /mnt/inference/models/Meta-Llama-3-8B-Instruct \
--served-model-name base-model --enable-lora --max-lora-rank=64 --max-loras=60

First inference request took too long between LoRA ranks:

Load/unload lora adaptors is working fine.

curl -X POST http://localhost:8080/v1/load_lora_adapter \
    -H "Content-Type: application/json" \
    -d '{"lora_name": "lora8", "lora_path": "/mnt/test/test-lora8"}'

First forward pass is taking too much time.

curl -X POST localhost8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lora8",
    "messages": [
      {
        "role": "user",
        "content": "Write a short story about a magical forest."
      }
    ],
    "max_tokens": 100
  }'

When running LoRA module with rank 8 (the first pass), the operation completes very quickly (less than 5 seconds).
However, when running LoRA module with rank 16 (the first pass), the operation becomes significantly slower, taking around 3 minutes to complete.

Example log

INFO 09-17 23:09:25 logger.py:36] Received request chat-07fb51bb258443939f26c3a8bc0b22a1: prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 882, 128007, 271, 15339, 128009, 128006, 78191, 128007, 271], lora_request: LoRARequest(lora_name='lora64', lora_int_id=4, lora_path='/mnt/pvc/samples/lora64', lora_local_path=None, long_lora_max_len=None), prompt_adapter_request: None.
INFO 09-17 23:09:25 async_llm_engine.py:201] Added request chat-07fb51bb258443939f26c3a8bc0b22a1.
DEBUG 09-17 23:09:25 async_llm_engine.py:716] Got new requests!
INFO 09-17 23:09:30 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:09:44 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:09:56 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:10 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:22 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:35 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:46 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:59 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:11:11 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 09-17 23:11:15 models.py:634] Adding lora. Model id: 4, int id: 4, scaling factor: None
DEBUG 09-17 23:11:15 models.py:370] Activating LoRA. int id: 4, slot index: 3
INFO 09-17 23:11:16 metrics.py:351] Avg prompt throughput: 1.9 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:11:17 async_llm_engine.py:169] Finished request chat-07fb51bb258443939f26c3a8bc0b22a1.

Inability to run LoRA module (first pass) & base model simultaneously:

first query to run Lora (first pass)

##assume we already loaded the lora through /load_lora_adapter endpoint and trying to run the first inference pass
curl -X POST localhost8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lora8",
    "messages": [
      {
        "role": "user",
        "content": "Write a short story about a magical forest."
      }
    ],
    "max_tokens": 100
  }'

at the same time, try to run the base model now

curl -X POST localhost8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "base-model",
    "messages": [
      {
        "role": "user",
        "content": "Write a short story about a magical forest."
      }
    ],
    "max_tokens": 100
  }'

now the (base model) request will not execute unless the 1. request has passed. Same thing happened with multiple first pass of loaded Lora module.

some finding

I notice the slowdown occur in two function

for lora in   loras.values():
     lora.optimize()

loras[module_name].lora_b = tensor.to(device=device,
                                                      dtype=dtype).t()
                assert embedding_padding_modules is not None
                if any(name in module_name
                       for name in embedding_padding_modules
                       ) and target_embedding_padding is not None:
                    lora_b = loras[module_name].lora_b
                    assert target_embedding_padding >= lora_b.shape[1]
                    addition = target_embedding_padding - lora_b.shape[1]
                    loras[module_name].lora_b = torch.nn.functional.pad(
                        lora_b, (0, addition))
                if pin_memory:
                    loras[module_name].lora_b = loras[
                        module_name].lora_b.pin_memory()

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-12-19T02:03:38Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2025-01-19T02:01:39Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

zhao-lun added the usage How to use vllm label Sep 18, 2024

github-actions bot added the stale label Dec 19, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: Behavior with LoRA Ranks dynamic loading #8559

[Usage]: Behavior with LoRA Ranks dynamic loading #8559

zhao-lun commented Sep 18, 2024 •

edited

Loading

github-actions bot commented Dec 19, 2024

github-actions bot commented Jan 19, 2025

[Usage]: Behavior with LoRA Ranks dynamic loading #8559

[Usage]: Behavior with LoRA Ranks dynamic loading #8559

Comments

zhao-lun commented Sep 18, 2024 • edited Loading

Your current environment

How would you like to use vllm

First inference request took too long between LoRA ranks:

Inability to run LoRA module (first pass) & base model simultaneously:

some finding

Before submitting a new issue...

github-actions bot commented Dec 19, 2024

github-actions bot commented Jan 19, 2025

zhao-lun commented Sep 18, 2024 •

edited

Loading