Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: Behavior with LoRA Ranks dynamic loading #8559

Closed
1 task done
zhao-lun opened this issue Sep 18, 2024 · 2 comments
Closed
1 task done

[Usage]: Behavior with LoRA Ranks dynamic loading #8559

zhao-lun opened this issue Sep 18, 2024 · 2 comments
Labels
stale usage How to use vllm

Comments

@zhao-lun
Copy link

zhao-lun commented Sep 18, 2024

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

Hi, I’ve encountered a couple of issues while trying the new feat, and I’m hoping to get clarification or assistance.

VLLM container: vllm/vllm-openai:latest
lora rank 8 weight: https://huggingface.co/Akchacha/meta-llama-Meta-Llama-3-8B-Instruct-1726391523/blob/main/adapter_config.json
lora rank 16 weight: https://huggingface.co/Akchacha/meta-llama-Meta-Llama-3-8B-Instruct-1725954636/blob/main/adapter_config.json
server launch cmd

python3 -m vllm.entrypoints.openai.api_server --port 8080 \
 --model /mnt/inference/models/Meta-Llama-3-8B-Instruct \
--served-model-name base-model --enable-lora --max-lora-rank=64 --max-loras=60

First inference request took too long between LoRA ranks:

  1. Load/unload lora adaptors is working fine.
curl -X POST http://localhost:8080/v1/load_lora_adapter \
    -H "Content-Type: application/json" \
    -d '{"lora_name": "lora8", "lora_path": "/mnt/test/test-lora8"}'
  1. First forward pass is taking too much time.
curl -X POST localhost8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lora8",
    "messages": [
      {
        "role": "user",
        "content": "Write a short story about a magical forest."
      }
    ],
    "max_tokens": 100
  }'

When running LoRA module with rank 8 (the first pass), the operation completes very quickly (less than 5 seconds).
However, when running LoRA module with rank 16 (the first pass), the operation becomes significantly slower, taking around 3 minutes to complete.

Example log

INFO 09-17 23:09:25 logger.py:36] Received request chat-07fb51bb258443939f26c3a8bc0b22a1: prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 882, 128007, 271, 15339, 128009, 128006, 78191, 128007, 271], lora_request: LoRARequest(lora_name='lora64', lora_int_id=4, lora_path='/mnt/pvc/samples/lora64', lora_local_path=None, long_lora_max_len=None), prompt_adapter_request: None.
INFO 09-17 23:09:25 async_llm_engine.py:201] Added request chat-07fb51bb258443939f26c3a8bc0b22a1.
DEBUG 09-17 23:09:25 async_llm_engine.py:716] Got new requests!
INFO 09-17 23:09:30 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:09:44 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:09:56 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:10 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:22 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:35 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:46 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:59 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:11:11 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 09-17 23:11:15 models.py:634] Adding lora. Model id: 4, int id: 4, scaling factor: None
DEBUG 09-17 23:11:15 models.py:370] Activating LoRA. int id: 4, slot index: 3
INFO 09-17 23:11:16 metrics.py:351] Avg prompt throughput: 1.9 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:11:17 async_llm_engine.py:169] Finished request chat-07fb51bb258443939f26c3a8bc0b22a1.

Inability to run LoRA module (first pass) & base model simultaneously:

  1. first query to run Lora (first pass)
##assume we already loaded the lora through /load_lora_adapter endpoint and trying to run the first inference pass
curl -X POST localhost8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lora8",
    "messages": [
      {
        "role": "user",
        "content": "Write a short story about a magical forest."
      }
    ],
    "max_tokens": 100
  }'
  1. at the same time, try to run the base model now
curl -X POST localhost8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "base-model",
    "messages": [
      {
        "role": "user",
        "content": "Write a short story about a magical forest."
      }
    ],
    "max_tokens": 100
  }'
  1. now the (base model) request will not execute unless the 1. request has passed. Same thing happened with multiple first pass of loaded Lora module.

some finding

I notice the slowdown occur in two function

for lora in   loras.values():
     lora.optimize()
loras[module_name].lora_b = tensor.to(device=device,
                                                      dtype=dtype).t()
                assert embedding_padding_modules is not None
                if any(name in module_name
                       for name in embedding_padding_modules
                       ) and target_embedding_padding is not None:
                    lora_b = loras[module_name].lora_b
                    assert target_embedding_padding >= lora_b.shape[1]
                    addition = target_embedding_padding - lora_b.shape[1]
                    loras[module_name].lora_b = torch.nn.functional.pad(
                        lora_b, (0, addition))
                if pin_memory:
                    loras[module_name].lora_b = loras[
                        module_name].lora_b.pin_memory()

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@zhao-lun zhao-lun added the usage How to use vllm label Sep 18, 2024
Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Dec 19, 2024
Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale usage How to use vllm
Projects
None yet
Development

No branches or pull requests

1 participant