Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging LoRA weights into a quantized model is not supported 嗯。你说的 #2795

Closed
1 task done
sunjunlishi opened this issue Mar 12, 2024 · 7 comments
Closed
1 task done
Labels
wontfix This will not be worked on

Comments

@sunjunlishi
Copy link

Reminder

  • I have read the README and searched the existing issues.

Reproduction

python src/export_model.py
--model_name_or_path ../../../workspace/Llama/Qwen-14B-Chat-Int4
--adapter_name_or_path ../LLaMA-Factory-main-bk/path_to_sft14bint4_checkpoint/checkpoint-7000
--template default
--finetuning_type lora
--export_dir export_sft14bint4
--export_size 2
--export_legacy_format False
ValueError: Cannot merge adapters to a quantized model

1 确实 Cannot merge adapters to a quantized model
2 但是 python src/web_demo.py
--model_name_or_path ../../../workspace/Llama/Qwen-14B-Chat-Int4
--adapter_name_or_path path_to_sft14bint4_checkpoint/checkpoint-7000/
--template qwen
--finetuning_type lora
用你的web_demo.py 是可以加载量化模型和lora微调的模型的,我亲测可以。但是速度慢
3 所以作者你的整体流程是通的; 用量化模型训练,还能加载他们。
但是现在就是解决速度慢的问题。
a 我在想你的这个思路,vllm 单独加载量化是可以,那 用vllm 实现你的整体加载是不是也可以的
b 你的整体加载是不是等于说可以解决“ Cannot merge adapters to a quantized model”这个问题

为什么我纠结于此问题,作者您已经实现了量化模型的训练和加载,但是速度慢。
如果能够解决速度问题。代表70%的玩家可以用量化14B甚至更大的量化,用来训练可以加载。对实现落地还是有很大益处的。
我也将继续学习您的代码。

Expected behavior

No response

System Info

No response

Others

No response

@sunjunlishi
Copy link
Author

LLaMA-Factory-main/src/llmtuner/model/adapter.py 这里的代码把微调模型和基础模型统一了起来,好神奇呀。

if adapter_to_resume is not None: # resume lora training
print('to resume....')
model = PeftModel.from_pretrained(model, adapter_to_resume, is_trainable=is_trainable)

@hiyouga hiyouga added the pending This problem is yet to be addressed label Mar 12, 2024
@sunjunlishi
Copy link
Author

sunjunlishi commented Mar 12, 2024

@hiyouga 为啥用量化模型训练呀。因为大模型的量化模型用来训练,精度高,loss值降的快,并且占用显存少。
现在整个流程都是通的,训练,执行demo。唯一的就是速度稍慢。
14B-4bit量化比7B非量化用来训练,效果更好。

@sunjunlishi
Copy link
Author

sunjunlishi commented Mar 15, 2024

can vllm support loading lora?

vllm-project/vllm#2710

jvmncs commented on Feb 6
Have a look at this example: https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py

@simon-mo
Collaborator
simon-mo commented 2 weeks ago
And documentation here: https://docs.vllm.ai/en/latest/models/lora.html

Using LoRA adapters
This document shows you how to use LoRA adapters with vLLM on top of a base model. Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save them locally with

from huggingface_hub import snapshot_download

sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
Then we instantiate the base model and pass in the enable_lora=True flag:

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)
We can now submit the prompts and call llm.generate with the lora_request parameter. The first parameter of LoRARequest is a human identifiable name, the second parameter is a globally unique ID for the adapter and the third parameter is the path to the LoRA adapter.

sampling_params = SamplingParams(
temperature=0,
max_tokens=256,
stop=["[/assistant]"]
)

prompts = [
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
]

outputs = llm.generate(
prompts,
sampling_params,
lora_request=LoRARequest("sql_adapter", 1, sql_lora_path)
)

@sunjunlishi
Copy link
Author

vllm-project/vllm#2828

vllm Support loras on quantized models

@hiyouga hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed labels Mar 25, 2024
@hiyouga hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale Mar 25, 2024
@sunjunlishi
Copy link
Author

@hiyouga
Merging LoRA weights into a quantized model is not supported.
我看可以Qlora训练量化模型,那作者大拿,Qlora模型可不可以和量化的模型合并啊。
我就用Qlora训练,然后合并。

@world2025
Copy link

但是qlora训练只能单卡

@lebronjamesking
Copy link

@hiyouga Merging LoRA weights into a quantized model is not supported. 我看可以Qlora训练量化模型,那作者大拿,Qlora模型可不可以和量化的模型合并啊。 我就用Qlora训练,然后合并。

同问,gptq量化模型如何合并呢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants