-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support loras on quantized models #2828
Support loras on quantized models #2828
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Could we add an integration test similar to what's in tests/lora/test_llama.py
?
@Yard1 which subset of tests are acceptable to you? |
@fmmoret I think as long as we can be sure that there are no exceptions and output is correct (even if it's not stable) it should be fine. You can limit it to stable prompts or reduce the number of tokens that are generated to see if it helps. |
Is anyone planning to take this forward? |
I'm out on vacation -- I might be able to get in the time sometime next week to pull this over the finish line without squeezellm support. IIRC the squeeze impl was slightly different / needed some more changes. For the rest of them, I just need to find some completion examples that are not going to be flakey across different devices / driver versions. My local GPU was not always matching remote CI for some of the completions. |
@fmmoret Is it okay if I take a crack at it & try some other examples here? Need this urgently for a project. |
Yep |
@fmmoret Any way I can help? This would be an awesome feature so I'd hate to see its inclusion falter. |
could someone explain what's the idea behind for handling different dtype of base and Lora weights? |
@@ -568,9 +568,6 @@ def verify_with_model_config(self, model_config: ModelConfig): | |||
self.lora_dtype = model_config.dtype | |||
elif isinstance(self.lora_dtype, str): | |||
self.lora_dtype = getattr(torch, self.lora_dtype) | |||
if model_config.quantization is not None: | |||
raise ValueError( | |||
"LoRA is not supported with quantized models yet.") | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably should throw for non-tested quant types. Should have an allowlist
As far as Ive seen, it has worked since early Feb with my first commits. The tests take a long time to run & many are flakey. I think all the tests actually should pass at this point on this PR, but the Lora suite times out and other test suites that I think are unrelated often flake. If someone else wants to branch off and finish, feel free. I haven't had the time to contribute (I don't need the change -- just saw that it shouldn't take much to make it work). |
@fmmoret I can confirm that I was able to run a quantization model + adapters using your branch, and the results was good. |
lora_config.max_lora_rank, | ||
), | ||
dtype=lora_config.lora_dtype, | ||
device=self.base_layer.weight.device, | ||
device=device, | ||
) | ||
self.indices: Optional[torch.Tensor] = None | ||
self.indices_len: Optional[List[int]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fmmoret Thank you for your contribution. I have tested this pull request, and the result looked good. However, one issue that needs to be considered is tensor parallelism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just change self.base_layer.input_size
to self.base_layer.input_size_per_partition
for RowParallelLinear
module, and change self.base_layer.output_size
to self.base_layer.output_size_per_partition
for ColumnParallelLinearWithLoRA
. We could run on tensor parallelism setting using this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there might raise NoneType error when tp_size > 1 and either lora_b[0] or lora_b[1] is None. It's still a bug in main branch
if self.tp_size > 1:
tensor_model_parallel_rank = get_tensor_model_parallel_rank()
shard_size = self.output_dim
start_idx = tensor_model_parallel_rank * shard_size
end_idx = (tensor_model_parallel_rank + 1) * shard_size
# lora_b = lora_b[0][:,
# start_idx:end_idx], lora_b[1][:,
# start_idx:end_idx]
if lora_b[0] is not None:
lora_b[0] = lora_b[0][:, start_idx:end_idx]
if lora_b[1] is not None:
lora_b[1] = lora_b[1][:, start_idx:end_idx]
which branch? |
@jeejeelee pulled this one over the finish line -- thank you very much. This feature looks very popular to the community |
#2821
Tried https://huggingface.co/yard1/llama-2-7b-sql-lora-test/blob/main/adapter_config.json against awq, gptq, & squeeze via editing https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py