Support loras on quantized models #2828

fmmoret · 2024-02-09T21:02:59Z

Tried https://huggingface.co/yard1/llama-2-7b-sql-lora-test/blob/main/adapter_config.json against awq, gptq, & squeeze via editing https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py

Yard1

Looking good! Could we add an integration test similar to what's in tests/lora/test_llama.py?

fmmoret · 2024-02-12T06:53:27Z

@Yard1 which subset of tests are acceptable to you?
I am finding that the the code paths that go through the dequant kernels are not deterministic -- back-to-back runs locally sometimes result in slightly different samplings for your snapshot style test.

Yard1 · 2024-02-12T16:10:20Z

@fmmoret I think as long as we can be sure that there are no exceptions and output is correct (even if it's not stable) it should be fine. You can limit it to stable prompts or reduce the number of tokens that are generated to see if it helps.

samadkoita · 2024-03-06T08:55:58Z

Is anyone planning to take this forward?

fmmoret · 2024-03-06T12:33:08Z

Is anyone planning to take this forward?

I'm out on vacation -- I might be able to get in the time sometime next week to pull this over the finish line without squeezellm support. IIRC the squeeze impl was slightly different / needed some more changes.

For the rest of them, I just need to find some completion examples that are not going to be flakey across different devices / driver versions. My local GPU was not always matching remote CI for some of the completions.

samadkoita · 2024-03-06T16:40:39Z

@fmmoret Is it okay if I take a crack at it & try some other examples here? Need this urgently for a project.

fmmoret · 2024-03-06T22:04:32Z

@fmmoret Is it okay if I take a crack at it & try some other examples here? Need this urgently for a project.

Yep

Peter-Devine · 2024-03-13T03:32:46Z

@fmmoret Any way I can help? This would be an awesome feature so I'd hate to see its inclusion falter.

…n-quantized-models

thincal · 2024-03-24T00:25:04Z

could someone explain what's the idea behind for handling different dtype of base and Lora weights?

flexorRegev · 2024-03-24T11:35:55Z

@fmmoret @Yard1 From what it looks in this PR - there isn't anything inherent to the fact that quantized models + lora aren't supported right now - it just wasn't tested?

fmmoret · 2024-03-24T11:54:40Z

vllm/config.py

@@ -568,9 +568,6 @@ def verify_with_model_config(self, model_config: ModelConfig):
            self.lora_dtype = model_config.dtype
        elif isinstance(self.lora_dtype, str):
            self.lora_dtype = getattr(torch, self.lora_dtype)
-        if model_config.quantization is not None:
-            raise ValueError(
-                "LoRA is not supported with quantized models yet.")



This probably should throw for non-tested quant types. Should have an allowlist

fmmoret · 2024-03-24T12:00:02Z

@fmmoret @Yard1 From what it looks in this PR - there isn't anything inherent to the fact that quantized models + lora aren't supported right now - it just wasn't tested?

As far as Ive seen, it has worked since early Feb with my first commits. The tests take a long time to run & many are flakey.

I think all the tests actually should pass at this point on this PR, but the Lora suite times out and other test suites that I think are unrelated often flake.

If someone else wants to branch off and finish, feel free.

I haven't had the time to contribute (I don't need the change -- just saw that it shouldn't take much to make it work).

hanan9m · 2024-03-27T17:57:05Z

@fmmoret I can confirm that I was able to run a quantization model + adapters using your branch, and the results was good.

jeejeelee · 2024-03-28T02:44:59Z

vllm/lora/layers.py

                lora_config.max_lora_rank,
            ),
            dtype=lora_config.lora_dtype,
-            device=self.base_layer.weight.device,
+            device=device,
        )
        self.indices: Optional[torch.Tensor] = None
        self.indices_len: Optional[List[int]] = None


@fmmoret Thank you for your contribution. I have tested this pull request, and the result looked good. However, one issue that needs to be considered is tensor parallelism.

Just change self.base_layer.input_size to self.base_layer.input_size_per_partition for RowParallelLinear module, and change self.base_layer.output_size to self.base_layer.output_size_per_partition for ColumnParallelLinearWithLoRA. We could run on tensor parallelism setting using this PR.

Also, there might raise NoneType error when tp_size > 1 and either lora_b[0] or lora_b[1] is None. It's still a bug in main branch

if self.tp_size > 1: tensor_model_parallel_rank = get_tensor_model_parallel_rank() shard_size = self.output_dim start_idx = tensor_model_parallel_rank * shard_size end_idx = (tensor_model_parallel_rank + 1) * shard_size # lora_b = lora_b[0][:, # start_idx:end_idx], lora_b[1][:, # start_idx:end_idx] if lora_b[0] is not None: lora_b[0] = lora_b[0][:, start_idx:end_idx] if lora_b[1] is not None: lora_b[1] = lora_b[1][:, start_idx:end_idx]

sunjunlishi · 2024-03-30T04:09:23Z

@fmmoret I can confirm that I was able to run a quantization model + adapters using your branch, and the results was good.

which branch?

fmmoret · 2024-04-12T04:34:08Z

@jeejeelee pulled this one over the finish line -- thank you very much. This feature looks very popular to the community

fmmoret added 2 commits February 9, 2024 12:17

support loras on quantized models

460c89f

format

ad35a88

fmmoret marked this pull request as ready for review February 9, 2024 21:10

Yard1 reviewed Feb 9, 2024

View reviewed changes

fmmoret added 2 commits February 9, 2024 14:52

parametrize tests. needs updates to expectations

276e693

use non-chat variants

1f64b20

fmmoret marked this pull request as draft February 9, 2024 23:27

fmmoret added 4 commits February 10, 2024 04:21

debugging

e706a44

debug

f4dd735

debug

3d0a224

expectations match locally

c6d7127

ProjectProgramAMark mentioned this pull request Feb 18, 2024

Feature Request: Add LoRA support through LangChain #2911

Closed

fmmoret added 8 commits March 12, 2024 21:59

rebase

f10080a

lint

ed9cb4c

exclude codespell

c684aca

patch prompt checks

f2bb84e

lint

658b531

try out smaller model

c4cc39b

lint

41a5df1

Merge branch 'main' of github.com:fmmoret/vllm into fm-support-lora-o…

fe560a5

…n-quantized-models

sunjunlishi mentioned this pull request Mar 18, 2024

Merging LoRA weights into a quantized model is not supported 嗯。你说的 hiyouga/LLaMA-Factory#2795

Closed

1 task

jeejeelee mentioned this pull request Mar 23, 2024

[Usage]: How to inference the quantized base model with LoRA weights ? #3580

Closed

fmmoret commented Mar 24, 2024

View reviewed changes

jeejeelee reviewed Mar 28, 2024

View reviewed changes

jeejeelee mentioned this pull request Apr 11, 2024

[Core] Support LoRA on quantized models #4012

Merged

fmmoret closed this Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support loras on quantized models #2828

Support loras on quantized models #2828

fmmoret commented Feb 9, 2024 •

edited

Loading

Yard1 left a comment

fmmoret commented Feb 12, 2024

Yard1 commented Feb 12, 2024

samadkoita commented Mar 6, 2024

fmmoret commented Mar 6, 2024

samadkoita commented Mar 6, 2024

fmmoret commented Mar 6, 2024

Peter-Devine commented Mar 13, 2024

thincal commented Mar 24, 2024 •

edited

Loading

flexorRegev commented Mar 24, 2024

fmmoret Mar 24, 2024

fmmoret commented Mar 24, 2024

hanan9m commented Mar 27, 2024

jeejeelee Mar 28, 2024

desperadoola Apr 3, 2024

desperadoola Apr 3, 2024

sunjunlishi commented Mar 30, 2024

fmmoret commented Apr 12, 2024

Support loras on quantized models #2828

Support loras on quantized models #2828

Conversation

fmmoret commented Feb 9, 2024 • edited Loading

Yard1 left a comment

Choose a reason for hiding this comment

fmmoret commented Feb 12, 2024

Yard1 commented Feb 12, 2024

samadkoita commented Mar 6, 2024

fmmoret commented Mar 6, 2024

samadkoita commented Mar 6, 2024

fmmoret commented Mar 6, 2024

Peter-Devine commented Mar 13, 2024

thincal commented Mar 24, 2024 • edited Loading

flexorRegev commented Mar 24, 2024

fmmoret Mar 24, 2024

Choose a reason for hiding this comment

fmmoret commented Mar 24, 2024

hanan9m commented Mar 27, 2024

jeejeelee Mar 28, 2024

Choose a reason for hiding this comment

desperadoola Apr 3, 2024

Choose a reason for hiding this comment

desperadoola Apr 3, 2024

Choose a reason for hiding this comment

sunjunlishi commented Mar 30, 2024

fmmoret commented Apr 12, 2024

fmmoret commented Feb 9, 2024 •

edited

Loading

thincal commented Mar 24, 2024 •

edited

Loading