Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loras on quantized models #2828

Closed

Conversation

fmmoret
Copy link

@fmmoret fmmoret commented Feb 9, 2024

@fmmoret fmmoret marked this pull request as ready for review February 9, 2024 21:10
Copy link
Collaborator

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Could we add an integration test similar to what's in tests/lora/test_llama.py?

@fmmoret fmmoret marked this pull request as draft February 9, 2024 23:27
@fmmoret
Copy link
Author

fmmoret commented Feb 12, 2024

@Yard1 which subset of tests are acceptable to you?
I am finding that the the code paths that go through the dequant kernels are not deterministic -- back-to-back runs locally sometimes result in slightly different samplings for your snapshot style test.

@Yard1
Copy link
Collaborator

Yard1 commented Feb 12, 2024

@fmmoret I think as long as we can be sure that there are no exceptions and output is correct (even if it's not stable) it should be fine. You can limit it to stable prompts or reduce the number of tokens that are generated to see if it helps.

@samadkoita
Copy link

Is anyone planning to take this forward?

@fmmoret
Copy link
Author

fmmoret commented Mar 6, 2024

Is anyone planning to take this forward?

I'm out on vacation -- I might be able to get in the time sometime next week to pull this over the finish line without squeezellm support. IIRC the squeeze impl was slightly different / needed some more changes.

For the rest of them, I just need to find some completion examples that are not going to be flakey across different devices / driver versions. My local GPU was not always matching remote CI for some of the completions.

@samadkoita
Copy link

@fmmoret Is it okay if I take a crack at it & try some other examples here? Need this urgently for a project.

@fmmoret
Copy link
Author

fmmoret commented Mar 6, 2024

@fmmoret Is it okay if I take a crack at it & try some other examples here? Need this urgently for a project.

Yep

@Peter-Devine
Copy link

@fmmoret Any way I can help? This would be an awesome feature so I'd hate to see its inclusion falter.

@thincal
Copy link

thincal commented Mar 24, 2024

could someone explain what's the idea behind for handling different dtype of base and Lora weights?

@flexorRegev
Copy link

@fmmoret @Yard1 From what it looks in this PR - there isn't anything inherent to the fact that quantized models + lora aren't supported right now - it just wasn't tested?

@@ -568,9 +568,6 @@ def verify_with_model_config(self, model_config: ModelConfig):
self.lora_dtype = model_config.dtype
elif isinstance(self.lora_dtype, str):
self.lora_dtype = getattr(torch, self.lora_dtype)
if model_config.quantization is not None:
raise ValueError(
"LoRA is not supported with quantized models yet.")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably should throw for non-tested quant types. Should have an allowlist

@fmmoret
Copy link
Author

fmmoret commented Mar 24, 2024

@fmmoret @Yard1 From what it looks in this PR - there isn't anything inherent to the fact that quantized models + lora aren't supported right now - it just wasn't tested?

As far as Ive seen, it has worked since early Feb with my first commits. The tests take a long time to run & many are flakey.

I think all the tests actually should pass at this point on this PR, but the Lora suite times out and other test suites that I think are unrelated often flake.

If someone else wants to branch off and finish, feel free.

I haven't had the time to contribute (I don't need the change -- just saw that it shouldn't take much to make it work).

@hanan9m
Copy link

hanan9m commented Mar 27, 2024

@fmmoret I can confirm that I was able to run a quantization model + adapters using your branch, and the results was good.

lora_config.max_lora_rank,
),
dtype=lora_config.lora_dtype,
device=self.base_layer.weight.device,
device=device,
)
self.indices: Optional[torch.Tensor] = None
self.indices_len: Optional[List[int]] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmmoret Thank you for your contribution. I have tested this pull request, and the result looked good. However, one issue that needs to be considered is tensor parallelism.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just change self.base_layer.input_size to self.base_layer.input_size_per_partition for RowParallelLinear module, and change self.base_layer.output_size to self.base_layer.output_size_per_partition for ColumnParallelLinearWithLoRA. We could run on tensor parallelism setting using this PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there might raise NoneType error when tp_size > 1 and either lora_b[0] or lora_b[1] is None. It's still a bug in main branch

if self.tp_size > 1:
            tensor_model_parallel_rank = get_tensor_model_parallel_rank()
            shard_size = self.output_dim
            start_idx = tensor_model_parallel_rank * shard_size
            end_idx = (tensor_model_parallel_rank + 1) * shard_size
            # lora_b = lora_b[0][:,
            #                    start_idx:end_idx], lora_b[1][:,
            #                                                  start_idx:end_idx]
            if lora_b[0] is not None:
                lora_b[0] = lora_b[0][:, start_idx:end_idx]
            if lora_b[1] is not None:
                lora_b[1] = lora_b[1][:, start_idx:end_idx]

@sunjunlishi
Copy link

@fmmoret I can confirm that I was able to run a quantization model + adapters using your branch, and the results was good.

which branch?

@fmmoret
Copy link
Author

fmmoret commented Apr 12, 2024

@jeejeelee pulled this one over the finish line -- thank you very much. This feature looks very popular to the community

@fmmoret fmmoret closed this Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants