[RFC][Refactor] Generalize linear_method to be quant_method #4342

comaniac · 2024-04-24T23:00:36Z

Motivation: In most quantization methodologies, we only focus on linear layer quantization. Thus, current vLLM has linear_method that allows each quantization method to customize the behavior of creating and applying weights. However, things get more complicate with W8A8 and FP8 quantization, because we may want to cover more modules.

In this PR: This PR and RFC attempts to propose an approach to let not only linear but all modules have a way to customize the logic of 1) weight loading, 2) kernel dispatching for different quantization methods, including FP8.

Specifically, this PR makes the following changes:

Instead of passing linear_method, we now pass quant_config when initializing model.
[During Initialization] Each module such as Linear could optionally use quant_config.get_quantize_method(self) to get its quantization method, and use quant_method.create_weights to create weights.
[After Initialization] The loader goes through quant_method in each module and invoke process_weight_after_loading to post-process weights for different method (e.g., transpose for GEMM in FP8).
[Forward] Each module could use self.quant_method.apply(self, ...) to dispatch the computation to the configured method.

Note that this approach has another flexibility: The quant_config can be easily extended and has more fine-grained controls to each module. For example, we can use quant_config to control the output data type of each module to avoid duplicated quantization/de-quantization overheads. This will be implemented in follow-up PRs.

I've tested this PR with Mistral-7B (TP1 and TP8) and Mixtral-7B (TP8). I didn't evaluate the performance as it shouldn't have any impact, but I'll still check it once the design is aligned.

cc @pcmoritz @robertgshaw2-neuralmagic @tlrmchlsmth @simon-mo @WoosukKwon @zhuohan123

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

pcmoritz · 2024-04-25T00:30:52Z

I love the overall design, great job! Some comments:

I noticed there is an asymetry in that the quantized linear layers live in vllm/model_executor/layers/quantization/{scheme}.py, whereas the quantized MoE layer is in vllm/model_executor/layers/fused_moe/quant_methods.py. It seemed odd when I first saw it, but actually makes sense, because the linear layers are used as building blocks everywhere (e.g. QKV multiplications in attention, in MoE), whereas the MoE layers are a lot more specialized and are most likely modified by the people who work on MoE improvements.

If others agree on this design, in order to get it merged, we should split up the PR into the linear_method -> quant_config refactor (which hopefully will be trivial to verify but most of the changes are related to it) and then the MoE changes, and also possibly split out the change to use scaled_fp8_quant (which I think we can merge quickly regardless of this PR).

comaniac · 2024-04-25T00:38:05Z

Make sense. I also spent some time thinking about where to put fused_moe/quant_methods.py, because all files under layers/ are modules, but fused_moe has kernels. My original plan was having layers/moe.py which does similar things as linear.py, but then I found fosed_moe...

I also agree to revert the MoE changes in this PR to make it concise once the design is aligned. The reason I included MoE change is to demonstrate (and test) how this mechanism can be used other than linear layers.

robertgshaw2-redhat

Thanks for this.

+1 to @pcmoritz on splitting this up

A) Question

Is the motivation for this that we would ultimately use the QuantConfig in the constructor of, for instance, the activation function of MLP as opposed to just passing it along to RowParallelLinear and ColumnParallelLinear?

The reason I ask is that LinearMethod already has the QuantizationConfig as a member, so this refactor does nothing in the current state. But I see why passing the config around could be useful if we start passing it to other types (since passing a LinearMethod to an activation function constructor would be funny :)

B) Suggestion

I think we need to pass the name of the layer to get_quantize_method().

This would enable the QuantizationConfig to make different decisions for how to configure the LinearMethod on a layer-by-layer basis. This would enable:

Supporting different output datatypes (e.g. if we wanted to dequantize after the SiLUandMul for MLP, but wanted to dequantize after the GEMM in the QKV)
Supporting non-uniform quantization (e.g. channelwise in only the most sensitive layers)

At current, we would have no way to do express. If we pass around the layer_name from the state_dict, the QuantizationConfig can make a decision about how a specific layer should be setup. Here's an example of this in our W8A8 prototype

C) Other Idea (for future PR)

I think we should also refactor how the weight_loading logic works.

Right now, we put a bunch of data in a dictionary of the weight, but there is not a consistent interface and things are getting very hacky. (e.g. the packed_dim is an example of this).

This might be a good time to revisit the interface between the Parameters and the Layers. Happy to take this on

comaniac · 2024-04-25T16:54:23Z

Thanks for this.

+1 to @pcmoritz on splitting this up

Thanks for the review and valuable feedback. Since there's no strong objection to the current design, I'll split this PR as suggested and make it ready for review. Other add-ons can be covered in future PRs.

A) Question

Is the motivation for this that we would ultimately use the QuantConfig in the constructor of, for instance, the activation function of MLP as opposed to just passing it along to RowParallelLinear and ColumnParallelLinear?

The reason I ask is that LinearMethod already has the QuantizationConfig as a member, so this refactor does nothing in the current state. But I see why passing the config around could be useful if we start passing it to other types (since passing a LinearMethod to an activation function constructor would be funny :)

Your understanding is 100% correct, and passing other types to all modules instead of just linear modules are exactly the motivation of this PR. Meanwhile, this refactor is compatible with the current linear_method, meaning that it does nothing (both good or bad) in the current state.

B) Suggestion

I think we need to pass the name of the layer to get_quantize_method().

This would enable the QuantizationConfig to make different decisions for how to configure the LinearMethod on a layer-by-layer basis. This would enable:

Supporting different output datatypes (e.g. if we wanted to dequantize after the SiLUandMul for MLP, but wanted to dequantize after the GEMM in the QKV)

Supporting non-uniform quantization (e.g. channelwise in only the most sensitive layers)

At current, we would have no way to do express. If we pass around the layer_name from the state_dict, the QuantizationConfig can make a decision about how a specific layer should be setup. Here's an example of this in our W8A8 prototype

This is a point I aware of. For example in Mixtral MoE, we want to disable FP8 in QKV and gate linears for now due to performance regression but would like to enable FP8 in fused MoE. The current way is ad-hoc, so I was hoping to do the same thing you suggested. For example, we could have something like decode.moe: fp8 in quant_config.json to only enable MoE in FP8. However, like you pointed out, I haven't found a good way to obtain this information. Passing layer name all the way from top level seems a bit tedious for development and maintenance. Will keep thinking about a better solution.

C) Other Idea (for future PR)

I think we should also refactor how the weight_loading logic works.

Right now, we put a bunch of data in a dictionary of the weight, but there is not a consistent interface and things are getting very hacky. (e.g. the packed_dim is an example of this).

This might be a good time to revisit the interface between the Parameters and the Layers. Happy to take this on

Agree on this point. From my personal perspective, it would be much better to refactor weight loading logic so that we can load all weights (e.g., q, k and v) of a module at the same time. In this way, we don't need process_weights_after_loading anymore because we can directly process them in the particular weight_loader function.

comaniac · 2024-04-26T23:37:14Z

Closed as #4373 is merged. MoE refactoring will be in another follow-up PR.

[Refactor] Generalize linear_method to be quant_method

f2c0758

comaniac mentioned this pull request Apr 25, 2024

[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales #4343

Merged

robertgshaw2-redhat reviewed Apr 25, 2024

View reviewed changes

comaniac mentioned this pull request Apr 25, 2024

[Misc][Refactor] Generalize linear_method to be quant_method #4373

Merged

comaniac closed this Apr 26, 2024

chu-tianxiang mentioned this pull request May 24, 2024

GPTQ & AWQ Fused MOE #2761

Closed

3 tasks

comaniac deleted the linear_method branch July 15, 2024 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][Refactor] Generalize linear_method to be quant_method #4342

[RFC][Refactor] Generalize linear_method to be quant_method #4342

comaniac commented Apr 24, 2024

pcmoritz commented Apr 25, 2024

comaniac commented Apr 25, 2024

robertgshaw2-redhat left a comment •

edited

Loading

comaniac commented Apr 25, 2024

A) Question

B) Suggestion

C) Other Idea (for future PR)

comaniac commented Apr 26, 2024

[RFC][Refactor] Generalize linear_method to be quant_method #4342

[RFC][Refactor] Generalize linear_method to be quant_method #4342

Conversation

comaniac commented Apr 24, 2024

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

pcmoritz commented Apr 25, 2024

comaniac commented Apr 25, 2024

robertgshaw2-redhat left a comment • edited Loading

Choose a reason for hiding this comment

A) Question

B) Suggestion

C) Other Idea (for future PR)

comaniac commented Apr 25, 2024

A) Question

B) Suggestion

C) Other Idea (for future PR)

comaniac commented Apr 26, 2024

robertgshaw2-redhat left a comment •

edited

Loading