-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][Refactor] Generalize linear_method to be quant_method #4342
Conversation
I love the overall design, great job! Some comments: I noticed there is an asymetry in that the quantized linear layers live in If others agree on this design, in order to get it merged, we should split up the PR into the linear_method -> quant_config refactor (which hopefully will be trivial to verify but most of the changes are related to it) and then the MoE changes, and also possibly split out the change to use |
Make sense. I also spent some time thinking about where to put I also agree to revert the MoE changes in this PR to make it concise once the design is aligned. The reason I included MoE change is to demonstrate (and test) how this mechanism can be used other than linear layers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this.
+1 to @pcmoritz on splitting this up
A) Question
Is the motivation for this that we would ultimately use the QuantConfig
in the constructor of, for instance, the activation function of MLP
as opposed to just passing it along to RowParallelLinear
and ColumnParallelLinear
?
The reason I ask is that LinearMethod
already has the QuantizationConfig
as a member, so this refactor does nothing in the current state. But I see why passing the config around could be useful if we start passing it to other types (since passing a LinearMethod
to an activation function constructor would be funny :)
B) Suggestion
I think we need to pass the name of the layer to get_quantize_method()
.
This would enable the QuantizationConfig
to make different decisions for how to configure the LinearMethod
on a layer-by-layer basis. This would enable:
- Supporting different output datatypes (e.g. if we wanted to dequantize after the SiLUandMul for MLP, but wanted to dequantize after the GEMM in the QKV)
- Supporting non-uniform quantization (e.g. channelwise in only the most sensitive layers)
At current, we would have no way to do express. If we pass around the layer_name from the state_dict, the QuantizationConfig
can make a decision about how a specific layer should be setup. Here's an example of this in our W8A8 prototype
C) Other Idea (for future PR)
I think we should also refactor how the weight_loading
logic works.
Right now, we put a bunch of data in a dictionary of the weight
, but there is not a consistent interface and things are getting very hacky. (e.g. the packed_dim
is an example of this).
This might be a good time to revisit the interface between the Parameters
and the Layers
. Happy to take this on
Thanks for the review and valuable feedback. Since there's no strong objection to the current design, I'll split this PR as suggested and make it ready for review. Other add-ons can be covered in future PRs.
Your understanding is 100% correct, and passing other types to all modules instead of just linear modules are exactly the motivation of this PR. Meanwhile, this refactor is compatible with the current linear_method, meaning that it does nothing (both good or bad) in the current state.
This is a point I aware of. For example in Mixtral MoE, we want to disable FP8 in QKV and gate linears for now due to performance regression but would like to enable FP8 in fused MoE. The current way is ad-hoc, so I was hoping to do the same thing you suggested. For example, we could have something like
Agree on this point. From my personal perspective, it would be much better to refactor weight loading logic so that we can load all weights (e.g., q, k and v) of a module at the same time. In this way, we don't need |
Closed as #4373 is merged. MoE refactoring will be in another follow-up PR. |
Motivation: In most quantization methodologies, we only focus on linear layer quantization. Thus, current vLLM has
linear_method
that allows each quantization method to customize the behavior of creating and applying weights. However, things get more complicate with W8A8 and FP8 quantization, because we may want to cover more modules.In this PR: This PR and RFC attempts to propose an approach to let not only linear but all modules have a way to customize the logic of 1) weight loading, 2) kernel dispatching for different quantization methods, including FP8.
Specifically, this PR makes the following changes:
linear_method
, we now passquant_config
when initializing model.quant_config.get_quantize_method(self)
to get its quantization method, and usequant_method.create_weights
to create weights.quant_method
in each module and invokeprocess_weight_after_loading
to post-process weights for different method (e.g., transpose for GEMM in FP8).self.quant_method.apply(self, ...)
to dispatch the computation to the configured method.Note that this approach has another flexibility: The
quant_config
can be easily extended and has more fine-grained controls to each module. For example, we can usequant_config
to control the output data type of each module to avoid duplicated quantization/de-quantization overheads. This will be implemented in follow-up PRs.I've tested this PR with Mistral-7B (TP1 and TP8) and Mixtral-7B (TP8). I didn't evaluate the performance as it shouldn't have any impact, but I'll still check it once the design is aligned.
cc @pcmoritz @robertgshaw2-neuralmagic @tlrmchlsmth @simon-mo @WoosukKwon @zhuohan123
BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
PR Checklist (Click to Expand)
Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]
for bug fixes.[CI/Build]
for build or continuous integration improvements.[Doc]
for documentation fixes and improvements.[Model]
for adding a new model or improving an existing model. Model name should appear in the title.[Frontend]
For changes on the vLLM frontend (e.g., OpenAI API server,LLM
class, etc.)[Kernel]
for changes affecting CUDA kernels or other compute kernels.[Core]
for changes in the core vLLM logic (e.g.,LLMEngine
,AsyncLLMEngine
,Scheduler
, etc.)[Hardware][Vendor]
for hardware-specific changes. Vendor name should appear in the prefix (e.g.,[Hardware][AMD]
).[Misc]
for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
format.sh
to format your code.docs/source/
if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.Notes for Large Changes
Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with
rfc-required
and might not go through the PR.What to Expect for the Reviews
The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:
action-required
label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!