-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] KeyError: 'lm_head.weight' when loading quantized llama 3.2 3B and 1B models #2935
Comments
I am able to load the model by changing the following in the model/config.json file:
|
Yeah. This is normal for llama models. |
I think after quantization, "tie_word_embeddings" should still be true for the model. I think that this is due to the error during quantization. Are you still face this? I think that this paramater should not be changed. |
Yes, I agree that the lm head weights should be tied, but most libraries are saving both. For example, using both auto gptq and llm compressor. The publicly available models like neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 also have this. VLLM supports this by skipping lm_head. I dont know what the right approach here is. |
Strange. We have a slack channel #quantization in slack.sglang.ai Could you join in and discuss it here? |
hi @zhaochenyang20 we met this same error, while trying to serve fp8 qwen |
Thanks! We goona find someone on this! |
Hi @zhaochenyang20 , I'd like to take this. |
hi @Hongbosherlock i currently set |
@guoyaol I think small models should tie it, but I don't know how bad it will be if we not tie it. You can have a try, thanks |
Hi @zhaochenyang20 I think these PRs(#3777, #3766) have resolved the problem, and maybe the version of SGL in Docker is not the latest. |
Thanks. We need to update the docker to upstream this. |
Checklist
Describe the bug
The issue arises when I try to load quantized models of llama 3.2 models of sizes 3B and 1B models. This doesnot happen with llama 3.1 8B model. When I launch the quantized model "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8" using sglang docker, the following error is raised. The same model is loaded properly in VLLM.
The error seems to be that in llama 3.2 3B and 1B models, the lm_head weight and embed_tokens weight are tied. But the quantization libraries store the copy of lm_head while quantization (tried using both AutoGPTQ and llm-compressor). When this model is loaded, the lm_head.weight is being tried to load, when the parameter is not there in the model definition because of tied weights. This raised the error that lm_head.weight is there in state_dict, but not in the defined model parameters.
I have found a related issue in VLLM:
vllm-project/vllm#3553
The following code in VLLM handles this usecase :
To run the model on sglang:
To run the same model on VLLM:
vllm serve neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8
Thanks for the great repo.
Reproduction
Environment
I am using the latest sglang docker image to run the models.
The text was updated successfully, but these errors were encountered: