fix gemma loading after quantization or LoRA.

lm_head is not used in vllm as it is tied weight with embed_token. Sometimes duplicate lm_head layers are added when the structure of the model is newly created by quantization, LoRA, etc. To avoid the error that occurs, skip loading lm_head.weight.
vllm-project · Mar 21, 2024 · 959ceb9 · 959ceb9
1 parent 4c07dd2
commit 959ceb9
Showing 1 changed file with 5 additions and 0 deletions.
diff --git a/vllm/model_executor/models/gemma.py b/vllm/model_executor/models/gemma.py
@@ -340,6 +340,11 @@ def load_weights(self,
                 weight_loader(param, loaded_weight, shard_id)
                 break
             else:
+                # lm_head is not used in vllm as it is tied weight with embed_token.
+                # Sometimes duplicate lm_head layers are added when the structure of the model is newly created by quantization, LORA, etc.
+                # To avoid the error that occurs, skip loading lm_head.weight.
+                if "lm_head.weight" in name:
+                    continue
                 # Skip loading extra bias for GPTQ models.
                 if name.endswith(".bias") and name not in params_dict:
                     continue