-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cannot load some models via vllm #1268
Comments
models such as |
i do see vllm support |
I have also encountered this issue. As I'm trying to run the Lora adapter plus base model (unsloth/Qwen2.5-7B-bnb-4bit) but it seems to not work. I was led to believe vLLM was the way to go for multi Lora on small models. |
should I always set |
Yes for bitsandbytes models, use from vllm import LLM
import torch
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(model=model_id, dtype=torch.bfloat16, \
quantization="bitsandbytes", load_format="bitsandbytes") |
@danielhanchen hi , i have to reopen this issue. but could you take a look at it ? thank.s |
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(model= args.llm_name,
dtype='float16',
max_model_len = args.sft_max_len if args.sft_max_len else None,
tensor_parallel_size= torch.cuda.device_count(),
#pipeline_parallel_size = torch.cuda.device_count(),
gpu_memory_utilization= args.gpu_memory_utilization,
#seed=None,
trust_remote_code=True,
quantization= "bitsandbytes" if args.quant or 'bnb-4bit' in args.llm_name else None,
load_format= "bitsandbytes" if args.quant or 'bnb-4bit' in args.llm_name else "auto",
enforce_eager=True,
enable_lora=True if args.sft_path else False,
tokenizer_mode= "mistral" if args.llm_name.startswith('mistralai') else 'auto',
cpu_offload_gb = 0 if args.quant or 'bnb-4bit' in args.llm_name else 16,
swap_space=16
) |
Yes I always get this issue also on the Qwen models. It's also present on the unsloth/Qwen2.5-3B-bnb-4bit version as well as the 7B. Can confirm the error across all the Qwen models when trying to run inference on vllm: I'm assuming since I'm working from a fresh install of: "pip install vllm" That I'm using the same version. Currently I've had to switch to the the "unsloth/Llama-3.2-3B-bnb-4bit" model as I couldn't find a fix. If anyone finds a way to get it to work please let me know! I'd love to be able to switch back to the fine tuned LORA on top of the 4bit Qwen 2.5 base model. |
@JJEccles maybe you can use original model @danielhanchen correct me if i am wrong. |
I will look into it, Thanks! |
@yananchen1989 @JJEccles BitsAndBytes Qwen2.5 models are not supported on the latest vLLM release (v0.6.3.post1) as of today. However, they will function correctly once a new version is released. You can refer to this already-merged pull request. If you want to use BitsAndBytes with Qwen2.5 immediately, you can install the latest vLLM version using the following command: pip install git+https://github.com/vllm-project/vllm.git I have tested both Qwen2.5 7B Instruct and Qwen 14B Instruct with BNB quantization, and they worked correctly. If you're using the Docker |
@danielhanchen similar issue I am facing with llama3.2 vision instruct model. have opened issue in vllm here. The expected weights shape are not matching with loaded weights. I have fintuned using unsloth 2025.1.1 and vllm 0.6.6 |
here is the summary:
unsloth/mistral-7b-v0.3-bnb-4bit
with error :KeyError: 'layers.0.mlp.down_proj.weight'
unsloth/Qwen2.5-7B-Instruct-bnb-4bit
with error:KeyError: 'layers.0.mlp.down_proj.weight'
unsloth/Llama-3.2-1B-Instruct-bnb-4bit
with error:KeyError: 'layers.0.mlp.down_proj.weight'
here is the code:
here is the environment info:
The text was updated successfully, but these errors were encountered: