We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch version: N/A Is debug build: N/A CUDA used to build PyTorch: N/A ROCM used to build PyTorch: N/A OS: Amazon Linux 2 (x86_64) GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) Clang version: Could not collect CMake version: version 3.29.0 Libc version: glibc-2.26 Python version: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0] (64-bit runtime) Python platform: Linux-5.10.215-203.850.amzn2.x86_64-x86_64-with-glibc2.26 Is CUDA available: N/A CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: NVIDIA L4 Nvidia driver version: 535.161.08 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: N/A CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7R13 Processor Stepping: 1 CPU MHz: 3655.646 BogoMIPS: 5300.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid Versions of relevant libraries: [pip3] numpy==1.26.4 [conda] numpy 1.26.4 py310hb13e2d6_0 conda-forge ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-3 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
quantization of Phi3 mini as 4Bit or 8Bit, but none of them worked with vllm 0.5.1
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig import torch model_id = "microsoft/Phi-3-mini-4k-instruct" quantization_config = GPTQConfig( bits=4, group_size=128, dataset="c4", desc_act=False, ) tokenizer = AutoTokenizer.from_pretrained(model_id) quant_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map='auto')
if load with transformers
import torch from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline torch.random.manual_seed(0) model = AutoModelForCausalLM.from_pretrained( "kaitchup/Phi-3-mini-4k-instruct-gptq-4bit", device_map="cuda", torch_dtype="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("kaitchup/Phi-3-mini-4k-instruct-gptq-4bit")
if you take a look inside:
# model.model.layers[0].self_attn.o_proj.__dict__ {'training': False, '_parameters': OrderedDict(), '_buffers': OrderedDict([('qweight', tensor([[ 1773757815, 1768328279, -1464370249, ..., -2039838327, 2022213767, 2040039560], [-1821869179, 2055439289, 2022094682, ..., -1734768743, -2004252536, -2005370488], [-1490392886, 1783199093, -1737979752, ..., -1701213575, -2005305208, -1736939382], ..., [ 2020075661, 1488582826, 1469745272, ..., 2031857540, -2056668821, 2006354234], [-1468552842, 2011772828, 1251699099, ..., -1431862614, -2055685992, 1704302774], [-1231591193, -1696096940, -1984251797, ..., -1969973080, 1989630649, 1773565820]], device='cuda:0', dtype=torch.int32)), ('qzeros', tensor([[2004318071, 2004318071, 2004318071, ..., 2004318071, 2004318071, 2004318071], [2004318071, 2004318071, 2004318071, ..., 2004318071, 2004318071, 2004318071], [2004318071, 2004318071, 2004318071, ..., 2004318071, 2004318071, 2004318071], ..., [2004318071, 2004318071, 2004318071, ..., 2004318071, 2004318071, 2004318071], [2004318071, 2004318071, 2004318071, ..., 2004318071, 2004318071, 2004318071], [2004318071, 2004318071, 2004318071, ..., 2004318071, 2004318071, 2004318071]], device='cuda:0', dtype=torch.int32)), ('scales', tensor([[0.0036, 0.0041, 0.0069, ..., 0.0090, 0.0193, 0.0094], [0.0062, 0.0057, 0.0051, ..., 0.0055, 0.0103, 0.0052], [0.0079, 0.0066, 0.0082, ..., 0.0058, 0.0221, 0.0192], ..., [0.0082, 0.0104, 0.0082, ..., 0.0107, 0.0102, 0.0122], [0.0087, 0.0071, 0.0105, ..., 0.0082, 0.0081, 0.0068], [0.0089, 0.0058, 0.0119, ..., 0.0086, 0.0095, 0.0078]], device='cuda:0', dtype=torch.float16)), ('g_idx', tensor([ 0, 0, 0, ..., 23, 23, 23], device='cuda:0', dtype=torch.int32))]), '_non_persistent_buffers_set': set(), '_backward_pre_hooks': OrderedDict(), '_backward_hooks': OrderedDict(), '_is_full_backward_hook': None, '_forward_hooks': OrderedDict(), '_forward_hooks_with_kwargs': OrderedDict(), '_forward_hooks_always_called': OrderedDict(), '_forward_pre_hooks': OrderedDict(), '_forward_pre_hooks_with_kwargs': OrderedDict(), '_state_dict_hooks': OrderedDict(), '_state_dict_pre_hooks': OrderedDict(), '_load_state_dict_pre_hooks': OrderedDict(), '_load_state_dict_post_hooks': OrderedDict(), '_modules': OrderedDict(), 'infeatures': 3072, 'outfeatures': 3072, 'bits': 4, 'group_size': 128, 'maxq': 15, 'bias': None, 'half_indim': 1536, 'use_cuda_fp16': False, 'wf': tensor([[ 0, 4, 8, 12, 16, 20, 24, 28]], dtype=torch.int32), 'kernel_switch_threshold': 128, 'autogptq_cuda_available': False, 'autogptq_cuda': None, 'trainable': False, 'device': device(type='meta'), '_is_hf_initialized': True}
it worked via model.generate, but if you load it with vLLM, it failed with weight loading
# llm = LLM(model="StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit", trust_remote_code=True ) config.json: 0%| | 0.00/1.58k [00:00<?, ?B/s] INFO 07-08 15:25:48 gptq_marlin.py:141] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel. INFO 07-08 15:25:48 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit', speculative_config=None, tokenizer='StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit, use_v2_block_manager=False, enable_prefix_caching=False) tokenizer_config.json: 0%| | 0.00/3.17k [00:00<?, ?B/s] tokenizer.model: 0%| | 0.00/500k [00:00<?, ?B/s] tokenizer.json: 0%| | 0.00/1.84M [00:00<?, ?B/s] added_tokens.json: 0%| | 0.00/293 [00:00<?, ?B/s] special_tokens_map.json: 0%| | 0.00/569 [00:00<?, ?B/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. generation_config.json: 0%| | 0.00/172 [00:00<?, ?B/s] INFO 07-08 15:25:50 weight_utils.py:218] Using model weights format ['*.safetensors'] model.safetensors: 0%| | 0.00/4.11G [00:00<?, ?B/s] INFO 07-08 15:27:30 weight_utils.py:261] No model.safetensors.index.json found in remote. --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In[5], line 1 ----> 1 llm = LLM(model="StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit", trust_remote_code=True ) File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/entrypoints/llm.py:149, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs) 127 raise TypeError( 128 "There is no need to pass vision-related arguments anymore.") 129 engine_args = EngineArgs( 130 model=model, 131 tokenizer=tokenizer, (...) 147 **kwargs, 148 ) --> 149 self.llm_engine = LLMEngine.from_engine_args( 150 engine_args, usage_context=UsageContext.LLM_CLASS) 151 self.request_counter = Counter() File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/engine/llm_engine.py:414, in LLMEngine.from_engine_args(cls, engine_args, usage_context) 411 executor_class = GPUExecutor 413 # Create the LLM engine. --> 414 engine = cls( 415 **engine_config.to_dict(), 416 executor_class=executor_class, 417 log_stats=not engine_args.disable_log_stats, 418 usage_context=usage_context, 419 ) 420 return engine File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/engine/llm_engine.py:243, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config, decoding_config, observability_config, executor_class, log_stats, usage_context, stat_loggers) 237 self.generation_config_fields = _load_generation_config_dict( 238 model_config) 240 self.input_processor = INPUT_REGISTRY.create_input_processor( 241 self.model_config) --> 243 self.model_executor = executor_class( 244 model_config=model_config, 245 cache_config=cache_config, 246 parallel_config=parallel_config, 247 scheduler_config=scheduler_config, 248 device_config=device_config, 249 lora_config=lora_config, 250 multimodal_config=multimodal_config, 251 speculative_config=speculative_config, 252 load_config=load_config, 253 ) 255 if not self.model_config.embedding_mode: 256 self._initialize_kv_caches() File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/executor/executor_base.py:42, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config) 39 self.multimodal_config = multimodal_config 40 self.speculative_config = speculative_config ---> 42 self._init_executor() File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/executor/gpu_executor.py:24, in GPUExecutor._init_executor(self) 22 self.driver_worker = self._create_worker() 23 self.driver_worker.init_device() ---> 24 self.driver_worker.load_model() File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/worker/worker.py:133, in Worker.load_model(self) 132 def load_model(self): --> 133 self.model_runner.load_model() File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/worker/model_runner.py:243, in GPUModelRunnerBase.load_model(self) 241 def load_model(self) -> None: 242 with CudaMemoryProfiler() as m: --> 243 self.model = get_model( 244 model_config=self.model_config, 245 device_config=self.device_config, 246 load_config=self.load_config, 247 lora_config=self.lora_config, 248 multimodal_config=self.multimodal_config, 249 parallel_config=self.parallel_config, 250 scheduler_config=self.scheduler_config, 251 cache_config=self.cache_config, 252 ) 254 self.model_memory_usage = m.consumed_memory 255 logger.info("Loading model weights took %.4f GB", 256 self.model_memory_usage / float(2**30)) File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py:21, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, multimodal_config, cache_config) 14 def get_model(*, model_config: ModelConfig, load_config: LoadConfig, 15 device_config: DeviceConfig, parallel_config: ParallelConfig, 16 scheduler_config: SchedulerConfig, 17 lora_config: Optional[LoRAConfig], 18 multimodal_config: Optional[MultiModalConfig], 19 cache_config: CacheConfig) -> nn.Module: 20 loader = get_model_loader(load_config) ---> 21 return loader.load_model(model_config=model_config, 22 device_config=device_config, 23 lora_config=lora_config, 24 multimodal_config=multimodal_config, 25 parallel_config=parallel_config, 26 scheduler_config=scheduler_config, 27 cache_config=cache_config) File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:270, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, multimodal_config, parallel_config, scheduler_config, cache_config) 266 with torch.device(device_config.device): 267 model = _initialize_model(model_config, self.load_config, 268 lora_config, multimodal_config, 269 cache_config) --> 270 model.load_weights( 271 self._get_weights_iterator(model_config.model, 272 model_config.revision, 273 fall_back_to_pt=getattr( 274 model, 275 "fall_back_to_pt_during_load", 276 True)), ) 278 for _, module in model.named_modules(): 279 quant_method = getattr(module, "quant_method", None) File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:486, in LlamaForCausalLM.load_weights(self, weights) 483 param = params_dict[name] 484 weight_loader = getattr(param, "weight_loader", 485 default_weight_loader) --> 486 weight_loader(param, loaded_weight) 487 except KeyError: 488 pass File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py:391, in MergedColumnParallelLinear.weight_loader(self, param, loaded_weight, loaded_shard_id) 389 if output_dim is None: 390 if needs_scalar_to_array is not None: --> 391 param_data, loaded_weight = adjust_scalar_to_fused_array( 392 param_data, loaded_weight, 0) 394 assert param_data.shape == loaded_weight.shape 395 param_data.copy_(loaded_weight) File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py:61, in adjust_scalar_to_fused_array(param, loaded_weight, shard_id) 58 # AutoFP8 scales do not have a shape 59 # compressed-tensors scales do have a shape 60 if len(loaded_weight.shape) != 0: ---> 61 assert loaded_weight.shape[0] == 1 62 loaded_weight = loaded_weight[0] 64 return param[shard_id], loaded_weight AssertionError:
is it because of the "No model.safetensors.index.json"? or is this a BUG, or, if I was using it in the wrong way?
The text was updated successfully, but these errors were encountered:
This is a bug - I will put up a patch
Sorry, something went wrong.
Fixed by #6238
robertgshaw2-redhat
No branches or pull requests
Your current environment
🐛 Describe the bug
quantization of Phi3 mini as 4Bit or 8Bit, but none of them worked with vllm 0.5.1
if load with transformers
if you take a look inside:
it worked via model.generate, but if you load it with vLLM, it failed with weight loading
is it because of the "No model.safetensors.index.json"?
or is this a BUG, or, if I was using it in the wrong way?
The text was updated successfully, but these errors were encountered: