Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: issue with Phi3 mini GPTQ 4Bit/8Bit #6217

Closed
gm3000 opened this issue Jul 8, 2024 · 2 comments
Closed

[Bug]: issue with Phi3 mini GPTQ 4Bit/8Bit #6217

gm3000 opened this issue Jul 8, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@gm3000
Copy link

gm3000 commented Jul 8, 2024

Your current environment

PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Amazon Linux 2 (x86_64)
GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
Clang version: Could not collect
CMake version: version 3.29.0
Libc version: glibc-2.26

Python version: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0] (64-bit runtime)
Python platform: Linux-5.10.215-203.850.amzn2.x86_64-x86_64-with-glibc2.26
Is CUDA available: N/A
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA L4
Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7R13 Processor
Stepping:            1
CPU MHz:             3655.646
BogoMIPS:            5300.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid

Versions of relevant libraries:
[pip3] numpy==1.26.4
[conda] numpy                     1.26.4          py310hb13e2d6_0    conda-forge
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-3     0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

quantization of Phi3 mini as 4Bit or 8Bit, but none of them worked with vllm 0.5.1

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
import torch

model_id = "microsoft/Phi-3-mini-4k-instruct"

quantization_config = GPTQConfig(
     bits=4,
     group_size=128,
     dataset="c4",
     desc_act=False,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
quant_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map='auto')

if load with transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0) 
model = AutoModelForCausalLM.from_pretrained(
    "kaitchup/Phi-3-mini-4k-instruct-gptq-4bit",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("kaitchup/Phi-3-mini-4k-instruct-gptq-4bit")

if you take a look inside:

# model.model.layers[0].self_attn.o_proj.__dict__   
{'training': False,
 '_parameters': OrderedDict(),
 '_buffers': OrderedDict([('qweight',
               tensor([[ 1773757815,  1768328279, -1464370249,  ..., -2039838327,
                         2022213767,  2040039560],
                       [-1821869179,  2055439289,  2022094682,  ..., -1734768743,
                        -2004252536, -2005370488],
                       [-1490392886,  1783199093, -1737979752,  ..., -1701213575,
                        -2005305208, -1736939382],
                       ...,
                       [ 2020075661,  1488582826,  1469745272,  ...,  2031857540,
                        -2056668821,  2006354234],
                       [-1468552842,  2011772828,  1251699099,  ..., -1431862614,
                        -2055685992,  1704302774],
                       [-1231591193, -1696096940, -1984251797,  ..., -1969973080,
                         1989630649,  1773565820]], device='cuda:0', dtype=torch.int32)),
              ('qzeros',
               tensor([[2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
                        2004318071],
                       [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
                        2004318071],
                       [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
                        2004318071],
                       ...,
                       [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
                        2004318071],
                       [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
                        2004318071],
                       [2004318071, 2004318071, 2004318071,  ..., 2004318071, 2004318071,
                        2004318071]], device='cuda:0', dtype=torch.int32)),
              ('scales',
               tensor([[0.0036, 0.0041, 0.0069,  ..., 0.0090, 0.0193, 0.0094],
                       [0.0062, 0.0057, 0.0051,  ..., 0.0055, 0.0103, 0.0052],
                       [0.0079, 0.0066, 0.0082,  ..., 0.0058, 0.0221, 0.0192],
                       ...,
                       [0.0082, 0.0104, 0.0082,  ..., 0.0107, 0.0102, 0.0122],
                       [0.0087, 0.0071, 0.0105,  ..., 0.0082, 0.0081, 0.0068],
                       [0.0089, 0.0058, 0.0119,  ..., 0.0086, 0.0095, 0.0078]],
                      device='cuda:0', dtype=torch.float16)),
              ('g_idx',
               tensor([ 0,  0,  0,  ..., 23, 23, 23], device='cuda:0', dtype=torch.int32))]),
 '_non_persistent_buffers_set': set(),
 '_backward_pre_hooks': OrderedDict(),
 '_backward_hooks': OrderedDict(),
 '_is_full_backward_hook': None,
 '_forward_hooks': OrderedDict(),
 '_forward_hooks_with_kwargs': OrderedDict(),
 '_forward_hooks_always_called': OrderedDict(),
 '_forward_pre_hooks': OrderedDict(),
 '_forward_pre_hooks_with_kwargs': OrderedDict(),
 '_state_dict_hooks': OrderedDict(),
 '_state_dict_pre_hooks': OrderedDict(),
 '_load_state_dict_pre_hooks': OrderedDict(),
 '_load_state_dict_post_hooks': OrderedDict(),
 '_modules': OrderedDict(),
 'infeatures': 3072,
 'outfeatures': 3072,
 'bits': 4,
 'group_size': 128,
 'maxq': 15,
 'bias': None,
 'half_indim': 1536,
 'use_cuda_fp16': False,
 'wf': tensor([[ 0,  4,  8, 12, 16, 20, 24, 28]], dtype=torch.int32),
 'kernel_switch_threshold': 128,
 'autogptq_cuda_available': False,
 'autogptq_cuda': None,
 'trainable': False,
 'device': device(type='meta'),
 '_is_hf_initialized': True}

it worked via model.generate, but if you load it with vLLM, it failed with weight loading

# llm = LLM(model="StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit", trust_remote_code=True )
config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]
INFO 07-08 15:25:48 gptq_marlin.py:141] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 07-08 15:25:48 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit', speculative_config=None, tokenizer='StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit, use_v2_block_manager=False, enable_prefix_caching=False)
tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]
tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]
added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/569 [00:00<?, ?B/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]
INFO 07-08 15:25:50 weight_utils.py:218] Using model weights format ['*.safetensors']
model.safetensors:   0%|          | 0.00/4.11G [00:00<?, ?B/s]
INFO 07-08 15:27:30 weight_utils.py:261] No model.safetensors.index.json found in remote.
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[5], line 1
----> 1 llm = LLM(model="StefanKrsteski/Phi-3-mini-4k-instruct-GPTQ-8bit", trust_remote_code=True )

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/entrypoints/llm.py:149, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs)
    127     raise TypeError(
    128         "There is no need to pass vision-related arguments anymore.")
    129 engine_args = EngineArgs(
    130     model=model,
    131     tokenizer=tokenizer,
   (...)
    147     **kwargs,
    148 )
--> 149 self.llm_engine = LLMEngine.from_engine_args(
    150     engine_args, usage_context=UsageContext.LLM_CLASS)
    151 self.request_counter = Counter()

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/engine/llm_engine.py:414, in LLMEngine.from_engine_args(cls, engine_args, usage_context)
    411     executor_class = GPUExecutor
    413 # Create the LLM engine.
--> 414 engine = cls(
    415     **engine_config.to_dict(),
    416     executor_class=executor_class,
    417     log_stats=not engine_args.disable_log_stats,
    418     usage_context=usage_context,
    419 )
    420 return engine

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/engine/llm_engine.py:243, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config, decoding_config, observability_config, executor_class, log_stats, usage_context, stat_loggers)
    237 self.generation_config_fields = _load_generation_config_dict(
    238     model_config)
    240 self.input_processor = INPUT_REGISTRY.create_input_processor(
    241     self.model_config)
--> 243 self.model_executor = executor_class(
    244     model_config=model_config,
    245     cache_config=cache_config,
    246     parallel_config=parallel_config,
    247     scheduler_config=scheduler_config,
    248     device_config=device_config,
    249     lora_config=lora_config,
    250     multimodal_config=multimodal_config,
    251     speculative_config=speculative_config,
    252     load_config=load_config,
    253 )
    255 if not self.model_config.embedding_mode:
    256     self._initialize_kv_caches()

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/executor/executor_base.py:42, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config)
     39 self.multimodal_config = multimodal_config
     40 self.speculative_config = speculative_config
---> 42 self._init_executor()

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/executor/gpu_executor.py:24, in GPUExecutor._init_executor(self)
     22 self.driver_worker = self._create_worker()
     23 self.driver_worker.init_device()
---> 24 self.driver_worker.load_model()

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/worker/worker.py:133, in Worker.load_model(self)
    132 def load_model(self):
--> 133     self.model_runner.load_model()

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/worker/model_runner.py:243, in GPUModelRunnerBase.load_model(self)
    241 def load_model(self) -> None:
    242     with CudaMemoryProfiler() as m:
--> 243         self.model = get_model(
    244             model_config=self.model_config,
    245             device_config=self.device_config,
    246             load_config=self.load_config,
    247             lora_config=self.lora_config,
    248             multimodal_config=self.multimodal_config,
    249             parallel_config=self.parallel_config,
    250             scheduler_config=self.scheduler_config,
    251             cache_config=self.cache_config,
    252         )
    254     self.model_memory_usage = m.consumed_memory
    255     logger.info("Loading model weights took %.4f GB",
    256                 self.model_memory_usage / float(2**30))

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py:21, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, multimodal_config, cache_config)
     14 def get_model(*, model_config: ModelConfig, load_config: LoadConfig,
     15               device_config: DeviceConfig, parallel_config: ParallelConfig,
     16               scheduler_config: SchedulerConfig,
     17               lora_config: Optional[LoRAConfig],
     18               multimodal_config: Optional[MultiModalConfig],
     19               cache_config: CacheConfig) -> nn.Module:
     20     loader = get_model_loader(load_config)
---> 21     return loader.load_model(model_config=model_config,
     22                              device_config=device_config,
     23                              lora_config=lora_config,
     24                              multimodal_config=multimodal_config,
     25                              parallel_config=parallel_config,
     26                              scheduler_config=scheduler_config,
     27                              cache_config=cache_config)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:270, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, multimodal_config, parallel_config, scheduler_config, cache_config)
    266 with torch.device(device_config.device):
    267     model = _initialize_model(model_config, self.load_config,
    268                               lora_config, multimodal_config,
    269                               cache_config)
--> 270 model.load_weights(
    271     self._get_weights_iterator(model_config.model,
    272                                model_config.revision,
    273                                fall_back_to_pt=getattr(
    274                                    model,
    275                                    "fall_back_to_pt_during_load",
    276                                    True)), )
    278 for _, module in model.named_modules():
    279     quant_method = getattr(module, "quant_method", None)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:486, in LlamaForCausalLM.load_weights(self, weights)
    483     param = params_dict[name]
    484     weight_loader = getattr(param, "weight_loader",
    485                             default_weight_loader)
--> 486     weight_loader(param, loaded_weight)
    487 except KeyError:
    488     pass

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py:391, in MergedColumnParallelLinear.weight_loader(self, param, loaded_weight, loaded_shard_id)
    389 if output_dim is None:
    390     if needs_scalar_to_array is not None:
--> 391         param_data, loaded_weight = adjust_scalar_to_fused_array(
    392             param_data, loaded_weight, 0)
    394     assert param_data.shape == loaded_weight.shape
    395     param_data.copy_(loaded_weight)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py:61, in adjust_scalar_to_fused_array(param, loaded_weight, shard_id)
     58 # AutoFP8 scales do not have a shape
     59 # compressed-tensors scales do have a shape
     60 if len(loaded_weight.shape) != 0:
---> 61     assert loaded_weight.shape[0] == 1
     62     loaded_weight = loaded_weight[0]
     64 return param[shard_id], loaded_weight

AssertionError: 

is it because of the "No model.safetensors.index.json"?
or is this a BUG, or, if I was using it in the wrong way?

@gm3000 gm3000 added the bug Something isn't working label Jul 8, 2024
@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Jul 8, 2024

This is a bug - I will put up a patch

@robertgshaw2-redhat
Copy link
Collaborator

Fixed by #6238

@gm3000 gm3000 closed this as completed Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants