-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] AutoAWQ quantization - Cannot open ndarray-cache.json
#1732
Comments
I believe for AWQ you'd still need to go through
This is mainly because these quantization are somewhat experimental in MLC. I will add the steps to the docs! |
OK, tried model quantized with llm-awq instead (like in #1229), but get an error during $ mlc_chat convert_weight \
models/Llama-2-7b-chat-hf \
--quantization q4f16_autoawq \
--source /data/models/awq/Llama-2-7b-chat-hf/w4-g128-awq.pt \
--source-format awq \
--output Llama-2-7b-chat-hf-q4f16_autoawq
[2024-02-10 05:46:10] INFO auto_config.py:115: Found model configuration: models/Llama-2-7b-chat-hf/config.json
[2024-02-10 05:46:11] INFO auto_device.py:76: Found device: cuda:0
[2024-02-10 05:46:12] INFO auto_device.py:85: Not found device: rocm:0
[2024-02-10 05:46:13] INFO auto_device.py:85: Not found device: metal:0
[2024-02-10 05:46:14] INFO auto_device.py:85: Not found device: vulkan:0
[2024-02-10 05:46:15] INFO auto_device.py:85: Not found device: opencl:0
[2024-02-10 05:46:15] INFO auto_device.py:33: Using device: cuda:0
[2024-02-10 05:46:15] INFO auto_weight.py:70: Finding weights in: /data/models/awq/Llama-2-7b-chat-hf/w4-g128-awq.pt
[2024-02-10 05:46:15] INFO auto_config.py:153: Found model type: llama. Use `--model-type` to override.
Weight conversion with arguments:
--config models/Llama-2-7b-chat-hf/config.json
--quantization AWQQuantize(name='q4f16_autoawq', kind='awq', group_size=128, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', num_elem_per_storage=8, num_storage_per_group=16, max_int_value=7, prebuilt_quantize_func={})
--model-type llama
--device cuda:0
--source /data/models/awq/Llama-2-7b-chat-hf/w4-g128-awq.pt
--source-format awq
--output Llama-2-7b-chat-hf-q4f16_autoawq
[2024-02-10 05:46:15] INFO llama_model.py:51: context_window_size not found in config.json. Falling back to max_position_embeddings (4096)
[2024-02-10 05:46:15] INFO llama_model.py:71: prefill_chunk_size defaults to context_window_size (4096)
[2024-02-10 05:46:23] INFO huggingface_loader.py:182: Loading HF parameters from: /data/models/awq/Llama-2-7b-chat-hf/w4-g128-awq.pt
[2024-02-10 05:46:27] INFO huggingface_loader.py:172: [Not quantized] Parameter: "model.embed_tokens.weight", shape: (32000, 4096), dtype: float16
[2024-02-10 05:46:27] INFO huggingface_loader.py:172: [Not quantized] Parameter: "model.layers.0.self_attn.qkv_proj.qweight", shape: (4096, 1536), dtype: uint32
[2024-02-10 05:46:27] INFO huggingface_loader.py:172: [Not quantized] Parameter: "model.layers.0.self_attn.qkv_proj.qzeros", shape: (4096, 12), dtype: uint32
0% 2/451 [00:00<01:34, 4.73it/s]
Traceback (most recent call last):
File "/usr/local/bin/mlc_chat", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/__main__.py", line 28, in main
cli.main(sys.argv[2:])
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/cli/convert_weight.py", line 87, in main
convert_weight(
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/interface/convert_weight.py", line 169, in convert_weight
_convert_args(args)
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/interface/convert_weight.py", line 124, in _convert_args
_check_param(name, param)
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/interface/convert_weight.py", line 102, in _check_param
raise ValueError(
ValueError: Parameter model.layers.0.self_attn.qkv_proj.qzeros has shape (4096, 12), but expected [32, 1536] Are you able to run it? I can provide the AWQ model if that's helpful. |
Aha thanks @CharlieFRuan!, I am able to run the flow from #1362 on However it is very slow, only ~13.7 tokens/sec output (whereas Llama-2-7b with q4f16_ft quantization produces 45 tokens/sec). Meanwhile the original llm_awq project runs at ~30 tokens/sec on the same hardware. Any ideas what may be happening? |
I see! Thanks for the report; we'll test the degradation this week! |
@dusty-nv I think for AWQ's layout we indeed haven't optimized it well (unlike FasterTransformer which was due to imparity with the old flow). We will come back in the future to optimize it. |
Hi @dusty-nv
|
🐛 Bug
When trying to run a llama-7B that was compiled with mlc_chat and q4f16_autoawq quantization, it never makes the
ndarray-cache.json
file and produces error when trying to load it at runtime:This is the script that I used to convert the original Llama-2-7b-chat-hf model to AWQ format first (using AutoAWQ library)
And then I used mlc_chat convert_weight/gen_config/compile with
--quantization=q4f16_autoawq
, which was successful. However it doesn't load at runtime due to the aforementioned issue. I didn't find docs for mlc_chat about using AWQ, so it's unclear if I missed any steps or what the model directory structure should be (or where the weights reside)This is what the directory of the AutoAWQ model looks like:
And this is what the directory of the compiled model from mlc_chat looks like:
So unlike the models compiled with
q4f16_1
orq4f16_ft
quantization, theq4f16_autoawq
model doesn't have the weights within it or thatndarray-cache.json
file. Is there something I am doing wrong?Environment
conda
, source): source 16aaa30pip
, source): source https://github.com/mlc-ai/relax/tree/292137088115ac81779607ca223bbbd9ad40cb55python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):The text was updated successfully, but these errors were encountered: