-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] SLM - mlc_chat convert_weight
has errors with q4f16_ft quantization
#1723
Comments
Hi @dusty-nv, thanks for reporting the issue! Was able to reproduce it on my end; I'm looking into it and will update a fix soon. Meanwhile, as a workaround for
Regarding the runtime issue, I commented on the other issue. If you are using prebuilt pip package for TVM, updating it should suffice. If you are building from source, updating to head and rebuilding it should work (to be safe, you can |
Thanks @CharlieFRuan, will try this! |
The issue for |
Thanks @CharlieFRuan! q4f16_ft is working again with #1731. However the q4f16_ft model built with mlc_chat benchmarks ~15% slower than the same one built with mlc_llm.build (measured using Do you see similar perf regressions with mlc_chat SLIM model compiler vs mlc_llm.build? It doesn't seem specific to q4f16_ft quantization, the other methods compare similarly with the newer SLIM way. |
Really appreciate reporting the regression @dusty-nv! We'll look into this. |
Bumping on this! We also notice several performance hits. Another issue is that the Mixtral model build process fails on the same environment (Jetson Orin AARCH64) |
This PR reverts the change introduced in #1731 where we skip quantizing `lm_head` to avoid the dynamic vocab issue. This led to performance degradation as pointed out in #1723. Instead, we fall back to `GroupQuantizeLinear` for `lm_head`, which preserves performance and avoids the dynamic vocab size issue. Performance change on RTX 4090 Prefill: throughput: **950.792 --> 973.859 tok/s** total tokens: 7 tok Decode: throughput: **214.372 --> 223.491 tok/s** total tokens: 256 tok
Hi @dusty-nv @srikanthsrnvs! With these 2 PRs that just got merged:
We were able to increase decode from 214 tok/s to ~227 tok/s on RTX 4090 for Llama 2 7B q4f16_ft. We aligned most of the performance, but there is still a minor gap with the old flow due to SLM currently not having CUDA graph support. We will follow up with that. |
Thanks @CharlieFRuan, I've confirmed with
So yea, that is close, thank you! Please update this topic when the CUDA graphs have been added to p.s. in this latest build I have started having errors with |
Thanks @CharlieFRuan Im curious to know what performance can we expect from Mistral 8x7B on an Orin vs Mistral-7B. The Mixtral model gives me performance closer to a 13B model when using non-mlc implementations. However right now, I am just getting 7 tokens/s on Mixtral, vs 45 tokens/s with a 7B model. Curious if the Cuda graph is the reason behind this, but I suspect that is just a minor 5% improvement. Any ideas? @dusty-nv perhaps you've tested out Mixtral as well? |
Hi @CharlieFRuan just bumping on this, is Mixtral fully optimized on MLC? or are there optimizations to be made still? Would love it if you could point me in the direction so I could take a look myself! |
Hi, @srikanthsrnvs! Unfortunately, I am not too familiar with Mixtral in MLC. Feel free to open a separate issue for this and people with more knowledge on this could respond. Thank you!! |
Hi @srikanthsrnvs, Mixtral on CUDA should be mostly optimized. Here is another piece we observed that might be helpful for performance recently #1866 -- we can leverage thrust sort for the expert selection. After this getting merged, Mixtral should be well optimized I think. We are happy to dig more if there is any perf regression or there are some perf issue popping up. |
How do I actually use Thrust? #1866 (comment) |
Thank you @srikanthsrnvs for following up. Given thurst is not the focus of this issue, I'm gonna close this one. We can follow up the discussion in 1866 |
This PR reverts the change introduced in mlc-ai/mlc-llm#1731 where we skip quantizing `lm_head` to avoid the dynamic vocab issue. This led to performance degradation as pointed out in mlc-ai/mlc-llm#1723. Instead, we fall back to `GroupQuantizeLinear` for `lm_head`, which preserves performance and avoids the dynamic vocab size issue. Performance change on RTX 4090 Prefill: throughput: **950.792 --> 973.859 tok/s** total tokens: 7 tok Decode: throughput: **214.372 --> 223.491 tok/s** total tokens: 256 tok
This PR reverts the change introduced in mlc-ai/mlc-llm#1731 where we skip quantizing `lm_head` to avoid the dynamic vocab issue. This led to performance degradation as pointed out in mlc-ai/mlc-llm#1723. Instead, we fall back to `GroupQuantizeLinear` for `lm_head`, which preserves performance and avoids the dynamic vocab size issue. Performance change on RTX 4090 Prefill: throughput: **950.792 --> 973.859 tok/s** total tokens: 7 tok Decode: throughput: **214.372 --> 223.491 tok/s** total tokens: 256 tok
🐛 Bug
When using the
mlc_chat convert_weight
method on llama-2-7b-chat-hf with q4f16_ft quantization, it throws the following exception:The same model/install works with the previous
mlc_llm.build
method usingq4f16_ft
. Andq4f16_1
quantization works with the SLMmlc_chat
model builder, but later encounters issue #1551 (comment) during runtime.Environment
conda
, source): Source (commitd840de5
)pip
, source): Sourcepython -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):The text was updated successfully, but these errors were encountered: