-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Bug Report】llama v3 70B int4 reasoning abnormal #1638
Comments
Do you encounter same issue on LLaMA 2-70B? |
The current test is llama3, and llama-2-70B has not been tested before. Is this related to int4/awq, FP16 is normal |
Because we don't observe such issue on llama-2-70B with int4-awq, and we don't have llama-3-70B ckpt now. So, we hope to find an baseline model as reference to help reproducing. Could you take a try on llama-2-70B? |
I can confirm, I'm seeing the same behavior with the llama-3-70B checkpoint as @vip-china |
I think it makes sense. I already mention in this issue #1470 and my comment.
FYI. My script with Vanilla model. And is public on my huggingface as well. Huggingface link. TRT-LLM 0.11 Main branch. f430a4b AWQ script.
trtllm-build script.
run.py
Output.
|
Hi @vip-china do u still have further issue or question now? If not, we'll close it soon. |
System Info
GPU name (NVIDIA A6000)
TensorRT-LLM tage (v0.9.0 main)
transformers tage (0.41.0)
Who can help?
@nc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1.quantization
python3 convert_checkpoint.py
--model_dir ./dolphin-2.9-llama3-70b
--output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4
--dtype float16
--use_weight_only
--weight_only_precision int4
--tp_size 4
--pp_size 1
1.1 or
python ../quantization/quantize.py --model_dir ./dolphin-2.9-llama3-70b
--dtype float16
--qformat int4_awq
--awq_block_size 128
--output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4
--calib_size 32
--tp_size 4
2.build
trtllm-build --checkpoint_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4 --output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-by --gemm_plugin float16 --max_batch_size 8 --use_custom_all_reduce disable --max_input_len 8192 --max_output_len 4096
3.reasoning
python3 run.py --engine_dir ./llama/dolphin-2.9-llama3-70b-new-ljf-int4-by --tokenizer_dir /tensorrtllm_backend/TensorRT-LLM/examples/llama/dolphin-2.9-llama3-70b --max_output_len 20 --input_text "I lovefrench quiche"
Expected behavior
expect Correct answer
actual behavior
if quantified to 4 bits, the answer will become garbled


If using fp16 precision to answer without garbled characters, but unable to stop
additional notes
prompt:
<|im_start|>system
You are gpt, a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
The text was updated successfully, but these errors were encountered: