Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Bug Report】llama v3 70B int4 reasoning abnormal #1638

Closed
3 of 4 tasks
vip-china opened this issue May 21, 2024 · 6 comments
Closed
3 of 4 tasks

【Bug Report】llama v3 70B int4 reasoning abnormal #1638

vip-china opened this issue May 21, 2024 · 6 comments
Labels
bug Something isn't working stale

Comments

@vip-china
Copy link

System Info

GPU name (NVIDIA A6000)
TensorRT-LLM tage (v0.9.0 main)
transformers tage (0.41.0)

Who can help?

@nc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

1.quantization
python3 convert_checkpoint.py
--model_dir ./dolphin-2.9-llama3-70b
--output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4
--dtype float16
--use_weight_only
--weight_only_precision int4
--tp_size 4
--pp_size 1

1.1 or
python ../quantization/quantize.py --model_dir ./dolphin-2.9-llama3-70b
--dtype float16
--qformat int4_awq
--awq_block_size 128
--output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4
--calib_size 32
--tp_size 4
2.build
trtllm-build --checkpoint_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4 --output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-by --gemm_plugin float16 --max_batch_size 8 --use_custom_all_reduce disable --max_input_len 8192 --max_output_len 4096

3.reasoning
python3 run.py --engine_dir ./llama/dolphin-2.9-llama3-70b-new-ljf-int4-by --tokenizer_dir /tensorrtllm_backend/TensorRT-LLM/examples/llama/dolphin-2.9-llama3-70b --max_output_len 20 --input_text "I lovefrench quiche"

Expected behavior

expect Correct answer

actual behavior

if quantified to 4 bits, the answer will become garbled
9028b07e31b61a471df7c8c898ba6c8
企业微信截图_17161209198428

If using fp16 precision to answer without garbled characters, but unable to stop

additional notes

prompt:
<|im_start|>system
You are gpt, a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

@vip-china vip-china added the bug Something isn't working label May 21, 2024
@byshiue
Copy link
Collaborator

byshiue commented May 23, 2024

Do you encounter same issue on LLaMA 2-70B?

@vip-china
Copy link
Author

Do you encounter same issue on LLaMA 2-70B?

The current test is llama3, and llama-2-70B has not been tested before. Is this related to int4/awq, FP16 is normal

@byshiue
Copy link
Collaborator

byshiue commented May 27, 2024

Because we don't observe such issue on llama-2-70B with int4-awq, and we don't have llama-3-70B ckpt now. So, we hope to find an baseline model as reference to help reproducing. Could you take a try on llama-2-70B?

@smehta2000
Copy link

I can confirm, I'm seeing the same behavior with the llama-3-70B checkpoint as @vip-china

@matichon-vultureprime
Copy link
Contributor

I think it makes sense.
This is my script with Vanilla Llama3-70b.

I already mention in this issue #1470 and my comment.

Quantization for Llama3 is a bit different. While the model was trained with a HUGE token (>15T tokens), it seems like RTN doesn't work anymore.

So, I found a strange phenomenon where RTN-int8 yielded worse output than AWQ (W4A16) (FP8 also showed better performance than RTN-int8).

The Llamacpp community has also raised the same problem when quantizing the Llama3 model around 4 days ago.

FYI. My script with Vanilla model. And is public on my huggingface as well. Huggingface link.

TRT-LLM 0.11 Main branch. f430a4b
DGX-H100

AWQ script.

python  ../quantization/quantize.py --model_dir /root/.cache/huggingface/hub/models--casperhansen--llama-3-70b-fp16/snapshots/c8647dcc2296eb8d763645645ebda784da16141a \
                                         --dtype float16 \
                                         --qformat int4_awq \
                                         --awq_block_size 64 \
                                         --output_dir ./quantized-llama3-70b-awq-w4a16-gs64 \
                                         --batch_size 32 \
                                         --tp_size 4 \
                                         --calib_size 512

trtllm-build script.

trtllm-build --checkpoint_dir ./quantized-llama3-70b-awq-w4a16-gs64 \
             --output_dir ./llama3-70b-awq-bs128 \
             --gpt_attention_plugin float16 \
             --max_batch_size 32 \
             --max_input_len 4096 \
             --max_output_len 4096 \
             --context_fmha enable \
             --paged_kv_cache enable \
             --remove_input_padding enable \
             --gpt_attention_plugin float16 \
             --multi_block_mode enable \
             --use_paged_context_fmha enable \
             --tokens_per_block 64 \
             --workers 4 \
             --gemm_plugin auto

run.py

mpirun -n 4 --allow-run-as-root --oversubscribe python3 ../run.py --engine_dir ./llama3-70b-awq-bs128 --tokenizer_dir /code/tensorrt_llm/models--casperhansen--llama-3-70b-fp16/snapshots/c8647dcc2296eb8d763645645ebda784da16141a --max_output_len 20 --input_text "I lovefrench quiche"

Output.

Input [Text 0]: "I lovefrench quiche"
Output [Text 0 Beam 0]: " and this one looks so delicious. I love the addition of the spinach and the cheese. I am"

@nv-guomingz
Copy link
Collaborator

Hi @vip-china do u still have further issue or question now? If not, we'll close it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

5 participants