【Bug Report】llama v3 70B int4 reasoning abnormal #1638

vip-china · 2024-05-21T07:14:26Z

System Info

GPU name (NVIDIA A6000)
TensorRT-LLM tage (v0.9.0 main)
transformers tage (0.41.0)

Who can help?

@nc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

1.quantization
python3 convert_checkpoint.py
--model_dir ./dolphin-2.9-llama3-70b
--output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4
--dtype float16
--use_weight_only
--weight_only_precision int4
--tp_size 4
--pp_size 1

1.1 or
python ../quantization/quantize.py --model_dir ./dolphin-2.9-llama3-70b
--dtype float16
--qformat int4_awq
--awq_block_size 128
--output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4
--calib_size 32
--tp_size 4
2.build
trtllm-build --checkpoint_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-0521-tp4 --output_dir ./dolphin-2.9-llama3-70b-new-ljf-int4-by --gemm_plugin float16 --max_batch_size 8 --use_custom_all_reduce disable --max_input_len 8192 --max_output_len 4096

3.reasoning
python3 run.py --engine_dir ./llama/dolphin-2.9-llama3-70b-new-ljf-int4-by --tokenizer_dir /tensorrtllm_backend/TensorRT-LLM/examples/llama/dolphin-2.9-llama3-70b --max_output_len 20 --input_text "I lovefrench quiche"

Expected behavior

expect Correct answer

actual behavior

if quantified to 4 bits, the answer will become garbled

If using fp16 precision to answer without garbled characters, but unable to stop

additional notes

prompt:
<|im_start|>system
You are gpt, a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

The text was updated successfully, but these errors were encountered:

byshiue · 2024-05-23T08:39:43Z

Do you encounter same issue on LLaMA 2-70B?

vip-china · 2024-05-23T13:07:24Z

Do you encounter same issue on LLaMA 2-70B?

The current test is llama3, and llama-2-70B has not been tested before. Is this related to int4/awq, FP16 is normal

byshiue · 2024-05-27T09:21:50Z

Because we don't observe such issue on llama-2-70B with int4-awq, and we don't have llama-3-70B ckpt now. So, we hope to find an baseline model as reference to help reproducing. Could you take a try on llama-2-70B?

smehta2000 · 2024-06-02T06:18:13Z

I can confirm, I'm seeing the same behavior with the llama-3-70B checkpoint as @vip-china

matichon-vultureprime · 2024-06-02T06:55:28Z

I think it makes sense.
This is my script with Vanilla Llama3-70b.

I already mention in this issue #1470 and my comment.

Quantization for Llama3 is a bit different. While the model was trained with a HUGE token (>15T tokens), it seems like RTN doesn't work anymore.

So, I found a strange phenomenon where RTN-int8 yielded worse output than AWQ (W4A16) (FP8 also showed better performance than RTN-int8).

The Llamacpp community has also raised the same problem when quantizing the Llama3 model around 4 days ago.

FYI. My script with Vanilla model. And is public on my huggingface as well. Huggingface link.

TRT-LLM 0.11 Main branch. f430a4b
DGX-H100

AWQ script.

python  ../quantization/quantize.py --model_dir /root/.cache/huggingface/hub/models--casperhansen--llama-3-70b-fp16/snapshots/c8647dcc2296eb8d763645645ebda784da16141a \
                                         --dtype float16 \
                                         --qformat int4_awq \
                                         --awq_block_size 64 \
                                         --output_dir ./quantized-llama3-70b-awq-w4a16-gs64 \
                                         --batch_size 32 \
                                         --tp_size 4 \
                                         --calib_size 512

trtllm-build script.

trtllm-build --checkpoint_dir ./quantized-llama3-70b-awq-w4a16-gs64 \
             --output_dir ./llama3-70b-awq-bs128 \
             --gpt_attention_plugin float16 \
             --max_batch_size 32 \
             --max_input_len 4096 \
             --max_output_len 4096 \
             --context_fmha enable \
             --paged_kv_cache enable \
             --remove_input_padding enable \
             --gpt_attention_plugin float16 \
             --multi_block_mode enable \
             --use_paged_context_fmha enable \
             --tokens_per_block 64 \
             --workers 4 \
             --gemm_plugin auto

run.py

mpirun -n 4 --allow-run-as-root --oversubscribe python3 ../run.py --engine_dir ./llama3-70b-awq-bs128 --tokenizer_dir /code/tensorrt_llm/models--casperhansen--llama-3-70b-fp16/snapshots/c8647dcc2296eb8d763645645ebda784da16141a --max_output_len 20 --input_text "I lovefrench quiche"

Output.

Input [Text 0]: "I lovefrench quiche"
Output [Text 0 Beam 0]: " and this one looks so delicious. I love the addition of the spinach and the cheese. I am"

nv-guomingz · 2024-11-14T06:41:05Z

Hi @vip-china do u still have further issue or question now? If not, we'll close it soon.

vip-china added the bug Something isn't working label May 21, 2024

nv-guomingz added the stale label Nov 14, 2024

nv-guomingz closed this as completed Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Bug Report】llama v3 70B int4 reasoning abnormal #1638

【Bug Report】llama v3 70B int4 reasoning abnormal #1638

vip-china commented May 21, 2024

byshiue commented May 23, 2024

vip-china commented May 23, 2024

byshiue commented May 27, 2024

smehta2000 commented Jun 2, 2024

matichon-vultureprime commented Jun 2, 2024

nv-guomingz commented Nov 14, 2024

【Bug Report】llama v3 70B int4 reasoning abnormal #1638

【Bug Report】llama v3 70B int4 reasoning abnormal #1638

Comments

vip-china commented May 21, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

byshiue commented May 23, 2024

vip-china commented May 23, 2024

byshiue commented May 27, 2024

smehta2000 commented Jun 2, 2024

matichon-vultureprime commented Jun 2, 2024

nv-guomingz commented Nov 14, 2024