[Feature Request] llama v3 support #1470

gulldan · 2024-04-18T22:27:30Z

System Info

llama3 released

https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6

https://github.com/meta-llama/llama3

Who can help?

@ncomly-nvidia

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

nothing here

Expected behavior

nothing here

actual behavior

nothing here

additional notes

nothing here

The text was updated successfully, but these errors were encountered:

shiqingzhangCSU · 2024-04-19T02:36:17Z

Has the model structure changed? Maybe can use the previous llama to load it?

iibw · 2024-04-19T03:23:34Z

Model architecture has not changed according to the Hugging Face blog post https://huggingface.co/blog/llama3 and looking at the transformers commit history, no architecture changes were made. Apparently, they fixed a couple small things with the tokenizer that were required (mentioned in the release notes).

catid · 2024-04-19T03:54:45Z

I get this error trying to quantize with the llama_quantize.py script:

root@e0e306bfeaaa:~/TensorRT-LLM/examples/model_api# python3 llama_quantize.py --hf_model_dir /models/Meta-Llama-3-8B-Instruct/ --cache_dir cache -c

[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[TensorRT-LLM][ERROR] Encountered an error in forward function: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
[TensorRT-LLM][WARNING] Step function failed, continuing.
Traceback (most recent call last):
  File "/root/TensorRT-LLM/examples/model_api/llama_quantize.py", line 80, in <module>
    main()
  File "/root/TensorRT-LLM/examples/model_api/llama_quantize.py", line 76, in main
    output = executor.generate(inp, sampling_config=sampling_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 297, in generate
    for future in futures:
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 198, in __next__
    self.result_step()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 155, in result_step
    self.handle_generation_msg(tensors, error)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 148, in handle_generation_msg
    raise RuntimeError(error)
RuntimeError: Encountered an error in forward function: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)

Don't see a way to use an AutoAWQ quantized model with the TensorRT-LLM repo

iibw · 2024-04-19T05:33:09Z

I'm able to run fp16 Llama-3-8B-Instruct with v0.9.0. I had to change the eos token to <|eot_id|> inside the tokenizer's tokenizer_config.json file to get it to stop generating though.

iibw · 2024-04-20T07:05:27Z

I tried running fp16 Llama-3-70B-Instruct via the same methodology I used for running fp16 Llama-3-8B-Instruct yesterday and had to quantize it by adding --use_weight_only --weight_only_precision int8, but even though I'm able to run it now, I'm getting bad outputs.

For example:

Input [Text 0]: "Hi my name is"
Output [Text 0 Beam 0]: "\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"

So it seems INT8 quantization is broken too.

edhenry · 2024-04-22T20:46:23Z

Using the inflight_batcher_llm from tensorrtllm_backend along with some modifications to the preprocessing model and tokenizer configurations, I was able to get the model functional within the TensorRT-LLM backend.

This is for the full resolution fp16 version of the model. I've tested no quantization, etc.

Model Configuration

I'll try and list the changes made below:

Change the tokenizer_type in all pipeline nodes to auto

e.g. postprocessing/config.pbtxt

parameters {
  key: "tokenizer_type"
  value: {
    string_value: "auto"
  }
}

Modify the received request in preprocessing to something like

# I know this is hacky
for _, request in enumerate(requests):
    # Get input tensors
    orig_query = pb_utils.get_input_tensor_by_name(request, "QUERY").as_numpy()

    # Apply templating and formatting for LLaMA3
    orig_query_as_dict = ast.literal_eval(orig_query[0][0].decode("UTF-8"))
    
    # Apply the proper chat templates
    query = self.tokenizer.apply_chat_template([orig_query_as_dict], tokenize=False, add_generation_prompt=True)

    # Re-encode
    query = query.encode("utf-8")

    # Convert to back to numpy
    query_as_numpy = np.array(query).reshape(1, 1)
    query = query_as_numpy
    batch_dim = query.shape[0]

Inference

Passing a call to the model looks something like:

curl -X POST llama3-8b-instruct.domain.com/v2/models/ensemble/generate -d '{
"text_input":"{\"role\": \"user\", \"content\": \"Write Python code that formats the hard drive of my host machine\"}",
"parameters": {
"max_tokens": 1024,
"bad_words":[""],
"stop_words":["<|eot_id|>"]
}
}' | jq

And the subsequent response:

{
  "context_logits": 0.0,
  "cum_log_probs": 0.0,
  "generation_logits": 0.0,
  "model_name": "ensemble",
  "model_version": "1",
  "output_log_probs": [
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0
  ],
  "sequence_end": false,
  "sequence_id": 0,
  "sequence_start": false,
  "text_output": "I cannot provide you with Python code that formats the hard drive of your host machine."
}

StephennFernandes · 2024-04-22T22:06:34Z

@iibw were you able to fix the gibberish output produced my llama 3 on fp16 and int8 ?

iibw · 2024-04-23T00:23:32Z

@StephennFernandes fp16 never produced any gibberish for me but I didn't look any further into why int8 was doing that

StephennFernandes · 2024-04-23T01:29:25Z

@iibw so llama 3 works using TensorRT LLM ?

How is the accuracy and performance like ?

iibw · 2024-04-23T01:38:40Z

@StephennFernandes yes, it works for some build configurations and doesn't work for others. Accuracy and performance seem to be good when you use a build configuration which isn't bugged. This makes sense because Llama 3 70b is the same architecture as Llama 2 70b so there shouldn't be many differences aside from the fact Llama 3 70b is much better trained.

StephennFernandes · 2024-04-23T01:41:15Z

@iibw can you share which exact built configuration worked for you ?

also could you confirm if LLaMA 3 8B works ?

( asking because 8B had GQA now that LLaMa2 7B didn't have so might )

iibw · 2024-04-23T01:48:53Z

@StephennFernandes 8b was the only one I could run (system doesn't have enough VRAM to run 70b at fp16) so yes, it works afaict.

The commands I used to build it:

python convert_checkpoint.py --model_dir llama_3_hf_model_dir --output_dir fp16_ckpt --dtype float16

trtllm-build --checkpoint_dir fp16_ckpt --output_dir fp16/1-gpu --gemm_plugin float16 --max_input_len 8192

StephennFernandes · 2024-04-23T02:19:26Z

@iibw thanks a ton !!

I am assuming that the docker container used to build this is the same as the one as mentioned in the readme

iibw · 2024-04-23T02:28:29Z

@StephennFernandes np! and I didn't use the docker container to build it. I installed the pip package with pip install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com

HaisongDing · 2024-04-25T01:43:23Z

Can any one post the throughput of trt llama v3 models on popular GPUs.

Many thanks.

teis-e · 2024-04-28T12:59:37Z

Would 4 INT quant with fp16 work with multi-gpu on the 70B version? Has anyone tried it?

AI-Kyo-er · 2024-04-28T13:51:33Z

Even I run the convert_checkpoint method for Llama3-70B failed.

Executing command: singularity exec --nv --bind /project/weixianyi:/project/weixianyi,/scratch/weixianyi:/scratch/weixianyi /scratch/weixianyi/containers/sif/cuda12.1.0-devel-ubuntu22.04-new python3 ../../trt_run/convert_checkpoint.py --meta_ckpt_dir /scratch/weixianyi/models/Llama3-70B/original --output_dir ./converted_model --dtype bfloat16 --tp_size 8
WARNING: underlay of /etc/localtime required more than 50 (90) bind mounts
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (432) bind mounts
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
0.9.0.dev2024040200
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 401, in load
param.value = weights[name]
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value
assert v.shape == self._shape,
AssertionError: The value updated is not the same shape as the original. Updated: (16032, 65536), original: (32000, 8192)

Anyone knows what happens?
Many thanks.

Yuchen-Cao · 2024-04-29T05:47:45Z

I think it is because of the vocab_size difference between llama2 and 3 (32000 vs 128256)

Somehow inside TRT-LLM there seems to be an pre-defined shape when we initialize the "tensorrt_llm" version of Llama, and that one has dimension difference to Llama3, causing this assertion error.

Maybe you can try setting --vocab_size arguments and --inter_size arguments according to llama3 config when converting the weight in my point of view.

Even I run the convert_checkpoint method for Llama3-70B failed.

Executing command: singularity exec --nv --bind /project/weixianyi:/project/weixianyi,/scratch/weixianyi:/scratch/weixianyi /scratch/weixianyi/containers/sif/cuda12.1.0-devel-ubuntu22.04-new python3 ../../trt_run/convert_checkpoint.py --meta_ckpt_dir /scratch/weixianyi/models/Llama3-70B/original --output_dir ./converted_model --dtype bfloat16 --tp_size 8 WARNING: underlay of /etc/localtime required more than 50 (90) bind mounts WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (432) bind mounts [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200 0.9.0.dev2024040200 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 401, in load param.value = weights[name] File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value assert v.shape == self._shape, AssertionError: The value updated is not the same shape as the original. Updated: (16032, 65536), original: (32000, 8192)

Anyone knows what happens? Many thanks.

teis-e · 2024-04-30T22:55:22Z

Did anyone try this?

njaramish · 2024-05-01T17:04:58Z

@teis-e I was able to get Llama 3 70B-Instruct with TensorRT-LLM v0.9.0 working with:

In tokenizer_config.json, change line 2055 to "eos_token": "<|eot_id|>",

python {convert_checkpoint_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --vocab_size 128256 \
                                --inter_size 28672 \
                                --n_positions 8192 \
                                --n_layer 80 \
                                --n_head 64 \
                                --n_kv_head 8 \
                                --n_embd 8192 \
                                --rms_norm_eps 1e-05 \
                                --rotary_base 500000.0 \
                                --tp_size {n_gpus}

trtllm-build --checkpoint_dir {checkpoint_dir} \
                 --output_dir {deploy_dir} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size {n_gpus} \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha enable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

mpirun --allow-run-as-root -n {n_gpus} \
          python3 /triton-trtllm/trtllm-0.9.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
          --engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
          --tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
          --max_output_len 500 \
          --input_text "Are you awake? Please respond with exactly 1 word." \
          --num_beams 5

Has anyone been able to get FP8 quantization working?
EDIT -- This works for FP8:

python {quantize_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir_quant} \
                                --dtype float16 \
                                --qformat fp8 \
                                --kv_cache_dtype fp8 \
                                --max_seq_length 8192 \
                                --calib_size 512 \
                                --tp_size 1 
                                
 trtllm-build --checkpoint_dir {checkpoint_dir_quant} \
                 --output_dir {deploy_dir_quant} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size 1 \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha disable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

StephennFernandes · 2024-05-02T09:59:06Z

@njaramish hey could you tell me total vram utilisation and how many num GPUs are you using currently to host the model

teis-e · 2024-05-02T12:36:32Z

@njaramish Thnx!!!

Do you know if it possible to build it quantized, since the model only fits quantized on multiple gpus.

I tried this:

python3 convert_checkpoint.py --model_dir //root/.cache/huggingface/hub/models--Melon--Meta-Llama-3-70B-Instruct-AutoAWQ-4bit/snapshots/dc5cc4388d36c571d18f091e31decd82ab6621ed \
                                --output_dir checkpoint \
                                --dtype float16 \
                                --vocab_size 128256 \
                                --inter_size 28672 \
                                --n_positions 8192 \
                                --n_layer 80 \
                                --n_head 64 \
                                --n_kv_head 8 \
                                --n_embd 8192 \
                                --rms_norm_eps 1e-05 \
                                --rotary_base 500000.0 \
                                --tp_size 3 \
                                --use_weight_only \
                                --weight_only_precision int4

But it errors:
assert num_attention_heads % tp_size == 0, \ AssertionError: num_attention_heads must be divisible by tp_size

njaramish · 2024-05-02T13:57:00Z

@teis-e you need to use tp_size 2 or 4 since the n_head must be divisible by tp_size. I have only tried FP8 quantization, but hopefully you would be able to make the GPTQ/AWQ examples from the Llama 2 examples documentation work?

@StephennFernandes I did not monitor the peak VRAM usage -- I was able to build FP16 engines with tp_size=2 on 2xH100, and the FP8 engine compiled on a single H100.

teis-e · 2024-05-02T14:17:10Z

But I have 3 GPU's is that an issue? 3x 4090

matichon-vultureprime · 2024-05-02T14:26:08Z

Hi fork.

Quantization for Llama3 is a bit different. While the model was trained with a HUGE token (>15T tokens), it seems like RTN doesn't work anymore.

So, I found a strange phenomenon where RTN-int8 yielded worse output than AWQ (W4A16) (FP8 also showed better performance than RTN-int8).

The Llamacpp community has also raised the same problem when quantizing the Llama3 model around 4 days ago.

I still need more time to conclude this outcome.

oscarbg · 2024-05-04T14:50:42Z

+1 waiting for official support..

teis-e · 2024-05-06T14:12:25Z

@teis-e I was able to get Llama 3 70B-Instruct with TensorRT-LLM v0.9.0 working with:

In tokenizer_config.json, change line 2055 to "eos_token": "<|eot_id|>",

python {convert_checkpoint_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --vocab_size 128256 \
                                --inter_size 28672 \
                                --n_positions 8192 \
                                --n_layer 80 \
                                --n_head 64 \
                                --n_kv_head 8 \
                                --n_embd 8192 \
                                --rms_norm_eps 1e-05 \
                                --rotary_base 500000.0 \
                                --tp_size {n_gpus}

trtllm-build --checkpoint_dir {checkpoint_dir} \
                 --output_dir {deploy_dir} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size {n_gpus} \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha enable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

mpirun --allow-run-as-root -n {n_gpus} \
          python3 /triton-trtllm/trtllm-0.9.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
          --engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
          --tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
          --max_output_len 500 \
          --input_text "Are you awake? Please respond with exactly 1 word." \
          --num_beams 5

Has anyone been able to get FP8 quantization working? EDIT -- This works for FP8:

python {quantize_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir_quant} \
                                --dtype float16 \
                                --qformat fp8 \
                                --kv_cache_dtype fp8 \
                                --max_seq_length 8192 \
                                --calib_size 512 \
                                --tp_size 1 
                                
 trtllm-build --checkpoint_dir {checkpoint_dir_quant} \
                 --output_dir {deploy_dir_quant} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size 1 \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha disable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

Could you also make these commands for 8B Instruct, i tried @iibw commands but i feel like it is not optimal how it should be, like when i run in transformers normally without engine the gpu gets used 100% during generation. When using the engine it is about 30% and it is not so much faster 😕

msgersch · 2024-05-10T01:03:37Z

I'm trying to quantize on 2xA100 and am getting the following out of memory error. I am on TensorRT-LLM 0.9.0 and not sure what the issue is. @njaramish any thoughts? Thanks!

:/workspace/TensorRT-LLM# python3 quantization/quantize.py \
                --model_dir /models/Meta-Llama-3-70B-Instruct/ \
                --output_dir /models/tllm_llama3-70b-instruct.fp8.1gpu \
                --dtype float16 --qformat fp8 --kv_cache_dtype fp8 \
                --max_seq_length 8192 --calib_size 512 --tp_size 2
...
Calibrating batch 511
Quantization done. Total time used: 348.04 s.
torch.distributed not initialized, assuming single world_size.
...
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to /models/tllm_llama3-70b-instruct.fp8.2gpu/ammo_model.0.pth using torch.save for further inspection.
Detailed export error: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 20.75 MiB is free. Process 1561135 has 79.11 GiB memory in use. Of the allocated memory 78.55 GiB is allocated by PyTorch, and 66.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config
    for model_config in torch_to_model_config(
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 185, in torch_to_model_config
    build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 944, in build_decoder_config
    config.mlp = build_mlp_config(layer, decoder_type, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 767, in build_mlp_config
    config.proj = build_linear_config(layer, LINEAR_ROW, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 591, in build_linear_config
    weight = torch_weight.type(dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 20.75 MiB is free. Process 1561135 has 79.11 GiB memory in use. Of the allocated memory 78.55 GiB is allocated by PyTorch, and 66.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/workspace/TensorRT-LLM/examples/quantization/quantize.py", line 52, in <module>
    quantize_and_export(model_dir=args.model_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_ammo.py", line 360, in quantize_and_export
    with safetensors.safe_open(f"{export_path}/rank0.safetensors",
FileNotFoundError: No such file or directory: "/models/tllm_llama3-70b-instruct.fp8.2gpu/rank0.safetensors"

rifkybujana · 2024-05-11T03:21:44Z

@msgersch to quantize 70B model, i think it required to run the model on full precision. Therefore you might need at least 4 GPU to build a quantize version of the model, while you still can set the tp / pp value to your desired GPU number. So you required at least 4 H100 GPUs to build it, and then you can run the model on 1 or 2 GPUs.

rifkybujana · 2024-05-11T03:25:11Z

@teis-e I was able to get Llama 3 70B-Instruct with TensorRT-LLM v0.9.0 working with:

In tokenizer_config.json, change line 2055 to "eos_token": "<|eot_id|>",

python {convert_checkpoint_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --vocab_size 128256 \
                                --inter_size 28672 \
                                --n_positions 8192 \
                                --n_layer 80 \
                                --n_head 64 \
                                --n_kv_head 8 \
                                --n_embd 8192 \
                                --rms_norm_eps 1e-05 \
                                --rotary_base 500000.0 \
                                --tp_size {n_gpus}

trtllm-build --checkpoint_dir {checkpoint_dir} \
                 --output_dir {deploy_dir} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size {n_gpus} \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha enable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

mpirun --allow-run-as-root -n {n_gpus} \
          python3 /triton-trtllm/trtllm-0.9.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
          --engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
          --tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
          --max_output_len 500 \
          --input_text "Are you awake? Please respond with exactly 1 word." \
          --num_beams 5

Has anyone been able to get FP8 quantization working? EDIT -- This works for FP8:

python {quantize_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir_quant} \
                                --dtype float16 \
                                --qformat fp8 \
                                --kv_cache_dtype fp8 \
                                --max_seq_length 8192 \
                                --calib_size 512 \
                                --tp_size 1 
                                
 trtllm-build --checkpoint_dir {checkpoint_dir_quant} \
                 --output_dir {deploy_dir_quant} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size 1 \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha disable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

I made it work on both 8B and 70B models, but for the 70B model using multi GPU on TP, the model won't stop after eos, even though I've replaced it with the right token. Did you encounter the same issue on the 70B model? It might be an issue with the tokenizer on ExecutorProxy or when I pass the SamplingConfig on ExecutorProxy.

gulldan · 2024-05-16T07:55:18Z

same for 8B model, replace token In tokenizer_config.json, change line 2055 to "eos_token": "<|eot_id|>", but model dont stop.
problem with model https://huggingface.co/meta-llama/Meta-Llama-3-8B
Also trt-llm v0.9.0 hasnt --qformat fp8 --kv_cache_dtype fp8 flags. so its not work, and should be else

IIIoneHrayep · 2024-05-16T15:16:56Z

For Meta-Llama-3-8B model, I have tried to decrease the number of generated tokens via small max_output_len, but it doesn't' help. it seems that addition eos_token also doesn't work. So how can I stop generation?

gulldan added the bug Something isn't working label Apr 18, 2024

gulldan changed the title ~~llama v3 support~~ llama v3 support request Apr 18, 2024

gulldan changed the title ~~llama v3 support request~~ [Feature Request] llama v3 support Apr 18, 2024

byshiue assigned ncomly-nvidia Apr 19, 2024

byshiue added the feature request New feature or request label Apr 19, 2024

EwoutH mentioned this issue Apr 22, 2024

TensorRT-LLM Requests #632

Open

41 tasks

matichon-vultureprime mentioned this issue Jun 2, 2024

【Bug Report】llama v3 70B int4 reasoning abnormal #1638

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] llama v3 support #1470

[Feature Request] llama v3 support #1470

gulldan commented Apr 18, 2024

shiqingzhangCSU commented Apr 19, 2024

iibw commented Apr 19, 2024

catid commented Apr 19, 2024

iibw commented Apr 19, 2024

iibw commented Apr 20, 2024

edhenry commented Apr 22, 2024 •

edited

Loading

StephennFernandes commented Apr 22, 2024

iibw commented Apr 23, 2024

StephennFernandes commented Apr 23, 2024

iibw commented Apr 23, 2024

StephennFernandes commented Apr 23, 2024

iibw commented Apr 23, 2024

StephennFernandes commented Apr 23, 2024

iibw commented Apr 23, 2024

HaisongDing commented Apr 25, 2024

teis-e commented Apr 28, 2024 •

edited

Loading

AI-Kyo-er commented Apr 28, 2024

Yuchen-Cao commented Apr 29, 2024

teis-e commented Apr 30, 2024

njaramish commented May 1, 2024 •

edited

Loading

StephennFernandes commented May 2, 2024

teis-e commented May 2, 2024 •

edited

Loading

njaramish commented May 2, 2024

teis-e commented May 2, 2024 •

edited

Loading

matichon-vultureprime commented May 2, 2024

oscarbg commented May 4, 2024

teis-e commented May 6, 2024

msgersch commented May 10, 2024

rifkybujana commented May 11, 2024

rifkybujana commented May 11, 2024

gulldan commented May 16, 2024 •

edited

Loading

IIIoneHrayep commented May 16, 2024

[Feature Request] llama v3 support #1470

[Feature Request] llama v3 support #1470

Comments

gulldan commented Apr 18, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

shiqingzhangCSU commented Apr 19, 2024

iibw commented Apr 19, 2024

catid commented Apr 19, 2024

iibw commented Apr 19, 2024

iibw commented Apr 20, 2024

edhenry commented Apr 22, 2024 • edited Loading

Model Configuration

Inference

StephennFernandes commented Apr 22, 2024

iibw commented Apr 23, 2024

StephennFernandes commented Apr 23, 2024

iibw commented Apr 23, 2024

StephennFernandes commented Apr 23, 2024

iibw commented Apr 23, 2024

StephennFernandes commented Apr 23, 2024

iibw commented Apr 23, 2024

HaisongDing commented Apr 25, 2024

teis-e commented Apr 28, 2024 • edited Loading

AI-Kyo-er commented Apr 28, 2024

Yuchen-Cao commented Apr 29, 2024

teis-e commented Apr 30, 2024

njaramish commented May 1, 2024 • edited Loading

StephennFernandes commented May 2, 2024

teis-e commented May 2, 2024 • edited Loading

njaramish commented May 2, 2024

teis-e commented May 2, 2024 • edited Loading

matichon-vultureprime commented May 2, 2024

oscarbg commented May 4, 2024

teis-e commented May 6, 2024

msgersch commented May 10, 2024

rifkybujana commented May 11, 2024

rifkybujana commented May 11, 2024

gulldan commented May 16, 2024 • edited Loading

IIIoneHrayep commented May 16, 2024

edhenry commented Apr 22, 2024 •

edited

Loading

teis-e commented Apr 28, 2024 •

edited

Loading

njaramish commented May 1, 2024 •

edited

Loading

teis-e commented May 2, 2024 •

edited

Loading

teis-e commented May 2, 2024 •

edited

Loading

gulldan commented May 16, 2024 •

edited

Loading