Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] llama v3 support #1470

Open
4 tasks
gulldan opened this issue Apr 18, 2024 · 32 comments
Open
4 tasks

[Feature Request] llama v3 support #1470

gulldan opened this issue Apr 18, 2024 · 32 comments
Assignees
Labels
bug Something isn't working feature request New feature or request

Comments

@gulldan
Copy link

gulldan commented Apr 18, 2024

System Info

llama3 released

https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6

https://github.com/meta-llama/llama3

Who can help?

@ncomly-nvidia

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

nothing here

Expected behavior

nothing here

actual behavior

nothing here

additional notes

nothing here

@gulldan gulldan added the bug Something isn't working label Apr 18, 2024
@gulldan gulldan changed the title llama v3 support llama v3 support request Apr 18, 2024
@gulldan gulldan changed the title llama v3 support request [Feature Request] llama v3 support Apr 18, 2024
@shiqingzhangCSU
Copy link

Has the model structure changed? Maybe can use the previous llama to load it?

@iibw
Copy link

iibw commented Apr 19, 2024

Model architecture has not changed according to the Hugging Face blog post https://huggingface.co/blog/llama3 and looking at the transformers commit history, no architecture changes were made. Apparently, they fixed a couple small things with the tokenizer that were required (mentioned in the release notes).

@catid
Copy link

catid commented Apr 19, 2024

I get this error trying to quantize with the llama_quantize.py script:

root@e0e306bfeaaa:~/TensorRT-LLM/examples/model_api# python3 llama_quantize.py --hf_model_dir /models/Meta-Llama-3-8B-Instruct/ --cache_dir cache -c
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[TensorRT-LLM][ERROR] Encountered an error in forward function: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
[TensorRT-LLM][WARNING] Step function failed, continuing.
Traceback (most recent call last):
  File "/root/TensorRT-LLM/examples/model_api/llama_quantize.py", line 80, in <module>
    main()
  File "/root/TensorRT-LLM/examples/model_api/llama_quantize.py", line 76, in main
    output = executor.generate(inp, sampling_config=sampling_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 297, in generate
    for future in futures:
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 198, in __next__
    self.result_step()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 155, in result_step
    self.handle_generation_msg(tensors, error)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 148, in handle_generation_msg
    raise RuntimeError(error)
RuntimeError: Encountered an error in forward function: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)

Don't see a way to use an AutoAWQ quantized model with the TensorRT-LLM repo

@iibw
Copy link

iibw commented Apr 19, 2024

I'm able to run fp16 Llama-3-8B-Instruct with v0.9.0. I had to change the eos token to <|eot_id|> inside the tokenizer's tokenizer_config.json file to get it to stop generating though.

@byshiue byshiue added the feature request New feature or request label Apr 19, 2024
@iibw
Copy link

iibw commented Apr 20, 2024

I tried running fp16 Llama-3-70B-Instruct via the same methodology I used for running fp16 Llama-3-8B-Instruct yesterday and had to quantize it by adding --use_weight_only --weight_only_precision int8, but even though I'm able to run it now, I'm getting bad outputs.

For example:

Input [Text 0]: "Hi my name is"
Output [Text 0 Beam 0]: "\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"

So it seems INT8 quantization is broken too.

@EwoutH EwoutH mentioned this issue Apr 22, 2024
41 tasks
@edhenry
Copy link

edhenry commented Apr 22, 2024

Using the inflight_batcher_llm from tensorrtllm_backend along with some modifications to the preprocessing model and tokenizer configurations, I was able to get the model functional within the TensorRT-LLM backend.

This is for the full resolution fp16 version of the model. I've tested no quantization, etc.

Model Configuration

I'll try and list the changes made below:

  1. Change the tokenizer_type in all pipeline nodes to auto

e.g. postprocessing/config.pbtxt

parameters {
  key: "tokenizer_type"
  value: {
    string_value: "auto"
  }
}
  1. Modify the received request in preprocessing to something like
# I know this is hacky
for _, request in enumerate(requests):
    # Get input tensors
    orig_query = pb_utils.get_input_tensor_by_name(request, "QUERY").as_numpy()

    # Apply templating and formatting for LLaMA3
    orig_query_as_dict = ast.literal_eval(orig_query[0][0].decode("UTF-8"))
    
    # Apply the proper chat templates
    query = self.tokenizer.apply_chat_template([orig_query_as_dict], tokenize=False, add_generation_prompt=True)

    # Re-encode
    query = query.encode("utf-8")

    # Convert to back to numpy
    query_as_numpy = np.array(query).reshape(1, 1)
    query = query_as_numpy
    batch_dim = query.shape[0]

Inference

Passing a call to the model looks something like:

curl -X POST llama3-8b-instruct.domain.com/v2/models/ensemble/generate -d '{
"text_input":"{\"role\": \"user\", \"content\": \"Write Python code that formats the hard drive of my host machine\"}",
"parameters": {
"max_tokens": 1024,
"bad_words":[""],
"stop_words":["<|eot_id|>"]
}
}' | jq

And the subsequent response:

{
  "context_logits": 0.0,
  "cum_log_probs": 0.0,
  "generation_logits": 0.0,
  "model_name": "ensemble",
  "model_version": "1",
  "output_log_probs": [
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0
  ],
  "sequence_end": false,
  "sequence_id": 0,
  "sequence_start": false,
  "text_output": "I cannot provide you with Python code that formats the hard drive of your host machine."
}

@StephennFernandes
Copy link

@iibw were you able to fix the gibberish output produced my llama 3 on fp16 and int8 ?

@iibw
Copy link

iibw commented Apr 23, 2024

@StephennFernandes fp16 never produced any gibberish for me but I didn't look any further into why int8 was doing that

@StephennFernandes
Copy link

@iibw so llama 3 works using TensorRT LLM ?

How is the accuracy and performance like ?

@iibw
Copy link

iibw commented Apr 23, 2024

@StephennFernandes yes, it works for some build configurations and doesn't work for others. Accuracy and performance seem to be good when you use a build configuration which isn't bugged. This makes sense because Llama 3 70b is the same architecture as Llama 2 70b so there shouldn't be many differences aside from the fact Llama 3 70b is much better trained.

@StephennFernandes
Copy link

@iibw can you share which exact built configuration worked for you ?

also could you confirm if LLaMA 3 8B works ?

( asking because 8B had GQA now that LLaMa2 7B didn't have so might )

@iibw
Copy link

iibw commented Apr 23, 2024

@StephennFernandes 8b was the only one I could run (system doesn't have enough VRAM to run 70b at fp16) so yes, it works afaict.

The commands I used to build it:

python convert_checkpoint.py --model_dir llama_3_hf_model_dir --output_dir fp16_ckpt --dtype float16

trtllm-build --checkpoint_dir fp16_ckpt --output_dir fp16/1-gpu --gemm_plugin float16 --max_input_len 8192

@StephennFernandes
Copy link

@iibw thanks a ton !!

I am assuming that the docker container used to build this is the same as the one as mentioned in the readme

@iibw
Copy link

iibw commented Apr 23, 2024

@StephennFernandes np! and I didn't use the docker container to build it. I installed the pip package with pip install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com

@HaisongDing
Copy link

Can any one post the throughput of trt llama v3 models on popular GPUs.

Many thanks.

@teis-e
Copy link

teis-e commented Apr 28, 2024

Would 4 INT quant with fp16 work with multi-gpu on the 70B version? Has anyone tried it?

@AI-Kyo-er
Copy link

Even I run the convert_checkpoint method for Llama3-70B failed.

Executing command: singularity exec --nv --bind /project/weixianyi:/project/weixianyi,/scratch/weixianyi:/scratch/weixianyi /scratch/weixianyi/containers/sif/cuda12.1.0-devel-ubuntu22.04-new python3 ../../trt_run/convert_checkpoint.py --meta_ckpt_dir /scratch/weixianyi/models/Llama3-70B/original --output_dir ./converted_model --dtype bfloat16 --tp_size 8
WARNING: underlay of /etc/localtime required more than 50 (90) bind mounts
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (432) bind mounts
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
0.9.0.dev2024040200
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 401, in load
param.value = weights[name]
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value
assert v.shape == self._shape,
AssertionError: The value updated is not the same shape as the original. Updated: (16032, 65536), original: (32000, 8192)

Anyone knows what happens?
Many thanks.

@Yuchen-Cao
Copy link

I think it is because of the vocab_size difference between llama2 and 3 (32000 vs 128256)

Somehow inside TRT-LLM there seems to be an pre-defined shape when we initialize the "tensorrt_llm" version of Llama, and that one has dimension difference to Llama3, causing this assertion error.

Maybe you can try setting --vocab_size arguments and --inter_size arguments according to llama3 config when converting the weight in my point of view.

Even I run the convert_checkpoint method for Llama3-70B failed.

Executing command: singularity exec --nv --bind /project/weixianyi:/project/weixianyi,/scratch/weixianyi:/scratch/weixianyi /scratch/weixianyi/containers/sif/cuda12.1.0-devel-ubuntu22.04-new python3 ../../trt_run/convert_checkpoint.py --meta_ckpt_dir /scratch/weixianyi/models/Llama3-70B/original --output_dir ./converted_model --dtype bfloat16 --tp_size 8 WARNING: underlay of /etc/localtime required more than 50 (90) bind mounts WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (432) bind mounts [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200 0.9.0.dev2024040200 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 401, in load param.value = weights[name] File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value assert v.shape == self._shape, AssertionError: The value updated is not the same shape as the original. Updated: (16032, 65536), original: (32000, 8192)

Anyone knows what happens? Many thanks.

@teis-e
Copy link

teis-e commented Apr 30, 2024

Did anyone try this?

@njaramish
Copy link

njaramish commented May 1, 2024

@teis-e I was able to get Llama 3 70B-Instruct with TensorRT-LLM v0.9.0 working with:

  1. In tokenizer_config.json, change line 2055 to "eos_token": "<|eot_id|>",
python {convert_checkpoint_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --vocab_size 128256 \
                                --inter_size 28672 \
                                --n_positions 8192 \
                                --n_layer 80 \
                                --n_head 64 \
                                --n_kv_head 8 \
                                --n_embd 8192 \
                                --rms_norm_eps 1e-05 \
                                --rotary_base 500000.0 \
                                --tp_size {n_gpus}
trtllm-build --checkpoint_dir {checkpoint_dir} \
                 --output_dir {deploy_dir} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size {n_gpus} \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha enable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5 
mpirun --allow-run-as-root -n {n_gpus} \
          python3 /triton-trtllm/trtllm-0.9.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
          --engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
          --tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
          --max_output_len 500 \
          --input_text "Are you awake? Please respond with exactly 1 word." \
          --num_beams 5 

Has anyone been able to get FP8 quantization working?
EDIT -- This works for FP8:

python {quantize_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir_quant} \
                                --dtype float16 \
                                --qformat fp8 \
                                --kv_cache_dtype fp8 \
                                --max_seq_length 8192 \
                                --calib_size 512 \
                                --tp_size 1 
                                
 trtllm-build --checkpoint_dir {checkpoint_dir_quant} \
                 --output_dir {deploy_dir_quant} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size 1 \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha disable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

@StephennFernandes
Copy link

@njaramish hey could you tell me total vram utilisation and how many num GPUs are you using currently to host the model

@teis-e
Copy link

teis-e commented May 2, 2024

@njaramish Thnx!!!

Do you know if it possible to build it quantized, since the model only fits quantized on multiple gpus.

I tried this:

python3 convert_checkpoint.py --model_dir //root/.cache/huggingface/hub/models--Melon--Meta-Llama-3-70B-Instruct-AutoAWQ-4bit/snapshots/dc5cc4388d36c571d18f091e31decd82ab6621ed \
                                --output_dir checkpoint \
                                --dtype float16 \
                                --vocab_size 128256 \
                                --inter_size 28672 \
                                --n_positions 8192 \
                                --n_layer 80 \
                                --n_head 64 \
                                --n_kv_head 8 \
                                --n_embd 8192 \
                                --rms_norm_eps 1e-05 \
                                --rotary_base 500000.0 \
                                --tp_size 3 \
                                --use_weight_only \
                                --weight_only_precision int4

But it errors:
assert num_attention_heads % tp_size == 0, \ AssertionError: num_attention_heads must be divisible by tp_size

@njaramish
Copy link

@teis-e you need to use tp_size 2 or 4 since the n_head must be divisible by tp_size. I have only tried FP8 quantization, but hopefully you would be able to make the GPTQ/AWQ examples from the Llama 2 examples documentation work?

@StephennFernandes I did not monitor the peak VRAM usage -- I was able to build FP16 engines with tp_size=2 on 2xH100, and the FP8 engine compiled on a single H100.

@teis-e
Copy link

teis-e commented May 2, 2024

But I have 3 GPU's is that an issue? 3x 4090

@matichon-vultureprime
Copy link
Contributor

Hi fork.

Quantization for Llama3 is a bit different. While the model was trained with a HUGE token (>15T tokens), it seems like RTN doesn't work anymore.

So, I found a strange phenomenon where RTN-int8 yielded worse output than AWQ (W4A16) (FP8 also showed better performance than RTN-int8).

The Llamacpp community has also raised the same problem when quantizing the Llama3 model around 4 days ago.

I still need more time to conclude this outcome.

@oscarbg
Copy link

oscarbg commented May 4, 2024

+1 waiting for official support..

@teis-e
Copy link

teis-e commented May 6, 2024

@teis-e I was able to get Llama 3 70B-Instruct with TensorRT-LLM v0.9.0 working with:

  1. In tokenizer_config.json, change line 2055 to "eos_token": "<|eot_id|>",
python {convert_checkpoint_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --vocab_size 128256 \
                                --inter_size 28672 \
                                --n_positions 8192 \
                                --n_layer 80 \
                                --n_head 64 \
                                --n_kv_head 8 \
                                --n_embd 8192 \
                                --rms_norm_eps 1e-05 \
                                --rotary_base 500000.0 \
                                --tp_size {n_gpus}
trtllm-build --checkpoint_dir {checkpoint_dir} \
                 --output_dir {deploy_dir} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size {n_gpus} \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha enable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5 
mpirun --allow-run-as-root -n {n_gpus} \
          python3 /triton-trtllm/trtllm-0.9.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
          --engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
          --tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
          --max_output_len 500 \
          --input_text "Are you awake? Please respond with exactly 1 word." \
          --num_beams 5 

Has anyone been able to get FP8 quantization working? EDIT -- This works for FP8:

python {quantize_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir_quant} \
                                --dtype float16 \
                                --qformat fp8 \
                                --kv_cache_dtype fp8 \
                                --max_seq_length 8192 \
                                --calib_size 512 \
                                --tp_size 1 
                                
 trtllm-build --checkpoint_dir {checkpoint_dir_quant} \
                 --output_dir {deploy_dir_quant} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size 1 \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha disable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

Could you also make these commands for 8B Instruct, i tried @iibw commands but i feel like it is not optimal how it should be, like when i run in transformers normally without engine the gpu gets used 100% during generation. When using the engine it is about 30% and it is not so much faster 😕

@msgersch
Copy link

I'm trying to quantize on 2xA100 and am getting the following out of memory error. I am on TensorRT-LLM 0.9.0 and not sure what the issue is. @njaramish any thoughts? Thanks!

:/workspace/TensorRT-LLM# python3 quantization/quantize.py \
                --model_dir /models/Meta-Llama-3-70B-Instruct/ \
                --output_dir /models/tllm_llama3-70b-instruct.fp8.1gpu \
                --dtype float16 --qformat fp8 --kv_cache_dtype fp8 \
                --max_seq_length 8192 --calib_size 512 --tp_size 2
...
Calibrating batch 511
Quantization done. Total time used: 348.04 s.
torch.distributed not initialized, assuming single world_size.
...
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to /models/tllm_llama3-70b-instruct.fp8.2gpu/ammo_model.0.pth using torch.save for further inspection.
Detailed export error: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 20.75 MiB is free. Process 1561135 has 79.11 GiB memory in use. Of the allocated memory 78.55 GiB is allocated by PyTorch, and 66.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config
    for model_config in torch_to_model_config(
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 185, in torch_to_model_config
    build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 944, in build_decoder_config
    config.mlp = build_mlp_config(layer, decoder_type, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 767, in build_mlp_config
    config.proj = build_linear_config(layer, LINEAR_ROW, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 591, in build_linear_config
    weight = torch_weight.type(dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 20.75 MiB is free. Process 1561135 has 79.11 GiB memory in use. Of the allocated memory 78.55 GiB is allocated by PyTorch, and 66.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/workspace/TensorRT-LLM/examples/quantization/quantize.py", line 52, in <module>
    quantize_and_export(model_dir=args.model_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_ammo.py", line 360, in quantize_and_export
    with safetensors.safe_open(f"{export_path}/rank0.safetensors",
FileNotFoundError: No such file or directory: "/models/tllm_llama3-70b-instruct.fp8.2gpu/rank0.safetensors"

@rifkybujana
Copy link

@msgersch to quantize 70B model, i think it required to run the model on full precision. Therefore you might need at least 4 GPU to build a quantize version of the model, while you still can set the tp / pp value to your desired GPU number. So you required at least 4 H100 GPUs to build it, and then you can run the model on 1 or 2 GPUs.

@rifkybujana
Copy link

@teis-e I was able to get Llama 3 70B-Instruct with TensorRT-LLM v0.9.0 working with:

  1. In tokenizer_config.json, change line 2055 to "eos_token": "<|eot_id|>",
python {convert_checkpoint_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --vocab_size 128256 \
                                --inter_size 28672 \
                                --n_positions 8192 \
                                --n_layer 80 \
                                --n_head 64 \
                                --n_kv_head 8 \
                                --n_embd 8192 \
                                --rms_norm_eps 1e-05 \
                                --rotary_base 500000.0 \
                                --tp_size {n_gpus}
trtllm-build --checkpoint_dir {checkpoint_dir} \
                 --output_dir {deploy_dir} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size {n_gpus} \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha enable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5 
mpirun --allow-run-as-root -n {n_gpus} \
          python3 /triton-trtllm/trtllm-0.9.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
          --engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
          --tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
          --max_output_len 500 \
          --input_text "Are you awake? Please respond with exactly 1 word." \
          --num_beams 5 

Has anyone been able to get FP8 quantization working? EDIT -- This works for FP8:

python {quantize_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir_quant} \
                                --dtype float16 \
                                --qformat fp8 \
                                --kv_cache_dtype fp8 \
                                --max_seq_length 8192 \
                                --calib_size 512 \
                                --tp_size 1 
                                
 trtllm-build --checkpoint_dir {checkpoint_dir_quant} \
                 --output_dir {deploy_dir_quant} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size 1 \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha disable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

I made it work on both 8B and 70B models, but for the 70B model using multi GPU on TP, the model won't stop after eos, even though I've replaced it with the right token. Did you encounter the same issue on the 70B model? It might be an issue with the tokenizer on ExecutorProxy or when I pass the SamplingConfig on ExecutorProxy.

@gulldan
Copy link
Author

gulldan commented May 16, 2024

same for 8B model, replace token In tokenizer_config.json, change line 2055 to "eos_token": "<|eot_id|>", but model dont stop.
problem with model https://huggingface.co/meta-llama/Meta-Llama-3-8B
Also trt-llm v0.9.0 hasnt --qformat fp8 --kv_cache_dtype fp8 flags. so its not work, and should be else

@IIIoneHrayep
Copy link

For Meta-Llama-3-8B model, I have tried to decrease the number of generated tokens via small max_output_len, but it doesn't' help. it seems that addition eos_token also doesn't work. So how can I stop generation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature request New feature or request
Projects
None yet
Development

No branches or pull requests