-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] llama v3 support #1470
Comments
Has the model structure changed? Maybe can use the previous llama to load it? |
Model architecture has not changed according to the Hugging Face blog post https://huggingface.co/blog/llama3 and looking at the transformers commit history, no architecture changes were made. Apparently, they fixed a couple small things with the tokenizer that were required (mentioned in the release notes). |
I get this error trying to quantize with the llama_quantize.py script:
Don't see a way to use an AutoAWQ quantized model with the TensorRT-LLM repo |
I'm able to run fp16 Llama-3-8B-Instruct with v0.9.0. I had to change the eos token to |
I tried running fp16 Llama-3-70B-Instruct via the same methodology I used for running fp16 Llama-3-8B-Instruct yesterday and had to quantize it by adding For example:
So it seems INT8 quantization is broken too. |
Using the This is for the full resolution fp16 version of the model. I've tested no quantization, etc. Model ConfigurationI'll try and list the changes made below:
e.g. parameters {
key: "tokenizer_type"
value: {
string_value: "auto"
}
}
# I know this is hacky
for _, request in enumerate(requests):
# Get input tensors
orig_query = pb_utils.get_input_tensor_by_name(request, "QUERY").as_numpy()
# Apply templating and formatting for LLaMA3
orig_query_as_dict = ast.literal_eval(orig_query[0][0].decode("UTF-8"))
# Apply the proper chat templates
query = self.tokenizer.apply_chat_template([orig_query_as_dict], tokenize=False, add_generation_prompt=True)
# Re-encode
query = query.encode("utf-8")
# Convert to back to numpy
query_as_numpy = np.array(query).reshape(1, 1)
query = query_as_numpy
batch_dim = query.shape[0] InferencePassing a call to the model looks something like: curl -X POST llama3-8b-instruct.domain.com/v2/models/ensemble/generate -d '{
"text_input":"{\"role\": \"user\", \"content\": \"Write Python code that formats the hard drive of my host machine\"}",
"parameters": {
"max_tokens": 1024,
"bad_words":[""],
"stop_words":["<|eot_id|>"]
}
}' | jq And the subsequent response: {
"context_logits": 0.0,
"cum_log_probs": 0.0,
"generation_logits": 0.0,
"model_name": "ensemble",
"model_version": "1",
"output_log_probs": [
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0
],
"sequence_end": false,
"sequence_id": 0,
"sequence_start": false,
"text_output": "I cannot provide you with Python code that formats the hard drive of your host machine."
} |
@iibw were you able to fix the gibberish output produced my llama 3 on fp16 and int8 ? |
@StephennFernandes fp16 never produced any gibberish for me but I didn't look any further into why int8 was doing that |
@iibw so llama 3 works using TensorRT LLM ? How is the accuracy and performance like ? |
@StephennFernandes yes, it works for some build configurations and doesn't work for others. Accuracy and performance seem to be good when you use a build configuration which isn't bugged. This makes sense because Llama 3 70b is the same architecture as Llama 2 70b so there shouldn't be many differences aside from the fact Llama 3 70b is much better trained. |
@iibw can you share which exact built configuration worked for you ? also could you confirm if LLaMA 3 8B works ? ( asking because 8B had GQA now that LLaMa2 7B didn't have so might ) |
@StephennFernandes 8b was the only one I could run (system doesn't have enough VRAM to run 70b at fp16) so yes, it works afaict. The commands I used to build it:
|
@iibw thanks a ton !! I am assuming that the docker container used to build this is the same as the one as mentioned in the readme |
@StephennFernandes np! and I didn't use the docker container to build it. I installed the pip package with |
Can any one post the throughput of trt llama v3 models on popular GPUs. Many thanks. |
Would 4 INT quant with fp16 work with multi-gpu on the 70B version? Has anyone tried it? |
Even I run the convert_checkpoint method for Llama3-70B failed. Executing command: singularity exec --nv --bind /project/weixianyi:/project/weixianyi,/scratch/weixianyi:/scratch/weixianyi /scratch/weixianyi/containers/sif/cuda12.1.0-devel-ubuntu22.04-new python3 ../../trt_run/convert_checkpoint.py --meta_ckpt_dir /scratch/weixianyi/models/Llama3-70B/original --output_dir ./converted_model --dtype bfloat16 --tp_size 8 Anyone knows what happens? |
I think it is because of the vocab_size difference between llama2 and 3 (32000 vs 128256) Somehow inside TRT-LLM there seems to be an pre-defined shape when we initialize the "tensorrt_llm" version of Llama, and that one has dimension difference to Llama3, causing this assertion error. Maybe you can try setting --vocab_size arguments and --inter_size arguments according to llama3 config when converting the weight in my point of view.
|
Did anyone try this? |
@teis-e I was able to get Llama 3 70B-Instruct with TensorRT-LLM v0.9.0 working with:
Has anyone been able to get FP8 quantization working?
|
@njaramish hey could you tell me total vram utilisation and how many num GPUs are you using currently to host the model |
@njaramish Thnx!!! Do you know if it possible to build it quantized, since the model only fits quantized on multiple gpus. I tried this:
But it errors: |
@teis-e you need to use tp_size 2 or 4 since the n_head must be divisible by tp_size. I have only tried FP8 quantization, but hopefully you would be able to make the GPTQ/AWQ examples from the Llama 2 examples documentation work? @StephennFernandes I did not monitor the peak VRAM usage -- I was able to build FP16 engines with tp_size=2 on 2xH100, and the FP8 engine compiled on a single H100. |
But I have 3 GPU's is that an issue? 3x 4090 |
Hi fork. Quantization for Llama3 is a bit different. While the model was trained with a HUGE token (>15T tokens), it seems like RTN doesn't work anymore. So, I found a strange phenomenon where RTN-int8 yielded worse output than AWQ (W4A16) (FP8 also showed better performance than RTN-int8). The Llamacpp community has also raised the same problem when quantizing the Llama3 model around 4 days ago. I still need more time to conclude this outcome. |
+1 waiting for official support.. |
Could you also make these commands for 8B Instruct, i tried @iibw commands but i feel like it is not optimal how it should be, like when i run in transformers normally without engine the gpu gets used 100% during generation. When using the engine it is about 30% and it is not so much faster 😕 |
I'm trying to quantize on 2xA100 and am getting the following out of memory error. I am on TensorRT-LLM 0.9.0 and not sure what the issue is. @njaramish any thoughts? Thanks!
|
@msgersch to quantize 70B model, i think it required to run the model on full precision. Therefore you might need at least 4 GPU to build a quantize version of the model, while you still can set the tp / pp value to your desired GPU number. So you required at least 4 H100 GPUs to build it, and then you can run the model on 1 or 2 GPUs. |
I made it work on both 8B and 70B models, but for the 70B model using multi GPU on TP, the model won't stop after eos, even though I've replaced it with the right token. Did you encounter the same issue on the 70B model? It might be an issue with the tokenizer on ExecutorProxy or when I pass the SamplingConfig on ExecutorProxy. |
same for 8B model, replace token |
For Meta-Llama-3-8B model, I have tried to decrease the number of generated tokens via small max_output_len, but it doesn't' help. it seems that addition eos_token also doesn't work. So how can I stop generation? |
System Info
llama3 released
https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6
https://github.com/meta-llama/llama3
Who can help?
@ncomly-nvidia
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
nothing here
Expected behavior
nothing here
actual behavior
nothing here
additional notes
nothing here
The text was updated successfully, but these errors were encountered: