[Usage]: Llama 3 8B Instruct Inference #4180

aliozts · 2024-04-18T20:09:05Z

Your current environment

Using the latest version of vLLM on 2 L4 GPUs.

How would you like to use vllm

I was trying to utilize vLLM to deploy meta-llama/Meta-Llama-3-8B-Instruct model and use OpenAI compatible server with the latest docker image. When I did, it was not stopping generation for a while when max_tokens=None. I saw that it's generating <|eot_id|> token which is its eos token apparently but in their tokenizer_config and in other configs it is <|end_of_text|>.

I can fix this by setting the eos_token parameter in tokenizer_config.json as <|eot_id|> or using

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user",
               "content": "Write a function for fibonacci sequence. Use LRUCache"}],
    max_tokens=700,
    stream=False,
    extra_body={"stop_token_ids":[128009]}
)

stop_token_ids in my request. I wanted to ask the optimal way to solve this problem.

There is an existing discussion/PR in their repo which is updating the generation_config.json but unless I clone myself, I saw that vLLM does not install the generation_config.json file. I also tried with this revision but it still was not stopping generating after <|eot_id|>. Moreover, I tried with this revision as well but it did not stop generating as well.

tldr; Llama-3-8B-Instruct model does not stop generation because of the eos token.

Updating generation_config.json does not work.
Updating config.json also does not work.
Updating tokenizer_config.json works but it overwrites the existing eos_token. Is this problematic or is there a more elegant way to solve this?

May I ask the optimal way to solve this issue?

The text was updated successfully, but these errors were encountered:

simon-mo · 2024-04-18T21:13:34Z

What you are doing here with stop_token_ids is the right temporary fix. I'll send a PR to respect generation_config.json and once the meta-llama/Meta-Llama-3-8B-Instruct is updated on the hub it should be working out of the box.

Generation config support multiple eos.

njhill · 2024-04-18T21:26:53Z

It does not appear that the -instruct models will output the configured EOS token, so I think it's also safe to change in the config to <|eot_id|> (128009). The base / pre-trained models should keep the existing <|end_of_text|> eos token though.

aliozts · 2024-04-18T21:31:40Z

thank you so much!

haqishen · 2024-04-19T07:52:27Z

Based on the sample code on the HF page for LLaMA3 (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), it is necessary to manually add a new EOS token as follow.

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

Therefore, I think this might not be an issue that the vllm team needs to address, but rather something that requires manually adding this EOS token when using vllm to generate with LLaMA3.

Here's the sample code for dealing it for batch inference:

llm = LLM(
    model=name,
    trust_remote_code=True,
    tensor_parallel_size=2,
)
tokenizer = llm.get_tokenizer()

conversations = tokenizer.apply_chat_template(
    [{'role': 'user', 'content': 'hi, how are you?'}],
    tokenize=False,
)

outputs = llm.generate(
    [conversations],
    SamplingParams(
        temperature=0.5,
        top_p=0.9,
        max_tokens=1024,
        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],  # KEYPOINT HERE
    )
)

joshuachak · 2024-04-19T09:11:39Z

Based on the sample code on the HF page for LLaMA3 (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), it is necessary to manually add a new EOS token as follow.
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
Therefore, I think this might not be an issue that the vllm team needs to address, but rather something that requires manually adding this EOS token when using vllm to generate with LLaMA3.

Here's the sample code for dealing it for batch inference:
llm = LLM(
    model=name,
    trust_remote_code=True,
    tensor_parallel_size=2,
)
tokenizer = llm.get_tokenizer()

conversations = tokenizer.apply_chat_template(
    [{'role': 'user', 'content': 'hi, how are you?'}],
    tokenize=False,
)

outputs = llm.generate(
    [conversations],
    SamplingParams(
        temperature=0.5,
        top_p=0.9,
        max_tokens=1024,
        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],  # KEYPOINT HERE
    )
)

yes, in the README file, it is said we need to add terminators ([tokenizer.eos_token_id,tokenizer.convert_tokens_to_ids("<|eot_id|>")]) manually to stop the inference. I tried it (but in the mlx-lm library)and it works well.

eav-solution · 2024-04-19T09:21:25Z

How to. use with vllm-openai, please suggest me

aliozts · 2024-04-19T09:27:49Z

@eav-solution to use with docker image, I think you need to wait for 0.4.1 release and also the update from meta's huggingface repo from what I can say.

The respective PR #4182 has been merged but it will be released with 0.4.1. You can create a docker image and install vLLM from source, install your model too because you will need to change the generation_config.json with the eos_token -> [128001, 128009].

Or an alternative solution that is also confirmed by @njhill is to just changing the tokenizer_config.json and changing eos_token to <|eot_id|>. Regardless, you need to install from source and modify the model files respectively from what I can say, I hope this helps but please take these with a pinch of salt and confirm them by taking a look at this issue and the model's repo files/discussion in HF.

eav-solution · 2024-04-19T09:48:45Z

That great! Thanks!!!!!!

ericg108 · 2024-04-19T11:21:33Z

@aliozts Hi Ali, I'm doing the same thing. But I got unexpected answer with <|im_end|> and <|im_start|>.

I guess I'm using the wrong chat template. Can you share your chat template if I may ask? thanks a lot

aliozts · 2024-04-19T11:24:08Z

@ericg108 I'm not using a custom chat template so I wouldn't want to misinform about it.

Saigut · 2024-04-21T07:22:49Z

@eav-solution to use with docker image, I think you need to wait for 0.4.1 release and also the update from meta's huggingface repo from what I can say.

The respective PR #4182 has been merged but it will be released with 0.4.1. You can create a docker image and install vLLM from source, install your model too because you will need to change the generation_config.json with the eos_token -> [128001, 128009].

Or an alternative solution that is also confirmed by @njhill is to just changing the tokenizer_config.json and changing eos_token to <|eot_id|>. Regardless, you need to install from source and modify the model files respectively from what I can say, I hope this helps but please take these with a pinch of salt and confirm them by taking a look at this issue and the model's repo files/discussion in HF.

vllm v0.3.0
In tokenizer_config.json and changing eos_token to <|eot_id|> works.
Change the generation_config.json with the eos_token -> [128001, 128009] have no effect, model will not stop.

MoGuGuai-hzr · 2024-04-21T16:03:35Z

Based on the sample code on the HF page for LLaMA3 (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), it is necessary to manually add a new EOS token as follow.
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
Therefore, I think this might not be an issue that the vllm team needs to address, but rather something that requires manually adding this EOS token when using vllm to generate with LLaMA3.

Here's the sample code for dealing it for batch inference:
llm = LLM(
    model=name,
    trust_remote_code=True,
    tensor_parallel_size=2,
)
tokenizer = llm.get_tokenizer()

conversations = tokenizer.apply_chat_template(
    [{'role': 'user', 'content': 'hi, how are you?'}],
    tokenize=False,
)

outputs = llm.generate(
    [conversations],
    SamplingParams(
        temperature=0.5,
        top_p=0.9,
        max_tokens=1024,
        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],  # KEYPOINT HERE
    )
)

This is fantastic, the code works perfectly with this.

Do you know how I should modify it if I want to deploy the API using vllm.entrypoints.openai.api_server? I have no idea.

john-theo · 2024-04-21T21:20:51Z

Based on the sample code on the HF page for LLaMA3 (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), it is necessary to manually add a new EOS token as follow.
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
Therefore, I think this might not be an issue that the vllm team needs to address, but rather something that requires manually adding this EOS token when using vllm to generate with LLaMA3.
Here's the sample code for dealing it for batch inference:
llm = LLM(
    model=name,
    trust_remote_code=True,
    tensor_parallel_size=2,
)
tokenizer = llm.get_tokenizer()

conversations = tokenizer.apply_chat_template(
    [{'role': 'user', 'content': 'hi, how are you?'}],
    tokenize=False,
)

outputs = llm.generate(
    [conversations],
    SamplingParams(
        temperature=0.5,
        top_p=0.9,
        max_tokens=1024,
        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],  # KEYPOINT HERE
    )
)
This is fantastic, the code works perfectly with this.

Do you know how I should modify it if I want to deploy the API using vllm.entrypoints.openai.api_server? I have no idea.

@MoGuGuai-hzr You won't need to alter any lines for the server setup. Just add the stop_token_ids line in every request to patch the issue for now:

{
    "model": "/your/path/to/meta-llama/Meta-Llama-3-70B-Instruct",
    "messages": [{
        "role": "user",
        "content": "Hello!"
    }],
    "stop_token_ids": [128001, 128009] // THIS LINE
}

Note that these two IDs correspond to tokenizer.eos_token_id and tokenizer.convert_tokens_to_ids("<|eot_id|>"), respectively.

MoGuGuai-hzr · 2024-04-22T00:41:15Z

Based on the sample code on the HF page for LLaMA3 (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), it is necessary to manually add a new EOS token as follow.
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
Therefore, I think this might not be an issue that the vllm team needs to address, but rather something that requires manually adding this EOS token when using vllm to generate with LLaMA3.
Here's the sample code for dealing it for batch inference:
llm = LLM(
    model=name,
    trust_remote_code=True,
    tensor_parallel_size=2,
)
tokenizer = llm.get_tokenizer()

conversations = tokenizer.apply_chat_template(
    [{'role': 'user', 'content': 'hi, how are you?'}],
    tokenize=False,
)

outputs = llm.generate(
    [conversations],
    SamplingParams(
        temperature=0.5,
        top_p=0.9,
        max_tokens=1024,
        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],  # KEYPOINT HERE
    )
)
This is fantastic, the code works perfectly with this.
Do you know how I should modify it if I want to deploy the API using vllm.entrypoints.openai.api_server? I have no idea.
@MoGuGuai-hzr You won't need to alter any lines for the server setup. Just add the stop_token_ids line in every request to patch the issue for now:
{
    "model": "/your/path/to/meta-llama/Meta-Llama-3-70B-Instruct",
    "messages": [{
        "role": "user",
        "content": "Hello!"
    }],
    "stop_token_ids": [128001, 128009] // THIS LINE
}
Note that these two IDs correspond to tokenizer.eos_token_id and tokenizer.convert_tokens_to_ids("<|eot_id|>"), respectively.

Nice, the perfect solution. Thank you so much.

hongyinjie · 2024-04-22T12:29:07Z

Change the file tokenizer_config.json: chat_template
eot_id ==> end_of_text （No）

tokenizer_config.json: chat_template eot_id ==> end_of_text （No）

In tokenizer_config.json and changing eos_token to <|eot_id|> works. (YES)

it work!

linzm1007 · 2024-04-23T01:31:45Z

更改文件tokenizer_config.json：chat_template eot_id ==> end_of_text

它有效！

为啥我没有用

krishna-cactus · 2024-04-23T15:25:21Z

A noob here. Will using openai chat.completions.create's stop parameter with value ["<|eot_id|>"] not solve this problem?

simon-mo · 2024-04-24T04:43:43Z

The new version of vLLM (https://github.com/vllm-project/vllm/releases/tag/v0.4.1) has been released, which is now compatible with the new llama3's end to turn stop token.

Zg-Serein · 2024-05-29T15:38:07Z

Change the file tokenizer_config.json: chat_template eot_id ==> end_of_text （No）

tokenizer_config.json: chat_template eot_id ==> end_of_text （No）

In tokenizer_config.json and changing eos_token to <|eot_id|> works. (YES)

it work!

Hello, may I ask that eos_token should be correct when I use the latest configuration file llama3-instruct at present, which is consistent with the modification in github issue. But still can not stop correctly, may I ask what may be the cause?

My vllm version is 0.4.2

Here is my configuration file
generation_config.json is as follows
{
"bos_token_id": 128000,
"eos_token_id": [128001, 128009],
"do_sample": true,
"temperature": 0.6,
"max_length": 4096,
"top_p": 0.9,
"transformers_version": "4.40.0.dev0"
}
tokenizer_config.json is as follows
"bos_token": "<|begin_of_text|>",
"chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}",
"clean_up_tokenization_spaces": true,
"eos_token": "<|eot_id|>",

aliozts added the usage How to use vllm label Apr 18, 2024

This was referenced Apr 18, 2024

v0.4.1 Release Tracker #4181

Closed

Support eos_token_id from generation_config.json #4182

Merged

catid mentioned this issue Apr 19, 2024

[Feature Request] llama v3 support NVIDIA/TensorRT-LLM#1470

Open

4 tasks

simon-mo mentioned this issue Apr 19, 2024

[Bug]: The LLama 3 base generation does not stop based on the passed stop words #4188

Closed

AaronFriel mentioned this issue Apr 19, 2024

[Bug]: Disk I/O Error when using tools due to shared outlines cache database #4193

Closed

moyix mentioned this issue Apr 19, 2024

llama3-instruct models not stopping at stop token ollama/ollama#3759

Closed

agt mentioned this issue Apr 23, 2024

[New Model]: Llama 3 8B Instruct #4297

Closed

simon-mo closed this as completed Apr 24, 2024

Alvant mentioned this issue Apr 24, 2024

Two Stop Tokens of Llama3 chujiezheng/chat_templates#10

Merged

detaos mentioned this issue Apr 25, 2024

Llama3 Bugs: Output contains input & Endless generating stanfordnlp/dspy#904

Closed

LuJunru mentioned this issue May 9, 2024

llama3 baseline reproduction problem EleutherAI/lm-evaluation-harness#1799

Closed

mirrorboat mentioned this issue May 29, 2024

How do models do batch inferring when using the transformer method? meta-llama/llama3#114

Open

dshm mentioned this issue Sep 10, 2024

[Usage]: example/offline_inference_chat.py run error. #8331

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: Llama 3 8B Instruct Inference #4180

[Usage]: Llama 3 8B Instruct Inference #4180

aliozts commented Apr 18, 2024

simon-mo commented Apr 18, 2024

njhill commented Apr 18, 2024

aliozts commented Apr 18, 2024

haqishen commented Apr 19, 2024 •

edited

Loading

joshuachak commented Apr 19, 2024

eav-solution commented Apr 19, 2024

aliozts commented Apr 19, 2024

eav-solution commented Apr 19, 2024

ericg108 commented Apr 19, 2024

aliozts commented Apr 19, 2024

Saigut commented Apr 21, 2024

MoGuGuai-hzr commented Apr 21, 2024

john-theo commented Apr 21, 2024

MoGuGuai-hzr commented Apr 22, 2024

hongyinjie commented Apr 22, 2024 •

edited

Loading

linzm1007 commented Apr 23, 2024 •

edited

Loading

krishna-cactus commented Apr 23, 2024 •

edited

Loading

simon-mo commented Apr 24, 2024

Zg-Serein commented May 29, 2024

[Usage]: Llama 3 8B Instruct Inference #4180

[Usage]: Llama 3 8B Instruct Inference #4180

Comments

aliozts commented Apr 18, 2024

Your current environment

How would you like to use vllm

simon-mo commented Apr 18, 2024

njhill commented Apr 18, 2024

aliozts commented Apr 18, 2024

haqishen commented Apr 19, 2024 • edited Loading

joshuachak commented Apr 19, 2024

eav-solution commented Apr 19, 2024

aliozts commented Apr 19, 2024

eav-solution commented Apr 19, 2024

ericg108 commented Apr 19, 2024

aliozts commented Apr 19, 2024

Saigut commented Apr 21, 2024

MoGuGuai-hzr commented Apr 21, 2024

john-theo commented Apr 21, 2024

MoGuGuai-hzr commented Apr 22, 2024

hongyinjie commented Apr 22, 2024 • edited Loading

linzm1007 commented Apr 23, 2024 • edited Loading

krishna-cactus commented Apr 23, 2024 • edited Loading

simon-mo commented Apr 24, 2024

Zg-Serein commented May 29, 2024

haqishen commented Apr 19, 2024 •

edited

Loading

hongyinjie commented Apr 22, 2024 •

edited

Loading

linzm1007 commented Apr 23, 2024 •

edited

Loading

krishna-cactus commented Apr 23, 2024 •

edited

Loading