Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: Llama 3 8B Instruct Inference #4180

Closed
aliozts opened this issue Apr 18, 2024 · 19 comments
Closed

[Usage]: Llama 3 8B Instruct Inference #4180

aliozts opened this issue Apr 18, 2024 · 19 comments
Labels
usage How to use vllm

Comments

@aliozts
Copy link

aliozts commented Apr 18, 2024

Your current environment

Using the latest version of vLLM on 2 L4 GPUs.

How would you like to use vllm

I was trying to utilize vLLM to deploy meta-llama/Meta-Llama-3-8B-Instruct model and use OpenAI compatible server with the latest docker image. When I did, it was not stopping generation for a while when max_tokens=None. I saw that it's generating <|eot_id|> token which is its eos token apparently but in their tokenizer_config and in other configs it is <|end_of_text|>.

I can fix this by setting the eos_token parameter in tokenizer_config.json as <|eot_id|> or using

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user",
               "content": "Write a function for fibonacci sequence. Use LRUCache"}],
    max_tokens=700,
    stream=False,
    extra_body={"stop_token_ids":[128009]}
)

stop_token_ids in my request. I wanted to ask the optimal way to solve this problem.

There is an existing discussion/PR in their repo which is updating the generation_config.json but unless I clone myself, I saw that vLLM does not install the generation_config.json file. I also tried with this revision but it still was not stopping generating after <|eot_id|>. Moreover, I tried with this revision as well but it did not stop generating as well.

tldr; Llama-3-8B-Instruct model does not stop generation because of the eos token.

  • Updating generation_config.json does not work.
  • Updating config.json also does not work.
  • Updating tokenizer_config.json works but it overwrites the existing eos_token. Is this problematic or is there a more elegant way to solve this?

May I ask the optimal way to solve this issue?

@aliozts aliozts added the usage How to use vllm label Apr 18, 2024
@simon-mo
Copy link
Collaborator

What you are doing here with stop_token_ids is the right temporary fix. I'll send a PR to respect generation_config.json and once the meta-llama/Meta-Llama-3-8B-Instruct is updated on the hub it should be working out of the box.

Generation config support multiple eos.

@njhill
Copy link
Member

njhill commented Apr 18, 2024

It does not appear that the -instruct models will output the configured EOS token, so I think it's also safe to change in the config to <|eot_id|> (128009). The base / pre-trained models should keep the existing <|end_of_text|> eos token though.

@aliozts
Copy link
Author

aliozts commented Apr 18, 2024

thank you so much!

@haqishen
Copy link

haqishen commented Apr 19, 2024

Based on the sample code on the HF page for LLaMA3 (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), it is necessary to manually add a new EOS token as follow.

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

Therefore, I think this might not be an issue that the vllm team needs to address, but rather something that requires manually adding this EOS token when using vllm to generate with LLaMA3.

Here's the sample code for dealing it for batch inference:

llm = LLM(
    model=name,
    trust_remote_code=True,
    tensor_parallel_size=2,
)
tokenizer = llm.get_tokenizer()

conversations = tokenizer.apply_chat_template(
    [{'role': 'user', 'content': 'hi, how are you?'}],
    tokenize=False,
)

outputs = llm.generate(
    [conversations],
    SamplingParams(
        temperature=0.5,
        top_p=0.9,
        max_tokens=1024,
        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],  # KEYPOINT HERE
    )
)

@joshuachak
Copy link

Based on the sample code on the HF page for LLaMA3 (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), it is necessary to manually add a new EOS token as follow.

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

Therefore, I think this might not be an issue that the vllm team needs to address, but rather something that requires manually adding this EOS token when using vllm to generate with LLaMA3.

Here's the sample code for dealing it for batch inference:

llm = LLM(
    model=name,
    trust_remote_code=True,
    tensor_parallel_size=2,
)
tokenizer = llm.get_tokenizer()

conversations = tokenizer.apply_chat_template(
    [{'role': 'user', 'content': 'hi, how are you?'}],
    tokenize=False,
)

outputs = llm.generate(
    [conversations],
    SamplingParams(
        temperature=0.5,
        top_p=0.9,
        max_tokens=1024,
        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],  # KEYPOINT HERE
    )
)

yes, in the README file, it is said we need to add terminators ([tokenizer.eos_token_id,tokenizer.convert_tokens_to_ids("<|eot_id|>")]) manually to stop the inference. I tried it (but in the mlx-lm library)and it works well.

@eav-solution
Copy link

How to. use with vllm-openai, please suggest me

@aliozts
Copy link
Author

aliozts commented Apr 19, 2024

@eav-solution to use with docker image, I think you need to wait for 0.4.1 release and also the update from meta's huggingface repo from what I can say.

The respective PR #4182 has been merged but it will be released with 0.4.1. You can create a docker image and install vLLM from source, install your model too because you will need to change the generation_config.json with the eos_token -> [128001, 128009].

Or an alternative solution that is also confirmed by @njhill is to just changing the tokenizer_config.json and changing eos_token to <|eot_id|>. Regardless, you need to install from source and modify the model files respectively from what I can say, I hope this helps but please take these with a pinch of salt and confirm them by taking a look at this issue and the model's repo files/discussion in HF.

@eav-solution
Copy link

That great! Thanks!!!!!!

@ericg108
Copy link

@aliozts Hi Ali, I'm doing the same thing. But I got unexpected answer with <|im_end|> and <|im_start|>.

I guess I'm using the wrong chat template. Can you share your chat template if I may ask? thanks a lot

@aliozts
Copy link
Author

aliozts commented Apr 19, 2024

@ericg108 I'm not using a custom chat template so I wouldn't want to misinform about it.

@Saigut
Copy link

Saigut commented Apr 21, 2024

@eav-solution to use with docker image, I think you need to wait for 0.4.1 release and also the update from meta's huggingface repo from what I can say.

The respective PR #4182 has been merged but it will be released with 0.4.1. You can create a docker image and install vLLM from source, install your model too because you will need to change the generation_config.json with the eos_token -> [128001, 128009].

Or an alternative solution that is also confirmed by @njhill is to just changing the tokenizer_config.json and changing eos_token to <|eot_id|>. Regardless, you need to install from source and modify the model files respectively from what I can say, I hope this helps but please take these with a pinch of salt and confirm them by taking a look at this issue and the model's repo files/discussion in HF.

vllm v0.3.0
In tokenizer_config.json and changing eos_token to <|eot_id|> works.
Change the generation_config.json with the eos_token -> [128001, 128009] have no effect, model will not stop.

@MoGuGuai-hzr
Copy link

Based on the sample code on the HF page for LLaMA3 (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), it is necessary to manually add a new EOS token as follow.

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

Therefore, I think this might not be an issue that the vllm team needs to address, but rather something that requires manually adding this EOS token when using vllm to generate with LLaMA3.

Here's the sample code for dealing it for batch inference:

llm = LLM(
    model=name,
    trust_remote_code=True,
    tensor_parallel_size=2,
)
tokenizer = llm.get_tokenizer()

conversations = tokenizer.apply_chat_template(
    [{'role': 'user', 'content': 'hi, how are you?'}],
    tokenize=False,
)

outputs = llm.generate(
    [conversations],
    SamplingParams(
        temperature=0.5,
        top_p=0.9,
        max_tokens=1024,
        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],  # KEYPOINT HERE
    )
)

This is fantastic, the code works perfectly with this.

Do you know how I should modify it if I want to deploy the API using vllm.entrypoints.openai.api_server? I have no idea.

@john-theo
Copy link

Based on the sample code on the HF page for LLaMA3 (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), it is necessary to manually add a new EOS token as follow.

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

Therefore, I think this might not be an issue that the vllm team needs to address, but rather something that requires manually adding this EOS token when using vllm to generate with LLaMA3.
Here's the sample code for dealing it for batch inference:

llm = LLM(
    model=name,
    trust_remote_code=True,
    tensor_parallel_size=2,
)
tokenizer = llm.get_tokenizer()

conversations = tokenizer.apply_chat_template(
    [{'role': 'user', 'content': 'hi, how are you?'}],
    tokenize=False,
)

outputs = llm.generate(
    [conversations],
    SamplingParams(
        temperature=0.5,
        top_p=0.9,
        max_tokens=1024,
        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],  # KEYPOINT HERE
    )
)

This is fantastic, the code works perfectly with this.

Do you know how I should modify it if I want to deploy the API using vllm.entrypoints.openai.api_server? I have no idea.

@MoGuGuai-hzr You won't need to alter any lines for the server setup. Just add the stop_token_ids line in every request to patch the issue for now:

{
    "model": "/your/path/to/meta-llama/Meta-Llama-3-70B-Instruct",
    "messages": [{
        "role": "user",
        "content": "Hello!"
    }],
    "stop_token_ids": [128001, 128009] // THIS LINE
}

Note that these two IDs correspond to tokenizer.eos_token_id and tokenizer.convert_tokens_to_ids("<|eot_id|>"), respectively.

@MoGuGuai-hzr
Copy link

Based on the sample code on the HF page for LLaMA3 (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), it is necessary to manually add a new EOS token as follow.

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

Therefore, I think this might not be an issue that the vllm team needs to address, but rather something that requires manually adding this EOS token when using vllm to generate with LLaMA3.
Here's the sample code for dealing it for batch inference:

llm = LLM(
    model=name,
    trust_remote_code=True,
    tensor_parallel_size=2,
)
tokenizer = llm.get_tokenizer()

conversations = tokenizer.apply_chat_template(
    [{'role': 'user', 'content': 'hi, how are you?'}],
    tokenize=False,
)

outputs = llm.generate(
    [conversations],
    SamplingParams(
        temperature=0.5,
        top_p=0.9,
        max_tokens=1024,
        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],  # KEYPOINT HERE
    )
)

This is fantastic, the code works perfectly with this.
Do you know how I should modify it if I want to deploy the API using vllm.entrypoints.openai.api_server? I have no idea.

@MoGuGuai-hzr You won't need to alter any lines for the server setup. Just add the stop_token_ids line in every request to patch the issue for now:

{
    "model": "/your/path/to/meta-llama/Meta-Llama-3-70B-Instruct",
    "messages": [{
        "role": "user",
        "content": "Hello!"
    }],
    "stop_token_ids": [128001, 128009] // THIS LINE
}

Note that these two IDs correspond to tokenizer.eos_token_id and tokenizer.convert_tokens_to_ids("<|eot_id|>"), respectively.

Nice, the perfect solution. Thank you so much.

@hongyinjie
Copy link

hongyinjie commented Apr 22, 2024

Change the file tokenizer_config.json: chat_template
eot_id ==> end_of_text (No)

tokenizer_config.json: chat_template eot_id ==> end_of_text (No)

In tokenizer_config.json and changing eos_token to <|eot_id|> works. (YES)

it work!

@linzm1007
Copy link

linzm1007 commented Apr 23, 2024

更改文件tokenizer_config.json:chat_template eot_id ==> end_of_text

它有效!
image

为啥我没有用

@krishna-cactus
Copy link

krishna-cactus commented Apr 23, 2024

A noob here. Will using openai chat.completions.create's stop parameter with value ["<|eot_id|>"] not solve this problem?

@simon-mo
Copy link
Collaborator

The new version of vLLM (https://github.com/vllm-project/vllm/releases/tag/v0.4.1) has been released, which is now compatible with the new llama3's end to turn stop token.

@Zg-Serein
Copy link

Change the file tokenizer_config.json: chat_template eot_id ==> end_of_text (No)

tokenizer_config.json: chat_template eot_id ==> end_of_text (No)

In tokenizer_config.json and changing eos_token to <|eot_id|> works. (YES)

it work!

Hello, may I ask that eos_token should be correct when I use the latest configuration file llama3-instruct at present, which is consistent with the modification in github issue. But still can not stop correctly, may I ask what may be the cause?

My vllm version is 0.4.2

Here is my configuration file
generation_config.json is as follows
{
"bos_token_id": 128000,
"eos_token_id": [128001, 128009],
"do_sample": true,
"temperature": 0.6,
"max_length": 4096,
"top_p": 0.9,
"transformers_version": "4.40.0.dev0"
}
tokenizer_config.json is as follows
"bos_token": "<|begin_of_text|>",
"chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}",
"clean_up_tokenization_spaces": true,
"eos_token": "<|eot_id|>",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests