Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does SFTTrainer handle instruction formatted datasets when a tokenizer has no chat_template? #1233

Closed
JohnGiorgi opened this issue Jan 16, 2024 · 5 comments

Comments

@JohnGiorgi
Copy link
Contributor

Hi! I am interested in using the SFTTrainer for instruction-tuning. Following the docs, I can see that I can provided examples in the following format to have the trainer format things for me:

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

The docs also say:

The SFTTrainer will then format the dataset for you using the defined format from the model’s tokenizer with the apply_chat_template method.

My question and confusion is, what does the trainer do if the tokenizer has no chat_template, as is the case with the base llama model?

@JohnGiorgi JohnGiorgi changed the title How does SFTTrainer handle instruction formatted datasets when a tokenizer has no template? How does SFTTrainer handle instruction formatted datasets when a tokenizer has no chat_template? Jan 16, 2024
@younesbelkada
Copy link
Contributor

Hi @JohnGiorgi
Thanks for the issue! Per my understanding if there is no chat template that feature is simply not supported - correct @philschmid ?

@younesbelkada
Copy link
Contributor

TLDR is that if you need to use a chat dataset you need to use a model that supports chat templating - if you want to use that model I think you can clone it and manually add a chat template in that model

@philschmid
Copy link
Contributor

philschmid commented Jan 17, 2024

Hey @JohnGiorgi,

There are fallback templates in tranformers. If the loaded model/tokenizer is not having a chat_template then transformers fallback to the class-specific template, if there is no class-specific template it falls back to base chat template, which is the chatml format.

My question and confusion is, what does the trainer do if the tokenizer has no chat_template, as is the case with the base llama model?

If you are not defining a chat_template it would automatically use the llama template. But I would recommend explicitly setting your template and making sure that it is included in the tokenizer after saving. You can check out this guide: https://huggingface.co/docs/transformers/main/en/chat_templating#how-do-i-create-a-chat-template

@JohnGiorgi
Copy link
Contributor Author

Thanks all! I understand now better and agree that it is probably best to explicitly define a template.

@philschmid
Copy link
Contributor

#1242 might be interesting for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants