How does SFTTrainer handle instruction formatted datasets when a tokenizer has no chat_template? #1233

JohnGiorgi · 2024-01-16T18:08:04Z

Hi! I am interested in using the SFTTrainer for instruction-tuning. Following the docs, I can see that I can provided examples in the following format to have the trainer format things for me:

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

The docs also say:

The SFTTrainer will then format the dataset for you using the defined format from the model’s tokenizer with the apply_chat_template method.

My question and confusion is, what does the trainer do if the tokenizer has no chat_template, as is the case with the base llama model?

The text was updated successfully, but these errors were encountered:

younesbelkada · 2024-01-17T13:47:11Z

Hi @JohnGiorgi
Thanks for the issue! Per my understanding if there is no chat template that feature is simply not supported - correct @philschmid ?

younesbelkada · 2024-01-17T13:49:20Z

TLDR is that if you need to use a chat dataset you need to use a model that supports chat templating - if you want to use that model I think you can clone it and manually add a chat template in that model

philschmid · 2024-01-17T13:51:23Z

Hey @JohnGiorgi,

There are fallback templates in tranformers. If the loaded model/tokenizer is not having a chat_template then transformers fallback to the class-specific template, if there is no class-specific template it falls back to base chat template, which is the chatml format.

My question and confusion is, what does the trainer do if the tokenizer has no chat_template, as is the case with the base llama model?

If you are not defining a chat_template it would automatically use the llama template. But I would recommend explicitly setting your template and making sure that it is included in the tokenizer after saving. You can check out this guide: https://huggingface.co/docs/transformers/main/en/chat_templating#how-do-i-create-a-chat-template

JohnGiorgi · 2024-01-17T17:22:10Z

Thanks all! I understand now better and agree that it is probably best to explicitly define a template.

philschmid · 2024-01-17T17:27:24Z

#1242 might be interesting for you.

JohnGiorgi changed the title ~~How does SFTTrainer handle instruction formatted datasets when a tokenizer has no template?~~ How does SFTTrainer handle instruction formatted datasets when a tokenizer has no chat_template? Jan 16, 2024

JohnGiorgi closed this as completed Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does SFTTrainer handle instruction formatted datasets when a tokenizer has no chat_template? #1233

How does SFTTrainer handle instruction formatted datasets when a tokenizer has no chat_template? #1233

JohnGiorgi commented Jan 16, 2024

younesbelkada commented Jan 17, 2024

younesbelkada commented Jan 17, 2024

philschmid commented Jan 17, 2024 •

edited

Loading

JohnGiorgi commented Jan 17, 2024

philschmid commented Jan 17, 2024

How does SFTTrainer handle instruction formatted datasets when a tokenizer has no chat_template? #1233

How does SFTTrainer handle instruction formatted datasets when a tokenizer has no chat_template? #1233

Comments

JohnGiorgi commented Jan 16, 2024

younesbelkada commented Jan 17, 2024

younesbelkada commented Jan 17, 2024

philschmid commented Jan 17, 2024 • edited Loading

JohnGiorgi commented Jan 17, 2024

philschmid commented Jan 17, 2024

philschmid commented Jan 17, 2024 •

edited

Loading