-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SFT Trainer] precompute packed iterable into a dataset #979
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great fix @lvwerra - very clean and elegant 🔥 !
Have you run the SFT example script to verify the training steps are correct & learning rate scheduler etc all look OK?
for i in constant_length_iterator: | ||
yield i | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a code comment that explains why you need to create the packed dataset via from_generator
and data_generator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here
Co-authored-by: lewtun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great, thanks @lvwerra !
I tested this PR with the script below:
from transformers import AutoModelForCausalLM, TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer
# This should get ~700 examples
dataset = load_dataset("imdb", split="train[:5%]")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
args = TrainingArguments(
output_dir="./",
per_device_train_batch_size=2,
num_train_epochs=1,
logging_steps=1,
lr_scheduler_type="cosine",
)
trainer = SFTTrainer(
model,
args=args,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
packing=True,
)
trainer.train()
) * precompute packed iterable into a dataset * add generator function * fix typo * fix style * fix test * fix style * add test * minor refactor * fix test * Apply suggestions from code review Co-authored-by: lewtun <[email protected]> * style --------- Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: lewtun <[email protected]> Co-authored-by: younesbelkada <[email protected]>
With this PR we'll precompute the packed iterable dataloader and push it into a dataset. That will make sure the epoch estimate is right and we have a working progress bar.
Fixes #1008 #1004