Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SFT Trainer] precompute packed iterable into a dataset #979

Merged
merged 11 commits into from
Dec 4, 2023

Conversation

lvwerra
Copy link
Member

@lvwerra lvwerra commented Nov 10, 2023

With this PR we'll precompute the packed iterable dataloader and push it into a dataset. That will make sure the epoch estimate is right and we have a working progress bar.

Fixes #1008 #1004

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix @lvwerra - very clean and elegant 🔥 !

Have you run the SFT example script to verify the training steps are correct & learning rate scheduler etc all look OK?

for i in constant_length_iterator:
yield i

try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a code comment that explains why you need to create the packed dataset via from_generator and data_generator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, thanks @lvwerra !
I tested this PR with the script below:

from transformers import AutoModelForCausalLM, TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer

# This should get ~700 examples
dataset = load_dataset("imdb", split="train[:5%]")

model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")

args = TrainingArguments(
    output_dir="./",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    logging_steps=1,
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model,
    args=args,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
    packing=True,
)

trainer.train()

And it gives me correct LR curves:
Screenshot 2023-12-04 at 13 02 07

@younesbelkada younesbelkada merged commit f06f357 into main Dec 4, 2023
@younesbelkada younesbelkada deleted the precompute-packing branch December 4, 2023 12:13
lapp0 pushed a commit to lapp0/trl that referenced this pull request May 10, 2024
)

* precompute packed iterable into a dataset

* add generator function

* fix typo

* fix style

* fix test

* fix style

* add test

* minor refactor

* fix test

* Apply suggestions from code review

Co-authored-by: lewtun <[email protected]>

* style

---------

Co-authored-by: Younes Belkada <[email protected]>
Co-authored-by: lewtun <[email protected]>
Co-authored-by: younesbelkada <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏋 SFT Related to SFT
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SFTTrainer training stops early?
5 participants