[SFT Trainer] precompute packed iterable into a dataset #979

lvwerra · 2023-11-10T14:51:15Z

With this PR we'll precompute the packed iterable dataloader and push it into a dataset. That will make sure the epoch estimate is right and we have a working progress bar.

Fixes #1008 #1004

HuggingFaceDocBuilderDev · 2023-11-10T14:55:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

lewtun

Great fix @lvwerra - very clean and elegant 🔥 !

Have you run the SFT example script to verify the training steps are correct & learning rate scheduler etc all look OK?

trl/trainer/sft_trainer.py

lewtun · 2023-12-04T09:20:02Z

trl/trainer/sft_trainer.py

+                for i in constant_length_iterator:
+                    yield i
+
+            try:


It would be nice to have a code comment that explains why you need to create the packed dataset via from_generator and data_generator

Same question here

Co-authored-by: lewtun <[email protected]>

younesbelkada

Looking great, thanks @lvwerra !
I tested this PR with the script below:

from transformers import AutoModelForCausalLM, TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer

# This should get ~700 examples
dataset = load_dataset("imdb", split="train[:5%]")

model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")

args = TrainingArguments(
    output_dir="./",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    logging_steps=1,
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model,
    args=args,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
    packing=True,
)

trainer.train()

And it gives me correct LR curves:

) * precompute packed iterable into a dataset * add generator function * fix typo * fix style * fix test * fix style * add test * minor refactor * fix test * Apply suggestions from code review Co-authored-by: lewtun <[email protected]> * style --------- Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: lewtun <[email protected]> Co-authored-by: younesbelkada <[email protected]>

precompute packed iterable into a dataset

646094f

leandro added 6 commits November 10, 2023 17:22

add generator function

696ac1a

fix typo

6d75d1b

fix style

e5d6634

fix test

d87eb9b

fix style

ee032be

add test

e9e0b91

tcapelle mentioned this pull request Nov 16, 2023

Guidance on how to fix the scheduler and ConstantLengthDataset #1004

Closed

This was referenced Nov 17, 2023

Missing config params on SFT huggingface/alignment-handbook#31

Merged

Training Finishes Prematurely after Max Length increases huggingface/alignment-handbook#36

Open

lvwerra mentioned this pull request Nov 20, 2023

SFTTrainer training stops early? #1008

Closed

leandro added 2 commits November 24, 2023 17:16

minor refactor

3f3fd56

fix test

e7618ec

lvwerra requested review from younesbelkada and edbeeching November 24, 2023 17:19

lvwerra added the 🏋 SFT Related to SFT label Nov 29, 2023

lewtun mentioned this pull request Dec 4, 2023

SFT training doesn't fully go through all samples huggingface/alignment-handbook#61

Open

lewtun approved these changes Dec 4, 2023

View reviewed changes

younesbelkada and others added 2 commits December 4, 2023 12:42

Apply suggestions from code review

8ec30e5

Co-authored-by: lewtun <[email protected]>

style

30e5072

younesbelkada approved these changes Dec 4, 2023

View reviewed changes

younesbelkada merged commit f06f357 into main Dec 4, 2023

younesbelkada deleted the precompute-packing branch December 4, 2023 12:13

This was referenced Dec 21, 2023

Issue concerning log when using packing=True #1090

Closed

ConstantLengthDataset does not return the right length #943

Closed

jwkirchenbauer mentioned this pull request Dec 31, 2023

Reproducing SFT results. huggingface/alignment-handbook#27

Open

lvwerra mentioned this pull request Jan 4, 2024

SFT trainer progress bar is not indicative #879

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SFT Trainer] precompute packed iterable into a dataset #979

[SFT Trainer] precompute packed iterable into a dataset #979

lvwerra commented Nov 10, 2023 •

edited by younesbelkada

Loading

HuggingFaceDocBuilderDev commented Nov 10, 2023

lewtun left a comment

lewtun Dec 4, 2023

chujiezheng Feb 9, 2024

younesbelkada left a comment

[SFT Trainer] precompute packed iterable into a dataset #979

[SFT Trainer] precompute packed iterable into a dataset #979

Conversation

lvwerra commented Nov 10, 2023 • edited by younesbelkada Loading

HuggingFaceDocBuilderDev commented Nov 10, 2023

lewtun left a comment

Choose a reason for hiding this comment

lewtun Dec 4, 2023

Choose a reason for hiding this comment

chujiezheng Feb 9, 2024

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

lvwerra commented Nov 10, 2023 •

edited by younesbelkada

Loading