Allow streaming (datasets.IterableDataset) #1468

BramVanroy · 2024-03-22T09:49:14Z

The motivation for this PR is given in this issue: #1455 (comment)

Currently using IterableDatasets in the SFTTrainer is not plausible because of two issues:

an IterableDataset is a subclass of a torch dataset, and therefore will not be considered when deciding whether to use packing

Lines 404 to 405 in 1705aeb

    
           if isinstance(dataset, (torch.utils.data.IterableDataset, torch.utils.data.Dataset, ConstantLengthDataset)): 
        
               return dataset

when packing, a given dataset is exhausted and loaded fully in-memory with Dataset.from_generator - which defeats the purpose of having an IterableDataset in the first place

trl/trl/trainer/sft_trainer.py

Lines 520 to 522 in 1705aeb

    
           packed_dataset = Dataset.from_generator( 
        
               data_generator, gen_kwargs={"constant_length_iterator": constant_length_iterator} 
        
           )

This PR remedies both those issues by explicitly checking whether the dataset is a datasets.IterableDataset in point 1, and also in point 2 in which case the packed ConstantLengthDataset is returned as-is (which is also an IterableDataset).

These changes should make the SFTTrainer better compatible with streaming datasets. The motivation to improve the situation is because in the alignment handbook we also use the SFTTrainer for continued pretraining where massive datasets (streamed) should be supported.

Note: seems that the epoch calculation does not happen correctly when max_steps is given though. With batch size 2, accum. steps 8 and optim steps 7153 with 16 gpus, I get this strange calculation for num examples and num epochs:

***** Running training *****
Num examples = 1,831,168
Num Epochs = 9,223,372,036,854,775,807
Instantaneous batch size per device = 2
Total train batch size (w. parallel, distributed & accumulation) = 256
Gradient Accumulation steps = 8
Total optimization steps = 7,153

closes #1455

younesbelkada

Thanks a lot @BramVanroy for the detailed work on this ! I think there should be no harm supporting this in SFTTrainer, can you run the styling checks? make precommit - then we can merge imo

HuggingFaceDocBuilderDev · 2024-04-08T11:50:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BramVanroy · 2024-04-10T09:13:02Z

Thanks a lot @BramVanroy for the detailed work on this ! I think there should be no harm supporting this in SFTTrainer, can you run the styling checks? make precommit - then we can merge imo

Do you have any thoughts on the incorrect reporting of number of epochs?

…nto allow_streaming

younesbelkada · 2024-04-10T09:23:18Z

@BramVanroy hmmm not sure what we can do here to be honest :/

BramVanroy · 2024-04-11T08:26:50Z

@BramVanroy hmmm not sure what we can do here to be honest :/

Hm yeah, maybe let's keep it like that for now then.

younesbelkada

Thanks again !

BramVanroy · 2024-04-11T12:33:59Z

@younesbelkada I probably should have included it in this PR, too, but there is another useful (imo) change that is just three lines of code that could make the SFTTrainer even more flexible. I added it to a separate PR: #1520

* safe-guard iterabledatasets * import datasets * reference the correct IterableDataset * make pre-commit

snow-kartikbagalore · 2024-10-12T03:19:49Z

@BramVanroy @younesbelkada Hello guys, amazing work on this repo.
I wanted to know if it is possible to pass num_train_epochs now? Or is max_steps the recommended way to go about this for now?
I am working with a very large dataset, and am streaming the dataset by passing it as a ConstantLengthDataset.

BramVanroy added 4 commits March 19, 2024 18:02

safe-guard iterabledatasets

399da32

import datasets

4667d5c

reference the correct IterableDataset

0a59e61

Merge branch 'huggingface:main' into allow_streaming

c9ca8e1

younesbelkada reviewed Apr 8, 2024

View reviewed changes

BramVanroy added 2 commits April 10, 2024 11:20

make pre-commit

e59310a

Merge branch 'allow_streaming' of https://github.com/BramVanroy/trl i…

64167bd

…nto allow_streaming

younesbelkada approved these changes Apr 11, 2024

View reviewed changes

younesbelkada merged commit e667550 into huggingface:main Apr 11, 2024
9 checks passed

lapp0 pushed a commit to lapp0/trl that referenced this pull request May 10, 2024

Allow streaming (datasets.IterableDataset) (huggingface#1468)

67dd096

* safe-guard iterabledatasets * import datasets * reference the correct IterableDataset * make pre-commit

yash4242 mentioned this pull request Jun 5, 2024

Feature: IterableDataset support for SFTTrainer #1695

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow streaming (datasets.IterableDataset) #1468

Allow streaming (datasets.IterableDataset) #1468

BramVanroy commented Mar 22, 2024 •

edited

Loading

younesbelkada left a comment

HuggingFaceDocBuilderDev commented Apr 8, 2024

BramVanroy commented Apr 10, 2024 •

edited

Loading

younesbelkada commented Apr 10, 2024

BramVanroy commented Apr 11, 2024

younesbelkada left a comment

BramVanroy commented Apr 11, 2024

snow-kartikbagalore commented Oct 12, 2024 •

edited

Loading

	if isinstance(dataset, (torch.utils.data.IterableDataset, torch.utils.data.Dataset, ConstantLengthDataset)):
	return dataset

	packed_dataset = Dataset.from_generator(
	data_generator, gen_kwargs={"constant_length_iterator": constant_length_iterator}
	)

Allow streaming (datasets.IterableDataset) #1468

Allow streaming (datasets.IterableDataset) #1468

Conversation

BramVanroy commented Mar 22, 2024 • edited Loading

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Apr 8, 2024

BramVanroy commented Apr 10, 2024 • edited Loading

younesbelkada commented Apr 10, 2024

BramVanroy commented Apr 11, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

BramVanroy commented Apr 11, 2024

snow-kartikbagalore commented Oct 12, 2024 • edited Loading

BramVanroy commented Mar 22, 2024 •

edited

Loading

BramVanroy commented Apr 10, 2024 •

edited

Loading

snow-kartikbagalore commented Oct 12, 2024 •

edited

Loading