Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set limits for fetcher.done #18441

Merged
merged 85 commits into from
Sep 7, 2023
Merged

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Aug 30, 2023

What does this PR do?

Follow up to #18376 making the dataloader_iter respect the limits set in the Trainer.
Fixes #18334

Debugging script to compare iterations to master branch (demonstrates NeMo use case):

import torch
from torch.utils.data import DataLoader, Dataset

from lightning.pytorch import LightningModule, Trainer

global_batch_size = 4
micro_batch_size = 2
assert global_batch_size % micro_batch_size == 0


class RandomDataset(Dataset):
    def __init__(self, length):
        self.len = length
        self.data = torch.randn(length, 32)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)
        self.val_fetched = 0
        self.val_iter_raised = False
        self.val_iter_done = False
        self.val_step_entered = 0

        self.train_fetched = 0
        self.train_iter_raised = False
        self.train_iter_done = False
        self.train_step_entered = 0

    def training_step(self, dataloader_iter, batch_idx):
        self.train_step_entered += 1
        self.train_iter_done = dataloader_iter.done
        for i in range(global_batch_size // micro_batch_size):
            try:
                batch = next(dataloader_iter)
            except StopIteration:
                self.train_iter_raised = True
                return None
            self.train_fetched += 1
        return self.layer(batch).sum()

    def validation_step(self, dataloader_iter, batch_idx):
        self.val_step_entered += 1
        self.val_iter_done = dataloader_iter.done
        for i in range(global_batch_size // micro_batch_size):
            try:
                batch = next(dataloader_iter)
            except StopIteration:
                self.val_iter_raised = True
                return
            self.val_fetched += 1
            self.layer(batch).sum()

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


train_data = DataLoader(RandomDataset(length=16), batch_size=micro_batch_size)
val_data = DataLoader(RandomDataset(length=16), batch_size=micro_batch_size)

model = BoringModel()
trainer = Trainer(
    # limit_train_batches=3,
    limit_val_batches=4,
    num_sanity_val_steps=0,
    # max_steps=2,
    max_epochs=1,
    accelerator="cpu",
)
trainer.fit(model, train_data, val_data)
# trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)

print("train fetched", model.train_fetched)
print("train step entered", model.train_step_entered)
print("train iter exhausted", model.train_iter_raised)

print("val fetched", model.val_fetched)
print("val step entered", model.val_step_entered)
print("val iter exhausted", model.val_iter_raised)

cc @Borda @justusschock @awaelchli @carmocca

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Aug 30, 2023
@awaelchli awaelchli force-pushed the dataloader-iter/via-loader-length branch from 0353163 to 49ce20f Compare August 30, 2023 22:11
@awaelchli awaelchli changed the title Set limits for fetcher.done V2 WIP: (v2) Set limits for fetcher.done Aug 31, 2023
@awaelchli awaelchli marked this pull request as ready for review August 31, 2023 12:06
@github-actions
Copy link
Contributor

github-actions bot commented Aug 31, 2023

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, lightning, 3.8, 1.11) success
pl-cpu (macOS-11, lightning, 3.9, 1.12) success
pl-cpu (macOS-11, lightning, 3.10, 1.13) success
pl-cpu (macOS-11, lightning, 3.10, 2.0) success
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11) success
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
pl-cpu (windows-2022, lightning, 3.8, 1.11) success
pl-cpu (windows-2022, lightning, 3.9, 1.12) success
pl-cpu (windows-2022, lightning, 3.10, 1.13) success
pl-cpu (windows-2022, lightning, 3.10, 2.0) success
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
pl-cpu (macOS-11, pytorch, 3.8, 1.13) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13) success
pl-cpu (windows-2022, pytorch, 3.8, 1.13) success
pl-cpu (macOS-12, pytorch, 3.11, 2.0) success
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.0) success
pl-cpu (windows-2022, pytorch, 3.11, 2.0) success

These checks are required after the changes to src/lightning/pytorch/loops/evaluation_loop.py, src/lightning/pytorch/loops/fetchers.py, src/lightning/pytorch/loops/fit_loop.py, src/lightning/pytorch/loops/prediction_loop.py, src/lightning/pytorch/loops/training_epoch_loop.py, src/lightning/pytorch/trainer/trainer.py, src/lightning/pytorch/utilities/combined_loader.py, tests/tests_pytorch/loops/test_evaluation_loop.py, tests/tests_pytorch/loops/test_fetchers.py, tests/tests_pytorch/loops/test_loops.py, tests/tests_pytorch/strategies/test_single_device.py, tests/tests_pytorch/trainer/properties/test_estimated_stepping_batches.py, tests/tests_pytorch/trainer/test_dataloaders.py, tests/tests_pytorch/trainer/test_trainer.py, tests/tests_pytorch/utilities/test_combined_loader.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
[pytorch-lightning (GPUs) (testing Lightning latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=173284&view=logs&jobId=47e66f3c-897a-5428-da11-bf5c7745762e) success
[pytorch-lightning (GPUs) (testing PyTorch latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=173284&view=logs&jobId=3f274fac-2e11-54ca-487e-194c91f3ae9f) success

These checks are required after the changes to src/lightning/pytorch/loops/evaluation_loop.py, src/lightning/pytorch/loops/fetchers.py, src/lightning/pytorch/loops/fit_loop.py, src/lightning/pytorch/loops/prediction_loop.py, src/lightning/pytorch/loops/training_epoch_loop.py, src/lightning/pytorch/trainer/trainer.py, src/lightning/pytorch/utilities/combined_loader.py, tests/tests_pytorch/loops/test_evaluation_loop.py, tests/tests_pytorch/loops/test_fetchers.py, tests/tests_pytorch/loops/test_loops.py, tests/tests_pytorch/strategies/test_single_device.py, tests/tests_pytorch/trainer/properties/test_estimated_stepping_batches.py, tests/tests_pytorch/trainer/test_dataloaders.py, tests/tests_pytorch/trainer/test_trainer.py, tests/tests_pytorch/utilities/test_combined_loader.py.

🟢 pytorch_lightning: Benchmarks
Check ID Status
lightning.Benchmarks success

These checks are required after the changes to src/lightning/pytorch/loops/evaluation_loop.py, src/lightning/pytorch/loops/fetchers.py, src/lightning/pytorch/loops/fit_loop.py, src/lightning/pytorch/loops/prediction_loop.py, src/lightning/pytorch/loops/training_epoch_loop.py, src/lightning/pytorch/trainer/trainer.py, src/lightning/pytorch/utilities/combined_loader.py.

🟢 pytorch_lightning: Docs
Check ID Status
docs-make (pytorch, doctest) success
docs-make (pytorch, html) success

These checks are required after the changes to src/lightning/pytorch/loops/evaluation_loop.py, src/lightning/pytorch/loops/fetchers.py, src/lightning/pytorch/loops/fit_loop.py, src/lightning/pytorch/loops/prediction_loop.py, src/lightning/pytorch/loops/training_epoch_loop.py, src/lightning/pytorch/trainer/trainer.py, src/lightning/pytorch/utilities/combined_loader.py.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to src/lightning/pytorch/loops/evaluation_loop.py, src/lightning/pytorch/loops/fetchers.py, src/lightning/pytorch/loops/fit_loop.py, src/lightning/pytorch/loops/prediction_loop.py, src/lightning/pytorch/loops/training_epoch_loop.py, src/lightning/pytorch/trainer/trainer.py, src/lightning/pytorch/utilities/combined_loader.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.8) success
install-pkg (ubuntu-22.04, app, 3.11) success
install-pkg (ubuntu-22.04, fabric, 3.8) success
install-pkg (ubuntu-22.04, fabric, 3.11) success
install-pkg (ubuntu-22.04, pytorch, 3.8) success
install-pkg (ubuntu-22.04, pytorch, 3.11) success
install-pkg (ubuntu-22.04, lightning, 3.8) success
install-pkg (ubuntu-22.04, lightning, 3.11) success
install-pkg (ubuntu-22.04, notset, 3.8) success
install-pkg (ubuntu-22.04, notset, 3.11) success
install-pkg (macOS-12, app, 3.8) success
install-pkg (macOS-12, app, 3.11) success
install-pkg (macOS-12, fabric, 3.8) success
install-pkg (macOS-12, fabric, 3.11) success
install-pkg (macOS-12, pytorch, 3.8) success
install-pkg (macOS-12, pytorch, 3.11) success
install-pkg (macOS-12, lightning, 3.8) success
install-pkg (macOS-12, lightning, 3.11) success
install-pkg (macOS-12, notset, 3.8) success
install-pkg (macOS-12, notset, 3.11) success
install-pkg (windows-2022, app, 3.8) success
install-pkg (windows-2022, app, 3.11) success
install-pkg (windows-2022, fabric, 3.8) success
install-pkg (windows-2022, fabric, 3.11) success
install-pkg (windows-2022, pytorch, 3.8) success
install-pkg (windows-2022, pytorch, 3.11) success
install-pkg (windows-2022, lightning, 3.8) success
install-pkg (windows-2022, lightning, 3.11) success
install-pkg (windows-2022, notset, 3.8) success
install-pkg (windows-2022, notset, 3.11) success

These checks are required after the changes to src/lightning/pytorch/loops/evaluation_loop.py, src/lightning/pytorch/loops/fetchers.py, src/lightning/pytorch/loops/fit_loop.py, src/lightning/pytorch/loops/prediction_loop.py, src/lightning/pytorch/loops/training_epoch_loop.py, src/lightning/pytorch/trainer/trainer.py, src/lightning/pytorch/utilities/combined_loader.py.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@awaelchli awaelchli requested a review from carmocca September 5, 2023 14:21
@mergify mergify bot added the has conflicts label Sep 5, 2023
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are thorough, good job

src/lightning/pytorch/loops/fetchers.py Outdated Show resolved Hide resolved
src/lightning/pytorch/loops/fit_loop.py Show resolved Hide resolved
src/lightning/pytorch/loops/fit_loop.py Outdated Show resolved Hide resolved
src/lightning/pytorch/loops/fit_loop.py Show resolved Hide resolved
src/lightning/pytorch/loops/training_epoch_loop.py Outdated Show resolved Hide resolved
src/lightning/pytorch/utilities/combined_loader.py Outdated Show resolved Hide resolved
src/lightning/pytorch/utilities/combined_loader.py Outdated Show resolved Hide resolved
tests/tests_pytorch/loops/test_loops.py Outdated Show resolved Hide resolved
tests/tests_pytorch/loops/test_loops.py Outdated Show resolved Hide resolved
tests/tests_pytorch/utilities/test_combined_loader.py Outdated Show resolved Hide resolved
@mergify mergify bot removed the has conflicts label Sep 5, 2023
@awaelchli awaelchli requested a review from carmocca September 6, 2023 01:28
@awaelchli awaelchli added the ready PRs ready to be merged label Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data handling Generic data-related topic feature Is an improvement or enhancement loops Related to the Loop API pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

training_step(dataloader_iter) does not consider limit_train_batches properly
3 participants