Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace DataLoader sampler once for IPUs #8858

Merged
merged 16 commits into from
Aug 16, 2021
Merged

Conversation

carmocca
Copy link
Contributor

@carmocca carmocca commented Aug 11, 2021

What does this PR do?

Fixes current IPU CI failures.

Shortens IPU CI

Error:

We were getting the following error:

        if batch_sampler is not None:
            # auto_collation with custom batch_sampler
            if batch_size != 1 or shuffle or sampler is not None or drop_last:
                raise ValueError('batch_sampler option is mutually exclusive '
                                 'with batch_size, shuffle, sampler, and '
                                 'drop_last')
            batch_size = None
            drop_last = False
        elif batch_size is None:
            # no auto_collation
            if drop_last:
>               raise ValueError('batch_size=None option disables auto-batching '
                                 'and is mutually exclusive with drop_last')
E               ValueError: batch_size=None option disables auto-batching and is mutually exclusive with drop_last

/root/miniconda3/envs/lightning/lib/python3.8/site-packages/torch/utils/data/dataloader.py:251: ValueError

The cause was that we were calling replace_sampler (which re-creates the dataloader with the correct sampler) first and _convert_to_poptorch_loader as a second post-process of the dataloaders which also requires recreating the dataloader.

However, in the first re-creation, a batch sampler is given which by the internal DataLoader design, sets batch_size = None.

        if batch_sampler is not None:
            # auto_collation with custom batch_sampler
            if batch_size != 1 or shuffle or sampler is not None or drop_last:
                raise ValueError('batch_sampler option is mutually exclusive '
                                 'with batch_size, shuffle, sampler, and '
                                 'drop_last')
            batch_size = None

This means that on the second re-creation, we do it it with batch_size = None and drop_last = True which raises the ValueError observed.

To fix this, we avoid doing this double re-creation and instead monkey-patch the function so the poptorch.DataLoader is used directly. This required splitting replace_sampler into two static methods to avoid code duplication.

Even though monkey-patching like this is hacky, this solution was chosen because the IPU plugin is highly experimental and subject to change so we did not want to add an abstraction to the Accelerator/TrainingTypePlugin to support this.

Finally, some other bug fixes were required to pass the correct RunningStage around.

Does your PR introduce any breaking changes? If yes, please list them.

Accelerator API:

  • on_reset_{train,val,test,predict}_dataloader have been removed as they aren't necessary anymore. We don't process the dataloader twice after this PR.

Trainer.request_dataloader now takes a RunningStage instead of str. Not sure how public is this function.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • [n/a] Did you make sure to update the documentation with your changes? (if necessary)
  • [n/a] Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@codecov
Copy link

codecov bot commented Aug 11, 2021

Codecov Report

Merging #8858 (50ebb60) into master (037a86c) will increase coverage by 0%.
The diff coverage is 83%.

@@           Coverage Diff           @@
##           master   #8858    +/-   ##
=======================================
  Coverage      93%     93%            
=======================================
  Files         172     172            
  Lines       14114   14234   +120     
=======================================
+ Hits        13091   13234   +143     
+ Misses       1023    1000    -23     

@carmocca carmocca self-assigned this Aug 11, 2021
@Borda Borda added the ci Continuous Integration label Aug 11, 2021
@carmocca carmocca added this to the v1.5 milestone Aug 14, 2021
@carmocca carmocca changed the title [WIP] Fix IPU CI Replace DataLoader sampler once for IPUs Aug 14, 2021
@carmocca carmocca marked this pull request as ready for review August 14, 2021 02:57
@carmocca carmocca added bug Something isn't working and removed ci Continuous Integration labels Aug 14, 2021
We don't have special tests, and only one IPU machine is available which is too slow for the full test suite CPU wise
@mergify mergify bot added the ready PRs ready to be merged label Aug 16, 2021
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@tchaton tchaton merged commit 93ab24d into master Aug 16, 2021
@tchaton tchaton deleted the ci/ipu-dataloader-fix branch August 16, 2021 09:28
four4fish pushed a commit to four4fish/pytorch-lightning that referenced this pull request Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready PRs ready to be merged refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants