Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CombinedLoader changes sampling in DDP #7013

Closed
lukashermann opened this issue Apr 14, 2021 · 1 comment · Fixed by #7102
Closed

CombinedLoader changes sampling in DDP #7013

lukashermann opened this issue Apr 14, 2021 · 1 comment · Fixed by #7102
Assignees
Labels
bug Something isn't working data handling Generic data-related topic distributed Generic distributed-related topic help wanted Open to be worked on priority: 2 Low priority task

Comments

@lukashermann
Copy link

🐛 Bug

The behavior of the validation dataloader sampling changes if you use the CombinedLoader with ddp in comparison to using a single dataloader. The CombinedLoader does not split and distribute validation dataset on the gpus, but all gpus get the full validation set. The problem is resolved when you explicitly pass the DistributedSampler to the dataloader.

Please reproduce using the BoringModel

https://gist.github.com/lukashermann/b19964ba32c9bde241be3e54deea01ad

To Reproduce

To reproduce run the file and check cmd line output.

Expected behavior

Single dataloader:
device cuda:1 [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63]
device cuda:0 [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62]

Combined dataloader:
device cuda:0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]
device cuda:1 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]

Is this behavior intended?

Environment

  • CUDA:
    - GPU:
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - available: True
    - version: 11.1
  • Packages:
    - numpy: 1.19.2
    - pyTorch_debug: False
    - pyTorch_version: 1.8.0
    - pytorch-lightning: 1.3.0rc1
    - tqdm: 4.53.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.8.5
    - version: removed reduce on non-loss outputs from dp #78-Ubuntu SMP Fri Mar 19 13:29:52 UTC 2021
@lukashermann lukashermann added bug Something isn't working help wanted Open to be worked on labels Apr 14, 2021
@awaelchli
Copy link
Contributor

Can we support CombinedLoader in validation directly?
The logic in training vs. validation is quite different here. In validation we get a list of dataloaders and we replace the sampler on each of them, but the logic here https://github.com/PyTorchLightning/pytorch-lightning/blob/71b4611c64059d7589e4d80115209fd2c89e8bdb/pytorch_lightning/trainer/data_loading.py#L106 only checks for a DataLoader instance.
In training instead, we apply the sampler first to the list before wrapping them.
@justusschock how do you see this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data handling Generic data-related topic distributed Generic distributed-related topic help wanted Open to be worked on priority: 2 Low priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants