Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP doesn't work properly with CUDA_VISIBLE_DEVICE #3422

Closed
rebryk opened this issue Sep 9, 2020 · 1 comment · Fixed by #3819
Closed

DDP doesn't work properly with CUDA_VISIBLE_DEVICE #3422

rebryk opened this issue Sep 9, 2020 · 1 comment · Fixed by #3819
Assignees
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@rebryk
Copy link

rebryk commented Sep 9, 2020

🐛 Bug

DDP trainer doesn't work properly when CUDA_VISIBLE_DEVICE is set.

To Reproduce

Steps to reproduce the behavior:

  1. Set CUDA_VISIBLE_DEVICE=1,2
  2. Run DDP trainer with 2 GPUs
  3. The main process will use available_gpus[self.trainer.local_rank] that is equal to 1
  4. The second process will use GPU process_idx that is again equal to 1
  5. Thus both processes will use the same single GPU, instead of both

Expected behavior

Training should use both GPUs.

Suggestion

Looks like the following if-statement causes the problem:
https://github.com/PyTorchLightning/pytorch-lightning/blob/5b4db52851000d5e4eca8c680d851bcdaafc3a80/pytorch_lightning/accelerators/ddp_backend.py#L204

Why do we use int(available_gpus[self.trainer.local_rank]) instead of simple process_idx?
As far as I understand, the master process should always use GPU 0, which is equal to the first GPU in the CUDA_VISIBLE_DEVICE list. Please, correct me if I am wrong.

@rebryk rebryk added bug Something isn't working help wanted Open to be worked on labels Sep 9, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Sep 9, 2020

Hi! thanks for your contribution!, great first issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
5 participants