DDP doesn't work properly with CUDA_VISIBLE_DEVICE #3422

rebryk · 2020-09-09T15:14:17Z

🐛 Bug

DDP trainer doesn't work properly when CUDA_VISIBLE_DEVICE is set.

To Reproduce

Steps to reproduce the behavior:

Set CUDA_VISIBLE_DEVICE=1,2
Run DDP trainer with 2 GPUs
The main process will use available_gpus[self.trainer.local_rank] that is equal to 1
The second process will use GPU process_idx that is again equal to 1
Thus both processes will use the same single GPU, instead of both

Expected behavior

Training should use both GPUs.

Suggestion

Looks like the following if-statement causes the problem:
https://github.com/PyTorchLightning/pytorch-lightning/blob/5b4db52851000d5e4eca8c680d851bcdaafc3a80/pytorch_lightning/accelerators/ddp_backend.py#L204

Why do we use int(available_gpus[self.trainer.local_rank]) instead of simple process_idx?
As far as I understand, the master process should always use GPU 0, which is equal to the first GPU in the CUDA_VISIBLE_DEVICE list. Please, correct me if I am wrong.

The text was updated successfully, but these errors were encountered:

github-actions · 2020-09-09T15:15:02Z

Hi! thanks for your contribution!, great first issue!

rebryk added bug Something isn't working help wanted Open to be worked on labels Sep 9, 2020

edenlightning added this to the 0.9.x milestone Sep 10, 2020

edenlightning added distributed Generic distributed-related topic Blocked on internal refactors labels Sep 10, 2020

Borda added the priority: 0 High priority task label Sep 15, 2020

edenlightning assigned teddykoker Sep 17, 2020

awaelchli mentioned this issue Sep 19, 2020

[wip] Fix ddp incorrectly reading CUDA_VISIBLE_DEVICES #3554

Closed

7 tasks

edenlightning assigned awaelchli and unassigned teddykoker Sep 21, 2020

This was referenced Sep 30, 2020

[WIP] ref: decoupled ddp, ddp spawn #3733

Closed

[WIP] ref: decoupled ddp, ddp spawn (finish 3733) #3819

Merged

williamFalcon closed this as completed in #3819 Oct 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP doesn't work properly with CUDA_VISIBLE_DEVICE #3422

DDP doesn't work properly with CUDA_VISIBLE_DEVICE #3422

rebryk commented Sep 9, 2020

github-actions bot commented Sep 9, 2020

DDP doesn't work properly with CUDA_VISIBLE_DEVICE #3422

DDP doesn't work properly with CUDA_VISIBLE_DEVICE #3422

Comments

rebryk commented Sep 9, 2020

🐛 Bug

To Reproduce

Expected behavior

Suggestion

github-actions bot commented Sep 9, 2020