DDP doesn't work properly with CUDA_VISIBLE_DEVICE #3422
Labels
bug
Something isn't working
distributed
Generic distributed-related topic
help wanted
Open to be worked on
priority: 0
High priority task
Milestone
🐛 Bug
DDP trainer doesn't work properly when CUDA_VISIBLE_DEVICE is set.
To Reproduce
Steps to reproduce the behavior:
CUDA_VISIBLE_DEVICE=1,2
available_gpus[self.trainer.local_rank]
that is equal to1
process_idx
that is again equal to1
Expected behavior
Training should use both GPUs.
Suggestion
Looks like the following if-statement causes the problem:
https://github.com/PyTorchLightning/pytorch-lightning/blob/5b4db52851000d5e4eca8c680d851bcdaafc3a80/pytorch_lightning/accelerators/ddp_backend.py#L204
Why do we use
int(available_gpus[self.trainer.local_rank])
instead of simpleprocess_idx
?As far as I understand, the master process should always use GPU 0, which is equal to the first GPU in the
CUDA_VISIBLE_DEVICE
list. Please, correct me if I am wrong.The text was updated successfully, but these errors were encountered: