DDP: Multiple processes try to create the logger directory tree #6364
Labels
bug
Something isn't working
distributed
Generic distributed-related topic
help wanted
Open to be worked on
priority: 1
Medium priority task
🐛 Bug
An user from our supercomputing center run into an issue which I think turned out to be a bug in PyTorch-Lightning.
When using the DDP accelerator together with a logger, multiple processes will try creating the logger directory tree, causing some errors about already existing directories or files.
Troubleshooting
PyTorch-Lightning uses extensively the
rank_zero_only
function to ensure that some actions are only performed by the process with rank 0:https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/utilities/distributed.py#L35-L42
rank_zero_only.rank
default value is set there:https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/utilities/distributed.py#L45-L46
but can be set in other modules, for example in our case DDP:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/plugins/training_type/ddp.py#L227-L228
Unfortunately it seems that the initialization by the DDP module happens too late, I think because of commit da6dbc8:
self.setup_trainer(model)
gets called on line 467 effectively initializing the logger and creating the logger directory treerank_zero_only.rank
getting the correct value only happens at line 477 when callingself.training_type_plugin.pre_training()
.To Reproduce
I have attached the code the user provided together the Slurm script: only_rank_zero.tar.gz.
I understand that you would prefer a
BoringModel
and Collab based reproducer but I am from the HPC world and I am not used to those. Let me know if I can help in any other way. I hope that my own digging into the code will hep.Environment (probably not relevant in this case)
conda
, I tried the latest version of PyTorch-Lightning available onconda
but also tested installing the current master branch from source and the behavior is still the same.The text was updated successfully, but these errors were encountered: