DDP: Multiple processes try to create the logger directory tree #6364

RemiLacroix-IDRIS · 2021-03-05T18:53:39Z

🐛 Bug

An user from our supercomputing center run into an issue which I think turned out to be a bug in PyTorch-Lightning.

When using the DDP accelerator together with a logger, multiple processes will try creating the logger directory tree, causing some errors about already existing directories or files.

Troubleshooting

PyTorch-Lightning uses extensively the rank_zero_only function to ensure that some actions are only performed by the process with rank 0:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/utilities/distributed.py#L35-L42

rank_zero_only.rank default value is set there:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/utilities/distributed.py#L45-L46
but can be set in other modules, for example in our case DDP:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/plugins/training_type/ddp.py#L227-L228

Unfortunately it seems that the initialization by the DDP module happens too late, I think because of commit da6dbc8:

self.setup_trainer(model) gets called on line 467 effectively initializing the logger and creating the logger directory tree
DDP initialization and thus rank_zero_only.rank getting the correct value only happens at line 477 when calling self.training_type_plugin.pre_training().

To Reproduce

I have attached the code the user provided together the Slurm script: only_rank_zero.tar.gz.

I understand that you would prefer a BoringModel and Collab based reproducer but I am from the HPC world and I am not used to those. Let me know if I can help in any other way. I hope that my own digging into the code will hep.

Environment (probably not relevant in this case)

PyTorch Version: 1.7.1
OS: Linux (Red Hat 8.1)
How you installed PyTorch: conda, I tried the latest version of PyTorch-Lightning available on conda but also tested installing the current master branch from source and the behavior is still the same.
Python version: 3.7.10
CUDA/cuDNN version: 11.0.221/8.0.5
GPU models and configuration: NVIDIA V100

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-03-06T18:59:33Z

@RemiLacroix-IDRIS thanks for the report. I can confirm this issue with lightning version >= 1.2.
Your observations are correct.
I prepared a fix #6380 that delays the initial calls to the logger after the rank information is correctly available.

If you wish to test the fix you can install from my branch like this:
pip install git+https://github.com/PyTorchLightning/pytorch-lightning@bugfix/logger-init

RemiLacroix-IDRIS · 2021-03-09T09:55:10Z

Thanks for the fix @awaelchli!

RemiLacroix-IDRIS added bug Something isn't working help wanted Open to be worked on labels Mar 5, 2021

tchaton added the priority: 1 Medium priority task label Mar 5, 2021

awaelchli added the distributed Generic distributed-related topic label Mar 5, 2021

awaelchli self-assigned this Mar 5, 2021

awaelchli mentioned this issue Mar 6, 2021

fix logger creating directory structure too early in DDP #6380

Merged

11 tasks

jgbos mentioned this issue Mar 8, 2021

Training stuck at 0% after few epochs while training with DDP #5865

Closed

Borda closed this as completed in #6380 Mar 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP: Multiple processes try to create the logger directory tree #6364

DDP: Multiple processes try to create the logger directory tree #6364

RemiLacroix-IDRIS commented Mar 5, 2021

awaelchli commented Mar 6, 2021

RemiLacroix-IDRIS commented Mar 9, 2021

DDP: Multiple processes try to create the logger directory tree #6364

DDP: Multiple processes try to create the logger directory tree #6364

Comments

RemiLacroix-IDRIS commented Mar 5, 2021

🐛 Bug

Troubleshooting

To Reproduce

Environment (probably not relevant in this case)

awaelchli commented Mar 6, 2021

RemiLacroix-IDRIS commented Mar 9, 2021