Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP: Multiple processes try to create the logger directory tree #6364

Closed
RemiLacroix-IDRIS opened this issue Mar 5, 2021 · 2 comments · Fixed by #6380
Closed

DDP: Multiple processes try to create the logger directory tree #6364

RemiLacroix-IDRIS opened this issue Mar 5, 2021 · 2 comments · Fixed by #6380
Assignees
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 1 Medium priority task

Comments

@RemiLacroix-IDRIS
Copy link

🐛 Bug

An user from our supercomputing center run into an issue which I think turned out to be a bug in PyTorch-Lightning.

When using the DDP accelerator together with a logger, multiple processes will try creating the logger directory tree, causing some errors about already existing directories or files.

Troubleshooting

PyTorch-Lightning uses extensively the rank_zero_only function to ensure that some actions are only performed by the process with rank 0:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/utilities/distributed.py#L35-L42

rank_zero_only.rank default value is set there:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/utilities/distributed.py#L45-L46
but can be set in other modules, for example in our case DDP:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b3b8f95e2a1ac040f6ff8f848542a1e5a27edfee/pytorch_lightning/plugins/training_type/ddp.py#L227-L228

Unfortunately it seems that the initialization by the DDP module happens too late, I think because of commit da6dbc8:

  • self.setup_trainer(model) gets called on line 467 effectively initializing the logger and creating the logger directory tree
  • DDP initialization and thus rank_zero_only.rank getting the correct value only happens at line 477 when calling self.training_type_plugin.pre_training().

To Reproduce

I have attached the code the user provided together the Slurm script: only_rank_zero.tar.gz.

I understand that you would prefer a BoringModel and Collab based reproducer but I am from the HPC world and I am not used to those. Let me know if I can help in any other way. I hope that my own digging into the code will hep.

Environment (probably not relevant in this case)

  • PyTorch Version: 1.7.1
  • OS: Linux (Red Hat 8.1)
  • How you installed PyTorch: conda, I tried the latest version of PyTorch-Lightning available on conda but also tested installing the current master branch from source and the behavior is still the same.
  • Python version: 3.7.10
  • CUDA/cuDNN version: 11.0.221/8.0.5
  • GPU models and configuration: NVIDIA V100
@RemiLacroix-IDRIS RemiLacroix-IDRIS added bug Something isn't working help wanted Open to be worked on labels Mar 5, 2021
@tchaton tchaton added the priority: 1 Medium priority task label Mar 5, 2021
@awaelchli awaelchli added the distributed Generic distributed-related topic label Mar 5, 2021
@awaelchli awaelchli self-assigned this Mar 5, 2021
@awaelchli
Copy link
Contributor

@RemiLacroix-IDRIS thanks for the report. I can confirm this issue with lightning version >= 1.2.
Your observations are correct.
I prepared a fix #6380 that delays the initial calls to the logger after the rank information is correctly available.

If you wish to test the fix you can install from my branch like this:
pip install git+https://github.com/PyTorchLightning/pytorch-lightning@bugfix/logger-init

@RemiLacroix-IDRIS
Copy link
Author

Thanks for the fix @awaelchli!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 1 Medium priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants