Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code freezes before validation sanity check when using DDP #7336

Closed
notprime opened this issue May 3, 2021 · 3 comments
Closed

Code freezes before validation sanity check when using DDP #7336

notprime opened this issue May 3, 2021 · 3 comments
Labels
bug Something isn't working help wanted Open to be worked on working as intended Working as intended

Comments

@notprime
Copy link

notprime commented May 3, 2021

🐛 Bug

Grretings from Italy!
I recently moved to PyTorch and a friend of mine introduced me to PL.
I'm coding an autoencoder (whose architecture is still pretty simple) using a custom loss function
which works on the hidden layer output. The link below leads to the github repo:

https://github.com/notprime/custom_autoencoder/blob/main/autoenc_torch.ipynb

I read the documentation about the Multi-GPU Training, so I used 'ddp' as accelerator,
and used gpus = -1 to select all the gpus.
However, when I launch the script, the code freezes there:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

I tried to wait 10-15 minutes, but nothing happened.
Instead, if I use 'dp' as accelerator, everything works fine, and the script doesn't freeze.
The documentation says that ddp is preferred over dp because it's faster:
is there something I did wrong? I really don't know why the code stucks if I use ddp !

Thanks in advance!

  • PyTorch Version: 1.8.1
  • OS: Ubuntu 18.04
  • How you installed PyTorch: 'conda'
  • Python version: 3.8
  • CUDA/cuDNN version: 11.2
  • GPU models and configuration: 4 x TITAN Xp 12GB
@notprime notprime added bug Something isn't working help wanted Open to be worked on labels May 3, 2021
@notprime notprime changed the title Code stuck before validation sanity check when using DDP Code freezes before validation sanity check when using DDP May 3, 2021
@awaelchli
Copy link
Contributor

awaelchli commented May 3, 2021

Hi,

You can't use the ddp accelerator in notebooks. Use ddp_spawn or dp for multi gpu training.
We added recently a check so the user gets informed that they are using an unsupported accelerator in the notebook #5970

Hop this helps.

@awaelchli awaelchli added the working as intended Working as intended label May 3, 2021
@notprime
Copy link
Author

notprime commented May 8, 2021

@awaelchli I tried to run the code again on PyCharm, using accelerator = 'ddp' and gpus = 2,
this time the code freezes here:


COMET INFO: Experiment is live on comet.ml link

CometLogger will be initialized in offline mode
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
COMET INFO: Experiment is live on comet.ml link

CometLogger will be initialized in offline mode
Using native 16bit precision.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : link
COMET INFO:   Uploads:
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     installed packages  : 1
COMET INFO:     os packages         : 1
COMET INFO:     source_code         : 1
COMET INFO: ---------------------------
COMET WARNING: Empty mapping given to log_params({}); ignoring

  | Name    | Type       | Params
---------------------------------------
0 | encoder | Sequential | 1.3 M 
1 | decoder | Sequential | 1.3 M 
---------------------------------------
2.6 M     Trainable params
0         Non-trainable params
2.6 M     Total params
10.524    Total estimated model params size (MB)

Same thing happens if I use gpus=-1 to use the 4 gpus.
Also, if I run nvidia-smi in terminal, only the first GPU is on, while the other 3 are marked as off, i.e. :

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:05:00.0  On |                  N/A |
| 31%   50C    P2    83W / 250W |   1114MiB / 12194MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:06:00.0 Off |                  N/A |
| 24%   47C    P2    60W / 250W |     37MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:09:00.0 Off |                  N/A |
| 24%   46C    P2    64W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:0A:00.0 Off |                  N/A |
| 23%   40C    P2    63W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

EDIT: if I use 'dp' as accelerator, everything works. Maybe 'ddp' isnt' supported by pycharm also?
What should I use that supports 'ddp'?

EDIT 2: I also tried to launch the script from terminal, code still not working.

@awaelchli
Copy link
Contributor

awaelchli commented May 15, 2021

Is it maybe because of Comet? Have you tried turning off the logger? Not sure what's going on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on working as intended Working as intended
Projects
None yet
Development

No branches or pull requests

2 participants