Code freezes before validation sanity check when using DDP #7336

notprime · 2021-05-03T21:31:35Z

🐛 Bug

Grretings from Italy!
I recently moved to PyTorch and a friend of mine introduced me to PL.
I'm coding an autoencoder (whose architecture is still pretty simple) using a custom loss function
which works on the hidden layer output. The link below leads to the github repo:

https://github.com/notprime/custom_autoencoder/blob/main/autoenc_torch.ipynb

I read the documentation about the Multi-GPU Training, so I used 'ddp' as accelerator,
and used gpus = -1 to select all the gpus.
However, when I launch the script, the code freezes there:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

I tried to wait 10-15 minutes, but nothing happened.
Instead, if I use 'dp' as accelerator, everything works fine, and the script doesn't freeze.
The documentation says that ddp is preferred over dp because it's faster:
is there something I did wrong? I really don't know why the code stucks if I use ddp !

Thanks in advance!

PyTorch Version: 1.8.1
OS: Ubuntu 18.04
How you installed PyTorch: 'conda'
Python version: 3.8
CUDA/cuDNN version: 11.2
GPU models and configuration: 4 x TITAN Xp 12GB

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-05-03T23:34:57Z

Hi,

You can't use the ddp accelerator in notebooks. Use ddp_spawn or dp for multi gpu training.
We added recently a check so the user gets informed that they are using an unsupported accelerator in the notebook #5970

Hop this helps.

notprime · 2021-05-08T18:46:18Z

@awaelchli I tried to run the code again on PyCharm, using accelerator = 'ddp' and gpus = 2,
this time the code freezes here:


COMET INFO: Experiment is live on comet.ml link

CometLogger will be initialized in offline mode
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
COMET INFO: Experiment is live on comet.ml link

CometLogger will be initialized in offline mode
Using native 16bit precision.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : link
COMET INFO:   Uploads:
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     installed packages  : 1
COMET INFO:     os packages         : 1
COMET INFO:     source_code         : 1
COMET INFO: ---------------------------
COMET WARNING: Empty mapping given to log_params({}); ignoring

  | Name    | Type       | Params
---------------------------------------
0 | encoder | Sequential | 1.3 M 
1 | decoder | Sequential | 1.3 M 
---------------------------------------
2.6 M     Trainable params
0         Non-trainable params
2.6 M     Total params
10.524    Total estimated model params size (MB)

Same thing happens if I use gpus=-1 to use the 4 gpus.
Also, if I run nvidia-smi in terminal, only the first GPU is on, while the other 3 are marked as off, i.e. :

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:05:00.0  On |                  N/A |
| 31%   50C    P2    83W / 250W |   1114MiB / 12194MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:06:00.0 Off |                  N/A |
| 24%   47C    P2    60W / 250W |     37MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:09:00.0 Off |                  N/A |
| 24%   46C    P2    64W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:0A:00.0 Off |                  N/A |
| 23%   40C    P2    63W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

EDIT: if I use 'dp' as accelerator, everything works. Maybe 'ddp' isnt' supported by pycharm also?
What should I use that supports 'ddp'?

EDIT 2: I also tried to launch the script from terminal, code still not working.

awaelchli · 2021-05-15T20:35:05Z

Is it maybe because of Comet? Have you tried turning off the logger? Not sure what's going on

notprime added bug Something isn't working help wanted Open to be worked on labels May 3, 2021

notprime changed the title ~~Code stuck before validation sanity check when using DDP~~ Code freezes before validation sanity check when using DDP May 3, 2021

awaelchli closed this as completed May 3, 2021

awaelchli added the working as intended Working as intended label May 3, 2021

notprime mentioned this issue May 9, 2021

Code freezes when using DDP in terminal or PyCharm #7454

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code freezes before validation sanity check when using DDP #7336

Code freezes before validation sanity check when using DDP #7336

notprime commented May 3, 2021 •

edited

Loading

awaelchli commented May 3, 2021 •

edited

Loading

notprime commented May 8, 2021 •

edited

Loading

awaelchli commented May 15, 2021 •

edited

Loading

Code freezes before validation sanity check when using DDP #7336

Code freezes before validation sanity check when using DDP #7336

Comments

notprime commented May 3, 2021 • edited Loading

🐛 Bug

awaelchli commented May 3, 2021 • edited Loading

notprime commented May 8, 2021 • edited Loading

awaelchli commented May 15, 2021 • edited Loading

notprime commented May 3, 2021 •

edited

Loading

awaelchli commented May 3, 2021 •

edited

Loading

notprime commented May 8, 2021 •

edited

Loading

awaelchli commented May 15, 2021 •

edited

Loading