Code freezes when using DDP in terminal or PyCharm #7454

notprime · 2021-05-09T15:03:17Z

🐛 Bug

Hello again!
Some days ago I opened this issue:

Code freezes before validation sanity check when using DDP

Basically, DDP wasn't working, and this was related to Jupyter Notebook unable to use ddp as accelerator.
So, some days later, I tried to re-run my script in PyCharm first, then in the terminal (I just did some changes, like using MADGRAD as optimizer, nothing more).
Even there I can't use DDP. I tried both using gpus=2 and gpus=-1. This time the code freezes here:

COMET INFO: Experiment is live on comet.ml link

CometLogger will be initialized in offline mode
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
COMET INFO: Experiment is live on comet.ml link

CometLogger will be initialized in offline mode
Using native 16bit precision.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : link
COMET INFO:   Uploads:
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     installed packages  : 1
COMET INFO:     os packages         : 1
COMET INFO:     source_code         : 1
COMET INFO: ---------------------------
COMET WARNING: Empty mapping given to log_params({}); ignoring

  | Name    | Type       | Params
---------------------------------------
0 | encoder | Sequential | 1.3 M 
1 | decoder | Sequential | 1.3 M 
---------------------------------------
2.6 M     Trainable params
0         Non-trainable params
2.6 M     Total params
10.524    Total estimated model params size (MB)

Even if it says initializing ddp, only the first GPU is ON, the other are OFF:


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:05:00.0  On |                  N/A |
| 31%   50C    P2    83W / 250W |   1114MiB / 12194MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:06:00.0 Off |                  N/A |
| 24%   47C    P2    60W / 250W |     37MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:09:00.0 Off |                  N/A |
| 24%   46C    P2    64W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:0A:00.0 Off |                  N/A |
| 23%   40C    P2    63W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Do you know what the problem could be?
The network isn't that big, so I could use only 1 GPU or I could use dp or ddp_spawn, but they are not recomended

PyTorch Version: 1.8.1
OS: Ubuntu 18.04
How you installed PyTorch (conda, pip, source): 'conda'
Python version: 3.8
CUDA/cuDNN version: 11.2
GPU models and configuration: 4 x TITAN Xp 12GB

The text was updated successfully, but these errors were encountered:

Borda · 2021-05-09T16:01:05Z

@notprime can you replicate it outside PyCharm, in regular terminal?
I guess that the pycharm terminal uses ipython by default...

notprime · 2021-05-09T18:33:00Z

@Borda I get the same error even if I run the script in regular terminal, 3 gpus are still off

Borda · 2021-05-09T18:52:59Z

can you please tell us what is here different here compare to #7336 which was evaluated that works as intended?

notprime · 2021-05-09T19:16:13Z

@Borda there are no differences in reality, I just used a different optimizer.
The problem in #7336 was related to Jupyter Notebook, which doesn't support ddp, but only dp and ddp_spawn.
So I tried to run the script using PyCharm and terminal with accelerator = ddp, but I still can't use ddp apparently.
These are the parameters of my pl.Trainer:

trainer = pl.Trainer(
    max_epochs=EPOCHS,
    gpus=-1,
    accelerator = 'ddp',
    logger=[comet_logger, tblogger],
    log_every_n_steps=steps,
    precision=16,
)

If I use dp all the gpus are used, ddp isn't working for me

Borda · 2021-05-25T14:57:16Z

so you say that if you call the same script python my-train.py in the system terminal it works fine and do the same in the PyCharm terminal it hangs?

Borda · 2021-07-19T10:05:10Z

ok, feel free to re-open if you have still this kind of issue 🐰

notprime added bug Something isn't working help wanted Open to be worked on labels May 9, 2021

edenlightning added distributed Generic distributed-related topic priority: 0 High priority task labels May 9, 2021

Borda added information needed priority: 1 Medium priority task and removed priority: 0 High priority task labels May 9, 2021

notprime closed this as completed May 9, 2021

notprime reopened this May 9, 2021

Lightning-AI deleted a comment from notprime May 9, 2021

edenlightning assigned Borda May 25, 2021

edenlightning added this to the v1.3.x milestone Jul 1, 2021

Borda added waiting on author Waiting on user action, correction, or update and removed information needed labels Jul 6, 2021

Borda modified the milestones: v1.3.x, v1.4 Jul 6, 2021

edenlightning modified the milestones: v1.4, v1.3.x, v1.4.x Jul 6, 2021

Borda closed this as completed Jul 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code freezes when using DDP in terminal or PyCharm #7454

Code freezes when using DDP in terminal or PyCharm #7454

notprime commented May 9, 2021 •

edited

Loading

Borda commented May 9, 2021

notprime commented May 9, 2021

Borda commented May 9, 2021

notprime commented May 9, 2021 •

edited by Borda

Loading

Borda commented May 25, 2021

Borda commented Jul 19, 2021

Code freezes when using DDP in terminal or PyCharm #7454

Code freezes when using DDP in terminal or PyCharm #7454

Comments

notprime commented May 9, 2021 • edited Loading

🐛 Bug

Borda commented May 9, 2021

notprime commented May 9, 2021

Borda commented May 9, 2021

notprime commented May 9, 2021 • edited by Borda Loading

Borda commented May 25, 2021

Borda commented Jul 19, 2021

notprime commented May 9, 2021 •

edited

Loading

notprime commented May 9, 2021 •

edited by Borda

Loading