Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code freezes when using DDP in terminal or PyCharm #7454

Closed
notprime opened this issue May 9, 2021 · 6 comments
Closed

Code freezes when using DDP in terminal or PyCharm #7454

notprime opened this issue May 9, 2021 · 6 comments
Assignees
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 1 Medium priority task waiting on author Waiting on user action, correction, or update
Milestone

Comments

@notprime
Copy link

notprime commented May 9, 2021

🐛 Bug

Hello again!
Some days ago I opened this issue:

Code freezes before validation sanity check when using DDP

Basically, DDP wasn't working, and this was related to Jupyter Notebook unable to use ddp as accelerator.
So, some days later, I tried to re-run my script in PyCharm first, then in the terminal (I just did some changes, like using MADGRAD as optimizer, nothing more).
Even there I can't use DDP. I tried both using gpus=2 and gpus=-1. This time the code freezes here:

COMET INFO: Experiment is live on comet.ml link

CometLogger will be initialized in offline mode
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
COMET INFO: Experiment is live on comet.ml link

CometLogger will be initialized in offline mode
Using native 16bit precision.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : link
COMET INFO:   Uploads:
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     installed packages  : 1
COMET INFO:     os packages         : 1
COMET INFO:     source_code         : 1
COMET INFO: ---------------------------
COMET WARNING: Empty mapping given to log_params({}); ignoring

  | Name    | Type       | Params
---------------------------------------
0 | encoder | Sequential | 1.3 M 
1 | decoder | Sequential | 1.3 M 
---------------------------------------
2.6 M     Trainable params
0         Non-trainable params
2.6 M     Total params
10.524    Total estimated model params size (MB)

Even if it says initializing ddp, only the first GPU is ON, the other are OFF:


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:05:00.0  On |                  N/A |
| 31%   50C    P2    83W / 250W |   1114MiB / 12194MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:06:00.0 Off |                  N/A |
| 24%   47C    P2    60W / 250W |     37MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:09:00.0 Off |                  N/A |
| 24%   46C    P2    64W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:0A:00.0 Off |                  N/A |
| 23%   40C    P2    63W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Do you know what the problem could be?
The network isn't that big, so I could use only 1 GPU or I could use dp or ddp_spawn, but they are not recomended

  • PyTorch Version: 1.8.1
  • OS: Ubuntu 18.04
  • How you installed PyTorch (conda, pip, source): 'conda'
  • Python version: 3.8
  • CUDA/cuDNN version: 11.2
  • GPU models and configuration: 4 x TITAN Xp 12GB
@notprime notprime added bug Something isn't working help wanted Open to be worked on labels May 9, 2021
@edenlightning edenlightning added distributed Generic distributed-related topic priority: 0 High priority task labels May 9, 2021
@Borda
Copy link
Member

Borda commented May 9, 2021

@notprime can you replicate it outside PyCharm, in regular terminal?
I guess that the pycharm terminal uses ipython by default...

@Borda Borda added information needed priority: 1 Medium priority task and removed priority: 0 High priority task labels May 9, 2021
@notprime notprime closed this as completed May 9, 2021
@notprime notprime reopened this May 9, 2021
@notprime
Copy link
Author

notprime commented May 9, 2021

@Borda I get the same error even if I run the script in regular terminal, 3 gpus are still off

@Lightning-AI Lightning-AI deleted a comment from notprime May 9, 2021
@Borda
Copy link
Member

Borda commented May 9, 2021

can you please tell us what is here different here compare to #7336 which was evaluated that works as intended?

@notprime
Copy link
Author

notprime commented May 9, 2021

@Borda there are no differences in reality, I just used a different optimizer.
The problem in #7336 was related to Jupyter Notebook, which doesn't support ddp, but only dp and ddp_spawn.
So I tried to run the script using PyCharm and terminal with accelerator = ddp, but I still can't use ddp apparently.
These are the parameters of my pl.Trainer:

trainer = pl.Trainer(
    max_epochs=EPOCHS,
    gpus=-1,
    accelerator = 'ddp',
    logger=[comet_logger, tblogger],
    log_every_n_steps=steps,
    precision=16,
)

If I use dp all the gpus are used, ddp isn't working for me

@Borda
Copy link
Member

Borda commented May 25, 2021

so you say that if you call the same script python my-train.py in the system terminal it works fine and do the same in the PyCharm terminal it hangs?

@edenlightning edenlightning added this to the v1.3.x milestone Jul 1, 2021
@Borda Borda added waiting on author Waiting on user action, correction, or update and removed information needed labels Jul 6, 2021
@Borda Borda modified the milestones: v1.3.x, v1.4 Jul 6, 2021
@edenlightning edenlightning modified the milestones: v1.4, v1.3.x, v1.4.x Jul 6, 2021
@Borda
Copy link
Member

Borda commented Jul 19, 2021

ok, feel free to re-open if you have still this kind of issue 🐰

@Borda Borda closed this as completed Jul 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 1 Medium priority task waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

No branches or pull requests

3 participants