Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks: process still remain in the back if even the code is finished. #2590

Closed
ShomyLiu opened this issue Jul 12, 2020 · 9 comments · Fixed by #3819
Closed

Memory leaks: process still remain in the back if even the code is finished. #2590

ShomyLiu opened this issue Jul 12, 2020 · 9 comments · Fixed by #3819
Assignees
Labels
bug Something isn't working priority: 0 High priority task
Milestone

Comments

@ShomyLiu
Copy link
Contributor

ShomyLiu commented Jul 12, 2020

Hi!
I'm new to Lightning, and have experienced one day. However, I found that there are some critical issues, especially in multi-gpu in Memory Leaks:

(1) Even if the code finished and exited, the process is still in the background.
(2) After I kill those process manually one by one, there seems still some processes occupying the GPU memories: for example:
image

BTW, there are some other issues in multi-gpu settings.

@awaelchli awaelchli added bug Something isn't working priority: 0 High priority task labels Jul 12, 2020
@ShomyLiu
Copy link
Contributor Author

ShomyLiu commented Jul 12, 2020

The following reproduces can be sure to happen the memory leaks in my code (100%) for your referece:

  • clone the code from: https://github.com/ShomyLiu/pytorch_bert_elmo_example
  • some third-party would need
    • fire, transformer, and so on
    • maybe the env should be added export TOKENIZERS_PARALLELISM=False for forbidding the warning information.
  • go to the data dir , download and unzip the dataset in google drive in data/README.md
cd data
unzip bert_elmo_glove.weight.zip
  • checkout the pl branch ('pl' means pytorch lightning)
git checkout pl
  • run the code in multi-gpu settings would lead to memory leaks: for example:
python3 main.py train --gpu_id=[0,1] --epochs=5

Each running will remain a process that would not be exited:
image

My enviroment:

  • python3.6.8
  • NVIDIA-SMI: 418.39
  • CUDA: 10.0
  • pytorch: 1.5.1+cu101
  • pytorch-lightning: 0.8.5

I'm not sure that the reason from pytorch or lightning.
@awaelchli

@awaelchli
Copy link
Contributor

Does it also happen with distributed_backend="ddp_spawn"?

@ShomyLiu
Copy link
Contributor Author

ShomyLiu commented Jul 13, 2020

Hi, I have tried the different backends settings:

  • distributed_backend="ddp_spawn" with num_workers=0, and in this settings, there is no memory leaks. However, there would be warning:
 You are using `distributed_backend=ddp_spawn` with num_workers=0. For much faster performance, switch to `distributed_backend=ddp` and set `num_workers>0
  • distributed_backend="dp" this would directly raise an error:
RuntimeError: arguments are located on different GPUs

In addition, if distributed_backend="ddp" and let the code runs over the memory leaks would happen. But if I interrupt the program manually during its running with ctrl-c, the memory leaks would not happen. Hope this can help.

@awaelchli
Copy link
Contributor

awaelchli commented Jul 13, 2020

I found this in the code:

def forward(self, x, device):
        self.device = device

this does not look right. LightningModule also hase a self.device attribute, these calls could cause the data left on the wrong device and maybe cause the memory leak?

https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#init-tensors-using-type-as-and-register-buffer

@ShomyLiu
Copy link
Contributor Author

ShomyLiu commented Jul 13, 2020

Just I have closed another issue about how to put new tensors into the right device : #2585
Since there are new tensors created in the submodule of Model (ie: the Net), so I have passed the device to the Net .

def forward(self, x, device):
        self.device = device

Here self is a submodule instead of the main module of pl.LightningModule, so I think this just a variable including the device information regardless of what the name of the variable is.

@Borda
Copy link
Member

Borda commented Sep 15, 2020

  • run the code in multi-gpu settings would lead to memory leaks: for example:
python3 main.py train --gpu_id=[0,1] --epochs=5

there is no attribute --gpu_id you shall use --gpus

@ShomyLiu
Copy link
Contributor Author

ShomyLiu commented Sep 16, 2020

@Borda Hi, --gpu_id is the argparse in my code, and i will pass the config.gpu_id into the gpus in trainer.
Yet, maybe this issue has been resolved in the current version, and I will check again asap

@awaelchli
Copy link
Contributor

awaelchli commented Sep 16, 2020

I can confirm this is still an issue. I also run into it very often when I kill ddp training. The problem is that the kill signal (like keyboard interrupt for example) is not sent to the children processes in ddp, and they keep running.
I promise I will get back to #2165 soon to fix it.

@ShomyLiu
Copy link
Contributor Author

@awaelchli Thanks for your great effort, and it's indeed a critical issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment