Memory leaks: process still remain in the back if even the code is finished. #2590

ShomyLiu · 2020-07-12T04:42:17Z

Hi!
I'm new to Lightning, and have experienced one day. However, I found that there are some critical issues, especially in multi-gpu in Memory Leaks:

(1) Even if the code finished and exited, the process is still in the background.
(2) After I kill those process manually one by one, there seems still some processes occupying the GPU memories: for example:

BTW, there are some other issues in multi-gpu settings.

ShomyLiu · 2020-07-12T07:50:20Z

The following reproduces can be sure to happen the memory leaks in my code (100%) for your referece:

clone the code from: https://github.com/ShomyLiu/pytorch_bert_elmo_example
some third-party would need
- fire, transformer, and so on
- maybe the env should be added export TOKENIZERS_PARALLELISM=False for forbidding the warning information.
go to the data dir , download and unzip the dataset in google drive in data/README.md

cd data
unzip bert_elmo_glove.weight.zip

checkout the pl branch ('pl' means pytorch lightning)

git checkout pl

run the code in multi-gpu settings would lead to memory leaks: for example:

python3 main.py train --gpu_id=[0,1] --epochs=5

Each running will remain a process that would not be exited:

My enviroment:

python3.6.8
NVIDIA-SMI: 418.39
CUDA: 10.0
pytorch: 1.5.1+cu101
pytorch-lightning: 0.8.5

I'm not sure that the reason from pytorch or lightning.
@awaelchli

awaelchli · 2020-07-12T16:22:09Z

Does it also happen with distributed_backend="ddp_spawn"?

ShomyLiu · 2020-07-13T00:34:27Z

Hi, I have tried the different backends settings:

distributed_backend="ddp_spawn" with num_workers=0, and in this settings, there is no memory leaks. However, there would be warning:

 You are using `distributed_backend=ddp_spawn` with num_workers=0. For much faster performance, switch to `distributed_backend=ddp` and set `num_workers>0

distributed_backend="dp" this would directly raise an error:

RuntimeError: arguments are located on different GPUs

In addition, if distributed_backend="ddp" and let the code runs over the memory leaks would happen. But if I interrupt the program manually during its running with ctrl-c, the memory leaks would not happen. Hope this can help.

awaelchli · 2020-07-13T01:21:13Z

I found this in the code:

def forward(self, x, device):
        self.device = device

this does not look right. LightningModule also hase a self.device attribute, these calls could cause the data left on the wrong device and maybe cause the memory leak?

https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#init-tensors-using-type-as-and-register-buffer

ShomyLiu · 2020-07-13T01:38:21Z

Just I have closed another issue about how to put new tensors into the right device : #2585
Since there are new tensors created in the submodule of Model (ie: the Net), so I have passed the device to the Net .

def forward(self, x, device):
        self.device = device

Here self is a submodule instead of the main module of pl.LightningModule, so I think this just a variable including the device information regardless of what the name of the variable is.

Borda · 2020-09-15T20:24:58Z

run the code in multi-gpu settings would lead to memory leaks: for example:
python3 main.py train --gpu_id=[0,1] --epochs=5

there is no attribute --gpu_id you shall use --gpus

ShomyLiu · 2020-09-16T01:29:30Z

@Borda Hi, --gpu_id is the argparse in my code, and i will pass the config.gpu_id into the gpus in trainer.
Yet, maybe this issue has been resolved in the current version, and I will check again asap

awaelchli · 2020-09-16T01:35:57Z

I can confirm this is still an issue. I also run into it very often when I kill ddp training. The problem is that the kill signal (like keyboard interrupt for example) is not sent to the children processes in ddp, and they keep running.
I promise I will get back to #2165 soon to fix it.

ShomyLiu · 2020-09-16T01:42:00Z

@awaelchli Thanks for your great effort, and it's indeed a critical issue.

awaelchli added bug Something isn't working priority: 0 High priority task labels Jul 12, 2020

awaelchli mentioned this issue Aug 1, 2020

[blocked by refactor] [WIP] graceful shutdown signal handling #2165

Closed

6 tasks

edenlightning assigned awaelchli Aug 3, 2020

edenlightning added this to the 0.9.x milestone Sep 16, 2020

awaelchli mentioned this issue Sep 22, 2020

Training with DDP ended error and process in GPU died #3605

Closed

williamFalcon self-assigned this Sep 23, 2020

williamFalcon modified the milestones: 0.9.x, 1.0 Sep 23, 2020

This was referenced Sep 30, 2020

[WIP] ref: decoupled ddp, ddp spawn #3733

Closed

[WIP] ref: decoupled ddp, ddp spawn (finish 3733) #3819

Merged

williamFalcon closed this as completed in #3819 Oct 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leaks: process still remain in the back if even the code is finished. #2590

Memory leaks: process still remain in the back if even the code is finished. #2590

ShomyLiu commented Jul 12, 2020 •

edited

Loading

ShomyLiu commented Jul 12, 2020 •

edited

Loading

awaelchli commented Jul 12, 2020

ShomyLiu commented Jul 13, 2020 •

edited

Loading

awaelchli commented Jul 13, 2020 •

edited

Loading

ShomyLiu commented Jul 13, 2020 •

edited

Loading

Borda commented Sep 15, 2020

ShomyLiu commented Sep 16, 2020 •

edited

Loading

awaelchli commented Sep 16, 2020 •

edited

Loading

ShomyLiu commented Sep 16, 2020

Memory leaks: process still remain in the back if even the code is finished. #2590

Memory leaks: process still remain in the back if even the code is finished. #2590

Comments

ShomyLiu commented Jul 12, 2020 • edited Loading

ShomyLiu commented Jul 12, 2020 • edited Loading

awaelchli commented Jul 12, 2020

ShomyLiu commented Jul 13, 2020 • edited Loading

awaelchli commented Jul 13, 2020 • edited Loading

ShomyLiu commented Jul 13, 2020 • edited Loading

Borda commented Sep 15, 2020

ShomyLiu commented Sep 16, 2020 • edited Loading

awaelchli commented Sep 16, 2020 • edited Loading

ShomyLiu commented Sep 16, 2020

ShomyLiu commented Jul 12, 2020 •

edited

Loading

ShomyLiu commented Jul 12, 2020 •

edited

Loading

ShomyLiu commented Jul 13, 2020 •

edited

Loading

awaelchli commented Jul 13, 2020 •

edited

Loading

ShomyLiu commented Jul 13, 2020 •

edited

Loading

ShomyLiu commented Sep 16, 2020 •

edited

Loading

awaelchli commented Sep 16, 2020 •

edited

Loading