Running out of memory when training ResNet50 #13

lilhuang · 2019-06-05T02:23:25Z

Hi, I am trying to train a ResNet50 for the initial representation learning stage on Python 3.7 and PyTorch version 0.4.1.post2, and for some reason every time I run this I get an out of memory error, even when running on multiple GPUs with 24 GB memory. This most likely has to do with backwards passing, as the script finishes running if I just comment out loss.backward(). I've also looked into running with no_grad() but with errors. The command I used was:

python ./main.py --model ResNet50
--traincfg base_classes_train_template.yaml
--valcfg base_classes_val_template.yaml
--print_freq 10 --save_freq 10
--aux_loss_wt 0.02 --aux_loss_type sgm
--checkpoint_dir checkpoints/ResNet50_sgm

The code worked perfectly with ResNet10, so I was wondering if there could be a solution to this issue? Do I have to run it on a different version of PyTorch or is it possible to fix with my current setup? Thank you for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running out of memory when training ResNet50 #13

Running out of memory when training ResNet50 #13

lilhuang commented Jun 5, 2019

Running out of memory when training ResNet50 #13

Running out of memory when training ResNet50 #13

Comments

lilhuang commented Jun 5, 2019