Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Running out of memory when training ResNet50 #13

Open
lilhuang opened this issue Jun 5, 2019 · 0 comments
Open

Running out of memory when training ResNet50 #13

lilhuang opened this issue Jun 5, 2019 · 0 comments

Comments

@lilhuang
Copy link

lilhuang commented Jun 5, 2019

Hi, I am trying to train a ResNet50 for the initial representation learning stage on Python 3.7 and PyTorch version 0.4.1.post2, and for some reason every time I run this I get an out of memory error, even when running on multiple GPUs with 24 GB memory. This most likely has to do with backwards passing, as the script finishes running if I just comment out loss.backward(). I've also looked into running with no_grad() but with errors. The command I used was:

python ./main.py --model ResNet50
--traincfg base_classes_train_template.yaml
--valcfg base_classes_val_template.yaml
--print_freq 10 --save_freq 10
--aux_loss_wt 0.02 --aux_loss_type sgm
--checkpoint_dir checkpoints/ResNet50_sgm

The code worked perfectly with ResNet10, so I was wondering if there could be a solution to this issue? Do I have to run it on a different version of PyTorch or is it possible to fix with my current setup? Thank you for your help!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant