Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T5-base goes out of memory on 4 GPUs with as small batch size as 4 #9311

Closed
ghost opened this issue Dec 26, 2020 · 2 comments
Closed

T5-base goes out of memory on 4 GPUs with as small batch size as 4 #9311

ghost opened this issue Dec 26, 2020 · 2 comments

Comments

@ghost
Copy link

ghost commented Dec 26, 2020

Environment info

  • transformers version: 3.5.1
  • Platform: LINUX
  • Python version: 3.7
  • PyTorch version (GPU?): 1.7
  • Tensorflow version (GPU?): -
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Who can help

Trainer: @sgugger
T5: @patrickvonplaten
examples/seq2seq: @patil-suraj

Information

Model I am using T5-base with batch size of 8 and on 4 GPUs, I am always getting out of memory even with small batch sizes, This looks like a bug as this model is not really big. I am under time pressure. Is there anyone who could help me with this bug? thanks

The tasks I am working on is:

  • GLUE benchmark

Error Stack

  0%|                                                                                                                                                                                                                 | 0/148395 [00:00<?, ?it/s]Traceback (most recent call last):
  File "finetune_trainer.py", line 303, in <module>
    main()
  File "finetune_trainer.py", line 239, in main
    training_args.optimize_from_scratch) else None
  File "/julia/codes/trainers/trainer.py", line 804, in train
    self.optimizer.step()
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/torch-1.7.1-py3.7-linux-x86_64.egg/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/transformers-3.5.1-py3.7.egg/transformers/optimization.py", line 285, in step
    state["exp_avg_sq"] = torch.zeros_like(p.data)
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 2; 15.78 GiB total capacity; 14.10 GiB already allocated; 20.25 MiB free; 14.42 GiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "finetune_trainer.py", line 303, in <module>
    main()
  File "finetune_trainer.py", line 239, in main
    training_args.optimize_from_scratch) else None
  File "/julia/codes/trainers/trainer.py", line 804, in train
    self.optimizer.step()
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/torch-1.7.1-py3.7-linux-x86_64.egg/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/transformers-3.5.1-py3.7.egg/transformers/optimization.py", line 296, in step
    denom = exp_avg_sq.sqrt().add_(group["eps"])
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 15.78 GiB total capacity; 14.06 GiB already allocated; 4.25 MiB free; 14.44 GiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "finetune_trainer.py", line 303, in <module>
    main()
  File "finetune_trainer.py", line 239, in main
Traceback (most recent call last):
  File "finetune_trainer.py", line 303, in <module>
    training_args.optimize_from_scratch) else None
  File "/julia/codes/trainers/trainer.py", line 804, in train
    main()
  File "finetune_trainer.py", line 239, in main
    self.optimizer.step()
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/torch-1.7.1-py3.7-linux-x86_64.egg/torch/optim/lr_scheduler.py", line 67, in wrapper
        training_args.optimize_from_scratch) else Nonereturn wrapped(*args, **kwargs)

  File "/julia/codes/trainers/trainer.py", line 804, in train
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/transformers-3.5.1-py3.7.egg/transformers/optimization.py", line 296, in step
    denom = exp_avg_sq.sqrt().add_(group["eps"])
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 1; 15.78 GiB total capacity; 14.13 GiB already allocated; 10.25 MiB free; 14.46 GiB reserved in total by PyTorch)
    self.optimizer.step()
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/torch-1.7.1-py3.7-linux-x86_64.egg/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/transformers-3.5.1-py3.7.egg/transformers/optimization.py", line 285, in step
    state["exp_avg_sq"] = torch.zeros_like(p.data)
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 3; 15.78 GiB total capacity; 14.10 GiB already allocated; 26.25 MiB free; 14.44 GiB reserved in total by PyTorch)
  0%|                                                                                                                                                                                                                 | 0/148395 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/opt/conda/envs/t5/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/envs/t5/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/torch-1.7.1-py3.7-linux-x86_64.egg/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/torch-1.7.1-py3.7-linux-x86_64.egg/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/envs/t5/bin/python', '-u', 'finetune_trainer.py', '--local_rank=3', 'configs/glue.json']' returned non-zero exit status 1.
@ghost ghost mentioned this issue Dec 26, 2020
@ghost ghost changed the title cuda out of memory with T5-base on 4 GPUs with very small batch size T5-base goes out of memory on 4 GPUs with as small batch size as 4 Dec 26, 2020
@stas00
Copy link
Contributor

stas00 commented Dec 26, 2020

Here are things you may try (they are unrelated to each other, so you can try in any order that resonates):

  1. turn off --fp16 or keep it but switch to pytorch-nightly - there was a large memory leak fixed a few weeks ago related to autocast (fp16) - if this is not related to autocast/ftp16 this won't help then. --fp16 was triggering the leak. Switching to apex amp is another option to try if you're hitting this memory leak in pytorch.

  2. If you are using huggingface trainer (I assume finetune_trainer.py is from examples/seq2seq then you're good) and if you can use transformers master, I'd suggest using the just added --sharded_ddp option. In my few experiments I was able to get 2-3 times bigger batches. It's documented in this PR [docs] outline sharded ddp doc #9208 (we are just waiting for a new fairscale release to merge it). But you can just use it w/o needing to understand if you are short on time. So if you want to try it, install both transformers and fairscale from master and then that new option will be available.

And please edit your Issue to show the command line you use, so we can see what cl args and/or hyper parameters you're using.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant