T5-base goes out of memory on 4 GPUs with as small batch size as 4 #9311

ghost · 2020-12-26T17:04:27Z

Environment info

transformers version: 3.5.1
Platform: LINUX
Python version: 3.7
PyTorch version (GPU?): 1.7
Tensorflow version (GPU?): -
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help

Trainer: @sgugger
T5: @patrickvonplaten
examples/seq2seq: @patil-suraj

Information

Model I am using T5-base with batch size of 8 and on 4 GPUs, I am always getting out of memory even with small batch sizes, This looks like a bug as this model is not really big. I am under time pressure. Is there anyone who could help me with this bug? thanks

The tasks I am working on is:

GLUE benchmark

Error Stack

  0%|                                                                                                                                                                                                                 | 0/148395 [00:00<?, ?it/s]Traceback (most recent call last):
  File "finetune_trainer.py", line 303, in <module>
    main()
  File "finetune_trainer.py", line 239, in main
    training_args.optimize_from_scratch) else None
  File "/julia/codes/trainers/trainer.py", line 804, in train
    self.optimizer.step()
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/torch-1.7.1-py3.7-linux-x86_64.egg/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/transformers-3.5.1-py3.7.egg/transformers/optimization.py", line 285, in step
    state["exp_avg_sq"] = torch.zeros_like(p.data)
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 2; 15.78 GiB total capacity; 14.10 GiB already allocated; 20.25 MiB free; 14.42 GiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "finetune_trainer.py", line 303, in <module>
    main()
  File "finetune_trainer.py", line 239, in main
    training_args.optimize_from_scratch) else None
  File "/julia/codes/trainers/trainer.py", line 804, in train
    self.optimizer.step()
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/torch-1.7.1-py3.7-linux-x86_64.egg/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/transformers-3.5.1-py3.7.egg/transformers/optimization.py", line 296, in step
    denom = exp_avg_sq.sqrt().add_(group["eps"])
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 15.78 GiB total capacity; 14.06 GiB already allocated; 4.25 MiB free; 14.44 GiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "finetune_trainer.py", line 303, in <module>
    main()
  File "finetune_trainer.py", line 239, in main
Traceback (most recent call last):
  File "finetune_trainer.py", line 303, in <module>
    training_args.optimize_from_scratch) else None
  File "/julia/codes/trainers/trainer.py", line 804, in train
    main()
  File "finetune_trainer.py", line 239, in main
    self.optimizer.step()
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/torch-1.7.1-py3.7-linux-x86_64.egg/torch/optim/lr_scheduler.py", line 67, in wrapper
        training_args.optimize_from_scratch) else Nonereturn wrapped(*args, **kwargs)

  File "/julia/codes/trainers/trainer.py", line 804, in train
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/transformers-3.5.1-py3.7.egg/transformers/optimization.py", line 296, in step
    denom = exp_avg_sq.sqrt().add_(group["eps"])
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 1; 15.78 GiB total capacity; 14.13 GiB already allocated; 10.25 MiB free; 14.46 GiB reserved in total by PyTorch)
    self.optimizer.step()
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/torch-1.7.1-py3.7-linux-x86_64.egg/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/transformers-3.5.1-py3.7.egg/transformers/optimization.py", line 285, in step
    state["exp_avg_sq"] = torch.zeros_like(p.data)
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 3; 15.78 GiB total capacity; 14.10 GiB already allocated; 26.25 MiB free; 14.44 GiB reserved in total by PyTorch)
  0%|                                                                                                                                                                                                                 | 0/148395 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/opt/conda/envs/t5/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/envs/t5/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/torch-1.7.1-py3.7-linux-x86_64.egg/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/opt/conda/envs/t5/lib/python3.7/site-packages/torch-1.7.1-py3.7-linux-x86_64.egg/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/envs/t5/bin/python', '-u', 'finetune_trainer.py', '--local_rank=3', 'configs/glue.json']' returned non-zero exit status 1.

The text was updated successfully, but these errors were encountered:

stas00 · 2020-12-26T17:41:28Z

Here are things you may try (they are unrelated to each other, so you can try in any order that resonates):

turn off --fp16 or keep it but switch to pytorch-nightly - there was a large memory leak fixed a few weeks ago related to autocast (fp16) - if this is not related to autocast/ftp16 this won't help then. --fp16 was triggering the leak. Switching to apex amp is another option to try if you're hitting this memory leak in pytorch.
If you are using huggingface trainer (I assume finetune_trainer.py is from examples/seq2seq then you're good) and if you can use transformers master, I'd suggest using the just added --sharded_ddp option. In my few experiments I was able to get 2-3 times bigger batches. It's documented in this PR [docs] outline sharded ddp doc #9208 (we are just waiting for a new fairscale release to merge it). But you can just use it w/o needing to understand if you are short on time. So if you want to try it, install both transformers and fairscale from master and then that new option will be available.

And please edit your Issue to show the command line you use, so we can see what cl args and/or hyper parameters you're using.

github-actions · 2021-04-15T15:04:55Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ghost mentioned this issue Dec 26, 2020

[seq2seq] memory regression #9261

Closed

ghost changed the title ~~cuda out of memory with T5-base on 4 GPUs with very small batch size~~ T5-base goes out of memory on 4 GPUs with as small batch size as 4 Dec 26, 2020

stas00 mentioned this issue Dec 29, 2020

"RuntimeError: Input, output and indices must be on the current device" when trying to finetune MBart #9336

Closed

github-actions bot closed this as completed Apr 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T5-base goes out of memory on 4 GPUs with as small batch size as 4 #9311

T5-base goes out of memory on 4 GPUs with as small batch size as 4 #9311

ghost commented Dec 26, 2020 •

edited by ghost

Loading

stas00 commented Dec 26, 2020 •

edited

Loading

github-actions bot commented Apr 15, 2021

T5-base goes out of memory on 4 GPUs with as small batch size as 4 #9311

T5-base goes out of memory on 4 GPUs with as small batch size as 4 #9311

Comments

ghost commented Dec 26, 2020 • edited by ghost Loading

Environment info

Who can help

Information

Error Stack

stas00 commented Dec 26, 2020 • edited Loading

github-actions bot commented Apr 15, 2021

ghost commented Dec 26, 2020 •

edited by ghost

Loading

stas00 commented Dec 26, 2020 •

edited

Loading