Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"RuntimeError: Input, output and indices must be on the current device" when trying to finetune MBart #9336

Closed
mespla opened this issue Dec 29, 2020 · 3 comments · Fixed by #9347

Comments

@mespla
Copy link

mespla commented Dec 29, 2020

Environment info

  • Platform: Linux-4.15.0-123-generic-x86_64-with-glibc2.10
  • Tried transformers versions 4.1.1 (installed with pip) and 4.2.2 (installed from master branch of the repository)
  • Python version: 3.7
  • PyTorch version: 1.7
  • Tensorflow version: 2.4
  • Number of available GPU: 2 (GeForce RTX 2080 Ti, with ~11GB of memory each)

Information

Model I am using (Bert, XLNet ...): MBart -> facebook/mbart-large-cc25

The problem arises when using: the official example scripts: (details below)

The tasks I am working on is: my own task or dataset: (details below)

I am fine-tuning MBart using my own dataset, using the examples/seq2seq/finetune.sh script. When I run it on a single GPU, I get a memory error, as one GPU has not enough memory to load the MBart model. When I try to distribute the model on two GPUs, I get a RuntimeError:
RuntimeError: Input, output and indices must be on the current device

To reproduce

I am running the script in the following way:
CUDA_VISIBLE_DEVICES=0,1 transformers/examples/seq2seq/finetune.sh --model_name_or_path "facebook/mbart-large-cc25" --output_dir output --data_dir data --overwrite_output_dir --model_parallel --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --freeze_encoder --freeze_embeds --tgt_lang "en"

I have also tried:
CUDA_VISIBLE_DEVICES=0,1 transformers/examples/seq2seq/finetune.sh --model_name_or_path "facebook/mbart-large-cc25" --output_dir output --data_dir data --overwrite_output_dir --model_parallel --tgt_lang "en"

I also tried limiting the length of source and target sentences by trying several values for --max_target_length and --max_source_length'. In addition, I tried using more GPUs (up to 4).

If I run wc -l on my data directory, I get:

3004 data/test.source
3004 data/test.target
686623 data/train.source
686623 data/train.target
2999 data/val.source
2999 data/val.target
@patrickvonplaten
Copy link
Contributor

Hey @mespla,

Thanks for your issue! I'm afraid at the moment, we're really unsure whether we want to keep supporting all the bash scripts in examples/seq2seq. In a couple of weeks, we plan on having a single concise training script for seq2seq models.

cc @sgugger

Also tagging @stas00, @patil-suraj in case you know a quick fix to this problem or have encountered this before as well.

@stas00
Copy link
Contributor

stas00 commented Dec 29, 2020

When I run it on a single GPU, I get a memory error, as one GPU has not enough memory to load the MBart model. When I try to distribute the model on two GPUs, I get a RuntimeError:
RuntimeError: Input, output and indices must be on the current device

Are you implying you've changed modeling_bart.py to support Model Parallelism? Surely that would explain that error. You probably switched the layers to different devices but not the inputs/indices.

I'm currently in the process of studying t5 MP we already have and about to do the same for Bart - i.e. add MP to Bart and its sub-classes (so MBART is included).

If you mean something else by " I try to distribute the model on two GPUs" please clarify what you mean.

If you're just trying to use 2 GPUs to solve the problem of not being able to load even one batch onto a single GPU, then just using 2 gpus won't do any good. In fact what you did (your command line) takes even more memory, since it activates DataParallel which is less memory efficient than DistributedDataParallel. See README.md in that folder for how to run DDP.

But fear not, have a look at these 2 possible solutions for you not being able to fit the model onto a single GPU:
#9311 (comment)
and another one will join soon once DeepSpeed has been integrated.

@stas00
Copy link
Contributor

stas00 commented Dec 29, 2020

oh, wait a sec, I have only now noticed you used --model_parallel. This flag currently would work only for t5 and gpt2 - as the only 2 models that have been ported to support MP.

So trainer should assert if this flag is used and arch isn't supporting MP.

This PR #9347 adds this assert.

And hopefully Bart will support MP soon as well. Until then try my suggestions in the comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants