Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T5 Model Parallelism in 4.3.0 #9718

Closed
2 tasks
PeterAJansen opened this issue Jan 21, 2021 · 1 comment · Fixed by #9726
Closed
2 tasks

T5 Model Parallelism in 4.3.0 #9718

PeterAJansen opened this issue Jan 21, 2021 · 1 comment · Fixed by #9726
Assignees

Comments

@PeterAJansen
Copy link

Environment info

  • transformers version: 4.3.0.dev0
  • Platform: Linux-5.4.0-62-generic-x86_64-with-debian-bullseye-sid
  • Python version: 3.7.9
  • PyTorch version (GPU?): 1.8.0.dev20210120 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: Yes -- 4x A100-SXM4-40GB
  • Using distributed or parallel set-up in script?: Yes

Who can help

@stas00 @alexorona @sgugger

Related

Related to the discussion in #8771 ( #8771 (comment) ) that suggests MP can be done in 4.3.0 just by calling model.parallelize() after loading. I made another issue rather than hijack that one that's about MP improvements in general.

Information

Model I am using (Bert, XLNet ...): T5

The problem arises when using:

  • my own modified scripts: (give details below)

Added one line to funetune_trainer.py after model is loaded ( model.parallelize(), see below)

+++ b/examples/seq2seq/finetune_trainer.py
@@ -215,6 +215,9 @@ def main():    
     # use task specific params
     use_task_specific_params(model, data_args.task)
 
+    # PJ: Parallelize model
+    model.parallelize()
+
     # set num_beams for evaluation
     if data_args.eval_beams is None:
         data_args.eval_beams = model.config.num_beams

The tasks I am working on is:

  • an official GLUE/SQUaD task: Running the example on an official task/dataset (seq2seq)

To reproduce

Steps to reproduce the behavior:

On 4.3.0-dev (tonight):

  1. Fresh pull of transformers. Add change above ( model.parallelize() ).
  2. Run runscript (below). Error appears to reproduce for any sized model (e.g. I'm using t5-11b, but also happens under t5-large).
python finetune_trainer.py \
    --learning_rate=3e-5 \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_val 1000 \
    --data_dir xsum \
    --output_dir=xsum_results \
    --num_train_epochs 1 \
    --model_name_or_path t5-large \
    --fp16 \
    "$@"

  1. The error:
...
[INFO|modeling_utils.py:1152] 2021-01-21 00:52:03,923 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at t5-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
01/21/2021 00:52:03 - INFO - utils -   setting model.config to task specific params for summarization:
 {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}
01/21/2021 00:52:03 - INFO - utils -   note: command line args may override some of these
[INFO|trainer.py:362] 2021-01-21 00:52:14,376 >> Using amp fp16 backend
01/21/2021 00:52:14 - INFO - __main__ -   *** Train ***
[INFO|trainer.py:813] 2021-01-21 00:52:14,383 >> ***** Running training *****
[INFO|trainer.py:814] 2021-01-21 00:52:14,383 >>   Num examples = 204016
[INFO|trainer.py:815] 2021-01-21 00:52:14,383 >>   Num Epochs = 1
[INFO|trainer.py:816] 2021-01-21 00:52:14,383 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:817] 2021-01-21 00:52:14,383 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:818] 2021-01-21 00:52:14,383 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:819] 2021-01-21 00:52:14,383 >>   Total optimization steps = 25502
  0%|                                                                                                                                                                                                           | 0/25502 [00:00<?, ?it/s]Traceback (most recent call last):
  File "finetune_trainer.py", line 370, in <module>
    main()
  File "finetune_trainer.py", line 301, in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
  File "/home/pajansen/anaconda3/envs/transformers-4.1.1a/lib/python3.7/site-packages/transformers/trainer.py", line 910, in train
    tr_loss += self.training_step(model, inputs)
  File "/home/pajansen/anaconda3/envs/transformers-4.1.1a/lib/python3.7/site-packages/transformers/trainer.py", line 1272, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/pajansen/anaconda3/envs/transformers-4.1.1a/lib/python3.7/site-packages/transformers/trainer.py", line 1300, in compute_loss
    outputs = model(**inputs)
  File "/home/pajansen/anaconda3/envs/transformers-4.1.1a/lib/python3.7/site-packages/torch/nn/modules/module.py", line 873, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/pajansen/anaconda3/envs/transformers-4.1.1a/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 1500, in forward
    return_dict=return_dict,
  File "/home/pajansen/anaconda3/envs/transformers-4.1.1a/lib/python3.7/site-packages/torch/nn/modules/module.py", line 873, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/pajansen/anaconda3/envs/transformers-4.1.1a/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 938, in forward
    head_mask = head_mask.to(hidden_states.device)
AttributeError: 'list' object has no attribute 'to'
  0%|                                                                                                                                                                                                           | 0/25502 [00:00<?, ?it/s]

  1. It's worth noting that the behavior on 4.1.1 is different and it works (essentially the same change, but with the device map specified as per https://huggingface.co/transformers/model_doc/t5.html?highlight=parallel#transformers.T5EncoderModel.parallelize , and the runscript also has the --model_parallel flag).
  • transformers version: 4.1.1
  • Platform: Linux-5.4.0-62-generic-x86_64-with-debian-bullseye-sid
  • Python version: 3.7.9
  • PyTorch version (GPU?): 1.8.0.dev20210120 (True)

Change:

+++ b/examples/seq2seq/finetune_trainer.py
@@ -231,6 +231,13 @@ def main():
     # use task specific params
     use_task_specific_params(model, data_args.task)
 
+    # PJ: Parrallelize
+    device_map = {0: [0, 1, 2],
+        1: [3, 4, 5, 6, 7, 8, 9],
+        2: [10, 11, 12, 13, 14, 15, 16],
+        3: [17, 18, 19, 20, 21, 22, 23]}
+    model.parallelize(device_map)
+
     # set num_beams for evaluation
     if data_args.eval_beams is None:
         data_args.eval_beams = model.config.num_beams

Runscript:

python finetune_trainer.py \
    --learning_rate=3e-5 \
    --fp16 \
    --do_train \
    --data_dir xsum \
    --output_dir=xsum_results \
    --num_train_epochs 1 \
    --model_name_or_path t5-large \
    --max_source_length 96 \
    --max_target_length 96 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --model_parallel \
    "$@"

Output (works fine):

01/21/2021 01:25:01 - INFO - __main__ -   *** Train ***
[INFO|trainer.py:703] 2021-01-21 01:25:01,016 >> ***** Running training *****
[INFO|trainer.py:704] 2021-01-21 01:25:01,016 >>   Num examples = 999
[INFO|trainer.py:705] 2021-01-21 01:25:01,016 >>   Num Epochs = 1
[INFO|trainer.py:706] 2021-01-21 01:25:01,016 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:707] 2021-01-21 01:25:01,016 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:708] 2021-01-21 01:25:01,016 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:709] 2021-01-21 01:25:01,017 >>   Total optimization steps = 999
  0%|                                                                                                                                                                                                             | 0/999 [00:00<?, ?it/s]/home/pajansen/anaconda3/envs/transformers-4.1.1/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:134: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
{'loss': nan, 'learning_rate': 1.4984984984984986e-05, 'epoch': 0.5005005005005005}                                                                                                                                                       
 50%|█████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                 | 500/999 [02:25<02:20,  3.54it/s][INFO|trainer.py:1226] 2021-01-21 01:27:26,134 >> Saving model checkpoint to xsum-mini_results/checkpoint-500
[INFO|configuration_utils.py:289] 2021-01-21 01:27:26,138 >> Configuration saved in xsum-mini_results/checkpoint-500/config.json
[INFO|modeling_utils.py:814] 2021-01-21 01:27:29,444 >> Model weights saved in xsum-mini_results/checkpoint-500/pytorch_model.bin
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 999/999 [04:54<00:00,  3.30it/s][INFO|trainer.py:862] 2021-01-21 01:29:55,140 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


{'epoch': 1.0}                                                                                                                                                                                                                            
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 999/999 [04:54<00:00,  3.40it/s]
[INFO|trainer.py:1226] 2021-01-21 01:29:55,141 >> Saving model checkpoint to xsum-mini_results
[INFO|configuration_utils.py:289] 2021-01-21 01:29:55,146 >> Configuration saved in xsum-mini_results/config.json
[INFO|modeling_utils.py:814] 2021-01-21 01:29:58,207 >> Model weights saved in xsum-mini_results/pytorch_model.bin
01/21/2021 01:29:58 - INFO - __main__ -   ***** train metrics *****
01/21/2021 01:29:58 - INFO - __main__ -     train_samples_per_second = -0.003
01/21/2021 01:29:58 - INFO - __main__ -     train_runtime = 294.1311
01/21/2021 01:29:58 - INFO - __main__ -     train_n_ojbs = -1

(Note, I substituted the xsum dataset above for a shorter version I made with /head/ to just use the first 1000 lines of each file, to see if it would finish to completion (without taking 15 hours for the full example dataset). It looks okay. It's worth noting that if the validation arguments are added:

    --evaluation_strategy steps \
    --predict_with_generate \
    --n_val 1000 \

then 4.1.1 will die at the checkpoints (500 iterations) with "RuntimeError: Input, output and indices must be on the current device". (I don't fully appreciate that one -- I'm assuming it means train/eval has to be done separately with MP, which is entirely manageable. #9336 showed a similar error, but that person was using BART (which doesn't have MP in 4.1.1) instead of T5, so I don't think it's the same thing).

Expected behavior

Model parallelism -- spreading large models across multiple GPUs.

@stas00
Copy link
Contributor

stas00 commented Jan 21, 2021

If you run into other issues please try this PR: #9323
which has lots of improvements. It just hasn't been merged since we are waiting for me I think to sort the whole MP/PP out before moving forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants