fp16 flag silently fails #14822

rumeshmadhusanka · 2021-12-18T02:17:29Z

Environment info

transformers version: 4.15.0.dev0
Platform: Linux-5.10.68+-x86_64-with-debian-bullseye-sid
Python version: 3.7.12
PyTorch version (GPU?): 1.9.1 (True)
Tensorflow version (GPU?): 2.6.2 (True)
Flax version (CPU?/GPU?/TPU?): 0.3.6 (gpu)
Jax version: 0.2.25
JaxLib version: 0.1.70
Using GPU in script?: Y
Using distributed or parallel set-up in script?: N

Models:

mT5-small
mT5-base

both have the same behavior

Library:

Trainer: @sgugger

Information

The problem arises when using:

my own modified scripts: https://github.com/rumeshmadhusanka/mt5-simplification/blob/main/finetune.py derived from and very much identical to the official translation example

The tasks I am working on is:

text simplification
translation

To reproduce

Steps to reproduce the behavior:

Run a translation/simplification task task turning fp16 flag on
My params(for simplification):
model="google/mt5-base" !python mt5-simplification/finetune.py \ --model_name_or_path $model \ --do_train \ --fp16 \ --do_eval \ --adafactor \ --source_lang com \ --target_lang sim \ --source_prefix "com-sim: " \ --train_file train.json \ --validation_file valid.json \ --output_dir mt5-simplification \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --save_total_limit=1 \ --adam_epsilon=1e-6 \ --learning_rate=3e-5 \ --save_strategy=epoch \ --report_to="wandb" \ --max_steps=1200 \ --warmup_steps=250 \ --overwrite_output_dir \ --log_level debug \ --output_dir saved \ --predict_with_generate

Some of the output logs:

0%| | 0/1200 [00:00<?, ?it/s]/kaggle/working/transformers/src/transformers/trainer.py:1366: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
args.max_grad_norm,
14%|█████▌ | 166/1200 [00:39<03:58, 4.33it/s]/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:134: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
{'loss': 0.0, 'learning_rate': 2.7347368421052632e-05, 'epoch': 0.02}
83%|████████████████████████████████▌ | 1000/1200 [04:10<10:45, 3.23s/it]{'loss': 0.0, 'learning_rate': 1.1557894736842106e-05, 'epoch': 0.03}
100%|███████████████████████████████████████| 1200/1200 [04:56<00:00, 4.39it/s][INFO|trainer.py:2033] 2021-12-18 02:11:18,316 >> Saving model checkpoint to saved/checkpoint-1200

***** train metrics *****
epoch = 0.04
train_loss = 0.0
train_runtime = 0:05:11.02
train_samples = 120000
train_samples_per_second = 15.433
train_steps_per_second = 3.858
[INFO|trainer.py:2281] 2021-12-18 02:11:46,330 >> ***** Running Evaluation *****
[INFO|trainer.py:2283] 2021-12-18 02:11:46,330 >> Num examples = 2000
[INFO|trainer.py:2286] 2021-12-18 02:11:46,330 >> Batch size = 4
100%|█████████████████████████████████████████| 500/500 [01:56<00:00, 4.31it/s]
***** eval metrics *****
epoch = 0.04
eval_bleu = 0.0126
eval_gen_len = 8.16
eval_loss = nan
eval_runtime = 0:01:56.24
eval_samples = 2000
eval_samples_per_second = 17.205
eval_steps_per_second = 4.301

When I run a translation task on Kaggle's GPU(Tesla P100-PCIE) or AWS's T4 GPU the training loss is always zero. This has been tried out multiple times with different training params.

Expected behavior

Loss not to be zero while training

Throw an error message if the GPU doesn't support fp16

The text was updated successfully, but these errors were encountered:

stas00 · 2021-12-18T16:43:48Z

It's not an issue of GPU not supporting fp16. It's an issue of many models trained in bf16 and attempted to be used with fp16 and overflowing due to the incompatible numerical range. bf16-pretrained models use much bigger weight values than fp16 can accommodate so it overflows.

Please see: #10956 for various workarounds.

github-actions · 2022-01-17T15:01:43Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp16 flag silently fails #14822

fp16 flag silently fails #14822

rumeshmadhusanka commented Dec 18, 2021 •

edited

Loading

stas00 commented Dec 18, 2021 •

edited

Loading

github-actions bot commented Jan 17, 2022

fp16 flag silently fails #14822

fp16 flag silently fails #14822

Comments

rumeshmadhusanka commented Dec 18, 2021 • edited Loading

Environment info

Information

To reproduce

Expected behavior

stas00 commented Dec 18, 2021 • edited Loading

github-actions bot commented Jan 17, 2022

rumeshmadhusanka commented Dec 18, 2021 •

edited

Loading

stas00 commented Dec 18, 2021 •

edited

Loading