You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
0%| | 0/1200 [00:00<?, ?it/s]/kaggle/working/transformers/src/transformers/trainer.py:1366: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
args.max_grad_norm,
14%|█████▌ | 166/1200 [00:39<03:58, 4.33it/s]/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:134: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
{'loss': 0.0, 'learning_rate': 2.7347368421052632e-05, 'epoch': 0.02}
83%|████████████████████████████████▌ | 1000/1200 [04:10<10:45, 3.23s/it]{'loss': 0.0, 'learning_rate': 1.1557894736842106e-05, 'epoch': 0.03}
100%|███████████████████████████████████████| 1200/1200 [04:56<00:00, 4.39it/s][INFO|trainer.py:2033] 2021-12-18 02:11:18,316 >> Saving model checkpoint to saved/checkpoint-1200
When I run a translation task on Kaggle's GPU(Tesla P100-PCIE) or AWS's T4 GPU the training loss is always zero. This has been tried out multiple times with different training params.
Expected behavior
Loss not to be zero while training
Throw an error message if the GPU doesn't support fp16
The text was updated successfully, but these errors were encountered:
It's not an issue of GPU not supporting fp16. It's an issue of many models trained in bf16 and attempted to be used with fp16 and overflowing due to the incompatible numerical range. bf16-pretrained models use much bigger weight values than fp16 can accommodate so it overflows.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
transformers
version: 4.15.0.dev0Models:
both have the same behavior
Library:
Information
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
My params(for simplification):
model="google/mt5-base" !python mt5-simplification/finetune.py \ --model_name_or_path $model \ --do_train \ --fp16 \ --do_eval \ --adafactor \ --source_lang com \ --target_lang sim \ --source_prefix "com-sim: " \ --train_file train.json \ --validation_file valid.json \ --output_dir mt5-simplification \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --save_total_limit=1 \ --adam_epsilon=1e-6 \ --learning_rate=3e-5 \ --save_strategy=epoch \ --report_to="wandb" \ --max_steps=1200 \ --warmup_steps=250 \ --overwrite_output_dir \ --log_level debug \ --output_dir saved \ --predict_with_generate
Some of the output logs:
When I run a translation task on Kaggle's GPU(Tesla P100-PCIE) or AWS's T4 GPU the training loss is always zero. This has been tried out multiple times with different training params.
Expected behavior
Loss not to be zero while training
Throw an error message if the GPU doesn't support fp16
The text was updated successfully, but these errors were encountered: