Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Trainer] finetuning: larger batch-size leading to a worse train loss #15020

Closed
stas00 opened this issue Jan 3, 2022 · 1 comment
Closed

Comments

@stas00
Copy link
Contributor

stas00 commented Jan 3, 2022

I was just running a benchmark to compare the speed up of enabling various --gradient_accumulation_steps levels and I have noticed that the lm loss gets progressively worse and by much with enlarging of gradient_accumulation_steps:

Variation Train
samples
per
second
Diff
%
Train
loss
--gradient_accumulation_steps 1 135.85 100 2.21
--gradient_accumulation_steps 2 156.95 116 2.29
--gradient_accumulation_steps 4 167.65 123 2.42
--gradient_accumulation_steps 8 175.02 129 2.62
--gradient_accumulation_steps 16 179.15 132 2.86

(this is with --per_device_train_batch_size 16)

So something is strange here.

But re-testing with just the batchsize differences, it appears to exhibit a very similar behavior:

Variation Train
samples
per
second
Diff
%
Train
loss
--per_device_train_batch_size 1 10.04 100 1.90
--per_device_train_batch_size 2 19.39 193 2.01
--per_device_train_batch_size 4 38.66 385 2.09
--per_device_train_batch_size 8 77.52 772 2.17
--per_device_train_batch_size 16 144.12 1435 2.26

So --gradient_accumulation_steps doesn't appear to be culprit, but somehow the model is super-sensitive to batchsize.

Any suggestions to why this is so?

The original cmd was:

CUDA_VISIBLE_DEVICES=0 examples/pytorch/translation/run_translation.py --model_name_or_path t5-base \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 16 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 10000 --dataloader_num_workers 2 --gradient_accumulation_steps 1

and then just changing --gradient_accumulation_steps to higher numbers.

Software:
transformers: 4.16.0.dev0
torch       : 1.10.1
cuda        : 11.3
python      : 3.8.11

Hardware:
1 GPUs      : NVIDIA GeForce RTX 3090, 23.70GB
@stas00 stas00 changed the title [Trainer] possible bug in gradient_accumulation_steps [Trainer] finetuning: larger batch-size leading to a worse train loss Jan 4, 2022
@stas00
Copy link
Contributor Author

stas00 commented Jan 4, 2022

OK, the issue was that my benchmark is very short and with less steps to take when the batches are larger, the model simply doesn't have a chance to step down far enough.

So such changes will require raising lr, but more realistically to increase the dataset size, since one can't make LR bigger proportionally to batch size increase and not get an overflow.

@stas00 stas00 closed this as completed Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant