[Trainer] finetuning: larger batch-size leading to a worse train loss #15020

stas00 · 2022-01-03T23:15:04Z

I was just running a benchmark to compare the speed up of enabling various --gradient_accumulation_steps levels and I have noticed that the lm loss gets progressively worse and by much with enlarging of gradient_accumulation_steps:

Variation	Train samples per second	Diff %	Train loss
--gradient_accumulation_steps 1	135.85	100	2.21
--gradient_accumulation_steps 2	156.95	116	2.29
--gradient_accumulation_steps 4	167.65	123	2.42
--gradient_accumulation_steps 8	175.02	129	2.62
--gradient_accumulation_steps 16	179.15	132	2.86

(this is with --per_device_train_batch_size 16)

So something is strange here.

But re-testing with just the batchsize differences, it appears to exhibit a very similar behavior:

Variation	Train samples per second	Diff %	Train loss
--per_device_train_batch_size 1	10.04	100	1.90
--per_device_train_batch_size 2	19.39	193	2.01
--per_device_train_batch_size 4	38.66	385	2.09
--per_device_train_batch_size 8	77.52	772	2.17
--per_device_train_batch_size 16	144.12	1435	2.26

So --gradient_accumulation_steps doesn't appear to be culprit, but somehow the model is super-sensitive to batchsize.

Any suggestions to why this is so?

The original cmd was:

CUDA_VISIBLE_DEVICES=0 examples/pytorch/translation/run_translation.py --model_name_or_path t5-base \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 16 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 10000 --dataloader_num_workers 2 --gradient_accumulation_steps 1

and then just changing --gradient_accumulation_steps to higher numbers.

Software:
transformers: 4.16.0.dev0
torch       : 1.10.1
cuda        : 11.3
python      : 3.8.11

Hardware:
1 GPUs      : NVIDIA GeForce RTX 3090, 23.70GB

The text was updated successfully, but these errors were encountered:

stas00 · 2022-01-04T02:56:12Z

OK, the issue was that my benchmark is very short and with less steps to take when the batches are larger, the model simply doesn't have a chance to step down far enough.

So such changes will require raising lr, but more realistically to increase the dataset size, since one can't make LR bigger proportionally to batch size increase and not get an overflow.

stas00 mentioned this issue Jan 3, 2022

[Benchmark] HF Trainer on RTX-3090 #14608

Open

stas00 changed the title ~~[Trainer] possible bug in gradient_accumulation_steps~~ [Trainer] finetuning: larger batch-size leading to a worse train loss Jan 4, 2022

stas00 closed this as completed Jan 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Trainer] finetuning: larger batch-size leading to a worse train loss #15020

[Trainer] finetuning: larger batch-size leading to a worse train loss #15020

stas00 commented Jan 3, 2022 •

edited

Loading

stas00 commented Jan 4, 2022

[Trainer] finetuning: larger batch-size leading to a worse train loss #15020

[Trainer] finetuning: larger batch-size leading to a worse train loss #15020

Comments

stas00 commented Jan 3, 2022 • edited Loading

stas00 commented Jan 4, 2022

stas00 commented Jan 3, 2022 •

edited

Loading