You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
stas00
changed the title
[Trainer] possible bug in gradient_accumulation_steps
[Trainer] finetuning: larger batch-size leading to a worse train loss
Jan 4, 2022
OK, the issue was that my benchmark is very short and with less steps to take when the batches are larger, the model simply doesn't have a chance to step down far enough.
So such changes will require raising lr, but more realistically to increase the dataset size, since one can't make LR bigger proportionally to batch size increase and not get an overflow.
I was just running a benchmark to compare the speed up of enabling various
--gradient_accumulation_steps
levels and I have noticed that the lm loss gets progressively worse and by much with enlarging ofgradient_accumulation_steps
:samples
per
second
%
loss
(this is with
--per_device_train_batch_size 16
)So something is strange here.
But re-testing with just the batchsize differences, it appears to exhibit a very similar behavior:
samples
per
second
%
loss
So
--gradient_accumulation_steps
doesn't appear to be culprit, but somehow the model is super-sensitive to batchsize.Any suggestions to why this is so?
The original cmd was:
and then just changing
--gradient_accumulation_steps
to higher numbers.The text was updated successfully, but these errors were encountered: