You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using a Trainer with check_val_every_n_epoch = n with n > 1 the trained checks the validation every n epochs and this works. But when used in combination with a ModelCheckpoint with save_top_k = m with m > 1 it also saves the model at every iteration. It should instead check every n. This behaviour happened in previous versions (if I remember correctly it worked in 1.2. But now is broken.
To Reproduce
This piece of code with the BoringModel reproduces the issue. It saves the model every epoch instead of every n epochs (see bash in the bottom).
@Borda this seems an easy fix (although I do not know the codebase well enough to be quick on this one. But I suggest that a test is needed to not make this to broke again as it was working in previous versions.
we print a warning when Trainer.check_val_every_n_epoch != ModelCheckpoint.every_n_epochs to warn the user that this might not be the desired behaviour
we set ModelCheckpoint.every_n_epochs = ModelCheckpoint.every_n_epochs when is not specified by the user
This is kind of important as one might check validation every n steps but save the model every m steps where n != m.
🐛 Bug
When using a
Trainer
withcheck_val_every_n_epoch = n
withn > 1
the trained checks the validation everyn
epochs and this works. But when used in combination with aModelCheckpoint
withsave_top_k = m
withm > 1
it also saves the model at every iteration. It should instead check everyn
. This behaviour happened in previous versions (if I remember correctly it worked in 1.2. But now is broken.To Reproduce
This piece of code with the
BoringModel
reproduces the issue. It saves the model every epoch instead of everyn
epochs (see bash in the bottom).>>> ls -l *.ckpt -rw-r--r--. 1 ndecao Domain Users 2378 Aug 27 09:39 model-epoch=01-valid_loss=-6.00.ckpt -rw-r--r--. 1 ndecao Domain Users 2579 Aug 27 09:39 model-epoch=02-valid_loss=-6.00.ckpt -rw-r--r--. 1 ndecao Domain Users 2378 Aug 27 09:39 model-epoch=03-valid_loss=-11.57.ckpt -rw-r--r--. 1 ndecao Domain Users 2643 Aug 27 09:39 model-epoch=04-valid_loss=-11.57.ckpt -rw-r--r--. 1 ndecao Domain Users 2378 Aug 27 09:39 model-epoch=05-valid_loss=-17.14.ckpt -rw-r--r--. 1 ndecao Domain Users 2643 Aug 27 09:39 model-epoch=06-valid_loss=-17.14.ckpt -rw-r--r--. 1 ndecao Domain Users 2378 Aug 27 09:39 model-epoch=07-valid_loss=-22.70.ckpt -rw-r--r--. 1 ndecao Domain Users 2643 Aug 27 09:39 model-epoch=08-valid_loss=-22.70.ckpt -rw-r--r--. 1 ndecao Domain Users 2378 Aug 27 09:39 model-epoch=09-valid_loss=-28.27.ckpt
Expected behavior
The model should check validation loss and save the model every
check_val_every_n_epoch
epochs. This should be the correct models saved:>>> ls -l *.ckpt -rw-r--r--. 1 ndecao Domain Users 2378 Aug 27 09:39 model-epoch=01-valid_loss=-6.00.ckpt -rw-r--r--. 1 ndecao Domain Users 2378 Aug 27 09:39 model-epoch=03-valid_loss=-11.57.ckpt -rw-r--r--. 1 ndecao Domain Users 2378 Aug 27 09:39 model-epoch=05-valid_loss=-17.14.ckpt -rw-r--r--. 1 ndecao Domain Users 2378 Aug 27 09:39 model-epoch=07-valid_loss=-22.70.ckpt -rw-r--r--. 1 ndecao Domain Users 2378 Aug 27 09:39 model-epoch=09-valid_loss=-28.27.ckpt
Environment
The text was updated successfully, but these errors were encountered: