-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation metrics assumed to be logged within the first training epoch #6791
Comments
I also have this issue. I set the |
I agree. There are 2 major failure cases with the existing error handling logic:
Rather than have a Related RFC: #6504 |
What do you mean with this exactly? Do you not want to try to save on keyboard interrupt? I agree with everything else |
@carmocca I think what @ananthsub is saying is that when in e.g. a DDP setting, if a subset of all the ranks fail (for whatever reason) then you get a hang when you call Personally, I feel like that should be handled in a separate issue, but, yall would know better than me. |
@tmcclintock I agree. there are 2 issues here:
|
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
I believe this was fixed with the addition of #8389 |
🐛 Bug
In
TrainLoop.on_train_end
a call tocheck_checkpoint_callback
is made. Within that method a call toon_validation_end
is performed. As per the docs (and the fact that theModelCheckpoint
fires onon_validation_end
), the expectation is to monitor validation metrics. However, if in theTrainer
we setnum_sanity_val_steps
to 0 then validation metrics are never logged, resulting in a misconfiguration exception in_validate_monitor_key
.Note that this is only an issue on the first epoch -- after this the val keys appear in the callback metrics and this issue is moot.
Please reproduce using the BoringModel
To Reproduce
Use following BoringModel and post here
I cannot reproduce this with the
BoringModel
since it uses deprecatedx_step
methods (e.g.validation_step
returns the loss rather than logs it). It should be updated to 1.2.6 in a different issue.Expected behavior
If the model checkpoint only implements
on_validation_end
then it should only fire on that callback, not secretly inon_train_end
. If it should fire inon_train_end
it should either have a secondmonitor
specific to thecallback_metrics
logged during training, or its logic should be moved out from underon_validation_end
to a more general (less misleading) hook.Note that the callbacks have access to the
Trainer.state
, so it is possible to move theModelCheckpoint.on_validation_end
logic into a higher level hook and leverage this state info. An elegant (imo) attribute to add toModelCheckpoint
could bemonitor_state
, so that for instance a user can say "monitor metric 'loss' but only while the trainer is in state 'train'".Environment
On PL master (1.2.6)
conda
,pip
, source): condaThe text was updated successfully, but these errors were encountered: