Replies: 2 comments 4 replies
-
Is it possible for your training sript to check for the previous checkpoint file and set |
Beta Was this translation helpful? Give feedback.
-
Dear @dasayan05, If you are willing to make a contribution, you could actually add support for HTCondor similar to https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/connectors/slurm_connector.py#L29 with auto-reload job as follow. Lightning already handles the reloading of the checkpoint internally. Ideally, this code should be moved to the Best, |
Beta Was this translation helpful? Give feedback.
-
Problem
My university provides a job scheduling system (HTCondor) where I can submit jobs. Sometimes, they get interrupted (due to external reasons) in the middle and the system respawns them immediately with exact same parameters.
When interrupted in the middle of training, I cannot rely on the respawned job because it would start a fresh run. I have to pass a proper
--resume_from_checkpoint=..
everytime it gets interrupted and manually restart the job again.An idea
Can we have an automatic mechanism in the
Trainer
to figure out whether there has been a previous run of the same experiment (from thelogger
andModelCheckpoint(..)
paths) ? Of course, it should be conditioned on whether trainer states are saved at all (i.e.,save_weights_only=False
). If successful, then do the same thing as users do manually withresume_from_checkpoint
by loading the last checkpoint (ifsave_last=True
or the last one with metric-monitoring).Furthermore, all this can be wrapped under a single boolean switch
--auto_resume_checkpoint
.Alternatives
Users can probably do all this from client code, but it would look really ugly. If that turned out to be the only way, then PL docs should atleast show some examples for automating the
resume_from_checkpoint
flag here.Beta Was this translation helpful? Give feedback.
All reactions