Auto resume training from last state #8814

dasayan05 · 2021-08-09T18:11:18Z

dasayan05
Aug 9, 2021

Problem

My university provides a job scheduling system (HTCondor) where I can submit jobs. Sometimes, they get interrupted (due to external reasons) in the middle and the system respawns them immediately with exact same parameters.

When interrupted in the middle of training, I cannot rely on the respawned job because it would start a fresh run. I have to pass a proper --resume_from_checkpoint=.. everytime it gets interrupted and manually restart the job again.

An idea

Can we have an automatic mechanism in the Trainer to figure out whether there has been a previous run of the same experiment (from the logger and ModelCheckpoint(..) paths) ? Of course, it should be conditioned on whether trainer states are saved at all (i.e., save_weights_only=False). If successful, then do the same thing as users do manually with resume_from_checkpoint by loading the last checkpoint (if save_last=True or the last one with metric-monitoring).

Furthermore, all this can be wrapped under a single boolean switch --auto_resume_checkpoint.

Alternatives

Users can probably do all this from client code, but it would look really ugly. If that turned out to be the only way, then PL docs should atleast show some examples for automating the resume_from_checkpoint flag here.

ananthsub · 2021-08-09T18:28:50Z

ananthsub
Aug 9, 2021

Is it possible for your training sript to check for the previous checkpoint file and set resume_from_checkpoint accordingly?

3 replies

dasayan05 Aug 9, 2021
Author

I haven't done it yet, but I am sure I can (and I will, for the time being) do it for my specific case by manipulating logger.log_dir to get the last version (since I am using auto versioning) directory.

The point of this post was to convey the fact that I (and mostly likely many users) use the resume_from_checkpoint mostly to resume from the very last checkpoint by repeatedly passing a /path/to/checkpoint which is very obvious from the context (i.e. log folder name, exp name).

ananthsub Aug 9, 2021

I don't think it's always so obvious from the framework POV:

what if multiple checkpoint callbacks are enabled and they have different paths enabled?
what if multiple loggers are set and they have different experiment info?

that being said, we should continue to explore what can be done to make this easier, especially in the docs!

since you mentioned scheduler preemption, how's this signal sent to the job?

dasayan05 Aug 9, 2021
Author

@ananthsub Got your point. Looking at the bigger picture, I too think it is hard to define what I asked. Nonetheless, it is a popular usage of the resume_from_checkpoint flag and having better docs about auto resumption would definitely help.

About scheduler preemption, HTCondor scheduler terminates a job (due to external reasons) by sending a SIGTERM and wait for 10 minutes for the job to finish any "cleanup". Then finally a SIGKILL.

tchaton · 2021-08-19T09:28:24Z

tchaton
Aug 19, 2021
Maintainer

Dear @dasayan05,

If you are willing to make a contribution, you could actually add support for HTCondor similar to https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/connectors/slurm_connector.py#L29 with auto-reload job as follow.

Lightning already handles the reloading of the checkpoint internally. Ideally, this code should be moved to the cluster env with custom logic for restart.

Best,
T.C

1 reply

dasayan05 Aug 19, 2021
Author

Cool. I'll look into it for sure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto resume training from last state #8814

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Auto resume training from last state #8814

dasayan05 Aug 9, 2021

Problem

An idea

Alternatives

Replies: 2 comments · 4 replies

ananthsub Aug 9, 2021

dasayan05 Aug 9, 2021 Author

ananthsub Aug 9, 2021

dasayan05 Aug 9, 2021 Author

tchaton Aug 19, 2021 Maintainer

dasayan05 Aug 19, 2021 Author

dasayan05
Aug 9, 2021

Replies: 2 comments 4 replies

ananthsub
Aug 9, 2021

dasayan05 Aug 9, 2021
Author

dasayan05 Aug 9, 2021
Author

tchaton
Aug 19, 2021
Maintainer

dasayan05 Aug 19, 2021
Author