Auto-resume training from checkpoint #9776

sgugger · 2021-01-25T01:49:39Z

What does this PR do?

Feature suggested by @madlag.
In the examples scripts, change the current behavior so if checkpoints are present in the output_dir passed to the script command, the training resumes from there. If the output_dir exists, is nonempty but has no checkpoint, the same behavior as before is applied (error). If it exists with checkpoints inside, the last checkpoint is grabbed to resume training from there. If --overwrite_output_dir is passed, the folder is destroyed as before.

This avoids user having to pass output_dir/checkpoint-xxx as their model name or path to resume training from a checkpoint, which is a nice improvement. The bad surprise can be if you set that output_dir with a trained model you like by mistake, but at the same time, the training is resumed from the last checkpoint so shouldn't be too long (and will converge to the same model) and interrupting before the end will not erase the model inside the folder, so I think the pros outweigh the cons.

Tweak the run_glue example script for now, will expand it to all scripts if accepted.

LysandreJik

Great, LGTM!

examples/text-classification/run_glue.py

Co-authored-by: Lysandre Debut <[email protected]>

LysandreJik

Great, thanks for taking care of the template too!

jncasey · 2021-01-26T21:33:23Z

I was having a problem getting this to actually resume training on my system, and I had to make three small changes to the new code in trainer_utils.py:

checkpoints = [path for path in content if _re_checkpoint.search(path) is not None and os.path.isdir(path)] was returning empty. I changed os.path.isdir(path) to os.path.isdir(os.path.join(folder, path)) and now it returns a list of the checkpoint folders as expected.
Similarly, the get_last_checkpoint function was returning the basename of the checkpoint folder, not the full path, which seems to be expected based on the updates to the example scripts. I changed the last line of the function to return os.path.join(folder, max(checkpoints, key=lambda x: int(_re_checkpoint.search(x).groups()[0])))
After I made those update, it was resuming from the oldest checkpoint, not the newest. I noticed the checkpoint regex was only capturing the final digit in the directory name. I changed it to _re_checkpoint = re.compile(r"^" + PREFIX_CHECKPOINT_DIR + r"\-(\d+)$") with the + inside the capture group, and now get_last_checkpoint is giving me the newest checkpoint as expected.

I'm just a novice, so I'm not sure if those tweaks would break anything other systems. Does os.listdir() return full paths instead of basenames under some OS/python combinations?

But with these changes I'm able to resume an aborted training from within the same folder.

sgugger · 2021-01-26T21:36:03Z

Oh, I went a bit too fast and all the issues you list are completely valid! You should make a PR with all your changes since you were the one to find the problems and fix them :-)

jncasey · 2021-01-26T21:50:24Z

Got sucked back into work for my day job, but I'll try getting to that soonish.

Unrelated, do you guys have an office in DUMBO? I think we're in the same building (at least until we all started working from home)

sgugger · 2021-01-26T21:59:09Z

DUMBO offices are at 20 Jay street, you should come and say hi for a coffee when the pandemic is over if you're around :-)

jncasey · 2021-01-26T22:04:59Z

Yeah, I run a small animation studio up on the 10th floor. Turns out the fancy GPUs I have for animation are also good for messing around with machine learning :) And so my pandemic side project became teaching my computer to write poetry.

Someday when it's safe again I'll definitely stop by.

julien-c · 2021-01-27T09:28:29Z

Haha small world @jncasey! I'm missing 20 Jay right now. What is your company's name?

jncasey · 2021-01-27T14:56:22Z

We're called Mixtape Club. We moved into the building from Manhattan last January, and were only there for about 10 weeks before everyone started working from home. Now my business partner is the only one who goes in, since he can't bring our whole sound studio back to his apartment.

Auto-resume training from checkpoint

87e3837

sgugger requested a review from LysandreJik January 25, 2021 01:49

LysandreJik approved these changes Jan 25, 2021

View reviewed changes

examples/text-classification/run_glue.py Outdated Show resolved Hide resolved

sgugger and others added 2 commits January 25, 2021 08:04

Update examples/text-classification/run_glue.py

b1d4300

Co-authored-by: Lysandre Debut <[email protected]>

Roll out to other examples

6a6f6cb

LysandreJik approved these changes Jan 25, 2021

View reviewed changes

sgugger merged commit caf4abf into master Jan 25, 2021

sgugger deleted the auto_resume_training branch January 25, 2021 17:03

jncasey mentioned this pull request Jan 27, 2021

Fix auto-resume training from checkpoint #9822

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-resume training from checkpoint #9776

Auto-resume training from checkpoint #9776

sgugger commented Jan 25, 2021

LysandreJik left a comment

LysandreJik left a comment

jncasey commented Jan 26, 2021

sgugger commented Jan 26, 2021

jncasey commented Jan 26, 2021

sgugger commented Jan 26, 2021

jncasey commented Jan 26, 2021

julien-c commented Jan 27, 2021

jncasey commented Jan 27, 2021

Auto-resume training from checkpoint #9776

Auto-resume training from checkpoint #9776

Conversation

sgugger commented Jan 25, 2021

What does this PR do?

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

jncasey commented Jan 26, 2021

sgugger commented Jan 26, 2021

jncasey commented Jan 26, 2021

sgugger commented Jan 26, 2021

jncasey commented Jan 26, 2021

julien-c commented Jan 27, 2021

jncasey commented Jan 27, 2021