-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-resume training from checkpoint #9776
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, LGTM!
Co-authored-by: Lysandre Debut <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for taking care of the template too!
I was having a problem getting this to actually resume training on my system, and I had to make three small changes to the new code in trainer_utils.py:
I'm just a novice, so I'm not sure if those tweaks would break anything other systems. Does But with these changes I'm able to resume an aborted training from within the same folder. |
Oh, I went a bit too fast and all the issues you list are completely valid! You should make a PR with all your changes since you were the one to find the problems and fix them :-) |
Got sucked back into work for my day job, but I'll try getting to that soonish. Unrelated, do you guys have an office in DUMBO? I think we're in the same building (at least until we all started working from home) |
DUMBO offices are at 20 Jay street, you should come and say hi for a coffee when the pandemic is over if you're around :-) |
Yeah, I run a small animation studio up on the 10th floor. Turns out the fancy GPUs I have for animation are also good for messing around with machine learning :) And so my pandemic side project became teaching my computer to write poetry. Someday when it's safe again I'll definitely stop by. |
Haha small world @jncasey! I'm missing 20 Jay right now. What is your company's name? |
We're called Mixtape Club. We moved into the building from Manhattan last January, and were only there for about 10 weeks before everyone started working from home. Now my business partner is the only one who goes in, since he can't bring our whole sound studio back to his apartment. |
What does this PR do?
Feature suggested by @madlag.
In the examples scripts, change the current behavior so if checkpoints are present in the output_dir passed to the script command, the training resumes from there. If the
output_dir
exists, is nonempty but has no checkpoint, the same behavior as before is applied (error). If it exists with checkpoints inside, the last checkpoint is grabbed to resume training from there. If--overwrite_output_dir
is passed, the folder is destroyed as before.This avoids user having to pass
output_dir/checkpoint-xxx
as their model name or path to resume training from a checkpoint, which is a nice improvement. The bad surprise can be if you set thatoutput_dir
with a trained model you like by mistake, but at the same time, the training is resumed from the last checkpoint so shouldn't be too long (and will converge to the same model) and interrupting before the end will not erase the model inside the folder, so I think the pros outweigh the cons.Tweak the
run_glue
example script for now, will expand it to all scripts if accepted.