-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Add delay property for checkpointing, refactor loading checkpoint (DeepSpeed Checkpointing Fix 1/n) #8627
Conversation
…o the run function to allow deepspeed engine to be loaded
Codecov Report
@@ Coverage Diff @@
## master #8627 +/- ##
=======================================
- Coverage 93% 88% -4%
=======================================
Files 168 168
Lines 13951 13967 +16
=======================================
- Hits 12909 12331 -578
- Misses 1042 1636 +594 |
pytorch_lightning/plugins/training_type/training_type_plugin.py
Outdated
Show resolved
Hide resolved
pytorch_lightning/plugins/training_type/training_type_plugin.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Adrian Wälchli <[email protected]>
Hello @SeanNaren! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-07-30 09:29:55 UTC |
Co-authored-by: Adrian Wälchli <[email protected]>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! I don't see any obvious other solution either.
…nt (DeepSpeed Checkpointing Fix 1/n) (Lightning-AI#8627) * Add property to delay checkpointing, move loading checkpoint file into the run function to allow deepspeed engine to be loaded * Add a small test * Apply suggestions from code review Co-authored-by: Adrian Wälchli <[email protected]> * Update pytorch_lightning/accelerators/accelerator.py Co-authored-by: Adrian Wälchli <[email protected]> * Address review * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
What does this PR do?
Related to #8397.
This PR introduces a property to the training type plugin that allows the plugin to delay loading the checkpoint till after pre-dispatch, and moves loading the checkpoint file (
resume_start
) into the same block of logic.After discussions with @awaelchli I couldn't unfortunately come up with another method to load via the deepspeed engine but to introduce this branch into the
run
function. This branch will allow us to calldeepspeed_engine.load_checkpoint
when loading the checkpoint file.This PR relys on #8347 which will not be included till 1.5, so should skip this from 1.4.x. I'll also to a combo CHANGELOG update in #8397 instead of here, as on it's own its confusing!
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃