-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add atomic save to checkpoint routine #20011
Conversation
… should make the save more robust to interrupts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Just a few suggestions
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #20011 +/- ##
=========================================
- Coverage 84% 59% -25%
=========================================
Files 427 422 -5
Lines 35504 35407 -97
=========================================
- Hits 29781 20773 -9008
- Misses 5723 14634 +8911 |
Thanks @corwinjoy |
Thanks for the help @awaelchli !
…On Tue, Jun 25, 2024, 1:53 AM awaelchli ***@***.***> wrote:
Thanks @corwinjoy <https://github.com/corwinjoy>
I also made a follow-up issue for the distributed checkpoint saving logic.
—
Reply to this email directly, view it on GitHub
<#20011 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBF42XZMLPG43N4Z5Z4RR3ZJEVZFAVCNFSM6AAAAABJ2XXZXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBYGM2DINZVGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
What does this PR do?
Upgrade the _atomic_checkpoint routine to use a transaction. This should make saves atomic so that if an interrupt happens during a file save, any existing file will not be corrupted.
The following links the related issue to the PR: #19970
Fixes #19970
I have skipped updating the changelog since it looks like this gets updated as part of the release.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--20011.org.readthedocs.build/en/20011/