-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add error handling for all trainer entry points #8819
Conversation
Hello @daniellepintz! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-08-12 00:25:10 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working on this @daniellepintz!
on reading through the entry points again, all of fit
/validate
/test
/predict
can raise exceptions before _run
is called. For the error handling to be complete, I think applying the try/catch around each of them directly (with something like _fit_impl
as opposed to _run_impl
) will avoid the risk of missing other exceptions. What do you think?
the accelerator is also calling on_train_end
in the error handling - i think this should be calling accelerator.teardown()
instead
Codecov Report
@@ Coverage Diff @@
## master #8819 +/- ##
=======================================
- Coverage 93% 89% -4%
=======================================
Files 176 176
Lines 14402 14410 +8
=======================================
- Hits 13343 12771 -572
- Misses 1059 1639 +580 |
2248b1b
to
9bd87f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @yifuwang 's comment, the context manager to consolidate the error handling would be really nice. that'll make it less likely to miss handling across the various entry points
d8942aa
to
82ef157
Compare
90311f7
to
dfaf7a1
Compare
thanks for the review @ananthsub and @yifuwang!! I have updated according to comments. Unfortunately I am still unable to test locally due to an error when I run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a bit code duplication with this addition. If I see correctly, the only difference is the call to the trainer.fit/test etc.
Would it make sense to have an intermediate function for interrupt handling like so:
def call_and_handle_interrupt(self, trainer_fn, *args, **kwargs):
try:
...
trainer_fn(*args, **kwargs)
except:
....
and then in fit/test/etc. we do
self._call_and_handle_interrupt(self.fit_impl)
?
@daniellepintz from the CI:
so you can set |
Thanks @ananthsub, I saw that, am just waiting for a resolution on the accelerator issue before pushing another update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daniellepintz thanks for working on this
I noticed that the PR is labelled refactor but I think we should not classify it as such and also add a changelog entry that the entry points are now fully guarded by the exception handling and that the on_keyboard_interrupt() callback hook will be called in all trainer stages.
great work! |
Co-authored-by: ananthsub <[email protected]>
So I am in a bit of a conundrum where the PR says "1 conversation must be resolved before merging" but when I click on the conversation to be resolved it says "We went looking everywhere, but couldn’t find those commits." - this is probably because I force-pushed a commit earlier.. 😅😅😅 Does anyone know how to get around this? I tried Googling it but no luck. |
@daniellepintz I resolved the conversation from the Conversation tab |
What does this PR do?
Before this PR lightning only has error handling for trainer.fit(). This PR moves the error handling to a higher level of abstraction so that it also applies to trainer.validate(), trainer.test(), and trainer.predict().
Fixes #8723
Does your PR introduce any breaking changes? If yes, please list them.
No
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Yes!