Add error handling for all trainer entry points #8819

daniellepintz · 2021-08-09T21:16:12Z

What does this PR do?

Before this PR lightning only has error handling for trainer.fit(). This PR moves the error handling to a higher level of abstraction so that it also applies to trainer.validate(), trainer.test(), and trainer.predict().
Fixes #8723

Does your PR introduce any breaking changes? If yes, please list them.

No

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Yes!

pep8speaks · 2021-08-09T21:16:16Z

Hello @daniellepintz! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-08-12 00:25:10 UTC

ananthsub

thanks for working on this @daniellepintz!

on reading through the entry points again, all of fit/validate/test/predict can raise exceptions before _run is called. For the error handling to be complete, I think applying the try/catch around each of them directly (with something like _fit_impl as opposed to _run_impl) will avoid the risk of missing other exceptions. What do you think?

the accelerator is also calling on_train_end in the error handling - i think this should be calling accelerator.teardown() instead

cc @carmocca @awaelchli @yifuwang @tchaton

pytorch_lightning/trainer/trainer.py

codecov · 2021-08-10T00:31:35Z

Codecov Report

Merging #8819 (37bb48b) into master (522df2b) will decrease coverage by 4%.
The diff coverage is 97%.

@@           Coverage Diff           @@
##           master   #8819    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         176     176            
  Lines       14402   14410     +8     
=======================================
- Hits        13343   12771   -572     
- Misses       1059    1639   +580

ananthsub

+1 to @yifuwang 's comment, the context manager to consolidate the error handling would be really nice. that'll make it less likely to miss handling across the various entry points

tests/trainer/test_trainer.py

for more information, see https://pre-commit.ci

…orch-lightning into error_handling

for more information, see https://pre-commit.ci

daniellepintz · 2021-08-10T06:21:35Z

thanks for the review @ananthsub and @yifuwang!! I have updated according to comments. Unfortunately I am still unable to test locally due to an error when I run python -m pytest -v tests/trainer/test_trainer_error_handling.py::test_error_handling_all_stages -> P436961746

awaelchli

There seems to be a bit code duplication with this addition. If I see correctly, the only difference is the call to the trainer.fit/test etc.

Would it make sense to have an intermediate function for interrupt handling like so:

def call_and_handle_interrupt(self, trainer_fn, *args, **kwargs):
    try: 
         ...
         trainer_fn(*args, **kwargs)
     except:
     ....

and then in fit/test/etc. we do

self._call_and_handle_interrupt(self.fit_impl)

?

tests/trainer/test_trainer_error_handling.py

pytorch_lightning/utilities/exceptions.py

tests/trainer/test_trainer_error_handling.py

pytorch_lightning/trainer/trainer.py

ananthsub · 2021-08-16T22:42:50Z

@daniellepintz from the CI:

____________________ test_spawn_predict_return_predictions ____________________

self = <pytorch_lightning.trainer.trainer.Trainer object at 0x000001B977E75460>
trainer_fn = <bound method Trainer._predict_impl of <pytorch_lightning.trainer.trainer.Trainer object at 0x000001B977E75460>>
args = (BoringModel(
  (layer): Linear(in_features=32, out_features=2, bias=True)
), <torch.utils.data.dataloader.DataLoader object at 0x000001B977F9D100>, None, True, None)
kwargs = {}

    def _call_and_handle_interrupt(self, trainer_fn: Callable, *args: Any, **kwargs: Any):
        r"""
        Error handling, intended to be used only for main trainer function entry points (fit, validate, test, predict)
        as all errors should funnel through them
    
        Args:
            trainer_fn: one of (fit, validate, test, predict)
    
            *args/**kwargs: args to be passed to trainer_fn
        """
        try:
>           return trainer_fn(*args, **kwargs)

D:\a\pytorch-lightning\pytorch-lightning\pytorch_lightning\trainer\trainer.py:500: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pytorch_lightning.trainer.trainer.Trainer object at 0x000001B977E75460>
model = BoringModel(
  (layer): Linear(in_features=32, out_features=2, bias=True)
)
dataloaders = <torch.utils.data.dataloader.DataLoader object at 0x000001B977F9D100>
datamodule = None, return_predictions = True, ckpt_path = None

    def _predict_impl(
        self,
        model: Optional["pl.LightningModule"] = None,
        dataloaders: Optional[Union[EVAL_DATALOADERS, LightningDataModule]] = None,
        datamodule: Optional[LightningDataModule] = None,
        return_predictions: Optional[bool] = None,
        ckpt_path: Optional[str] = None,
    ) -> Optional[_PREDICT_OUTPUT]:
        # --------------------
        # SETUP HOOK
        # --------------------
        Trainer._log_api_event("predict")
    
        self.state.fn = TrainerFn.PREDICTING
        self.state.status = TrainerStatus.RUNNING
        self.predicting = True
    
>       self.predict_loop.return_predictions = return_predictions

D:\a\pytorch-lightning\pytorch-lightning\pytorch_lightning\trainer\trainer.py:813: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pytorch_lightning.loops.dataloader.prediction_loop.PredictionLoop object at 0x000001B977F9D850>
return_predictions = True

    @return_predictions.setter
    def return_predictions(self, return_predictions: Optional[bool] = None) -> None:
        # `DDPSpawnPlugin` plugins and derivatives don't support return predictions.
        is_ddp_spawn = isinstance(self.trainer.training_type_plugin, DDPSpawnPlugin)
        if return_predictions and is_ddp_spawn:
>           raise MisconfigurationException(
                "`return_predictions` should be set to `False` when using the `DDPSpawnPlugin` or children class. "
                f"Found {return_predictions} with training_type_plugin {type(self.trainer.training_type_plugin)}."
            )
E           pytorch_lightning.utilities.exceptions.MisconfigurationException: `return_predictions` should be set to `False` when using the `DDPSpawnPlugin` or children class. Found True with training_type_plugin <class 'pytorch_lightning.plugins.training_type.ddp_spawn.DDPSpawnPlugin'>.

D:\a\pytorch-lightning\pytorch-lightning\pytorch_lightning\loops\dataloader\prediction_loop.py:35: MisconfigurationException

https://github.com/PyTorchLightning/pytorch-lightning/blob/938a191406fff5f51fba03fcf824f22d8d23c2e0/pytorch_lightning/trainer/trainer.py#L700-L723

so you can set trainer.predict(..., return_predictions=False) in this test

daniellepintz · 2021-08-16T23:00:53Z

Thanks @ananthsub, I saw that, am just waiting for a resolution on the accelerator issue before pushing another update

…orch-lightning into error_handling

tests/trainer/test_trainer_error_handling.py

tests/trainer/test_trainer.py

awaelchli

@daniellepintz thanks for working on this

I noticed that the PR is labelled refactor but I think we should not classify it as such and also add a changelog entry that the entry points are now fully guarded by the exception handling and that the on_keyboard_interrupt() callback hook will be called in all trainer stages.

pytorch_lightning/trainer/trainer.py

Co-authored-by: Adrian Wälchli <[email protected]>

…orch-lightning into error_handling

…lightning into error_handling

pytorch_lightning/trainer/trainer.py

ananthsub · 2021-08-18T00:13:05Z

great work!

Co-authored-by: ananthsub <[email protected]>

daniellepintz · 2021-08-18T01:21:45Z

So I am in a bit of a conundrum where the PR says "1 conversation must be resolved before merging" but when I click on the conversation to be resolved it says "We went looking everywhere, but couldn’t find those commits." - this is probably because I force-pushed a commit earlier.. 😅😅😅

Does anyone know how to get around this? I tried Googling it but no luck.

ananthsub · 2021-08-18T01:40:39Z

So I am in a bit of a conundrum where the PR says "1 conversation must be resolved before merging" but when I click on the conversation to be resolved it says "We went looking everywhere, but couldn’t find those commits." - this is probably because I force-pushed a commit earlier.. 😅😅😅

Does anyone know how to get around this? I tried Googling it but no luck.

@daniellepintz I resolved the conversation from the Conversation tab

ananthsub reviewed Aug 9, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

yifuwang reviewed Aug 9, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

yifuwang reviewed Aug 9, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

yifuwang reviewed Aug 9, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

daniellepintz force-pushed the error_handling branch from 2248b1b to 9bd87f3 Compare August 10, 2021 00:33

ananthsub reviewed Aug 10, 2021

View reviewed changes

tests/trainer/test_trainer.py Outdated Show resolved Hide resolved

tests/trainer/test_trainer.py Outdated Show resolved Hide resolved

daniellepintz force-pushed the error_handling branch from d8942aa to 82ef157 Compare August 10, 2021 05:08

[lightning] Ensure error handling works different trainer entry points

dfaf7a1

daniellepintz force-pushed the error_handling branch from 90311f7 to dfaf7a1 Compare August 10, 2021 05:09

pre-commit-ci bot and others added 4 commits August 10, 2021 05:11

[pre-commit.ci] auto fixes from pre-commit.com hooks

7747faf

for more information, see https://pre-commit.ci

Rebase and update after review

dc3b571

Merge branch 'error_handling' of https://github.com/daniellepintz/pyt…

297caae

…orch-lightning into error_handling

[pre-commit.ci] auto fixes from pre-commit.com hooks

2ed6b66

for more information, see https://pre-commit.ci

daniellepintz marked this pull request as ready for review August 10, 2021 06:21

daniellepintz requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners August 10, 2021 06:21

awaelchli reviewed Aug 10, 2021

View reviewed changes

tests/trainer/test_trainer_error_handling.py Outdated Show resolved Hide resolved

pytorch_lightning/utilities/exceptions.py Outdated Show resolved Hide resolved

carmocca reviewed Aug 10, 2021

View reviewed changes

tests/trainer/test_trainer_error_handling.py Outdated Show resolved Hide resolved

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

carmocca added feature Is an improvement or enhancement refactor labels Aug 10, 2021

carmocca added this to the v1.5 milestone Aug 10, 2021

ananthsub reviewed Aug 16, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

fix style

7d20612

Merge branch 'error_handling' of https://github.com/daniellepintz/pyt…

0b7b29a

…orch-lightning into error_handling

awaelchli reviewed Aug 17, 2021

View reviewed changes

tests/trainer/test_trainer_error_handling.py Outdated Show resolved Hide resolved

daniellepintz added 2 commits August 17, 2021 09:59

Runif special

e7b03ef

remove accel.teardown and move test to test_trainer.py

cb8567d

mergify bot added the has conflicts label Aug 17, 2021

awaelchli reviewed Aug 17, 2021

View reviewed changes

tests/trainer/test_trainer.py Outdated Show resolved Hide resolved

awaelchli approved these changes Aug 17, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

awaelchli removed the refactor label Aug 17, 2021

daniellepintz and others added 5 commits August 17, 2021 16:15

delete ddp test

bb9be0e

Fix sphinx formatting

7e6051a

Co-authored-by: Adrian Wälchli <[email protected]>

Merge branch 'error_handling' of https://github.com/daniellepintz/pyt…

10bcd73

…orch-lightning into error_handling

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

8e13f1d

…lightning into error_handling

update changelog

7def3fe

mergify bot added ready PRs ready to be merged and removed has conflicts labels Aug 17, 2021

daniellepintz changed the title ~~Ensure error handling works across different trainer entry points~~ Add error handling for all trainer entry points Aug 17, 2021

Fix style in unrleated file to make linter happy

f7ec908

ananthsub approved these changes Aug 18, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

Update pytorch_lightning/trainer/trainer.py

37bb48b

Co-authored-by: ananthsub <[email protected]>

ananthsub enabled auto-merge (squash) August 18, 2021 01:41

ananthsub merged commit bd13d39 into Lightning-AI:master Aug 18, 2021

daniellepintz deleted the error_handling branch August 18, 2021 02:11

daniellepintz mentioned this pull request Aug 29, 2021

Add mocked function assert to test_error_handling_all_stages #9182

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add error handling for all trainer entry points #8819

Add error handling for all trainer entry points #8819

daniellepintz commented Aug 9, 2021 •

edited

Loading

pep8speaks commented Aug 9, 2021 •

edited

Loading

ananthsub left a comment

codecov bot commented Aug 10, 2021 •

edited

Loading

ananthsub left a comment

daniellepintz commented Aug 10, 2021

awaelchli left a comment •

edited

Loading

ananthsub commented Aug 16, 2021

daniellepintz commented Aug 16, 2021

awaelchli left a comment

ananthsub commented Aug 18, 2021

daniellepintz commented Aug 18, 2021 •

edited

Loading

ananthsub commented Aug 18, 2021

Add error handling for all trainer entry points #8819

Add error handling for all trainer entry points #8819

Conversation

daniellepintz commented Aug 9, 2021 • edited Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

pep8speaks commented Aug 9, 2021 • edited Loading

Comment last updated at 2021-08-12 00:25:10 UTC

ananthsub left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 10, 2021 • edited Loading

Codecov Report

ananthsub left a comment

Choose a reason for hiding this comment

daniellepintz commented Aug 10, 2021

awaelchli left a comment • edited Loading

Choose a reason for hiding this comment

ananthsub commented Aug 16, 2021

daniellepintz commented Aug 16, 2021

awaelchli left a comment

Choose a reason for hiding this comment

ananthsub commented Aug 18, 2021

daniellepintz commented Aug 18, 2021 • edited Loading

ananthsub commented Aug 18, 2021

daniellepintz commented Aug 9, 2021 •

edited

Loading

pep8speaks commented Aug 9, 2021 •

edited

Loading

codecov bot commented Aug 10, 2021 •

edited

Loading

awaelchli left a comment •

edited

Loading

daniellepintz commented Aug 18, 2021 •

edited

Loading