-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime- and Assertionerror handling in trainer.run_train #6807
Comments
@tchaton @awaelchli @carmocca @justusschock from multiple threads, this error-handling is very fragile.
what are your thoughts on removing the error handling for everything that's not a keyboard interrupt? |
@ananthsub , agree, actually preparing PR for removing the part to run finally on_train_end |
So we want to call This could be greatly improved. |
@ananthsub I feel we definitely should change this. Something like: on_train_end_run = False
try:
training(...)
on_train_end_run = True
on_train_end(...)
except KeyboardInterrupt:
# current error handling
except Exception as e:
new_e = None
try:
if not on_train_end_run:
on_train_end(...)
except Exception as new_e:
raise new_e
finally:
raise e This is obviously not perfect but it should give an idea of what I'm talking about The only thing to check is that currently we also have slurm signal handlers looking at program exit to allow resubmission. Not sure how they'd be affected out of my head. |
Hey @justusschock. Under
|
Also add a temporay workaround to Lightning-AI#6807
Also add a temporay workaround to Lightning-AI#6807
🐛 Bug
Runtime- (e.g. Out of Memory Errors) /Assertionerror are ignored when running
trainer.fit(...)
. The error will not be raised correctly and the script will continue.This can waste a lot of time in some cases:
While the error is printed correctly, it is not raised and thus the script will continue.
https://github.com/PyTorchLightning/pytorch-lightning/blob/bb9ace43334ad50e3758d9cff08ad34216c7d4da/pytorch_lightning/trainer/trainer.py#L621-L634
Please reproduce using the BoringModel
To Reproduce
Use following BoringModel and post here
Expected behavior
The script should stop after the trainer cleaned up when an Assertion or Runtime error occurs during training.
Environment
Note:
Bugs with code
are solved faster !Colab Notebook
should be madepublic
!IDE
: Please, use our python bug_report_model.py template.Colab Notebook
: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).You can get the script and run it with:
conda
,pip
, source):Additional context
Simply saving the exception and raising it after the trainer called the final hook should be sufficient.
The text was updated successfully, but these errors were encountered: