Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime- and Assertionerror handling in trainer.run_train #6807

Closed
mibaumgartner opened this issue Apr 3, 2021 · 5 comments · Fixed by #6864
Closed

Runtime- and Assertionerror handling in trainer.run_train #6807

mibaumgartner opened this issue Apr 3, 2021 · 5 comments · Fixed by #6864
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@mibaumgartner
Copy link
Contributor

🐛 Bug

Runtime- (e.g. Out of Memory Errors) /Assertionerror are ignored when running trainer.fit(...). The error will not be raised correctly and the script will continue.
This can waste a lot of time in some cases:

# prepare code
...

# train (model raises an error, e.g. Out of Memory which is not raised by trainer)
trainer.fit(...)

# will continue here
# time intensive computation / evaluation / prediction
...

While the error is printed correctly, it is not raised and thus the script will continue.
https://github.com/PyTorchLightning/pytorch-lightning/blob/bb9ace43334ad50e3758d9cff08ad34216c7d4da/pytorch_lightning/trainer/trainer.py#L621-L634

Please reproduce using the BoringModel

To Reproduce

Use following BoringModel and post here

Expected behavior

The script should stop after the trainer cleaned up when an Assertion or Runtime error occurs during training.

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

Simply saving the exception and raising it after the trainer called the final hook should be sufficient.

@mibaumgartner mibaumgartner added bug Something isn't working help wanted Open to be worked on labels Apr 3, 2021
@ananthsub
Copy link
Contributor

@tchaton @awaelchli @carmocca @justusschock from multiple threads, this error-handling is very fragile.

  • calling on train loop end currently calls the checkpointing callback, which is not guaranteed to work (e.g. a monitor value is present only in validation, but the exception occurs during training, so there's no monitor metric available for the callback)
  • this handling doesn't account for if a subset of ranks fails in distributed training

what are your thoughts on removing the error handling for everything that's not a keyboard interrupt?
cc @shuyingsunshine21

@shuyingsunshine21
Copy link
Contributor

@ananthsub , agree, actually preparing PR for removing the part to run finally on_train_end

@carmocca
Copy link
Contributor

carmocca commented Apr 6, 2021

So we want to call on_train_end no matter what but still raise the exception. What if re-raise it after on_train_end? It's tricky because on_train_end can also raise an exception which would hide the original one. This is why the except block with print_exc is there, to give visibility to the original exception.

This could be greatly improved.

@justusschock
Copy link
Member

justusschock commented Apr 6, 2021

@ananthsub I feel we definitely should change this.
@carmocca for the exception from on_train_end: can't we raise that one and raise the other one in a finally then?

Something like:

on_train_end_run = False
try:
    training(...)
    on_train_end_run = True
    on_train_end(...)
except KeyboardInterrupt:
    # current error handling
except Exception as e:
    new_e = None
    try:
        if not on_train_end_run:
            on_train_end(...)
    except Exception as new_e:
       raise new_e
    finally:
        raise e

This is obviously not perfect but it should give an idea of what I'm talking about

The only thing to check is that currently we also have slurm signal handlers looking at program exit to allow resubmission. Not sure how they'd be affected out of my head.

@tchaton
Copy link
Contributor

tchaton commented Apr 9, 2021

@ananthsub I feel we definitely should change this.
@carmocca for the exception from on_train_end: can't we raise that one and raise the other one in a finally then?

Something like:

on_train_end_run = False
try:
    training(...)
    on_train_end_run = True
    on_train_end(...)
except KeyboardInterrupt:
    # current error handling
except Exception as e:
    new_e = None
    try:
        if not on_train_end_run:
            on_train_end(...)
    except Exception as new_e:
       raise new_e
    finally:
        raise e

This is obviously not perfect but it should give an idea of what I'm talking about

The only thing to check is that currently we also have slurm signal handlers looking at program exit to allow resubmission. Not sure how they'd be affected out of my head.

Hey @justusschock.

Under except KeyboardInterrupt:, we should also use this logic:

    new_e = None
    try:
        if not on_train_end_run:
            on_train_end(...)
    except Exception as new_e:
       raise new_e
    finally:
        raise e

ceshine added a commit to veritable-tech/pytorch-lightning that referenced this issue Apr 9, 2021
ceshine added a commit to veritable-tech/pytorch-lightning that referenced this issue Apr 9, 2021
ceshine added a commit to veritable-tech/pytorch-lightning that referenced this issue Apr 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants