Runtime- and Assertionerror handling in trainer.run_train #6807

mibaumgartner · 2021-04-03T16:19:02Z

🐛 Bug

Runtime- (e.g. Out of Memory Errors) /Assertionerror are ignored when running trainer.fit(...). The error will not be raised correctly and the script will continue.
This can waste a lot of time in some cases:

# prepare code
...

# train (model raises an error, e.g. Out of Memory which is not raised by trainer)
trainer.fit(...)

# will continue here
# time intensive computation / evaluation / prediction
...

While the error is printed correctly, it is not raised and thus the script will continue.
https://github.com/PyTorchLightning/pytorch-lightning/blob/bb9ace43334ad50e3758d9cff08ad34216c7d4da/pytorch_lightning/trainer/trainer.py#L621-L634

Please reproduce using the BoringModel

To Reproduce

Use following BoringModel and post here

Expected behavior

The script should stop after the trainer cleaned up when an Assertion or Runtime error occurs during training.

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

IDE: Please, use our python bug_report_model.py template.
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

Simply saving the exception and raising it after the trainer called the final hook should be sufficient.

The text was updated successfully, but these errors were encountered:

ananthsub · 2021-04-05T05:47:47Z

@tchaton @awaelchli @carmocca @justusschock from multiple threads, this error-handling is very fragile.

calling on train loop end currently calls the checkpointing callback, which is not guaranteed to work (e.g. a monitor value is present only in validation, but the exception occurs during training, so there's no monitor metric available for the callback)
this handling doesn't account for if a subset of ranks fails in distributed training

what are your thoughts on removing the error handling for everything that's not a keyboard interrupt?
cc @shuyingsunshine21

shuyingsunshine21 · 2021-04-05T17:26:16Z

@ananthsub , agree, actually preparing PR for removing the part to run finally on_train_end

carmocca · 2021-04-06T00:26:36Z

So we want to call on_train_end no matter what but still raise the exception. What if re-raise it after on_train_end? It's tricky because on_train_end can also raise an exception which would hide the original one. This is why the except block with print_exc is there, to give visibility to the original exception.

This could be greatly improved.

justusschock · 2021-04-06T05:43:06Z

@ananthsub I feel we definitely should change this.
@carmocca for the exception from on_train_end: can't we raise that one and raise the other one in a finally then?

Something like:

on_train_end_run = False
try:
    training(...)
    on_train_end_run = True
    on_train_end(...)
except KeyboardInterrupt:
    # current error handling
except Exception as e:
    new_e = None
    try:
        if not on_train_end_run:
            on_train_end(...)
    except Exception as new_e:
       raise new_e
    finally:
        raise e

This is obviously not perfect but it should give an idea of what I'm talking about

The only thing to check is that currently we also have slurm signal handlers looking at program exit to allow resubmission. Not sure how they'd be affected out of my head.

tchaton · 2021-04-09T12:15:31Z

@ananthsub I feel we definitely should change this.
@carmocca for the exception from on_train_end: can't we raise that one and raise the other one in a finally then?

Something like:
on_train_end_run = False
try:
    training(...)
    on_train_end_run = True
    on_train_end(...)
except KeyboardInterrupt:
    # current error handling
except Exception as e:
    new_e = None
    try:
        if not on_train_end_run:
            on_train_end(...)
    except Exception as new_e:
       raise new_e
    finally:
        raise e
This is obviously not perfect but it should give an idea of what I'm talking about

The only thing to check is that currently we also have slurm signal handlers looking at program exit to allow resubmission. Not sure how they'd be affected out of my head.

Hey @justusschock.

Under except KeyboardInterrupt:, we should also use this logic:

    new_e = None
    try:
        if not on_train_end_run:
            on_train_end(...)
    except Exception as new_e:
       raise new_e
    finally:
        raise e

Also add a temporay workaround to Lightning-AI#6807

mibaumgartner added bug Something isn't working help wanted Open to be worked on labels Apr 3, 2021

shuyingsunshine21 mentioned this issue Apr 6, 2021

Trainer Error Handling Fix #6842

Closed

awaelchli mentioned this issue Apr 8, 2021

Train End Error Handling Fix #6864

Merged

11 tasks

carmocca mentioned this issue Apr 9, 2021

Trainer(gradient_clip_algorithm='value') has no effect (from #6123) #6920

Closed

ceshine added a commit to veritable-tech/pytorch-lightning that referenced this issue Apr 9, 2021

An attempt to fix gradient_clip_algorithm problem (Lightning-AI#6920)

51d426c

Also add a temporay workaround to Lightning-AI#6807

ceshine added a commit to veritable-tech/pytorch-lightning that referenced this issue Apr 9, 2021

An attempt to fix gradient_clip_algorithm problem (Lightning-AI#6920)

8585071

Also add a temporay workaround to Lightning-AI#6807

ceshine mentioned this issue Apr 9, 2021

Fix the gradient_clip_algorithm has no effect issue. #6928

Merged

11 tasks

ceshine added a commit to veritable-tech/pytorch-lightning that referenced this issue Apr 9, 2021

Revert changes regarding to Lightning-AI#6807

bf1bd46

carmocca closed this as completed in #6864 Apr 14, 2021

mibaumgartner mentioned this issue Sep 2, 2021

Handling of KeyboardInterrupt #9286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime- and Assertionerror handling in trainer.run_train #6807

Runtime- and Assertionerror handling in trainer.run_train #6807

mibaumgartner commented Apr 3, 2021

ananthsub commented Apr 5, 2021

shuyingsunshine21 commented Apr 5, 2021

carmocca commented Apr 6, 2021

justusschock commented Apr 6, 2021 •

edited

Loading

tchaton commented Apr 9, 2021

Runtime- and Assertionerror handling in trainer.run_train #6807

Runtime- and Assertionerror handling in trainer.run_train #6807

Comments

mibaumgartner commented Apr 3, 2021

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

Additional context

ananthsub commented Apr 5, 2021

shuyingsunshine21 commented Apr 5, 2021

carmocca commented Apr 6, 2021

justusschock commented Apr 6, 2021 • edited Loading

tchaton commented Apr 9, 2021

justusschock commented Apr 6, 2021 •

edited

Loading