EarlyStopping is based upon callback.on_train_epoch_end, not callback.on_validation_epoch_end #9151

turian · 2021-08-26T22:47:44Z

🐛 Bug

EarlyStopping patience is supposed to be based upon callback.on_validation_epoch_end. "It must be noted that the patience parameter counts the number of validation epochs with no improvement, and not the number of training epochs. Therefore, with parameters check_val_every_n_epoch=10 and patience=3, the trainer will perform at least 40 training epochs before being stopped."

However, if you set check_val_every_n_epoch=10 and patience=3, you will get a crash after the first training epoch because of callback.on_train_epoch_end:

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/workspace/project/heareval/predictions/runner.py", line 75, in <module>
    runner()
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/workspace/project/heareval/predictions/runner.py", line 70, in runner
    task_path, scene_embedding_size, timestamp_embedding_size, gpus
  File "/workspace/project/heareval/predictions/task_predictions.py", line 764, in task_predictions
    gpus=gpus,
  File "/workspace/project/heareval/predictions/task_predictions.py", line 646, in task_predictions_train
    trainer.fit(predictor, train_dataloader, valid_dataloader)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
    self._run(model)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
    self._dispatch()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
    self.accelerator.start_training(self)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
    return self._run_train()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
    self.fit_loop.run()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 118, in run
    output = self.on_run_end()
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 235, in on_run_end
    self._on_train_epoch_end_hook(processed_outputs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 276, in _on_train_epoch_end_hook
    trainer_hook(processed_epoch_output)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 109, in on_train_epoch_end
    callback.on_train_epoch_end(self, self.lightning_module)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/early_stopping.py", line 170, in on_train_epoch_end
    self._run_early_stopping_check(trainer)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/early_stopping.py", line 185, in _run_early_stopping_check
    logs
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/early_stopping.py", line 134, in _validate_condition_metric
    raise RuntimeError(error_msg)
RuntimeError: Early stopping conditioned on metric `val_event_onset_200ms_fms` which is not available. Pass in or modify your `EarlyStopping` callback to use any of the following: `train_loss`

To Reproduce

BoringModel replication:

https://colab.research.google.com/drive/1MsMGM7Wsi6wJ50cIhn1jvxOaVg8z_Ypl#scrollTo=Flyi--SpvsJN

Expected behavior

It should only do early stopping callback on validation epoch ends, not training epoch ends.

Environment

CUDA:
    GPU:
    available: False
    version: None
Packages:
    numpy: 1.19.5
    pyTorch_debug: False
    pyTorch_version: 1.9.0
    pytorch-lightning: 1.4.1
    tqdm: 4.62.0
System:
    OS: Darwin
    architecture:
        64bit
    processor: i386
    python: 3.9.6
    version: Darwin Kernel Version 19.6.0: Tue Jun 22 19:49:55 PDT 2021; root:xnu-6153.141.35~1/RELEASE_X86_64

The text was updated successfully, but these errors were encountered:

carmocca · 2021-08-27T01:07:25Z

Hi @turian! Thanks for reporting this.

You should be able to avoid this by setting EarlyStopping(check_on_train_epoch_end=False)

turian · 2021-08-27T08:57:19Z

@carmocca oh interesting. Okay the docs could definitely also explain this better :)

tchaton · 2021-08-27T18:18:24Z

Dear @turian,

Mind making a PR improving the documentation clarify on this matter ?

Best,
T.C

turian · 2021-08-29T02:22:48Z

@tchaton When @carmocca 's change is merged I am happy to document the new behavior and the gotcha.

tchaton · 2021-08-29T17:32:37Z

Hey @turian,

Please, feel free to make a PR with documentation update when merged.

Best.
T.C

turian added bug Something isn't working help wanted Open to be worked on labels Aug 26, 2021

carmocca self-assigned this Aug 27, 2021

carmocca added callback and removed help wanted Open to be worked on labels Aug 27, 2021

carmocca added this to the v1.4.x milestone Aug 27, 2021

carmocca mentioned this issue Aug 27, 2021

Disable {save,check}_on_train_epoch_end with check_val_every_n_epoch>1 #9156

Merged

11 tasks

carmocca closed this as completed in #9156 Sep 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EarlyStopping is based upon callback.on_train_epoch_end, not callback.on_validation_epoch_end #9151

EarlyStopping is based upon callback.on_train_epoch_end, not callback.on_validation_epoch_end #9151

turian commented Aug 26, 2021

carmocca commented Aug 27, 2021

turian commented Aug 27, 2021

tchaton commented Aug 27, 2021

turian commented Aug 29, 2021

tchaton commented Aug 29, 2021

EarlyStopping is based upon callback.on_train_epoch_end, not callback.on_validation_epoch_end #9151

EarlyStopping is based upon callback.on_train_epoch_end, not callback.on_validation_epoch_end #9151

Comments

turian commented Aug 26, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

carmocca commented Aug 27, 2021

turian commented Aug 27, 2021

tchaton commented Aug 27, 2021

turian commented Aug 29, 2021

tchaton commented Aug 29, 2021