-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up #2467
Merged
Merged
Clean up #2467
Changes from all commits
Commits
Show all changes
49 commits
Select commit
Hold shift + click to select a range
b32f6d6
Fixes #2455
williamFalcon 59dff54
Fixes #2455
williamFalcon 4d2c127
Fixes #2455
williamFalcon 2ab5928
Fixes #2455
williamFalcon 0264783
Fixes #2455
williamFalcon 80988a3
Fixes #2455
williamFalcon f30358c
Fixes #2455
williamFalcon c2afd05
Fixes #2455
williamFalcon b3e5cfb
Fixes #2455
williamFalcon e399545
Fixes #2455
williamFalcon 77c5daa
Fixes #2455
williamFalcon 9874b5e
Fixes #2455
williamFalcon beeee3a
Fixes #2455
williamFalcon f8736b5
Fixes #2455
williamFalcon 0f70120
Fixes #2455
williamFalcon 26936bb
Fixes #2455
williamFalcon 4610f68
Fixes #2455
williamFalcon e0ddc90
Fixes #2455
williamFalcon f113088
Fixes #2455
williamFalcon cc8d1cd
Fixes #2455
williamFalcon fc1254b
Fixes #2455
williamFalcon 4492804
Fixes #2455
williamFalcon c59df13
Fixes #2455
williamFalcon bea5171
Fixes #2455
williamFalcon 6d2e0c5
Fixes #2455
williamFalcon ffa65ad
Fixes #2455
williamFalcon 7d5af1c
Fixes #2455
williamFalcon c907e36
added early stop tpu test
williamFalcon ce37587
added early stop tpu test
williamFalcon 6c77aef
added early stop tpu test
williamFalcon 6cd4fdc
added early stop tpu test
williamFalcon 7fdc7ec
added early stop tpu test
williamFalcon 7879fe2
added early stop tpu test
williamFalcon 3d77c36
added early stop tpu test
williamFalcon 2ff19ba
added early stop tpu test
williamFalcon 57b601b
added early stop tpu test
williamFalcon 9dcc73e
added early stop tpu test
williamFalcon 82df22e
added early stop tpu test
williamFalcon fafe7af
added early stop tpu test
williamFalcon 5af7f69
added early stop tpu test
williamFalcon fef08e2
added early stop tpu test
williamFalcon b4bbe1c
added early stop tpu test
williamFalcon 7f711a4
added early stop tpu test
williamFalcon 51d4740
added early stop tpu test
williamFalcon 58b66bc
added early stop tpu test
williamFalcon 50b5874
added early stop tpu test
williamFalcon c75e71b
added early stop tpu test
williamFalcon 71ab1f6
added early stop tpu test
williamFalcon 43fa463
added early stop tpu test
williamFalcon File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,13 +9,22 @@ | |
|
||
import numpy as np | ||
import torch | ||
import torch.distributed as dist | ||
|
||
from pytorch_lightning import _logger as log | ||
from pytorch_lightning.callbacks.base import Callback | ||
from pytorch_lightning.utilities import rank_zero_warn | ||
|
||
torch_inf = torch.tensor(np.Inf) | ||
|
||
try: | ||
import torch_xla | ||
import torch_xla.core.xla_model as xm | ||
except ImportError: | ||
XLA_AVAILABLE = False | ||
else: | ||
XLA_AVAILABLE = True | ||
|
||
|
||
class EarlyStopping(Callback): | ||
r""" | ||
|
@@ -138,17 +147,38 @@ def _run_early_stopping_check(self, trainer, pl_module): | |
|
||
current = logs.get(self.monitor) | ||
if not isinstance(current, torch.Tensor): | ||
current = torch.tensor(current) | ||
current = torch.tensor(current, device=pl_module.device) | ||
|
||
if self.monitor_op(current - self.min_delta, self.best_score): | ||
if self.monitor_op(current - self.min_delta, self.best_score.to(pl_module.device)): | ||
self.best_score = current | ||
self.wait_count = 0 | ||
else: | ||
self.wait_count += 1 | ||
if self.wait_count >= self.patience: | ||
should_stop = self.wait_count >= self.patience | ||
|
||
if bool(should_stop): | ||
self.stopped_epoch = trainer.current_epoch | ||
trainer.should_stop = True | ||
|
||
# stop every ddp process if any world process decides to stop | ||
self._stop_distributed_training(trainer, pl_module) | ||
|
||
def _stop_distributed_training(self, trainer, pl_module): | ||
|
||
# in ddp make sure all processes stop when one is flagged | ||
if trainer.use_ddp or trainer.use_ddp2: | ||
stop = torch.tensor(int(trainer.should_stop), device=pl_module.device) | ||
dist.all_reduce(stop, op=dist.reduce_op.SUM) | ||
dist.barrier() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @williamFalcon Is a barrier needed after an all reduce? |
||
trainer.should_stop = stop == trainer.world_size | ||
|
||
# if trainer.use_tpu: | ||
# stop = torch.tensor(int(trainer.should_stop), device=pl_module.device) | ||
# xm.all_reduce('sum', [stop]) | ||
# print(type(stop)) | ||
# torch_xla.core.xla_model.rendezvous("pl.EarlyStoppingCallback.stop_distributed_training_check") | ||
# trainer.should_stop = stop.item() == trainer.world_size | ||
|
||
def on_train_end(self, trainer, pl_module): | ||
if self.stopped_epoch > 0 and self.verbose > 0: | ||
rank_zero_warn('Displayed epoch numbers by `EarlyStopping` start from "1" until v0.6.x,' | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function name is misleading. This does not stop training, it just updates the trainer.should_stop state.