-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up #2467
Clean up #2467
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2467 +/- ##
======================================
- Coverage 89% 89% -0%
======================================
Files 69 69
Lines 5518 5533 +15
======================================
+ Hits 4889 4898 +9
- Misses 629 635 +6 |
@zcain117 any ideas here? (https://github.com/PyTorchLightning/pytorch-lightning/pull/2467/checks?check_run_id=832554097#step:10:236) I'm just getting familiar with xla details, but is the problem that we are comparing an xla tensor vs a non-xla tensor? |
if trainer.use_ddp or trainer.use_ddp2: | ||
stop = torch.tensor(int(trainer.should_stop), device=pl_module.device) | ||
dist.all_reduce(stop, op=dist.reduce_op.SUM) | ||
dist.barrier() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@williamFalcon Is a barrier needed after an all reduce?
self.stopped_epoch = trainer.current_epoch | ||
trainer.should_stop = True | ||
|
||
# stop every ddp process if any world process decides to stop | ||
self._stop_distributed_training(trainer, pl_module) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function name is misleading. This does not stop training, it just updates the trainer.should_stop state.
Fixes #1838
We have TPU tests now!