Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up #2467

Merged
merged 49 commits into from
Jul 3, 2020
Merged

Clean up #2467

merged 49 commits into from
Jul 3, 2020

Conversation

williamFalcon
Copy link
Contributor

@williamFalcon williamFalcon commented Jul 2, 2020

Fixes #1838

We have TPU tests now!

@williamFalcon williamFalcon changed the title Clean up ES Clean up Jul 2, 2020
@mergify mergify bot requested a review from a team July 2, 2020 14:25
@Borda Borda added feature Is an improvement or enhancement ci Continuous Integration labels Jul 2, 2020
@Borda Borda added this to the 0.8.x milestone Jul 2, 2020
@codecov
Copy link

codecov bot commented Jul 2, 2020

Codecov Report

Merging #2467 into master will decrease coverage by 0%.
The diff coverage is 61%.

@@          Coverage Diff           @@
##           master   #2467   +/-   ##
======================================
- Coverage      89%     89%   -0%     
======================================
  Files          69      69           
  Lines        5518    5533   +15     
======================================
+ Hits         4889    4898    +9     
- Misses        629     635    +6     

@williamFalcon
Copy link
Contributor Author

@zcain117 any ideas here? (https://github.com/PyTorchLightning/pytorch-lightning/pull/2467/checks?check_run_id=832554097#step:10:236)

I'm just getting familiar with xla details, but is the problem that we are comparing an xla tensor vs a non-xla tensor?

@williamFalcon williamFalcon merged commit 020c332 into master Jul 3, 2020
@Borda Borda deleted the ess branch July 3, 2020 07:55
if trainer.use_ddp or trainer.use_ddp2:
stop = torch.tensor(int(trainer.should_stop), device=pl_module.device)
dist.all_reduce(stop, op=dist.reduce_op.SUM)
dist.barrier()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williamFalcon Is a barrier needed after an all reduce?

self.stopped_epoch = trainer.current_epoch
trainer.should_stop = True

# stop every ddp process if any world process decides to stop
self._stop_distributed_training(trainer, pl_module)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name is misleading. This does not stop training, it just updates the trainer.should_stop state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue with EarlyStopping Callback on TPU runtime: Input tensor is not an XLA tensor: torch.FloatTensor
3 participants