-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set smarter default for DDP sharded for performance optimization #6937
Set smarter default for DDP sharded for performance optimization #6937
Conversation
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
…oint_consolidate Update test_all_gather_grad.py
This reverts commit 9d4a2b8.
This reverts commit 0d23d75.
This reverts commit 70fe5da.
This reverts commit a9aae99.
This reverts commit ea74906.
This reverts commit bf70e43.
This reverts commit f172101.
This reverts commit 536c132.
This reverts commit 3a9fde9.
This reverts commit 7a369f4.
This reverts commit 8222dc9.
This reverts commit 6c095b2.
This reverts commit 250d0aa.
This reverts commit 8651d54.
This reverts commit dcdcd29.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice!
… using fp16 broadcast
@@ -42,6 +47,12 @@ def _reinit_optimizers_with_oss(self): | |||
if not isinstance(optimizer, OSS): | |||
optim_class = type(optimizer) | |||
zero_optimizer = OSS(params=optimizer.param_groups, optim=optim_class, **optimizer.defaults) | |||
if _FAIRSCALE_OSS_FP16_BROADCAST_AVAILABLE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is added in facebookresearch/fairscale#540
i think the tests are failing since |
I'm going to make a PR to update to the latest fairscale version, and start the deprecation! |
this should be good to go now that #7017 is merged? |
Hello @shuyingsunshine21! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-04-26 20:25:46 UTC |
CHANGELOG.md
Outdated
@@ -144,7 +144,10 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). | |||
- Changed warnings and recommendations for dataloaders in `ddp_spawn` ([#6762](https://github.com/PyTorchLightning/pytorch-lightning/pull/6762/)) | |||
|
|||
|
|||
- `pl.seed_everyting` will now also set the seed on the `DistributedSampler` ([#7024](https://github.com/PyTorchLightning/pytorch-lightning/pull/7024)) | |||
- `pl.seed_eveing` will now also set the seed on the `DistributedSampler` ([#7024](https://github.com/PyTorchLightning/pytorch-lightning/pull/7024)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `pl.seed_eveing` will now also set the seed on the `DistributedSampler` ([#7024](https://github.com/PyTorchLightning/pytorch-lightning/pull/7024)) | |
- `pl.seed_everything` will now also set the seed on the `DistributedSampler` ([#7024](https://github.com/PyTorchLightning/pytorch-lightning/pull/7024)) |
# For multi-node training, compressing the model shards in fp16 before broadcasting | ||
# improves performance. When using PyTorch AMP, it will not degrade | ||
# the model performance. | ||
zero_optimizer.broadcast_fp16 = is_fp16 and self.num_nodes > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a test for this? It being True, for 16bit precision and multi node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a test for this? It being True, for 16bit precision and multi node.
was wondering do we have multi nodes testing example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be disabled for now. https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/accelerators/test_multi_nodes_gpu.py
cc: @Borda @SeanNaren @tchaton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know if I should add one if multi-node testing is currently disabled. Maybe i could add in the same file for now (which might be easier for re-enable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that should work. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we are in process of adding multi-node back...
# For multi-node training, compressing the model shards in fp16 before broadcasting | ||
# improves performance. When using PyTorch AMP, it will not degrade | ||
# the model performance. | ||
zero_optimizer.broadcast_fp16 = is_fp16 and self.num_nodes > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we are in process of adding multi-node back...
Might be overkill for now, but maybe a way to mock the multi-node for now for one GPU could be a potential stop gap here, but haven't delved too deep into how this would look! |
What does this PR do?
Fixes #6992
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃