Set smarter default for DDP sharded for performance optimization #6937

shuyingsunshine21 · 2021-04-09T19:33:21Z

What does this PR do?

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

…lightning pull latest code

…oint_consolidate Update test_all_gather_grad.py

This reverts commit 9d4a2b8.

…1-checkpoint_consolidate" This reverts commit c5053da, reversing changes made to 0d23d75.

This reverts commit 0d23d75.

This reverts commit 70fe5da.

This reverts commit a9aae99.

This reverts commit ea74906.

This reverts commit bf70e43.

This reverts commit f172101.

This reverts commit 536c132.

This reverts commit 3a9fde9.

This reverts commit 7a369f4.

…lightning

This reverts commit 8222dc9.

This reverts commit 6c095b2.

This reverts commit 250d0aa.

This reverts commit 8651d54.

This reverts commit dcdcd29.

…-lightning

ananthsub

very nice!

… using fp16 broadcast

shuyingsunshine21 · 2021-04-13T18:51:00Z

pytorch_lightning/plugins/training_type/sharded.py

@@ -42,6 +47,12 @@ def _reinit_optimizers_with_oss(self):
            if not isinstance(optimizer, OSS):
                optim_class = type(optimizer)
                zero_optimizer = OSS(params=optimizer.param_groups, optim=optim_class, **optimizer.defaults)
+                if _FAIRSCALE_OSS_FP16_BROADCAST_AVAILABLE:


this is added in facebookresearch/fairscale#540

ananthsub · 2021-04-14T17:01:29Z

i think the tests are failing since reduce_buffer_size isn't available based on the fairscale version lightning has pinned for CI. @SeanNaren is this PR blocked until we deprecate the ddp sequential plugin and use the latest fairscale pypi version for testing?

SeanNaren · 2021-04-14T17:13:41Z

i think the tests are failing since reduce_buffer_size isn't available based on the fairscale version lightning has pinned for CI. @SeanNaren is this PR blocked until we deprecate the ddp sequential plugin and use the latest fairscale pypi version for testing?

I'm going to make a PR to update to the latest fairscale version, and start the deprecation!

ananthsub · 2021-04-26T05:49:04Z

this should be good to go now that #7017 is merged?

pep8speaks · 2021-04-26T19:54:38Z

Hello @shuyingsunshine21! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-26 20:25:46 UTC

ananthsub · 2021-04-26T20:22:11Z

CHANGELOG.md

@@ -144,7 +144,10 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Changed warnings and recommendations for dataloaders in `ddp_spawn` ([#6762](https://github.com/PyTorchLightning/pytorch-lightning/pull/6762/))


- `pl.seed_everyting` will now also set the seed on the `DistributedSampler` ([#7024](https://github.com/PyTorchLightning/pytorch-lightning/pull/7024))
+- `pl.seed_eveing` will now also set the seed on the `DistributedSampler` ([#7024](https://github.com/PyTorchLightning/pytorch-lightning/pull/7024))


Suggested change

- `pl.seed_eveing` will now also set the seed on the `DistributedSampler` ([#7024](https://github.com/PyTorchLightning/pytorch-lightning/pull/7024))

- `pl.seed_everything` will now also set the seed on the `DistributedSampler` ([#7024](https://github.com/PyTorchLightning/pytorch-lightning/pull/7024))

kaushikb11 · 2021-04-26T20:30:33Z

pytorch_lightning/plugins/training_type/sharded.py

+                    # For multi-node training, compressing the model shards in fp16 before broadcasting
+                    # improves performance. When using PyTorch AMP, it will not degrade
+                    # the model performance.
+                    zero_optimizer.broadcast_fp16 = is_fp16 and self.num_nodes > 1


Could we add a test for this? It being True, for 16bit precision and multi node.

Could we add a test for this? It being True, for 16bit precision and multi node.

was wondering do we have multi nodes testing example.

It seems to be disabled for now. https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/accelerators/test_multi_nodes_gpu.py

cc: @Borda @SeanNaren @tchaton

let me know if I should add one if multi-node testing is currently disabled. Maybe i could add in the same file for now (which might be easier for re-enable)

Yup, that should work. Thanks!

yes, we are in process of adding multi-node back...

Borda · 2021-04-26T21:49:35Z

pytorch_lightning/plugins/training_type/sharded.py

+                    # For multi-node training, compressing the model shards in fp16 before broadcasting
+                    # improves performance. When using PyTorch AMP, it will not degrade
+                    # the model performance.
+                    zero_optimizer.broadcast_fp16 = is_fp16 and self.num_nodes > 1


yes, we are in process of adding multi-node back...

SeanNaren · 2021-04-26T22:16:38Z

Might be overkill for now, but maybe a way to mock the multi-node for now for one GPU could be a potential stop gap here, but haven't delved too deep into how this would look!

Shuying Sun and others added 30 commits March 23, 2021 12:06

Fix some test errors

89f284d

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

80cfbff

…lightning pull latest code

checkpoint consolidation

536c132

Update ddp_spawn.py

f172101

Update test_metric_result_integration.py

bf70e43

Update test_results.py

ea74906

Update utils.py

a9aae99

Update utils.py

70fe5da

Update test_all_gather_grad.py

0d23d75

Update test_all_gather_grad.py

ca6f98b

Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkp…

c5053da

…oint_consolidate Update test_all_gather_grad.py

Update test_results.py

9d4a2b8

Revert "Update test_results.py"

7635b4f

This reverts commit 9d4a2b8.

Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine2…

d64f90c

…1-checkpoint_consolidate" This reverts commit c5053da, reversing changes made to 0d23d75.

Revert "Update test_all_gather_grad.py"

dcdcd29

This reverts commit 0d23d75.

Revert "Update utils.py"

8651d54

This reverts commit 70fe5da.

Revert "Update utils.py"

15f4b9e

This reverts commit a9aae99.

Revert "Update test_results.py"

250d0aa

This reverts commit ea74906.

Revert "Update test_metric_result_integration.py"

6c095b2

This reverts commit bf70e43.

Revert "Update ddp_spawn.py"

8222dc9

This reverts commit f172101.

Revert "checkpoint consolidation"

3a9fde9

This reverts commit 536c132.

Revert "Revert "checkpoint consolidation""

7a369f4

This reverts commit 3a9fde9.

Revert "Revert "Revert "checkpoint consolidation"""

b4a0b9e

This reverts commit 7a369f4.

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

5cf1db1

…lightning

Revert "Revert "Update ddp_spawn.py""

0ce7e05

This reverts commit 8222dc9.

Revert "Revert "Update test_metric_result_integration.py""

fe9736d

This reverts commit 6c095b2.

Revert "Revert "Update test_results.py""

c314ef6

This reverts commit 250d0aa.

Revert "Revert "Update utils.py""

c3feda0

This reverts commit 8651d54.

Revert "Revert "Update test_all_gather_grad.py""

c759477

This reverts commit dcdcd29.

Merge branch 'master' of https://github.com/shuyingsunshine21/pytorch…

7a8e540

…-lightning

ananthsub approved these changes Apr 13, 2021

View reviewed changes

Shuying Sun added 2 commits April 13, 2021 11:44

fix azuer_pipeline failure, add minimum fairscale package version for…

98283b5

… using fp16 broadcast

fix

de85c4c

shuyingsunshine21 commented Apr 13, 2021

View reviewed changes

carmocca changed the title ~~[DRAFT]Set smarter default for DDP sharded for performance optimization~~ [WIP] Set smarter default for DDP sharded for performance optimization Apr 13, 2021

shuyingsunshine21 marked this pull request as ready for review April 13, 2021 21:07

shuyingsunshine21 requested review from kaushikb11 and williamFalcon as code owners April 13, 2021 21:07

SeanNaren mentioned this pull request Apr 14, 2021

Update FairScale on CI #7017

Merged

11 tasks

ananthsub added this to the 1.3 milestone Apr 17, 2021

mergify bot added the has conflicts label Apr 19, 2021

Merge branch 'master' into ddp_sharded_optimization

6967fe5

mergify bot removed the has conflicts label Apr 26, 2021

Shuying Sun added 3 commits April 26, 2021 11:16

merge

261bc38

merge

f39b13f

resolve merge error

24614c6

Shuying Sun added 2 commits April 26, 2021 12:55

fix

5113713

fix formatting

d7545c1

ananthsub reviewed Apr 26, 2021

View reviewed changes

fix changelog

65bc9ce

kaushikb11 reviewed Apr 26, 2021

View reviewed changes

Borda approved these changes Apr 26, 2021

View reviewed changes

SeanNaren approved these changes Apr 26, 2021

View reviewed changes

kaushikb11 changed the title ~~[WIP] Set smarter default for DDP sharded for performance optimization~~ Set smarter default for DDP sharded for performance optimization Apr 26, 2021

kaushikb11 merged commit 52a5cee into Lightning-AI:master Apr 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set smarter default for DDP sharded for performance optimization #6937

Set smarter default for DDP sharded for performance optimization #6937

shuyingsunshine21 commented Apr 9, 2021 •

edited by kaushikb11

Loading

ananthsub left a comment

shuyingsunshine21 Apr 13, 2021

ananthsub commented Apr 14, 2021

SeanNaren commented Apr 14, 2021

ananthsub commented Apr 26, 2021

pep8speaks commented Apr 26, 2021 •

edited

Loading

ananthsub Apr 26, 2021

kaushikb11 Apr 26, 2021

shuyingsunshine21 Apr 26, 2021

kaushikb11 Apr 26, 2021

shuyingsunshine21 Apr 26, 2021

kaushikb11 Apr 26, 2021

Borda Apr 26, 2021

Borda Apr 26, 2021

SeanNaren commented Apr 26, 2021

	- `pl.seed_eveing` will now also set the seed on the `DistributedSampler` ([#7024](https://github.com/PyTorchLightning/pytorch-lightning/pull/7024))
	- `pl.seed_everything` will now also set the seed on the `DistributedSampler` ([#7024](https://github.com/PyTorchLightning/pytorch-lightning/pull/7024))

Set smarter default for DDP sharded for performance optimization #6937

Set smarter default for DDP sharded for performance optimization #6937

Conversation

shuyingsunshine21 commented Apr 9, 2021 • edited by kaushikb11 Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

ananthsub left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ananthsub commented Apr 14, 2021

SeanNaren commented Apr 14, 2021

ananthsub commented Apr 26, 2021

pep8speaks commented Apr 26, 2021 • edited Loading

Comment last updated at 2021-04-26 20:25:46 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanNaren commented Apr 26, 2021

shuyingsunshine21 commented Apr 9, 2021 •

edited by kaushikb11

Loading

pep8speaks commented Apr 26, 2021 •

edited

Loading