[hotfix] ddp + manual_optimisation #4976

tchaton · 2020-12-04T16:01:45Z

What does this PR do?

Fixes #4953

SeanNaren EDIT:

There are two parts to this fix:

We assumed that when accumulating gradients, we do not need to sync gradients across distributed processes within the training step. This is fine for automatic optimization, but where the user has complete control over all training flow in training_step we have to allow sync. Also we hard coded the behaviour to DDP, we now have custom parallel wrappers :)
In the custom DDP override, we've added training step/val step/test step into the internal forward function. The issue here is that only after we exit these calls do the reducer backward hooks get added which breaks manual optimization. The PR proposes that we allow the accelerator to add these hooks manually.

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

Did you have fun?

Make sure you had fun coding 🙃

…ng manual optimization

This reverts commit ccca6b6

…htning/pytorch-lightning into bug/fixfix_ddp_manual

pep8speaks · 2020-12-05T11:29:55Z

Hello @tchaton! Thanks for updating this PR.

In the file tests/trainer/optimization/test_manual_optimization.py:

Line 911:121: E501 line too long (121 > 120 characters)
Line 974:34: E203 whitespace before ':'

Comment last updated at 2020-12-07 18:24:01 UTC

codecov · 2020-12-05T11:46:05Z

Codecov Report

Merging #4976 (678f642) into master (68ba493) will increase coverage by 0%.
The diff coverage is 88%.

@@          Coverage Diff           @@
##           master   #4976   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         130     130           
  Lines        9527    9547   +20     
======================================
+ Hits         8843    8871   +28     
+ Misses        684     676    -8

…htning/pytorch-lightning into bug/fixfix_ddp_manual

Borda

can we add return types to the newly added methods?

pytorch_lightning/trainer/training_loop.py

tests/trainer/optimization/test_manual_optimization.py

Co-authored-by: Jirka Borovec <[email protected]>

SeanNaren

Thanks for getting this over the line @tchaton

awaelchli

the tests passes but loss is NaN, please check again the gradient flow in the model.

tests/special_tests.sh

pytorch_lightning/trainer/training_loop.py

awaelchli · 2020-12-06T22:40:39Z

tests/trainer/optimization/test_manual_optimization.py

+                make_manual_backward(loss_ones_gen, opt_dis)
+
+            # this will accumulate gradients for 2 batches and then call opt_gen.step()
+            opt_gen.step(closure=gen_closure, make_optimizer_step=batch_idx % 2 == 0, optim='sgd')


Suggested change

opt_gen.step(closure=gen_closure, make_optimizer_step=batch_idx % 2 == 0, optim='sgd')

opt_gen.step(closure=gen_closure, make_optimizer_step=(batch_idx % 2 == 0), optim='sgd')

what is this extra argument optim="sgd"? This doesn't look right.

It is an arbitrary arguments. Just to make sure it is properly given to optimizer.step(*args, **kwargs)

ok, understood. Could we name it something else, like **extra_kwargs or similar so it is not misleading to be a required arg?

pytorch_lightning/trainer/training_loop.py

pytorch_lightning/overrides/data_parallel.py

SeanNaren · 2020-12-06T23:15:28Z

Thanks for the review @awaelchli, I think we can reduce the priority and focus on getting this right! Seems like we need to test a lot more.

justusschock

Besides the already mentioned issues, I really like it. So no other requests from my side.

SeanNaren · 2020-12-07T11:51:00Z

Just to put my thoughts for discussion:

The current PR adds too many edge cases of my liking. We should just do:

if manual_optimizations:
    model.add_backward_hooks()

We may not even need to worry about modifying the code too much other than defining the additional reduce hook function, for a first pass it would be good to check if calling prepare_for_backwards twice even changes any behaviour: https://github.com/pytorch/pytorch/blob/v1.4.0/torch/csrc/distributed/c10d/reducer.cpp#L496

tchaton · 2020-12-07T13:00:40Z

the tests passes but loss is NaN, please check again the gradient flow in the model.

In manual optimization, we don't return the loss. Therefore, we get a loss=nan in the prob_bar. Need to take care of this. When calling self.log("train_loss", loss) it is fine :)

…htning/pytorch-lightning into bug/fixfix_ddp_manual

SeanNaren · 2020-12-07T18:32:06Z

@awaelchli After offline discussion I think we should merge this PR as it stands, and create a followup issue to refactor the fix to be less intrusive whilst trying to sync upstream with the native DDP implementation. I think over the interim it's better since currently DDP is broken with manual optim. thoughts?

awaelchli

ok, I will remove the request for changes

azraelkuan · 2021-12-09T02:36:24Z

hi, when can we turn off find_unused_parameters as False to use DDP when using manual_optimization?
In naive pytorch, we do not need to set this, right?

awaelchli · 2021-12-09T03:34:52Z

Docs

In naive pytorch, we do not need to set this, right?

If your model/optimization needs it in pytorch, it will need it in Lightning and vice versa.

SeanNaren and others added 13 commits December 3, 2020 23:05

Rely on ddp plugin for blocking sync behaviour, and skip if we're usi…

072c272

…ng manual optimization

debug

ccca6b6

Revert "debug"

e4113fc

This reverts commit ccca6b6

Expose manual reduce for automatic optimization

8c11440

Add input arguments

cab1107

Enable parity test

b0b7d22

clean imports

ab0d892

Expose hook after to ensure we reset

7a33458

Fix naming

fad427c

Merge branch 'master' into feature/fix_ddp_manual

9f3eaea

add

20c9687

Merge branch 'master' into bug/fixfix_ddp_manual

5bb6a7f

Merge branch 'master' into bug/fixfix_ddp_manual

1fe943e

edenlightning added the bug Something isn't working label Dec 4, 2020

edenlightning added this to the 1.1 milestone Dec 4, 2020

tchaton added 2 commits December 4, 2020 22:01

fix test

b0252e4

Merge branch 'bug/fixfix_ddp_manual' of https://github.com/PyTorchLig…

6a30906

…htning/pytorch-lightning into bug/fixfix_ddp_manual

tchaton added the distributed Generic distributed-related topic label Dec 5, 2020

tchaton self-assigned this Dec 5, 2020

tchaton added the priority: 0 High priority task label Dec 5, 2020

tchaton changed the title ~~Bug/fixfix ddp manual~~ [hotfix] ddp + manual_optimisation Dec 5, 2020

tchaton added 2 commits December 5, 2020 15:05

Merge branch 'master' into bug/fixfix_ddp_manual

2cb807f

Merge branch 'master' into bug/fixfix_ddp_manual

b498319

tchaton marked this pull request as ready for review December 6, 2020 10:18

tchaton requested review from awaelchli, Borda, justusschock and SeanNaren as code owners December 6, 2020 10:18

tchaton added 2 commits December 6, 2020 15:21

typo

81a3f77

Merge branch 'bug/fixfix_ddp_manual' of https://github.com/PyTorchLig…

5616719

…htning/pytorch-lightning into bug/fixfix_ddp_manual

williamFalcon approved these changes Dec 6, 2020

View reviewed changes

Merge branch 'master' into bug/fixfix_ddp_manual

4c377f6

Borda approved these changes Dec 6, 2020

View reviewed changes

pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved

tests/trainer/optimization/test_manual_optimization.py Outdated Show resolved Hide resolved

tests/trainer/optimization/test_manual_optimization.py Outdated Show resolved Hide resolved

SeanNaren and others added 2 commits December 6, 2020 22:44

Update tests/trainer/optimization/test_manual_optimization.py

8414042

Co-authored-by: Jirka Borovec <[email protected]>

Update tests/trainer/optimization/test_manual_optimization.py

eb5fca2

Co-authored-by: Jirka Borovec <[email protected]>

SeanNaren approved these changes Dec 6, 2020

View reviewed changes

awaelchli suggested changes Dec 6, 2020

View reviewed changes

SeanNaren self-requested a review December 6, 2020 23:05

Borda requested review from williamFalcon and Borda December 6, 2020 23:40

justusschock approved these changes Dec 7, 2020

View reviewed changes

tchaton added 5 commits December 7, 2020 13:19

Merge branch 'master' into bug/fixfix_ddp_manual

681b4a2

update on comments

6339a3a

Merge branch 'bug/fixfix_ddp_manual' of https://github.com/PyTorchLig…

695f611

…htning/pytorch-lightning into bug/fixfix_ddp_manual

resolve comments

3751352

Merge branch 'master' into bug/fixfix_ddp_manual

678f642

awaelchli approved these changes Dec 7, 2020

View reviewed changes

SeanNaren approved these changes Dec 7, 2020

View reviewed changes

tchaton merged commit 2393474 into master Dec 7, 2020

tchaton deleted the bug/fixfix_ddp_manual branch December 7, 2020 19:32

edenlightning mentioned this pull request Dec 8, 2020

manual_optimization does not work with dp #4961

Closed

awaelchli mentioned this pull request Dec 18, 2020

Refactor LightningDistributedDataParallel #5185

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hotfix] ddp + manual_optimisation #4976

[hotfix] ddp + manual_optimisation #4976

tchaton commented Dec 4, 2020 •

edited by SeanNaren

Loading

pep8speaks commented Dec 5, 2020 •

edited

Loading

codecov bot commented Dec 5, 2020 •

edited

Loading

Borda left a comment

SeanNaren left a comment

awaelchli left a comment

awaelchli Dec 6, 2020

tchaton Dec 7, 2020

awaelchli Dec 7, 2020

SeanNaren commented Dec 6, 2020

justusschock left a comment

SeanNaren commented Dec 7, 2020

tchaton commented Dec 7, 2020

SeanNaren commented Dec 7, 2020 •

edited

Loading

awaelchli left a comment

azraelkuan commented Dec 9, 2021

awaelchli commented Dec 9, 2021

	opt_gen.step(closure=gen_closure, make_optimizer_step=batch_idx % 2 == 0, optim='sgd')
	opt_gen.step(closure=gen_closure, make_optimizer_step=(batch_idx % 2 == 0), optim='sgd')

[hotfix] ddp + manual_optimisation #4976

[hotfix] ddp + manual_optimisation #4976

Conversation

tchaton commented Dec 4, 2020 • edited by SeanNaren Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

pep8speaks commented Dec 5, 2020 • edited Loading

Comment last updated at 2020-12-07 18:24:01 UTC

codecov bot commented Dec 5, 2020 • edited Loading

Codecov Report

Borda left a comment

Choose a reason for hiding this comment

SeanNaren left a comment

Choose a reason for hiding this comment

awaelchli left a comment

Choose a reason for hiding this comment

awaelchli Dec 6, 2020

Choose a reason for hiding this comment

tchaton Dec 7, 2020

Choose a reason for hiding this comment

awaelchli Dec 7, 2020

Choose a reason for hiding this comment

SeanNaren commented Dec 6, 2020

justusschock left a comment

Choose a reason for hiding this comment

SeanNaren commented Dec 7, 2020

tchaton commented Dec 7, 2020

SeanNaren commented Dec 7, 2020 • edited Loading

awaelchli left a comment

Choose a reason for hiding this comment

azraelkuan commented Dec 9, 2021

awaelchli commented Dec 9, 2021

tchaton commented Dec 4, 2020 •

edited by SeanNaren

Loading

pep8speaks commented Dec 5, 2020 •

edited

Loading

codecov bot commented Dec 5, 2020 •

edited

Loading

SeanNaren commented Dec 7, 2020 •

edited

Loading