Fix DDP support #1182

ethanwharris · 2022-02-21T10:12:19Z

What does this PR do?

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests? [not needed for typos/docs]
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2022-02-21T10:17:40Z

Codecov Report

Merging #1182 (d598948) into master (a396e26) will increase coverage by 0.01%.
The diff coverage is 83.78%.

@@            Coverage Diff             @@
##           master    #1182      +/-   ##
==========================================
+ Coverage   90.92%   90.94%   +0.01%     
==========================================
  Files         283      284       +1     
  Lines       12701    12687      -14     
==========================================
- Hits        11549    11538      -11     
+ Misses       1152     1149       -3

Flag	Coverage Δ
unittests	`90.94% <83.78%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
flash/core/data/io/input.py	`93.95% <ø> (+1.01%)`	⬆️
flash/image/classification/model.py	`81.03% <0.00%> (ø)`
flash/image/segmentation/model.py	`92.94% <0.00%> (ø)`
flash/text/classification/model.py	`93.33% <0.00%> (ø)`
flash/text/question_answering/input.py	`95.12% <0.00%> (ø)`
flash/text/seq2seq/core/input.py	`97.29% <0.00%> (ø)`
flash/core/trainer.py	`91.20% <85.71%> (-1.03%)`	⬇️
flash/core/data/io/transform_predictions.py	`100.00% <100.00%> (ø)`
flash/text/question_answering/model.py	`91.89% <100.00%> (-0.17%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a396e26...d598948. Read the comment docs.

requirements/test.txt

Borda

do we have already removed the hardcoded run only on one GPU?

flash/core/trainer.py

krshrimali

Thanks, @ethanwharris for working on this! LGTM. Have a couple of questions, but shouldn't block merging this PR.

Asking for my knowledge, what really fixed the issue?

tests/examples/test_scripts.py

krshrimali · 2022-02-22T04:44:43Z

flash/core/data/io/transform_predictions.py

+            predictions = predict_step(*args, **kwargs)
+            if predictions is not None:
+                predictions = self.output_transform(predictions)
+                predictions = [self.output(prediction) for prediction in predictions]
+            return predictions


Just curious, when do you think predictions would be None? Should that be counted as a failure? Or a warning be raised that the OutputTransform and Output instances passed were not used?

I think there are some cases where it can be None but not sure, it may just be within our tests that it can be None. But yeah, could be better to have an error there

I'll also see if there is a possibility that predictions can be None, but for now - I guess we can merge this PR and create a small follow-up PR if required (for the error).

ethanwharris · 2022-02-22T09:37:36Z

flash/core/data/io/input.py

-    def __getstate__(self):
-        """Temporarily override pickle behaviour.
-
-        TODO: New DataPipeline should avoid this being pickled.
-        """
-        state = self.__dict__.copy()
-        state.pop("data")
-        if "data_iter" in state:
-            state.pop("data_iter")
-        return state
-
-    def __setstate__(self, newstate):
-        """Temporarily override pickle behaviour.
-
-        TODO: New DataPipeline should avoid this being pickled.
-        """
-        newstate["data"] = None
-        self.__dict__.update(newstate)
-
-    def __copy__(self):
-        """The default copy implementation seems to use ``__getstate__`` and ``__setstate__`` so we override it
-        here with a custom implementation to ensure that it includes the data list."""
-        cls = self.__class__
-        result = cls.__new__(cls)
-        result.__dict__.update(self.__dict__)
-        return result
-
-    def __deepcopy__(self, memo):
-        """The default deepcopy implementation seems to use ``__getstate__`` and ``__setstate__`` so we override it
-        here with a custom implementation to ensure that it includes the data list."""
-        cls = self.__class__
-        result = cls.__new__(cls)
-        memo[id(self)] = result
-        for k, v in self.__dict__.items():
-            setattr(result, k, deepcopy(v, memo))
-        return result


@krshrimali This is the main fix. We used to have a bug where the data was accidentally included in the checkpoint. We patched that by adding this overrides. But then DDP spawn needs to pickle the data to send it to each process so this causes problems. We refactored away the bit that got this included in the checkpoint so now can be safely removed 😃

Awesome, thank you so much for the explanation, @ethanwharris!

…tning-flash into bugfix/ddp

Co-authored-by: Jirka Borovec <[email protected]>

Try fix

c6cb7ef

ethanwharris added the bug / fix Something isn't working label Feb 21, 2022

ethanwharris added this to the 0.7.x milestone Feb 21, 2022

ethanwharris added 3 commits February 21, 2022 10:12

Merge branch 'master' into bugfix/ddp

d5e775e

Fix segmentation CI

c4cbb3b

Use all GPUs

06ab125

ethanwharris added 10 commits February 21, 2022 10:18

Update CHANGELOG.md

7189de5

Fix

8112f96

Try something

30b4003

Try fix

08aaae2

Trim CI for debug

6634759

Restore CI

2fe708c

Fix windows CI

dc9fce2

Fixes

c734949

Try fix

5c8db6e

Try fix

373ee6f

Borda added the Priority label Feb 21, 2022

Borda reviewed Feb 21, 2022

View reviewed changes

requirements/test.txt Show resolved Hide resolved

Try something

5f7fafb

ethanwharris marked this pull request as ready for review February 21, 2022 16:58

ethanwharris requested review from tchaton, ananyahjha93, justusschock, carmocca and kaushikb11 as code owners February 21, 2022 16:58

Borda approved these changes Feb 21, 2022

View reviewed changes

ethanwharris commented Feb 21, 2022

View reviewed changes

flash/core/trainer.py Outdated Show resolved Hide resolved

Borda reviewed Feb 21, 2022

View reviewed changes

flash/core/trainer.py Outdated Show resolved Hide resolved

-1

ba71adc

krshrimali approved these changes Feb 22, 2022

View reviewed changes

ethanwharris commented Feb 22, 2022

View reviewed changes

ethanwharris added 2 commits February 22, 2022 11:35

simplify import

be3da37

Merge branch 'bugfix/ddp' of https://github.com/PyTorchLightning/ligh…

d598948

…tning-flash into bugfix/ddp

ethanwharris merged commit 5cf1321 into master Feb 22, 2022

ethanwharris deleted the bugfix/ddp branch February 22, 2022 19:28

dudeperf3ct mentioned this pull request Feb 23, 2022

Multi-gpu training using flash fails for video classification #1188

Closed

ethanwharris added a commit that referenced this pull request Mar 1, 2022

Fix DDP support (#1182)

785290e

Co-authored-by: Jirka Borovec <[email protected]>

ethanwharris added a commit that referenced this pull request Mar 1, 2022

Fix DDP support (#1182)

c922d3d

Co-authored-by: Jirka Borovec <[email protected]>

This was referenced Apr 11, 2022

Add test harness to graph tasks #1286

Merged

Add task test harness to video #1287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DDP support #1182

Fix DDP support #1182

ethanwharris commented Feb 21, 2022 •

edited

Loading

codecov bot commented Feb 21, 2022 •

edited

Loading

Borda left a comment

krshrimali left a comment •

edited

Loading

krshrimali Feb 22, 2022

ethanwharris Feb 22, 2022

krshrimali Feb 22, 2022

ethanwharris Feb 22, 2022

krshrimali Feb 22, 2022

Fix DDP support #1182

Fix DDP support #1182

Conversation

ethanwharris commented Feb 21, 2022 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented Feb 21, 2022 • edited Loading

Codecov Report

Borda left a comment

Choose a reason for hiding this comment

krshrimali left a comment • edited Loading

Choose a reason for hiding this comment

krshrimali Feb 22, 2022

Choose a reason for hiding this comment

ethanwharris Feb 22, 2022

Choose a reason for hiding this comment

krshrimali Feb 22, 2022

Choose a reason for hiding this comment

ethanwharris Feb 22, 2022

Choose a reason for hiding this comment

krshrimali Feb 22, 2022

Choose a reason for hiding this comment

ethanwharris commented Feb 21, 2022 •

edited

Loading

codecov bot commented Feb 21, 2022 •

edited

Loading

krshrimali left a comment •

edited

Loading