Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Accelerator refactor #5385

Closed
wants to merge 122 commits into from
Closed

WIP: Accelerator refactor #5385

wants to merge 122 commits into from

Conversation

justusschock
Copy link
Member

@justusschock justusschock commented Jan 6, 2021

What does this PR do?

Fixes #4510

This PR separates the Accelerator (Hardware Part) from the actual different training routines.

Remaining TODOs:

  • TPUAccelerator (Thomas)
  • DDP2 Plugin (Adrian)
  • Shared Training Plugin (Sean)
  • RPC Plugin (Sean)
  • Add Layers for parallel: SingleProcessMultipleDevice (SPMD), SingleProcessSingleDevice (SPSD)
  • Rebase on release branch
  • Testing DDP2 and DDP Slurm
  • Remove old plugins (pl/plugins/old)
  • Make Tuner work (requires setting some attrs through trainer) (Adrian)

So far this PR was co-authored with @awaelchli !

cc @awaelchli @tchaton @SeanNaren who will likely work on this!

Slides for motivation and high-level overview

List of PRs to look out for (when rebasing, code we need to manually copy over to new files):
#5221, #5195, #5300, #5388

Comment on lines +1 to +4
from pytorch_lightning.accelerators.accelerator import Accelerator
from pytorch_lightning.accelerators.cpu import CPUAccelerator
from pytorch_lightning.accelerators.gpu import GPUAccelerator
from pytorch_lightning.accelerators.tpu import TPUAccelerator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 ... similar to trainer new attributes _distrib_type and _device_type
on the other hand, it will be breaking change... :/

Comment on lines +68 to +73
self.use_dp = False
self.use_ddp = False
self.use_ddp2 = False
self.use_horovod = False
self.use_single_gpu = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is refactored to _distrib_type but we can rename it if you want...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I know, we need to update the branch with these changes. It's a bit challenging now with keeping uptodate. But we will keep it in mind thanks

Copy link
Member

@Borda Borda Jan 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably no need for an update as you will cut this work to multiple PRs, just for you to be aware of it... :]

@Borda
Copy link
Member

Borda commented Jan 8, 2021

could we have some high-level summary, I see that accelerators are for HW (as it is in the description) but some summary of how the plugins (I would move them on the same level as accelerators as there re rather partners not sub-category) are linked, and can you have "any" combination?

@awaelchli
Copy link
Contributor

awaelchli commented Jan 8, 2021

We made some slides (see link in description above) for a high level overview. For some plugins and accelerators multiple combinations are possible, but some configurations are obviously not. So for our built-in modes, we take care of valid configuration. But the user can define their own accelerator and compatible plugins if they want to, or re-use our built-in ones.

@pep8speaks
Copy link

pep8speaks commented Jan 21, 2021

Hello @justusschock! Thanks for updating this PR.

Line 212:26: W292 no newline at end of file

Line 63:18: W291 trailing whitespace
Line 186:33: E128 continuation line under-indented for visual indent
Line 187:33: E128 continuation line under-indented for visual indent

Line 14:45: W292 no newline at end of file

Line 25:37: W292 no newline at end of file

Line 4:1: E302 expected 2 blank lines, found 1
Line 31:14: W292 no newline at end of file

Line 12:1: E302 expected 2 blank lines, found 1
Line 115:53: W292 no newline at end of file

Line 4:1: E302 expected 2 blank lines, found 1
Line 7:24: W292 no newline at end of file

Line 48:40: W292 no newline at end of file

Line 4:1: E302 expected 2 blank lines, found 1
Line 5:9: W292 no newline at end of file

Line 8:1: E302 expected 2 blank lines, found 1
Line 44:27: W292 no newline at end of file

Line 17:1: E302 expected 2 blank lines, found 1
Line 91:23: W292 no newline at end of file

Line 39:5: E303 too many blank lines (2)

Line 20:1: W293 blank line contains whitespace
Line 40:19: W292 no newline at end of file

Line 11:1: E302 expected 2 blank lines, found 1

Line 1:53: W292 no newline at end of file

Line 38:9: E265 block comment should start with '# '

Line 224:1: E305 expected 2 blank lines after class or function definition, found 0
Line 224:3: E227 missing whitespace around bitwise or shift operator
Line 224:7: E225 missing whitespace around operator
Line 227:3: E225 missing whitespace around operator
Line 227:5: E225 missing whitespace around operator
Line 227:7: E225 missing whitespace around operator
Line 228:9: E113 unexpected indentation
Line 230:7: E225 missing whitespace around operator

Line 427:121: E501 line too long (121 > 120 characters)

Line 112:1: E122 continuation line missing indentation or outdented
Line 112:3: E225 missing whitespace around operator
Line 112:5: E227 missing whitespace around bitwise or shift operator
Line 115:1: E122 continuation line missing indentation or outdented
Line 115:3: E225 missing whitespace around operator
Line 115:5: E225 missing whitespace around operator
Line 115:7: E225 missing whitespace around operator
Line 115:8: E251 unexpected spaces around keyword / parameter equals
Line 116:1: E122 continuation line missing indentation or outdented
Line 116:1: W503 line break before binary operator
Line 116:7: E225 missing whitespace around operator

Line 152:121: E501 line too long (121 > 120 characters)

Comment last updated at 2021-01-22 09:54:14 UTC

Base automatically changed from release/1.2-dev to master February 11, 2021 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Connectors Refactoring Discussion
4 participants