-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Move init dist connection into the setup function #6506
Merged
Merged
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
6bf721e
Move connection setup into the setup function. Call setup hook after …
1576176
Added CHANGELOG.md
7148ee6
fix setup order in callback test
awaelchli 4fd0c02
fix input arguments in test
awaelchli cbfa681
Mock distributed function, remove protection to turn into training ty…
2a1dfbf
Remove import
e9c3f83
Add missing mock, ensure custom plugin does not create children process
2141a1f
Merge branch 'master' into fix/setup_ddp_hook
96ca54f
Merge branch 'master' into fix/setup_ddp_hook
SeanNaren ffe1c3f
Skip test on windows
1709cdb
Update deepspeed to init connection in setup
708f97f
Do not initialize distributed module
ec33b96
Move DeepSpeed tests to special tests since dist communication is bei…
d782554
Merge branch 'master' into fix/setup_ddp_hook
0c03487
Special the test to see if this fixes CI
edde60b
Delete accelerator connector test to see if its causing build to fail
9d31742
Delete deepspeed test
9db893a
Revert "Delete accelerator connector test to see if its causing build…
56ef252
Revert "Delete deepspeed test"
cad0671
Reverse hook
6b7d835
Reverse setup hooks to debug again
4651e57
Add todo so i know where i left off
d7ec33e
For single device move in pre_dispatch after setup function
72097ba
Merge branch 'master' into fix/setup_ddp_hook
bd2a53a
Add additional model to device hook if any additional parameters have…
b5450de
See if we can enable deepspeed tests
136ddc5
Revert "See if we can enable deepspeed tests"
0210f17
See if this hook approach works
1bae940
Introduce new granular hooks
69d6c32
Remove import, fix tpu spawn by moving the function to setup
91fff3a
Added missing special test
88e2e09
Merge branch 'master' into fix/setup_ddp_hook
3eced98
Clean up the setup comment, since its run on train and test
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next
Next commit
Move connection setup into the setup function. Call setup hook after …
…we set up the accelerator
- Loading branch information
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -90,6 +90,8 @@ def setup(self, model): | |
# set the task idx | ||
self.task_idx = self.cluster_environment.local_rank() | ||
|
||
self._setup_distributed() | ||
|
||
def _call_children_scripts(self): | ||
|
||
# bookkeeping of spawned processes | ||
|
@@ -161,6 +163,34 @@ def _call_children_scripts(self): | |
delay = np.random.uniform(1, 5, 1)[0] | ||
sleep(delay) | ||
|
||
def _setup_distributed(self): | ||
# TODO: check if needed | ||
seed = os.environ.get("PL_GLOBAL_SEED") | ||
if seed is not None: | ||
seed_everything(int(seed)) | ||
|
||
# determine which process we are and world size | ||
self.set_world_ranks() | ||
|
||
# set warning rank | ||
rank_zero_only.rank = self.global_rank | ||
|
||
# set up server using proc 0's ip address | ||
# try to init for 20 times at max in case ports are taken | ||
# where to store ip_table | ||
self.init_ddp_connection(self.global_rank, self.world_size) | ||
|
||
# on world_size=0 let everyone know training is starting | ||
if self.is_global_zero and not torch.distributed.is_initialized(): | ||
log.info("-" * 100) | ||
log.info(f"distributed_backend={self.distributed_backend}") | ||
log.info(f"All DDP processes registered. Starting ddp with {self.world_size} processes") | ||
log.info("-" * 100) | ||
|
||
# set the ranks and devices | ||
self.dist.rank = self.global_rank | ||
self.dist.device = self.root_device | ||
|
||
def _check_can_spawn_children(self): | ||
if self._has_spawned_children: | ||
raise RuntimeError( | ||
|
@@ -179,9 +209,7 @@ def pre_configure_ddp(self): | |
# Many models require setting this parameter to True, as there are corner cases | ||
# when not all parameter backward hooks are fired by the autograd engine even if require_grad is set to True. | ||
# This flag does come with a performance hit, so it is suggested to disable in cases where it is possible. | ||
self._ddp_kwargs["find_unused_parameters"] = self._ddp_kwargs.get( | ||
"find_unused_parameters", True | ||
) | ||
self._ddp_kwargs["find_unused_parameters"] = self._ddp_kwargs.get("find_unused_parameters", True) | ||
# todo: PyTorch 1.7.0 DDP introduces ``self.reducer._rebuild_buckets()`` breaking manual_optimization | ||
if _TORCH_GREATER_EQUAL_1_7 and not self.lightning_module.automatic_optimization and not self._ddp_kwargs.get( | ||
"find_unused_parameters", False | ||
|
@@ -215,37 +243,6 @@ def init_ddp_connection(self, global_rank: int, world_size: int) -> None: | |
torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size) | ||
|
||
def pre_dispatch(self): | ||
# TODO: check if needed | ||
seed = os.environ.get("PL_GLOBAL_SEED") | ||
if seed is not None: | ||
seed_everything(int(seed)) | ||
|
||
# determine which process we are and world size | ||
self.set_world_ranks() | ||
|
||
# set warning rank | ||
rank_zero_only.rank = self.global_rank | ||
|
||
# set up server using proc 0's ip address | ||
# try to init for 20 times at max in case ports are taken | ||
# where to store ip_table | ||
self.init_ddp_connection(self.global_rank, self.world_size) | ||
|
||
# TODO: we moved it to the trainer.fit after calling pre_dispatch | ||
# ... need to double check that it is the correct place | ||
# self.trainer.call_setup_hook(self.model) | ||
|
||
Comment on lines
-231
to
-235
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah my silly todo.... Thanks for double checking @SeanNaren 😄 |
||
# on world_size=0 let everyone know training is starting | ||
if self.is_global_zero and not torch.distributed.is_initialized(): | ||
log.info("-" * 100) | ||
log.info(f"distributed_backend={self.distributed_backend}") | ||
log.info(f"All DDP processes registered. Starting ddp with {self.world_size} processes") | ||
log.info("-" * 100) | ||
|
||
# set the ranks and devices | ||
self.dist.rank = self.global_rank | ||
self.dist.device = self.root_device | ||
|
||
if self.sync_batchnorm: | ||
self.model = self.configure_sync_batchnorm(self.model) | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, shall we have this as a single message intend for 4 separate?