Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] ref: decoupled ddp, ddp spawn #3733

Closed
wants to merge 119 commits into from
Closed
Show file tree
Hide file tree
Changes from 111 commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
767b8ab
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
f746018
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
c3529ee
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
3497f0d
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
c4a9dc0
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
09bf2a6
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
81d7a0d
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
54a7402
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
2960aa2
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
417242c
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
b751f3a
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
e40a7c2
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
78bf07b
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
07efc8e
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
1276a51
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
d4b9f37
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
3041561
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
61ab801
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
7eeaa64
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
416a96d
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
b4454ee
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
f151c21
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
4278731
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
2e9c537
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
6f6f4fa
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
dab971d
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
95aaca6
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
b46874c
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
424a6db
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
35d01e4
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
f6e0bbe
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
a0542ae
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
64a486c
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
d124a94
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
3fa5ad2
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
2e49563
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
8acddd7
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
50a9c8b
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
5fc4912
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
2070075
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
f0c06bd
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
08b0cad
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
8a8a0bf
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
ed675ef
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
336bb47
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
c3f299a
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
e4cb76d
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
94ef3b9
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
357d640
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
e49c8a1
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
91736e2
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
15e5be0
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
b37d948
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
51370ce
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
23032ea
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
9f8705a
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
0f13e61
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
7ccabd8
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
9171464
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
1d4aeaa
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
b96d7c1
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
85050a3
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
506b037
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
63f5d50
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
01dd4c5
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
a0f52d7
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
650903a
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
cbd89f7
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
8ebd4ed
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
1f19c2f
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
ea448bb
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
fbeec9e
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
7663c6b
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
9421dbb
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
cf08480
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
f0c3cc5
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
459a0fa
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
64484a1
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
10bae5b
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
667c434
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
5b412e0
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
d9fc538
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
b2e941c
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
5ac3e59
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
3650f86
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
da582ab
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
471b576
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
545bf01
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
7b72cd6
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
1fbc1ca
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
c5c9faf
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
701f233
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
4a7368a
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
7169107
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
27e5870
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
455a488
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
6c3732c
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
73f0ef3
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
e36e20f
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
2f93660
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
1fb466c
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
202e82e
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
c8bd6ee
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
d4d8551
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
5acef3e
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
288fd23
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
0dcdd81
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
581e929
ref: decoupled ddp spawn
williamFalcon Sep 30, 2020
fe53c9a
h
williamFalcon Oct 1, 2020
c644f66
h
williamFalcon Oct 1, 2020
2a10f59
h
williamFalcon Oct 1, 2020
beacd6a
rebased
williamFalcon Oct 1, 2020
7e98763
merged
williamFalcon Oct 1, 2020
661cfb0
merged
williamFalcon Oct 1, 2020
69235e9
merged
williamFalcon Oct 1, 2020
c958ec7
ref: part 4 of #3733
williamFalcon Oct 1, 2020
6088c48
ref: part 4 of #3733
williamFalcon Oct 1, 2020
f86ab63
ref: part 4 of #3733
williamFalcon Oct 1, 2020
2c2755c
ref: clean up ddp before final fix
williamFalcon Oct 3, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion benchmarks/test_parity.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

@pytest.mark.parametrize('cls_model,max_diff', [
(ParityModuleRNN, 0.05),
(ParityModuleMNIST, 0.5)
(ParityModuleMNIST, 0.55)
])
@pytest.mark.skipif(not torch.cuda.is_available(), reason="test requires GPU machine")
def test_pytorch_parity(tmpdir, cls_model, max_diff):
Expand Down
3 changes: 2 additions & 1 deletion pl_examples/basic_examples/autoencoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,8 @@ def cli_main():
# ------------
# testing
# ------------
trainer.test(test_dataloaders=test_loader)
result = trainer.test(test_dataloaders=test_loader)
print(result)


if __name__ == '__main__':
Expand Down
9 changes: 6 additions & 3 deletions pytorch_lightning/accelerators/accelerator_connector.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,12 +81,11 @@ def on_trainer_init(

# override with environment flag
gpus = os.environ.get('PL_TRAINER_GPUS', gpus)
self.trainer.gpus = gpus

# for gpus allow int, string and gpu list
if auto_select_gpus and isinstance(gpus, int):
self.trainer.gpus = self.trainer.tuner.pick_multiple_gpus(gpus)
else:
self.trainer.gpus = gpus

self.trainer.data_parallel_device_ids = device_parser.parse_gpu_ids(self.trainer.gpus)
self.trainer.root_gpu = device_parser.determine_root_gpu_device(self.trainer.data_parallel_device_ids)
Expand Down Expand Up @@ -126,6 +125,9 @@ def on_trainer_init(
self.trainer.replace_sampler_ddp = replace_sampler_ddp

def select_accelerator(self):
if self.trainer.accelerator_backend is not None:
return self.trainer.accelerator_backend

# SLURM ddp
use_slurm_ddp = self.trainer.use_ddp and self.trainer.is_slurm_managing_tasks

Expand Down Expand Up @@ -294,7 +296,8 @@ def set_nvidia_flags(self, is_slurm_managing_tasks, data_parallel_device_ids):
os.environ["CUDA_VISIBLE_DEVICES"] = gpu_str

# don't make this debug... this is good UX
rank_zero_info(f'CUDA_VISIBLE_DEVICES: [{os.environ["CUDA_VISIBLE_DEVICES"]}]')
devices = os.environ["CUDA_VISIBLE_DEVICES"]
log.info(f'LOCAL_RANK: {self.trainer.local_rank} - CUDA_VISIBLE_DEVICES: [{devices}]')

def determine_local_rank(self):
if self.trainer.is_slurm_managing_tasks:
Expand Down
3 changes: 3 additions & 0 deletions pytorch_lightning/accelerators/base_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,9 @@ def _clip_gradients(self, optimizer):
def on_train_epoch_end(self):
pass

def on_train_end(self):
pass

def early_stopping_should_stop(self, pl_module):
return self.trainer.should_stop

Expand Down
153 changes: 133 additions & 20 deletions pytorch_lightning/accelerators/ddp_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,22 @@
# limitations under the License

import os
import torch
import torch.distributed as torch_distrib
import subprocess
import sys
from os.path import abspath
from time import sleep
from typing import Optional

import numpy as np
import torch

from pytorch_lightning import _logger as log
from pytorch_lightning.utilities.distributed import find_free_network_port
from pytorch_lightning.accelerators.ddp_base_backend import DDPBase
from pytorch_lightning.accelerators.base_backend import Accelerator
from pytorch_lightning.utilities.distributed import rank_zero_only
from pytorch_lightning.utilities import AMPType


try:
from hydra.utils import to_absolute_path, get_original_cwd
Expand All @@ -34,13 +39,14 @@
HYDRA_AVAILABLE = True


class DDPBackend(DDPBase):
class DDPBackend(Accelerator):

def __init__(self, trainer, mode: str = 'ddp'):
super().__init__(trainer)
self.task_idx = None
self._has_spawned_children = False
self.mode = mode
self.interactive_ddp_procs = []

def setup(self, model):
if self.mode == 'ddp':
Expand All @@ -59,6 +65,10 @@ def __torchelastic_setup(self):
self.task_idx = int(os.environ['LOCAL_RANK'])

def __ddp_script_mode_setup(self):
# do nothing when already in a ddp subprocess
if os.environ.get('PL_IN_DDP_SUBPROCESS', '0') == '1':
return

assert self.trainer.global_rank == 0
self._check_can_spawn_children()
self._has_spawned_children = True
Expand Down Expand Up @@ -91,21 +101,27 @@ def __ddp_script_mode_setup(self):
# when the trainer script was called the device has already been scoped by the time
# code reaches this point. so, to call the scripts, we need to leave cuda visible devices alone
# but forward the GPUs selected via environment variables
# set the flag for ddp scripts

os.environ['PL_TRAINER_GPUS'] = ','.join([str(i) for i in self.trainer.data_parallel_device_ids])
os.environ['PL_IN_DDP_SUBPROCESS'] = '1'

if self.trainer.logger is not None:
os.environ['PL_EXP_VERSION'] = str(self.trainer.logger.version)

gpu_ids = os.environ.get('CUDA_VISIBLE_DEVICES', '')
if len(gpu_ids) == 1:
gpu_ids = f'{gpu_ids},'

num_gpus = max(1, len(gpu_ids.split(',')))

# set the flag for ddp scripts
os.environ['PL_TRAINER_GPUS'] = gpu_ids

os.environ['WORLD_SIZE'] = f'{num_gpus * self.trainer.num_nodes}'

self.trainer.interactive_ddp_procs = []
self.interactive_ddp_procs = []
for local_rank in range(1, self.trainer.num_processes):
env_copy = os.environ.copy()
env_copy['LOCAL_RANK'] = f'{local_rank}'
env_copy['PL_DDP_PID'] = str(self.trainer.data_parallel_device_ids[local_rank])

# start process
# if hydra is available and initialized, make sure to set the cwd correctly
Expand All @@ -114,7 +130,7 @@ def __ddp_script_mode_setup(self):
if HydraConfig.initialized():
cwd = get_original_cwd()
proc = subprocess.Popen(command, env=env_copy, cwd=cwd)
self.trainer.interactive_ddp_procs.append(proc)
self.interactive_ddp_procs.append(proc)

# starting all processes at once can cause issues
# with dataloaders delay between 1-10 seconds
Expand All @@ -123,14 +139,116 @@ def __ddp_script_mode_setup(self):

self.task_idx = 0

# wait for all the procs to start
sleep(2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the processes communicate on startup? I feel like a hardcoded sleep is not the optimal solution here


def train(self):
model = self.trainer.model
if self.mode == 'ddp':
results = self.ddp_train_tmp(process_idx=self.task_idx, mp_queue=None, model=model, is_master=True)
del os.environ['WORLD_SIZE']
results = self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model, is_master=True)
if 'WORLD_SIZE' in os.environ:
del os.environ['WORLD_SIZE']
return results
else:
self.ddp_train_tmp(process_idx=self.task_idx, mp_queue=None, model=model)
return self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model)

def ddp_train(self, process_idx, mp_queue, model, is_master=False, proc_offset=0):
"""
Entry point for ddp
Args:
process_idx:
mp_queue: multiprocessing queue
model:
is_master:
proc_offset:
Returns:
"""
# offset the process id if requested
process_idx = process_idx + proc_offset

# show progressbar only on progress_rank 0
if (self.trainer.node_rank != 0 or process_idx != 0) and self.trainer.progress_bar_callback is not None:
self.trainer.progress_bar_callback.disable()

# determine which process we are and world size
self.set_world_ranks(process_idx)

# set warning rank
rank_zero_only.rank = self.trainer.global_rank

# set up server using proc 0's ip address
# try to init for 20 times at max in case ports are taken
# where to store ip_table
model.trainer = self.trainer
model.init_ddp_connection(
self.trainer.global_rank,
self.trainer.world_size,
self.trainer.is_slurm_managing_tasks
)

# call setup after the ddp process has connected
self.trainer.call_setup_hook(model)

# on world_size=0 let everyone know training is starting
if self.trainer.is_global_zero and not torch.distributed.is_initialized():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking, but is_global_zero should be part of DDPBackend IMO, since this is only needed for this

log.info('-' * 100)
log.info(f'distributed_backend={self.trainer.distributed_backend}')
log.info(f'All DDP processes registered. Starting ddp with {self.trainer.world_size} processes')
log.info('-' * 100)

# call sync_bn before .cuda(), configure_apex and configure_ddp
if self.trainer.sync_batchnorm:
model = model.configure_sync_batchnorm(model)

# MODEL
# copy model to each gpu
self.model_to_device(model, process_idx, is_master)

# CHOOSE OPTIMIZER
# allow for lr schedulers as well
self.setup_optimizers(model)

# set model properties before going into wrapper
self.trainer.model_connector.copy_trainer_model_properties(model)

# AMP - run through amp wrapper before going to distributed DP
# DDP uses all GPUs on the machine
device_ids = self.get_device_ids()

# allow user to configure ddp
model = model.configure_ddp(model, device_ids)

# set up training routine
self.barrier('ddp_setup')
self.trainer.train_loop.setup_training(model)

# train or test
results = self.train_or_test()

# clean up memory
torch.cuda.empty_cache()

return results

def training_step(self, args):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def training_step(self, args):
def training_step(self, *args):

changing this to positional args allows for inspection. Since this is also passed as positional args, this should be okay. Just have to check the according calls as well

if self.trainer.amp_backend == AMPType.NATIVE:
with torch.cuda.amp.autocast():
output = self.trainer.model(*args)
else:
output = self.trainer.model(*args)
return output

def validation_step(self, args):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like the preroutine we had before. can we rename training_step to _step and add a training_step method like validation_step and test_step?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def validation_step(self, args):
def validation_step(self, *args):

output = self.training_step(args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output = self.training_step(args)
output = self.training_step(*args)

return output

def test_step(self, args):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_step(self, args):
def test_step(self, *args):

output = self.training_step(args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output = self.training_step(args)
output = self.training_step(*args)

return output

def barrier(self, name: str = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the name argument not used here?

if torch_distrib.is_initialized():
torch_distrib.barrier()

def _check_can_spawn_children(self):
if self._has_spawned_children:
Expand All @@ -145,15 +263,7 @@ def set_world_ranks(self, process_idx):
self.trainer.world_size = self.trainer.num_nodes * self.trainer.num_processes

def model_to_device(self, model, process_idx, is_master):
gpu_idx = process_idx

# when using ddp, the master process (proc 0) continues running as the main one
# this means that the local rank will always be 0
# (even if cuda visible devices has other visible gpus)
# this means that the master process needs to pull the 0th visible index as the device number
if is_master:
available_gpus = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
gpu_idx = int(available_gpus[self.trainer.local_rank])
gpu_idx = int(os.environ.get('PL_DDP_PID', process_idx))

self.trainer.root_gpu = gpu_idx
torch.cuda.set_device(self.trainer.root_gpu)
Expand All @@ -162,3 +272,6 @@ def model_to_device(self, model, process_idx, is_master):
def get_device_ids(self):
device_ids = [self.trainer.root_gpu]
return device_ids

def on_train_end(self):
pass
5 changes: 3 additions & 2 deletions pytorch_lightning/accelerators/ddp_base_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,8 @@ def test_step(self, args):
return output

def barrier(self, name: str = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again unused name argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class is getting dropped in a few PRs.

torch_distrib.barrier()
if torch_distrib.is_initialized():
torch_distrib.barrier()

def early_stopping_should_stop(self, pl_module):
stop = torch.tensor(int(self.trainer.should_stop), device=pl_module.device)
Expand Down Expand Up @@ -132,7 +133,7 @@ def ddp_train_tmp(self, process_idx, mp_queue, model, is_master=False, proc_offs
self.trainer.call_setup_hook(model)

# on world_size=0 let everyone know training is starting
if self.trainer.is_global_zero:
if self.trainer.is_global_zero and not torch.distributed.is_initialized():
log.info('-' * 100)
log.info(f'distributed_backend={self.trainer.distributed_backend}')
log.info(f'All DDP processes registered. Starting ddp with {self.trainer.world_size} processes')
Expand Down
27 changes: 21 additions & 6 deletions pytorch_lightning/accelerators/ddp_cpu_spawn_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,9 +95,7 @@ def ddp_train(self, process_idx, mp_queue, model):
self.trainer.progress_bar_callback.disable()

# determine which process we are and world size
self.trainer.local_rank = process_idx
self.trainer.global_rank = self.trainer.node_rank * self.trainer.num_processes + process_idx
self.trainer.world_size = self.trainer.num_nodes * self.trainer.num_processes
self.set_world_ranks(process_idx)

# set warning rank
rank_zero_only.rank = self.trainer.global_rank
Expand All @@ -116,7 +114,7 @@ def ddp_train(self, process_idx, mp_queue, model):
self.trainer.call_setup_hook(model)

# on world_size=0 let everyone know training is starting
if self.trainer.is_global_zero:
if self.trainer.is_global_zero and not torch.distributed.is_initialized():
log.info('-' * 100)
log.info(f'distributed_backend={self.trainer.distributed_backend}')
log.info(f'All DDP processes registered. Starting ddp with {self.trainer.world_size} processes')
Expand All @@ -126,6 +124,9 @@ def ddp_train(self, process_idx, mp_queue, model):
if self.trainer.sync_batchnorm:
model = model.configure_sync_batchnorm(model)

# move the model to the correct device
self.model_to_device(model, process_idx)

# CHOOSE OPTIMIZER
# allow for lr schedulers as well
self.setup_optimizers(model)
Expand All @@ -137,7 +138,7 @@ def ddp_train(self, process_idx, mp_queue, model):
model = self.trainer.precision_connector.connect(model)

# DDP spawn already spawned off each process... no need to do anything
device_ids = None
device_ids = self.get_device_ids()

# allow user to configure ddp
model = model.configure_ddp(model, device_ids)
Expand Down Expand Up @@ -174,7 +175,8 @@ def test_step(self, args):
return output

def barrier(self, name: str = None):
torch_distrib.barrier()
if torch_distrib.is_initialized():
williamFalcon marked this conversation as resolved.
Show resolved Hide resolved
torch_distrib.barrier()

def broadcast(self, obj, src=0):
return self.dist.broadcast(obj)
Expand All @@ -186,6 +188,19 @@ def early_stopping_should_stop(self, pl_module):
should_stop = stop == self.trainer.world_size
return should_stop

def set_world_ranks(self, process_idx):
self.trainer.local_rank = process_idx
self.trainer.global_rank = self.trainer.node_rank * self.trainer.num_processes + process_idx
self.trainer.world_size = self.trainer.num_nodes * self.trainer.num_processes

def model_to_device(self, model, process_idx):
# in ddp cpu we don't actually move models to a device
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should explicitly move them to cpu here, since we don't know on which device it was initially.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call

pass

def get_device_ids(self):
device_ids = None
return device_ids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
device_ids = None
return device_ids
return None


def transfer_distrib_spawn_state_on_fit_end(self, model, mp_queue, results):
# track the best model path
best_model_path = None
Expand Down
Loading