resume from checkpoint fails in current master with deepspeed stage 2 #8344

gurvindersingh · 2021-07-09T08:11:09Z

🐛 Bug

When trying to resume model from stored checkpoint in DeepSpeed mode 2, it fails with this exception

Restoring states from the checkpoint file at tests/last.ckpt                                                      
Traceback (most recent call last):                                                                                
  File "test.py", line 82, in <module>                                                                            
    run()                                                                                                         
  File "test.py", line 77, in run                                                                                 
    trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)                                     
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/pytorch_lightning/tr
ainer/trainer.py", line 520, in fit                                                                                   self.checkpoint_connector.resume_start()                                                                      
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/pytorch_lightning/t$
ainer/connectors/checkpoint_connector.py", line 71, in resume_start
    self._loaded_checkpoint = self.trainer.training_type_plugin.load_checkpoint_file(checkpoint_path)
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/pytorch_lightning/p$
ugins/training_type/deepspeed.py", line 673, in load_checkpoint_file
    checkpoint_path = self.broadcast(checkpoint_path)
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/pytorch_lightning/p$
ugins/training_type/ddp.py", line 361, in broadcast
    return self.dist.broadcast(obj)
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/pytorch_lightning/d$
stributed/dist.py", line 33, in broadcast
    broadcast_object_list(obj, 0, group=group or _group.WORLD)
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/pytorch_lightning/o$errides/torch_distributed.py", line 48, in _broadcast_object_list
    my_rank = get_rank()
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/torch/distributed/d$
stributed_c10d.py", line 616, in get_rank
    _check_default_pg()
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/torch/distributed/d$
stributed_c10d.py", line 211, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

Please reproduce using the BoringModel

Run the following code snippet

import os
import torch
from torch.utils.data import Dataset, DataLoader
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.plugins import DeepSpeedPlugin
from deepspeed.ops.adam import FusedAdam


class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)
        return loss

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    checkpoint_callback = ModelCheckpoint(
        dirpath='tests/',
        filename='{epoch:02d}',
        save_last=True,
        every_n_train_steps=5,
    )
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        gpus=-1,
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        precision=16,
        accelerator='ddp',
        max_epochs=10,
        plugins=[DeepSpeedPlugin(cpu_offload=False, stage=2)],
        weights_summary=None,
        callbacks=[checkpoint_callback],
        #resume_from_checkpoint='tests/last.ckpt',
    )
    trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
    trainer.test(model, test_dataloaders=test_data)


if __name__ == '__main__':
    run()

once run it fully and then uncomment the resume_from_checkpoint parameter to Trainer and you will see the exception.

To Reproduce

Run the given code snippet to reproduce.

Expected behavior

Model training resume from stored checkpoint.

Environment

PyTorch Lightning Version (e.g., 1.3.0): 1.4.0-dev (current master)
PyTorch Version (e.g., 1.8): 1.7.1
Python version: 3.7.3
OS (e.g., Linux): Ubuntu 18.04
CUDA/cuDNN version: 10.2
GPU models and configuration: V100 (4 Gpus)
How you installed PyTorch (conda, pip, source): conda

The text was updated successfully, but these errors were encountered:

gurvindersingh · 2021-07-09T08:12:07Z

Pinging @tchaton @SeanNaren as discussed on slack

xxchauncey · 2021-07-12T08:05:42Z

same issue here, need help.

SeanNaren · 2021-08-03T19:47:03Z

We've merged a lot of fixes for DeepSpeed in #8397 that should allow a checkpoint to be restored fully! This has required changing the default method of saving to fully rely on DeepSpeed (which saves a directory), and you can generate a single file for inference by following these instructions: https://pytorch-lightning.readthedocs.io/en/latest/advanced/advanced_gpu.html#deepspeed-zero-stage-3-single-file. let us know if you run into any issues!

eelxpeng · 2021-09-15T23:28:58Z

I tested with the above reproduce script with the version 1.4.6. But the issue is still there RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. Do you know what is going on? @SeanNaren @gurvindersingh

gurvindersingh · 2021-09-16T07:13:38Z

@eelxpeng try with current master.

eelxpeng · 2021-09-16T17:13:11Z

@gurvindersingh Yes, master branch works. Thanks a lot. I think the release doc for version 1.4.6 is misleading, the issue apparently exists for version 1.4.6.

HMJiangGatech · 2021-10-21T20:58:13Z

1.4.9 still fails for the same error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group

tchaton · 2021-10-22T13:10:54Z

Dear @HMJiangGatech ,

Would you mind trying out Lightning 1.5 rc ?

Best,
T.C

HMJiangGatech · 2021-10-24T08:12:34Z

Dear @HMJiangGatech ,

Would you mind trying out Lightning 1.5 rc ?

Best, T.C

1.5 rc, itself looks fine. But it failed to load my 1.4.9 checkpoint, when using deepspeed.🤣

gurvindersingh added bug Something isn't working help wanted Open to be worked on labels Jul 9, 2021

SeanNaren self-assigned this Jul 9, 2021

awaelchli self-assigned this Jul 9, 2021

SeanNaren mentioned this issue Jul 26, 2021

Fix save/load/resume from checkpoint for DeepSpeed Plugin #8397

Merged

12 tasks

SeanNaren closed this as completed Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resume from checkpoint fails in current master with deepspeed stage 2 #8344

resume from checkpoint fails in current master with deepspeed stage 2 #8344

gurvindersingh commented Jul 9, 2021

gurvindersingh commented Jul 9, 2021

xxchauncey commented Jul 12, 2021

SeanNaren commented Aug 3, 2021

eelxpeng commented Sep 15, 2021 •

edited

Loading

gurvindersingh commented Sep 16, 2021

eelxpeng commented Sep 16, 2021

HMJiangGatech commented Oct 21, 2021 •

edited

Loading

tchaton commented Oct 22, 2021

HMJiangGatech commented Oct 24, 2021

resume from checkpoint fails in current master with deepspeed stage 2 #8344

resume from checkpoint fails in current master with deepspeed stage 2 #8344

Comments

gurvindersingh commented Jul 9, 2021

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

gurvindersingh commented Jul 9, 2021

xxchauncey commented Jul 12, 2021

SeanNaren commented Aug 3, 2021

eelxpeng commented Sep 15, 2021 • edited Loading

gurvindersingh commented Sep 16, 2021

eelxpeng commented Sep 16, 2021

HMJiangGatech commented Oct 21, 2021 • edited Loading

tchaton commented Oct 22, 2021

HMJiangGatech commented Oct 24, 2021

eelxpeng commented Sep 15, 2021 •

edited

Loading

HMJiangGatech commented Oct 21, 2021 •

edited

Loading