Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resume from checkpoint fails in current master with deepspeed stage 2 #8344

Closed
gurvindersingh opened this issue Jul 9, 2021 · 9 comments
Closed
Assignees
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@gurvindersingh
Copy link

🐛 Bug

When trying to resume model from stored checkpoint in DeepSpeed mode 2, it fails with this exception

Restoring states from the checkpoint file at tests/last.ckpt                                                      
Traceback (most recent call last):                                                                                
  File "test.py", line 82, in <module>                                                                            
    run()                                                                                                         
  File "test.py", line 77, in run                                                                                 
    trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)                                     
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/pytorch_lightning/tr
ainer/trainer.py", line 520, in fit                                                                                   self.checkpoint_connector.resume_start()                                                                      
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/pytorch_lightning/t$
ainer/connectors/checkpoint_connector.py", line 71, in resume_start
    self._loaded_checkpoint = self.trainer.training_type_plugin.load_checkpoint_file(checkpoint_path)
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/pytorch_lightning/p$
ugins/training_type/deepspeed.py", line 673, in load_checkpoint_file
    checkpoint_path = self.broadcast(checkpoint_path)
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/pytorch_lightning/p$
ugins/training_type/ddp.py", line 361, in broadcast
    return self.dist.broadcast(obj)
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/pytorch_lightning/d$
stributed/dist.py", line 33, in broadcast
    broadcast_object_list(obj, 0, group=group or _group.WORLD)
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/pytorch_lightning/o$errides/torch_distributed.py", line 48, in _broadcast_object_list
    my_rank = get_rank()
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/torch/distributed/d$
stributed_c10d.py", line 616, in get_rank
    _check_default_pg()
  File "/home/ca5b7a03-2d901b-2d45e5-2d969e-2df8ccc075972b/.local/lib/python3.7/site-packages/torch/distributed/d$
stributed_c10d.py", line 211, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

Please reproduce using the BoringModel

Run the following code snippet

import os
import torch
from torch.utils.data import Dataset, DataLoader
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.plugins import DeepSpeedPlugin
from deepspeed.ops.adam import FusedAdam


class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)
        return loss

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    checkpoint_callback = ModelCheckpoint(
        dirpath='tests/',
        filename='{epoch:02d}',
        save_last=True,
        every_n_train_steps=5,
    )
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        gpus=-1,
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        precision=16,
        accelerator='ddp',
        max_epochs=10,
        plugins=[DeepSpeedPlugin(cpu_offload=False, stage=2)],
        weights_summary=None,
        callbacks=[checkpoint_callback],
        #resume_from_checkpoint='tests/last.ckpt',
    )
    trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
    trainer.test(model, test_dataloaders=test_data)


if __name__ == '__main__':
    run()

once run it fully and then uncomment the resume_from_checkpoint parameter to Trainer and you will see the exception.

To Reproduce

Run the given code snippet to reproduce.

Expected behavior

Model training resume from stored checkpoint.

Environment

  • PyTorch Lightning Version (e.g., 1.3.0): 1.4.0-dev (current master)
  • PyTorch Version (e.g., 1.8): 1.7.1
  • Python version: 3.7.3
  • OS (e.g., Linux): Ubuntu 18.04
  • CUDA/cuDNN version: 10.2
  • GPU models and configuration: V100 (4 Gpus)
  • How you installed PyTorch (conda, pip, source): conda
@gurvindersingh gurvindersingh added bug Something isn't working help wanted Open to be worked on labels Jul 9, 2021
@gurvindersingh
Copy link
Author

Pinging @tchaton @SeanNaren as discussed on slack

@SeanNaren SeanNaren self-assigned this Jul 9, 2021
@awaelchli awaelchli self-assigned this Jul 9, 2021
@xxchauncey
Copy link

same issue here, need help.

@SeanNaren
Copy link
Contributor

We've merged a lot of fixes for DeepSpeed in #8397 that should allow a checkpoint to be restored fully! This has required changing the default method of saving to fully rely on DeepSpeed (which saves a directory), and you can generate a single file for inference by following these instructions: https://pytorch-lightning.readthedocs.io/en/latest/advanced/advanced_gpu.html#deepspeed-zero-stage-3-single-file. let us know if you run into any issues!

@eelxpeng
Copy link

eelxpeng commented Sep 15, 2021

I tested with the above reproduce script with the version 1.4.6. But the issue is still there RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. Do you know what is going on? @SeanNaren @gurvindersingh

@gurvindersingh
Copy link
Author

@eelxpeng try with current master.

@eelxpeng
Copy link

@gurvindersingh Yes, master branch works. Thanks a lot. I think the release doc for version 1.4.6 is misleading, the issue apparently exists for version 1.4.6.

@HMJiangGatech
Copy link

HMJiangGatech commented Oct 21, 2021

1.4.9 still fails for the same error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group

@tchaton
Copy link
Contributor

tchaton commented Oct 22, 2021

Dear @HMJiangGatech ,

Would you mind trying out Lightning 1.5 rc ?

Best,
T.C

@HMJiangGatech
Copy link

Dear @HMJiangGatech ,

Would you mind trying out Lightning 1.5 rc ?

Best, T.C

1.5 rc, itself looks fine. But it failed to load my 1.4.9 checkpoint, when using deepspeed.🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

7 participants