How to save and load checkpointing using DeepSpeed plugin stage 3? #9321

yidong72 · 2021-09-04T01:49:48Z

yidong72
Sep 4, 2021

I have been struggling figuring out how to save/load my model with DeepSpeed plugin. I cannot find any examples of doing it.

Here is how I setup the plugin

    dp = DeepSpeedPlugin(
        stage=3,
        cpu_offload=True,
        cpu_checkpointing=True,
  #    save_full_weights=False,
    )

I use the ModelCheckpoint callback to save the checkpoints. It generates either a single checkpoint file or a directory of pt files depending on the save_full_weights state true of false.

However I don't know how to load the checkpoint files. I tried either
Model.load_from_checkpoint or Trainer(resume_from_checkpoint=)methods, none of them works for me. I got AttributeError: 'NoneType' object has no attribute 'trainer' , Default process group has not been initialized, please make sure to call init_process_group. errors.

Could you show me a working example? Thanks.

awaelchli · 2021-09-06T22:55:43Z

awaelchli
Sep 6, 2021

Here is a small example. Run it twice, once without modifications and a second time by increasing the max epochs and uncommenting the line for resume_from_checkpoint:

import torch
from torch.utils.data import Dataset, DataLoader
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.plugins import DeepSpeedPlugin


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    checkpoint_callback = ModelCheckpoint(
        dirpath="checkpoints/",
        filename="{epoch:02d}",
    )
    trainer = Trainer(
        # resume_from_checkpoint="checkpoints/epoch=9.ckpt",
        max_epochs=1,  # increase when resuming
        gpus=2,
        accelerator="ddp",
        plugins=[DeepSpeedPlugin(stage=3)],
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        precision=16,
        weights_summary=None,
        callbacks=[checkpoint_callback],
    )
    trainer.fit(model, train_dataloader=train_data)


if __name__ == "__main__":
    run()

1 reply

yidong72 Sep 13, 2021
Author

Thanks for your example. However I still get the error message using your example.

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/vscode/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:530: LightningDeprecationWarning: `trainer.fit(train_dataloader)` is deprecated in v1.4 and will be removed in v1.6. Use `trainer.fit(train_dataloaders)` instead. HINT: added 's'
  rank_zero_deprecation(
Restoring states from the checkpoint file at checkpoints/epoch=00.ckpt
Traceback (most recent call last):
  File "/workspace/example.py", line 63, in <module>
    run()
  File "/workspace/example.py", line 59, in run
    trainer.fit(model, train_dataloader=train_data)
  File "/home/vscode/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 550, in fit
    self.checkpoint_connector.resume_start()
  File "/home/vscode/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 71, in resume_start
    self._loaded_checkpoint = self.trainer.training_type_plugin.load_checkpoint_file(checkpoint_path)
  File "/home/vscode/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 699, in load_checkpoint_file
    checkpoint_path = self.broadcast(checkpoint_path)
  File "/home/vscode/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 356, in broadcast
    return self.dist.broadcast(obj)
  File "/home/vscode/miniconda3/lib/python3.9/site-packages/pytorch_lightning/distributed/dist.py", line 32, in broadcast
    broadcast_object_list(obj, 0, group=group or _group.WORLD)
  File "/home/vscode/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1700, in broadcast_object_list
    my_rank = get_rank()
  File "/home/vscode/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 725, in get_rank
    default_pg = _get_default_group()
  File "/home/vscode/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

I use the latest stable pytorch lightnining.

(base) vscode@de96ed35bf2d:/workspace$ python -c 'import pytorch_lightning; print(pytorch_lightning.__version__)'
1.4.6

Not sure why it doesn't work for me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to save and load checkpointing using DeepSpeed plugin stage 3? #9321

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to save and load checkpointing using DeepSpeed plugin stage 3? #9321

yidong72 Sep 4, 2021

Replies: 1 comment · 1 reply

awaelchli Sep 6, 2021

yidong72 Sep 13, 2021 Author

yidong72
Sep 4, 2021

Replies: 1 comment 1 reply

awaelchli
Sep 6, 2021

yidong72 Sep 13, 2021
Author