DeepSpeed + training checkpointing doesn't work #8092

gahdritz · 2021-06-23T04:24:05Z

🐛 Bug

It looks like the default checkpoint connector doesn't handle DeepSpeed optimizer checkpointing properly. Among other issues, restore_training_state() (in pytorch_lightning==1.3.7.post0) passes DeepSpeed's load_state_dict() a dictionary, when it seems to expect a list.

Reproduction

To reproduce, train any model with DeepSpeed, using one of DeepSpeed's optimizers (I used FusedAdam) and create a checkpoint. Attempt to load that checkpoint with the Trainer's --restore_from_checkpoint option. That should case a crash.

Here's the trace I get:

Traceback (most recent call last):
  File "dilated_resnet_pl.py", line 578, in <module>
    trainer.fit(model_module, data_module)
  File "/home/ga122/code/venv/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/home/ga122/code/venv/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/home/ga122/code/venv/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/home/ga122/code/venv/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/ga122/code/venv/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/home/ga122/code/venv/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/home/ga122/code/venv/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 837, in run_train
    self._pre_training_routine()
  File "/home/ga122/code/venv/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in _pre_training_routine
    self.checkpoint_connector.restore_weights()
  File "/home/ga122/code/venv/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 73, in restore_weights
    self.restore(self.trainer.resume_from_checkpoint, on_gpu=self.trainer._device_type == DeviceType.GPU)
  File "/home/ga122/code/venv/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 102, in restore
    self.restore_training_state(checkpoint, load_optimizer_states)
  File "/home/ga122/code/venv/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 183, in restore_training_state
    optimizer.load_state_dict(opt_state)
  File "/home/ga122/code/venv/lib/python3.6/site-packages/deepspeed/runtime/zero/stage2.py", line 1951, in load_state_dict
    self.loss_scaler = state_dict_list[0]['loss_scaler']
KeyError: 0

At ZeRO stage 1, the issue can be fixed by simply wrapping opt_state in a list, as follows:

optimizer.load_state_dict(opt_state)

However; at higher levels of ZeRO optimization, when the optimizer state is partitioned, that doesn't cut it. In that case, it seems like the optimizer state is being stored differently from how DeepSpeed expects it: in deepspeed/runtime/zero/stage2.py, they iterate over the opt_state list passed to load_state_dict expecting there to be one item per partition. The checkpoint seems to actually contain one item with the state for all partitions (though the lengths don't exactly add up---I can't really figure out what's going wrong).

I'm running pytorch-lightning==1.3.3 and deepspeed==0.3.17+c1550b8 (compiled from source), though the issue is present in the current pip version of deepspeed and pytorch-lightning==1.3.7.post0.

#7282 is similar, but doesn't report this particular crash, or the fact that the ZeRO stage matters.

The text was updated successfully, but these errors were encountered:

gahdritz · 2021-06-26T06:32:37Z

I'm no longer able to reproduce the issue using the latest builds of both packages. I'll close this again (hopefully for good this time).

gahdritz · 2021-06-26T06:44:21Z

Sorry to flip-flop, but I've decided that this should remain open after all. The issue was superseded by a different issue, but the pip version (1.3.7.post0) still has it.

xxchauncey · 2021-07-12T04:37:41Z

Hello,

Any updates? I was trapped in the same error, my pytorch-lightning version is 1.3.8 and deepspeed is 0.4.0

SeanNaren · 2021-07-26T10:36:40Z

Resume from checkpoint hasn't been supported, but support is being worked on in #8397. We're waiting for 1.4 to come out before continuing the changes here, as we'll be introducing a few breaking changes.

SeanNaren · 2021-08-03T19:47:08Z

We've merged a lot of fixes for DeepSpeed in #8397 that should allow a checkpoint to be restored fully! This has required changing the default method of saving to fully rely on DeepSpeed (which saves a directory), and you can generate a single file for inference by following these instructions: https://pytorch-lightning.readthedocs.io/en/latest/advanced/advanced_gpu.html#deepspeed-zero-stage-3-single-file. let us know if you run into any issues!

gahdritz added bug Something isn't working help wanted Open to be worked on labels Jun 23, 2021

justusschock assigned justusschock and SeanNaren and unassigned justusschock Jun 23, 2021

Borda added the priority: 1 Medium priority task label Jun 24, 2021

gahdritz closed this as completed Jun 26, 2021

gahdritz reopened this Jun 26, 2021

gahdritz closed this as completed Jun 26, 2021

gahdritz reopened this Jun 26, 2021

SeanNaren added this to the v1.4.x milestone Jul 26, 2021

SeanNaren mentioned this issue Jul 26, 2021

Fix save/load/resume from checkpoint for DeepSpeed Plugin #8397

Merged

12 tasks

SeanNaren closed this as completed Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed + training checkpointing doesn't work #8092

DeepSpeed + training checkpointing doesn't work #8092

gahdritz commented Jun 23, 2021 •

edited

Loading

gahdritz commented Jun 26, 2021

gahdritz commented Jun 26, 2021 •

edited

Loading

xxchauncey commented Jul 12, 2021

SeanNaren commented Jul 26, 2021

SeanNaren commented Aug 3, 2021

DeepSpeed + training checkpointing doesn't work #8092

DeepSpeed + training checkpointing doesn't work #8092

Comments

gahdritz commented Jun 23, 2021 • edited Loading

🐛 Bug

Reproduction

gahdritz commented Jun 26, 2021

gahdritz commented Jun 26, 2021 • edited Loading

xxchauncey commented Jul 12, 2021

SeanNaren commented Jul 26, 2021

SeanNaren commented Aug 3, 2021

gahdritz commented Jun 23, 2021 •

edited

Loading

gahdritz commented Jun 26, 2021 •

edited

Loading