universal resume checkpoint from deepspeed #19585

pengzhangzhi · 2024-03-06T20:37:10Z

Description & Motivation

One pain in training with deepspeed is that when resume from a checkpoint you have to use the same amount of gpus as the num of gpus that the checkpoint was trained on. otherwise, you will see the following error:

deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 32 but the current world size is 128. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

see this issue. deepspeedai/DeepSpeed#3810

Also when the model is well trained, and we want to paly with the inference, it would be a problem to load a deepspeed ckpt for inference as it requires the same num of GPUs as the training?

But currently, Deepspeed proposes a universal checkpointing to convert the deepspeed ckpt to universal ckpt, which can be loaded in whatever many of gpus.
Please refer to the link
https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing#zero-stage-2-training

Would lightning integrate this feature?

Also a weird usage, is that any way to load only the checkpoint of the model while ignoring other state ckpt like optimizer?

Pitch

No response

Alternatives

No response

Additional context

No response

cc @Borda @awaelchli

The text was updated successfully, but these errors were encountered:

awaelchli · 2024-03-08T23:17:15Z

@pengzhangzhi Do you mean this?

https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/deepspeed.html#collating-single-file-checkpoint-for-deepspeed-zero-stage-3

After converting a checkpoint like shown there, you should be able to load it on a different number of GPUs.

pengzhangzhi added feature Is an improvement or enhancement needs triage Waiting to be triaged by maintainers labels Mar 6, 2024

awaelchli added strategy: deepspeed and removed needs triage Waiting to be triaged by maintainers labels Mar 8, 2024

pengzhangzhi closed this as completed Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

universal resume checkpoint from deepspeed #19585

universal resume checkpoint from deepspeed #19585

pengzhangzhi commented Mar 6, 2024 •

edited by github-actions bot

Loading

awaelchli commented Mar 8, 2024

universal resume checkpoint from deepspeed #19585

universal resume checkpoint from deepspeed #19585

Comments

pengzhangzhi commented Mar 6, 2024 • edited by github-actions bot Loading

Description & Motivation

Pitch

Alternatives

Additional context

awaelchli commented Mar 8, 2024

pengzhangzhi commented Mar 6, 2024 •

edited by github-actions bot

Loading