Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

universal resume checkpoint from deepspeed #19585

Closed
pengzhangzhi opened this issue Mar 6, 2024 · 1 comment
Closed

universal resume checkpoint from deepspeed #19585

pengzhangzhi opened this issue Mar 6, 2024 · 1 comment
Labels
feature Is an improvement or enhancement strategy: deepspeed

Comments

@pengzhangzhi
Copy link

pengzhangzhi commented Mar 6, 2024

Description & Motivation

One pain in training with deepspeed is that when resume from a checkpoint you have to use the same amount of gpus as the num of gpus that the checkpoint was trained on. otherwise, you will see the following error:

deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 32 but the current world size is 128. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

see this issue. deepspeedai/DeepSpeed#3810

Also when the model is well trained, and we want to paly with the inference, it would be a problem to load a deepspeed ckpt for inference as it requires the same num of GPUs as the training?

But currently, Deepspeed proposes a universal checkpointing to convert the deepspeed ckpt to universal ckpt, which can be loaded in whatever many of gpus.
Please refer to the link
https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing#zero-stage-2-training

Would lightning integrate this feature?

Also a weird usage, is that any way to load only the checkpoint of the model while ignoring other state ckpt like optimizer?

Pitch

No response

Alternatives

No response

Additional context

No response

cc @Borda @awaelchli

@pengzhangzhi pengzhangzhi added feature Is an improvement or enhancement needs triage Waiting to be triaged by maintainers labels Mar 6, 2024
@awaelchli awaelchli added strategy: deepspeed and removed needs triage Waiting to be triaged by maintainers labels Mar 8, 2024
@awaelchli
Copy link
Contributor

@pengzhangzhi Do you mean this?

https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/deepspeed.html#collating-single-file-checkpoint-for-deepspeed-zero-stage-3

After converting a checkpoint like shown there, you should be able to load it on a different number of GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement strategy: deepspeed
Projects
None yet
Development

No branches or pull requests

2 participants