-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resume from checkpoint fails in current master with deepspeed stage 2 #8344
Comments
Pinging @tchaton @SeanNaren as discussed on slack |
same issue here, need help. |
We've merged a lot of fixes for DeepSpeed in #8397 that should allow a checkpoint to be restored fully! This has required changing the default method of saving to fully rely on DeepSpeed (which saves a directory), and you can generate a single file for inference by following these instructions: https://pytorch-lightning.readthedocs.io/en/latest/advanced/advanced_gpu.html#deepspeed-zero-stage-3-single-file. let us know if you run into any issues! |
I tested with the above reproduce script with the version 1.4.6. But the issue is still there |
@eelxpeng try with current master. |
@gurvindersingh Yes, master branch works. Thanks a lot. I think the release doc for version 1.4.6 is misleading, the issue apparently exists for version 1.4.6. |
1.4.9 still fails for the same error |
Dear @HMJiangGatech , Would you mind trying out Lightning 1.5 rc ? Best, |
1.5 rc, itself looks fine. But it failed to load my 1.4.9 checkpoint, when using deepspeed.🤣 |
🐛 Bug
When trying to resume model from stored checkpoint in DeepSpeed mode 2, it fails with this exception
Please reproduce using the BoringModel
Run the following code snippet
once run it fully and then uncomment the
resume_from_checkpoint
parameter toTrainer
and you will see the exception.To Reproduce
Run the given code snippet to reproduce.
Expected behavior
Model training resume from stored checkpoint.
Environment
conda
,pip
, source): condaThe text was updated successfully, but these errors were encountered: