-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeed + training checkpointing doesn't work #8092
Comments
I'm no longer able to reproduce the issue using the latest builds of both packages. I'll close this again (hopefully for good this time). |
Sorry to flip-flop, but I've decided that this should remain open after all. The issue was superseded by a different issue, but the pip version (1.3.7.post0) still has it. |
Hello, Any updates? I was trapped in the same error, my pytorch-lightning version is 1.3.8 and deepspeed is 0.4.0 |
Resume from checkpoint hasn't been supported, but support is being worked on in #8397. We're waiting for 1.4 to come out before continuing the changes here, as we'll be introducing a few breaking changes. |
We've merged a lot of fixes for DeepSpeed in #8397 that should allow a checkpoint to be restored fully! This has required changing the default method of saving to fully rely on DeepSpeed (which saves a directory), and you can generate a single file for inference by following these instructions: https://pytorch-lightning.readthedocs.io/en/latest/advanced/advanced_gpu.html#deepspeed-zero-stage-3-single-file. let us know if you run into any issues! |
🐛 Bug
It looks like the default checkpoint connector doesn't handle DeepSpeed optimizer checkpointing properly. Among other issues,
restore_training_state()
(in pytorch_lightning==1.3.7.post0) passes DeepSpeed'sload_state_dict()
a dictionary, when it seems to expect a list.Reproduction
To reproduce, train any model with DeepSpeed, using one of DeepSpeed's optimizers (I used FusedAdam) and create a checkpoint. Attempt to load that checkpoint with the Trainer's --restore_from_checkpoint option. That should case a crash.
Here's the trace I get:
At ZeRO stage 1, the issue can be fixed by simply wrapping
opt_state
in a list, as follows:optimizer.load_state_dict(opt_state)
However; at higher levels of ZeRO optimization, when the optimizer state is partitioned, that doesn't cut it. In that case, it seems like the optimizer state is being stored differently from how DeepSpeed expects it: in
deepspeed/runtime/zero/stage2.py
, they iterate over theopt_state
list passed toload_state_dict
expecting there to be one item per partition. The checkpoint seems to actually contain one item with the state for all partitions (though the lengths don't exactly add up---I can't really figure out what's going wrong).I'm running pytorch-lightning==1.3.3 and deepspeed==0.3.17+c1550b8 (compiled from source), though the issue is present in the current pip version of deepspeed and pytorch-lightning==1.3.7.post0.
#7282 is similar, but doesn't report this particular crash, or the fact that the ZeRO stage matters.
The text was updated successfully, but these errors were encountered: