-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeed Stage 2 Tensors on Different Devices #9521
Comments
Hey @kelvins64, Thanks for sharing a script, I confirm I can reproduce this bug on master. Best, |
Looking into the DeepSpeed engine I noticed that there is an assumption regarding the local rank: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L596-L604 It seems the assumption is that the GPU rank is the same as the local rank of the machine (i.e if you had a 4 GPU machine, each process local rank of 0 to 4 matches the GPU rank). This wouldn't be the case if you specified certain GPU IDs as in this script. A solution is to introduce a |
Associated DeepSpeed PR has been merged, once a release has been made we can include this fix into Lightning! |
Good |
@SeanNaren any update on this |
Still waiting on DeepSpeed to make a release, I'll ping them to see if we can get this done sooner! cc @jeffra |
@SeanNaren v0.5.4 is now released to pypi: https://pypi.org/project/deepspeed/0.5.4/ this should include the PR in question :) |
Thanks everyone! Should now be fixed on lightning master, and with the latest Deepspeed version ( |
🐛 Bug
Attempting to run Trainer.fit with GPUs other than cuda:0 with the DeepSpeed Zero Stage 2 plugin results in
RuntimeError: Expected all tensors to be on the same device, but found at least two devices
.To Reproduce
The error:
Environment
conda
,pip
, source): piptorch.__config__.show()
:Additional context
The text was updated successfully, but these errors were encountered: