Replies: 1 comment
-
I think this might be a bug. The persistent param should not participate in partition when it completes forward and backward, and it should do allreduce to the param itself after the whole step is complete. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi there,
I noticed there is a all_gather step in
_post_step
function of stage3. The all-gather is used, instead of all-reduce, is it because the gradients of the persistent parameter is synchronized via reduce-scatter in the backward pass?https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1707
Beta Was this translation helpful? Give feedback.
All reactions