-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cpu tensors without direct device invocation #3842
Support cpu tensors without direct device invocation #3842
Conversation
@abhilash1910, thanks for this PR. I think this PR needs some work that leverages PR #3633 for the following reasons.
Please share your thoughts. Thanks! |
@tjruwase yes I think that would be a proper fix , instead of having separate dtypes, we can directly leverage abstract accelerator interface. Let me go through the changes for this . Thanks (Making this draft for now). |
@abhilash1910, thanks for your alignment. I will push to get #3633 merged asap. I left some comments on there. |
Yes sure, I will work with my colleague Yejing to make this work. |
@tjruwase I have a question. For the following check, is the check focusing on the data type, or data type+device type is needed? If only data type is important, then proper way is strip device type from t.type(), then compare with data type list. |
Yes, the focus is on checking the data type supported by the device. My feedback is based on avoiding any assumptions of string format combinations of device and data type. For example as shown in this list, the same dtype is formatted differently for cpu and cuda tensors. So, I Please let me know your thoughts. |
Yes, dtype is better. Some additional changed in _reduce_non_expert_gradients and _reduce_expert_gradients will be needed accordingly. |
@delock, thanks for the pointer. @abhilash1910, could you please help handle those changes in your PR? |
@abhilash1910, you raise an important issue with my proposal that I had overlooked. This is that One idea is to populate the list with Another potential issue with existing code is that we don't check whether the dtype of the underlying tensor of a SparseTensor is supported by the accelerator. I need to double check this concern with my teammates. Therefore, I am now wondering it would better to take a new approach that makes this difference explicit, and uses These are just some thoughts. Please let me know what you think. Thanks! |
Yes my thoughts exactly. I was thinking of adding a dtype getter inside the sparsetensor to make it consistent |
@tjruwase could you retrigger CI (issue seems to be fix now)? Thanks. |
@tjruwase could you help re-trigger the CI and re-review ? Much appreciated. |
Hi @abhilash1910 can you clarify whether current failures in CI is related to your PR or just a test issue? Thanks! |
@delock I think that it might be a test issue as I am able to run the CI for the sparse test locally. I changed the pathway of code and still I see the same allclose issue. @tjruwase could you suggest any modifications on this? This is strange as I tested in an isolated env and did not get the issue. |
Hi @abhilash1910 some suggestions:
|
fix the repetition loop list append
Thanks @inkcherry for highlighting the boundary issue ; seems it will pass the CI now. |
I could reproduce the CI issue in my local env and It could passed currently. |
Hi @tjruwase the previous error in CI workflow |
Hi @abhilash1910 are the following two errors related to your change?
|
@delock I think this failure is related to this PR , but it seems to be arising after the previous fix . I will take a look at it . |
Motivation: Fix for reproducible issue deepspeedai#3837 on cpu. On cpus direct invocation of torch.cpu.tensor leads to dtype mismatch. Another way would be to have something like : ["torch.DoubleTensor" if device_type == 'cpu else '"torch.{}.DoubleTensor".format(device_type)] for all elements in the supported list , but that would eliminate "torch.cpu.DoubleTensor" ,etc from the scope. @jeffra requesting review. CLA is signed --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: inkcherry <[email protected]>
Motivation:
Fix for reproducible issue #3837 on cpu. On cpus direct invocation of torch.cpu.tensor leads to dtype mismatch.
Another way would be to have something like :
["torch.DoubleTensor" if device_type == 'cpu else '"torch.{}.DoubleTensor".format(device_type)] for all elements in the supported list , but that would eliminate "torch.cpu.DoubleTensor" ,etc from the scope.
@jeffra requesting review.
CLA is signed