-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TPU] Correct the check for TPU device in a pod environment #6755
Conversation
@@ -555,7 +555,8 @@ def set_distributed_mode(self, distributed_backend: Optional[str] = None): | |||
|
|||
rank_zero_info(f'GPU available: {torch.cuda.is_available()}, used: {self._device_type == DeviceType.GPU}') | |||
num_cores = self.tpu_cores if self.tpu_cores is not None else 0 | |||
rank_zero_info(f'TPU available: {_TPU_AVAILABLE}, using: {num_cores} TPU cores') | |||
tpu_available = XLADeviceUtils.tpu_device_exists() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use _TPU_AVAILABLE there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_TPU_AVAILABLE
is replaced by XLADeviceUtils.tpu_device_exists()
xmp.spawn(_fn, args=(queue, ), nprocs=1) | ||
return queue.get() | ||
# Missing XLA Configuration | ||
except RuntimeError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments can be removed if not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
@@ -54,12 +54,12 @@ | |||
_APEX_AVAILABLE, | |||
_HOROVOD_AVAILABLE, | |||
_NATIVE_AMP_AVAILABLE, | |||
_TPU_AVAILABLE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove _TPU_AVAILABLE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced TPU_AVAILABLE
flag with a call XLADeviceUtils.tpu_device_exists()
, because now it spawns to decide the value, and I see python complaints (something not frozen yet) to spawn at the time of module creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add back _TPU_AVAILABLE. We use this as a common imports API in Lightning and it won't be backward compatible.
Hi @jiasenwu, could you please add |
Hi, surely I don't want to break the backward compatibility. Then I would suggest not test the actual devices at all, because testing the TPU device really requires to spawn. We can set |
For the moment, I will add the flag back soon, keeping only my update to |
Okay, I have reverted all changes related to the global flag _TPU_AVAILABLE. Let me know if I can do anything more. I appreciate very much all your works on supporting the TPU (both single device and pod). Particularly if the issue can be addressed early in Q2, as a customer of pytorch-lightning & GCP/TPU, we may rely on the feature a lot. |
@jiasenwu Thank you! We appreciate it, we will patch the fix asap. Will review it in some time & get back to you. |
Dear @jiasenwu. Would you mind joining Pytorch Lightning Slack channel. We would really like if you could help us to make sure TPUs are reliably working in PyTorch Lightning. Would you mind trying out this branch: https://github.com/PyTorchLightning/pytorch-lightning/pull/6781/files on v2-32 ? Best, |
Awesome, happy to join. I will be on holiday for a few days, and then should be able to find a slot to test it next Tuesday! Thank you very much :D |
What does this PR do?
#6719 provides a partial fix to #6692, however, I find it still causes "No TPU devices were found" error in a v2-32 env. I look into some internals and see that
xmp.spawn(.., nprocs=1)
will hold extremely long, and eventually a timeout. I try removingnprocs
settings (that equals tonprocs = 8
) and works fine. This PR completes the solution in this direction.Besides
Fixes #6692
Related to #6719
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃