Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TPU] Correct the check for TPU device in a pod environment #6755

Closed
wants to merge 12 commits into from

Conversation

jiasenwu
Copy link

What does this PR do?

#6719 provides a partial fix to #6692, however, I find it still causes "No TPU devices were found" error in a v2-32 env. I look into some internals and see that xmp.spawn(.., nprocs=1) will hold extremely long, and eventually a timeout. I try removing nprocs settings (that equals to nprocs = 8) and works fine. This PR completes the solution in this direction.

Besides

  • waiting timeout 25 is not sufficient in case of pod. It is highly likely to exceed.
  • because processes are spawned, it is not possible to define the _TPU_AVAIALBE flag at the module scope. I remove it completely and replace with _XLA_AVAILABLE (if global) or XLADeviceUtils.tpu_device_exists() (if local).

Fixes #6692
Related to #6719

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@@ -555,7 +555,8 @@ def set_distributed_mode(self, distributed_backend: Optional[str] = None):

rank_zero_info(f'GPU available: {torch.cuda.is_available()}, used: {self._device_type == DeviceType.GPU}')
num_cores = self.tpu_cores if self.tpu_cores is not None else 0
rank_zero_info(f'TPU available: {_TPU_AVAILABLE}, using: {num_cores} TPU cores')
tpu_available = XLADeviceUtils.tpu_device_exists()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use _TPU_AVAILABLE there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_TPU_AVAILABLE is replaced by XLADeviceUtils.tpu_device_exists()

xmp.spawn(_fn, args=(queue, ), nprocs=1)
return queue.get()
# Missing XLA Configuration
except RuntimeError as e:
Copy link
Contributor

@tchaton tchaton Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments can be removed if not used

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

@@ -54,12 +54,12 @@
_APEX_AVAILABLE,
_HOROVOD_AVAILABLE,
_NATIVE_AMP_AVAILABLE,
_TPU_AVAILABLE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove _TPU_AVAILABLE ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced TPU_AVAILABLE flag with a call XLADeviceUtils.tpu_device_exists(), because now it spawns to decide the value, and I see python complaints (something not frozen yet) to spawn at the time of module creation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add back _TPU_AVAILABLE. We use this as a common imports API in Lightning and it won't be backward compatible.

@kaushikb11
Copy link
Contributor

Hi @jiasenwu, could you please add _TPU_AVAILABLE back as Thomas has mentioned? We could take it from there.

@jiasenwu
Copy link
Author

jiasenwu commented Mar 31, 2021

Hi @jiasenwu, could you please add _TPU_AVAILABLE back as Thomas has mentioned? We could take it from there.

Hi, surely I don't want to break the backward compatibility. Then I would suggest not test the actual devices at all, because testing the TPU device really requires to spawn. We can set _TPU_AVAILABLE === existence of XRT* environment variable.

@jiasenwu
Copy link
Author

For the moment, I will add the flag back soon, keeping only my update to pl_multi_process.

@jiasenwu
Copy link
Author

jiasenwu commented Mar 31, 2021

Okay, I have reverted all changes related to the global flag _TPU_AVAILABLE. Let me know if I can do anything more. I appreciate very much all your works on supporting the TPU (both single device and pod). Particularly if the issue can be addressed early in Q2, as a customer of pytorch-lightning & GCP/TPU, we may rely on the feature a lot.

@kaushikb11
Copy link
Contributor

@jiasenwu Thank you! We appreciate it, we will patch the fix asap. Will review it in some time & get back to you.

@kaushikb11
Copy link
Contributor

Hi @jiasenwu, we decided to take a different approach to check TPUs availability #6767,

@kaushikb11 kaushikb11 closed this Apr 1, 2021
@jiasenwu
Copy link
Author

jiasenwu commented Apr 1, 2021

Hi @jiasenwu, we decided to take a different approach to check TPUs availability #6767,

Cool, it is a super good solution with the minimum code change!

@tchaton
Copy link
Contributor

tchaton commented Apr 1, 2021

Dear @jiasenwu.

Would you mind joining Pytorch Lightning Slack channel. We would really like if you could help us to make sure TPUs are reliably working in PyTorch Lightning.

Would you mind trying out this branch: https://github.com/PyTorchLightning/pytorch-lightning/pull/6781/files on v2-32 ?

Best,
T.C

@pierric
Copy link

pierric commented Apr 1, 2021

Dear @jiasenwu.

Would you mind joining Pytorch Lightning Slack channel. We would really like if you could help us to make sure TPUs are reliably working in PyTorch Lightning.

Would you mind trying out this branch: https://github.com/PyTorchLightning/pytorch-lightning/pull/6781/files on v2-32 ?

Best,
T.C

Awesome, happy to join. I will be on holiday for a few days, and then should be able to find a slot to test it next Tuesday! Thank you very much :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No TPU devices were found in a TPU pod env.
4 participants