[TPU] Correct the check for TPU device in a pod environment #6755

jiasenwu · 2021-03-30T21:45:03Z

What does this PR do?

#6719 provides a partial fix to #6692, however, I find it still causes "No TPU devices were found" error in a v2-32 env. I look into some internals and see that xmp.spawn(.., nprocs=1) will hold extremely long, and eventually a timeout. I try removing nprocs settings (that equals to nprocs = 8) and works fine. This PR completes the solution in this direction.

Besides

waiting timeout 25 is not sufficient in case of pod. It is highly likely to exceed.
because processes are spawned, it is not possible to define the _TPU_AVAIALBE flag at the module scope. I remove it completely and replace with _XLA_AVAILABLE (if global) or XLADeviceUtils.tpu_device_exists() (if local).

Fixes #6692
Related to #6719

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

tchaton · 2021-03-31T07:25:10Z

pytorch_lightning/trainer/connectors/accelerator_connector.py

@@ -555,7 +555,8 @@ def set_distributed_mode(self, distributed_backend: Optional[str] = None):

        rank_zero_info(f'GPU available: {torch.cuda.is_available()}, used: {self._device_type == DeviceType.GPU}')
        num_cores = self.tpu_cores if self.tpu_cores is not None else 0
-        rank_zero_info(f'TPU available: {_TPU_AVAILABLE}, using: {num_cores} TPU cores')
+        tpu_available = XLADeviceUtils.tpu_device_exists()


We can use _TPU_AVAILABLE there.

_TPU_AVAILABLE is replaced by XLADeviceUtils.tpu_device_exists()

tchaton · 2021-03-31T07:26:54Z

pytorch_lightning/utilities/xla_device.py

-        xmp.spawn(_fn, args=(queue, ), nprocs=1)
-        return queue.get()
+        # Missing XLA Configuration
+        except RuntimeError as e:


Comments can be removed if not used

tchaton · 2021-03-31T07:29:15Z

pytorch_lightning/trainer/connectors/accelerator_connector.py

@@ -54,12 +54,12 @@
    _APEX_AVAILABLE,
    _HOROVOD_AVAILABLE,
    _NATIVE_AMP_AVAILABLE,
-    _TPU_AVAILABLE,


Why did you remove _TPU_AVAILABLE ?

I replaced TPU_AVAILABLE flag with a call XLADeviceUtils.tpu_device_exists(), because now it spawns to decide the value, and I see python complaints (something not frozen yet) to spawn at the time of module creation.

Please, add back _TPU_AVAILABLE. We use this as a common imports API in Lightning and it won't be backward compatible.

kaushikb11 · 2021-03-31T11:58:26Z

Hi @jiasenwu, could you please add _TPU_AVAILABLE back as Thomas has mentioned? We could take it from there.

jiasenwu · 2021-03-31T13:46:22Z

Hi @jiasenwu, could you please add _TPU_AVAILABLE back as Thomas has mentioned? We could take it from there.

Hi, surely I don't want to break the backward compatibility. Then I would suggest not test the actual devices at all, because testing the TPU device really requires to spawn. We can set _TPU_AVAILABLE === existence of XRT* environment variable.

jiasenwu · 2021-03-31T13:52:10Z

For the moment, I will add the flag back soon, keeping only my update to pl_multi_process.

jiasenwu · 2021-03-31T14:22:21Z

Okay, I have reverted all changes related to the global flag _TPU_AVAILABLE. Let me know if I can do anything more. I appreciate very much all your works on supporting the TPU (both single device and pod). Particularly if the issue can be addressed early in Q2, as a customer of pytorch-lightning & GCP/TPU, we may rely on the feature a lot.

kaushikb11 · 2021-03-31T14:32:10Z

@jiasenwu Thank you! We appreciate it, we will patch the fix asap. Will review it in some time & get back to you.

kaushikb11 · 2021-03-31T19:11:46Z

Hi @jiasenwu, we decided to take a different approach to check TPUs availability #6767,

jiasenwu · 2021-04-01T08:15:51Z

Hi @jiasenwu, we decided to take a different approach to check TPUs availability #6767,

Cool, it is a super good solution with the minimum code change!

tchaton · 2021-04-01T10:28:24Z

Dear @jiasenwu.

Would you mind joining Pytorch Lightning Slack channel. We would really like if you could help us to make sure TPUs are reliably working in PyTorch Lightning.

Would you mind trying out this branch: https://github.com/PyTorchLightning/pytorch-lightning/pull/6781/files on v2-32 ?

Best,
T.C

pierric · 2021-04-01T19:38:59Z

Dear @jiasenwu.

Would you mind joining Pytorch Lightning Slack channel. We would really like if you could help us to make sure TPUs are reliably working in PyTorch Lightning.

Would you mind trying out this branch: https://github.com/PyTorchLightning/pytorch-lightning/pull/6781/files on v2-32 ?

Best,
T.C

Awesome, happy to join. I will be on holiday for a few days, and then should be able to find a slot to test it next Tuesday! Thank you very much :D

jiasenwu added 7 commits March 30, 2021 22:32

fixup in a v2-32 env

fd4f471

try

01c274c

remove global flag _TPU_AVAILABLE

87ebd31

timeout 25 => 60

69021ce

index is required

be31c1f

resolved bad merge

5385fb0

timeout 60 => 120

9fef5a9

tchaton reviewed Mar 31, 2021

View reviewed changes

jiasenwu added 3 commits March 31, 2021 09:56

replace _TPU_AVAILABLE in tests

b44722b

remove comment and unused var

9fd8ec2

remove _TPU_AVAILABLE in doc

f804e6b

jiasenwu added 2 commits March 31, 2021 16:13

revert all changes to _TPU_AVAILABLE flag

9cb858e

one more revert

2c5a700

kaushikb11 closed this Apr 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU] Correct the check for TPU device in a pod environment #6755

[TPU] Correct the check for TPU device in a pod environment #6755

jiasenwu commented Mar 30, 2021

tchaton Mar 31, 2021

jiasenwu Mar 31, 2021

tchaton Mar 31, 2021 •

edited

Loading

jiasenwu Mar 31, 2021

tchaton Mar 31, 2021

jiasenwu Mar 31, 2021

tchaton Mar 31, 2021

kaushikb11 commented Mar 31, 2021

jiasenwu commented Mar 31, 2021 •

edited

Loading

jiasenwu commented Mar 31, 2021

jiasenwu commented Mar 31, 2021 •

edited

Loading

kaushikb11 commented Mar 31, 2021

kaushikb11 commented Mar 31, 2021

jiasenwu commented Apr 1, 2021

tchaton commented Apr 1, 2021

pierric commented Apr 1, 2021

[TPU] Correct the check for TPU device in a pod environment #6755

[TPU] Correct the check for TPU device in a pod environment #6755

Conversation

jiasenwu commented Mar 30, 2021

What does this PR do?

Before submitting

PR review

Did you have fun?

tchaton Mar 31, 2021

Choose a reason for hiding this comment

jiasenwu Mar 31, 2021

Choose a reason for hiding this comment

tchaton Mar 31, 2021 • edited Loading

Choose a reason for hiding this comment

jiasenwu Mar 31, 2021

Choose a reason for hiding this comment

tchaton Mar 31, 2021

Choose a reason for hiding this comment

jiasenwu Mar 31, 2021

Choose a reason for hiding this comment

tchaton Mar 31, 2021

Choose a reason for hiding this comment

kaushikb11 commented Mar 31, 2021

jiasenwu commented Mar 31, 2021 • edited Loading

jiasenwu commented Mar 31, 2021

jiasenwu commented Mar 31, 2021 • edited Loading

kaushikb11 commented Mar 31, 2021

kaushikb11 commented Mar 31, 2021

jiasenwu commented Apr 1, 2021

tchaton commented Apr 1, 2021

pierric commented Apr 1, 2021

tchaton Mar 31, 2021 •

edited

Loading

jiasenwu commented Mar 31, 2021 •

edited

Loading

jiasenwu commented Mar 31, 2021 •

edited

Loading