Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No TPU Devices found for TPU Pod Training, even after recognizing it #7197

Closed
kaushikb11 opened this issue Apr 23, 2021 · 4 comments · Fixed by #7243
Closed

No TPU Devices found for TPU Pod Training, even after recognizing it #7197

kaushikb11 opened this issue Apr 23, 2021 · 4 comments · Fixed by #7243
Assignees
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@kaushikb11
Copy link
Contributor

🐛 Bug

No TPU Devices found for TPU Pod Training, even after recognizing it.

2021-04-23 18:14:43 10.164.0.29 [3]   File "/home/kaushikbokka/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 316, in __init__
2021-04-23 18:14:43 10.164.0.29 [3]     replace_sampler_ddp, deterministic, precision, amp_backend, amp_level, plugins
2021-04-23 18:14:43 10.164.0.29 [3]   File "/home/kaushikbokka/pytorch-lightning/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 97, in __init__
2021-04-23 18:14:43 10.164.0.29 [3]     self.tpu_cores = device_parser.parse_tpu_cores(tpu_cores)
2021-04-23 18:14:43 10.164.0.29 [3]   File "/home/kaushikbokka/pytorch-lightning/pytorch_lightning/utilities/device_parser.py", line 104, in parse_tpu_cores
2021-04-23 18:14:43 10.164.0.29 [3]     raise MisconfigurationException('No TPU devices were found.')
2021-04-23 18:14:43 10.164.0.29 [3] pytorch_lightning.utilities.exceptions.MisconfigurationException: No TPU devices were found.
2021-04-23 18:14:46 10.164.0.27 [0] GPU available: False, used: False
2021-04-23 18:14:46 10.164.0.27 [0] TPU available: True, using: 8 TPU cores

TPU Pod v2-32, running bug_report_model.py on 4 VMs with 8 tpu cores each.

@kaushikb11 kaushikb11 added bug Something isn't working help wanted Open to be worked on labels Apr 23, 2021
@kaushikb11 kaushikb11 self-assigned this Apr 23, 2021
@Borda Borda added priority: 0 High priority task accelerator: tpu Tensor Processing Unit labels Apr 26, 2021
@Borda
Copy link
Member

Borda commented Apr 26, 2021

@kaushikb11 is this on master or on past v1.2.x release?

@kaushikb11
Copy link
Contributor Author

@Borda It's everywhere. Single Cloud TPU (v2-8 or v3-8) + VM works well and is supported. I don't think Cloud TPU Pod has been supported well enough. Ref: pytorch-lightning.readthedocs.io/en/stable/advanced/tpu.html#tpu-pod, cloud.google.com/tpu/docs/tutorials/pytorch-pod.

@edenlightning edenlightning added this to the v1.3 milestone Apr 27, 2021
@kaushikb11
Copy link
Contributor Author

On non-master VMs

Traceback (most recent call last):
   File "/home/kaushikbokka/pytorch-lightning/pytorch_lightning/utilities/xla_device.py", line 31, in inner_f
     queue.put(func(*args, **kwargs))
   File "/home/kaushikbokka/pytorch-lightning/pytorch_lightning/utilities/xla_device.py", line 70, in _is_device_tpu
     return len(xm.get_xla_supported_devices("TPU")) > 0
   File "/anaconda3/envs/torch-xla-1.8/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 136, in get_xla_supported_devices
     xla_devices = _DEVICES.value
   File "/anaconda3/envs/torch-xla-1.8/lib/python3.6/site-packages/torch_xla/utils/utils.py", line 32, in value
     self._value = self._gen_fn()
   File "/anaconda3/envs/torch-xla-1.8/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 18, in <lambda>
     _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
 RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:258 : Check failed: default_device_target != options_.global_device_map.end()

@kaushikb11
Copy link
Contributor Author

Some start...

Screen Shot 2021-04-28 at 4 51 26 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants