'No TPU devices were found' continues to exist for v2-32. #6849

jiasenwu · 2021-04-06T10:21:24Z

🐛 Bug

The error is still similar to that previously, as described in #6778. I am running the check code with pytorch-lightning master branch.

All the 3 slaves show the same exception.

raceback (most recent call last):
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device.py", line 31, in inner_f
    queue.put(func(*args, **kwargs))
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device.py", line 70, in _is_device_tpu
    return len(xm.get_xla_supported_devices("TPU")) > 0
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 136, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 18, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:258 : Check failed: default_device_target != options_.global_device_map.end() 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace()
	xla::XrtComputationClient::XrtComputationClient(xla::XrtComputationClient::Options, std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >, xla::XrtLocalService*)
	xla::ComputationClient::Create()
	
	
	xla::ComputationClient::Get()
	
	
	_PyCFunction_FastCallDict
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	PyEval_EvalCodeEx
	
	PyObject_Call
	
	_PyObject_GenericGetAttrWithDict
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	PyEval_EvalCodeEx
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	PyEval_EvalCodeEx
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallDict
	_PyObject_FastCallDict
	_PyObject_Call_Prepend
	PyObject_Call
	
	
	_PyObject_FastCallDict
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	PyEval_EvalCodeEx
	PyEval_EvalCode
	
	PyCFunction_Call
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallDict
	_PyObject_FastCallDict
	_PyObject_CallMethodIdObjArgs
	PyImport_ImportModuleLevelObject
	_PyEval_EvalFrameDefault
	PyEval_EvalCodeEx
	PyEval_EvalCode
	
	PyCFunction_Call
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallDict
	_PyObject_FastCallDict
	_PyObject_CallMethodIdObjArgs
	PyImport_ImportModuleLevelObject
	_PyEval_EvalFrameDefault
	PyEval_EvalCodeEx
	PyEval_EvalCode
	
	PyCFunction_Call
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
*** End stack trace ***

/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: ModelCheckpoint(save_last=True, save_top_k=None, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None).
  warnings.warn(*args, **kwargs)
Traceback (most recent call last):
  File "play.py", line 119, in <module>
    main()
  File "play.py", line 102, in main
    checkpoint_callback=checkpointer,
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 40, in insert_env_defaults
    return fn(self, **kwargs)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 307, in __init__
    replace_sampler_ddp, deterministic, precision, amp_backend, amp_level, plugins
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 97, in __init__
    self.tpu_cores = device_parser.parse_tpu_cores(tpu_cores)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 113, in parse_tpu_cores
    raise MisconfigurationException('No TPU devices were found.')
pytorch_lightning.utilities.exceptions.MisconfigurationException: No TPU devices were found.

The master node looks ok:

GPU available: False, used: False
TPU available: True, using: 8 TPU cores

To Reproduce

the same as #6778

The text was updated successfully, but these errors were encountered:

kaushikb11 · 2021-04-06T19:58:59Z

@jiasenwu Did you try it on Colab or GCP VM?
Also, did you export the env variables required to connect to TPU? TPU_IP_ADDRESS & XRT_TPU_CONFIG.

jiasenwu · 2021-04-07T08:23:28Z

@jiasenwu Did you try it on Colab or GCP VM?
Also, did you export the env variables required to connect to TPU? TPU_IP_ADDRESS & XRT_TPU_CONFIG.

With GCP VM, and I ran the command like the following on one instance:

python -m torch_xla.distributed.xla_dist --tpu=node-1 --docker-image=gcr.io/tpu-pytorch/xla:r1.8 \
    --docker-run-flag=--rm=true \
    --docker-run-flag=--shm-size=16GB \
    --docker-run-flag=-v \
    --docker-run-flag=/home/jiasen/cvml_projects/tpu/torch/nfnet:/app \
    --docker-run-flag=-w \
    --docker-run-flag=/app \
    --env=XLA_USE_BF16=1 \
    -- bash -c "pip install git+https://github.com/PyTorchLightning/pytorch-lightning timm && python play.py"

The xla_dist should have handled all env-vars stuff.

jiasenwu · 2021-04-07T08:54:09Z

@jiasenwu Did you try it on Colab or GCP VM?
Also, did you export the env variables required to connect to TPU? TPU_IP_ADDRESS & XRT_TPU_CONFIG.

Does it work well in Colab?

kaushikb11 · 2021-04-07T09:57:59Z

It does work well on both, Colab & GCP VM. I haven't experimented with python -m torch_xla.distributed.xla_dist yet. Could you spawn up a VM and try it inside of it? Ref: https://cloud.google.com/tpu/docs/tutorials/pytorch-pod

Also, could you also test it with the Pytorch XLA test script to test if it's a config or Lightning Issue. Ref: https://github.com/pytorch/xla#start-distributed-training

jiasenwu · 2021-04-07T12:09:21Z

It does work well on both, Colab & GCP VM. I haven't experimented with python -m torch_xla.distributed.xla_dist yet. Could you spawn up a VM and try it inside of it? Ref: https://cloud.google.com/tpu/docs/tutorials/pytorch-pod

Also, could you also test it with the Pytorch XLA test script to test if it's a config or Lightning Issue. Ref: https://github.com/pytorch/xla#start-distributed-training

AFAIK, xla_dist is still the best way to do the pod training. I don't know how it could be done within a Colab.

Besides, I have done quite a few times successful jobs with xla_dist, including official xla's test scripts, and also our production model in a v2/3-32 env (but I need to force _TPU_AVAILABLE flag to be true as I explained in #6778 ).

I still believe the logic of detection of TPU device is kind of insufficient, but maybe xla should be blamed as well since the precondition of using xm.get_xla_supported_devices is totally unclear.

kaushikb11 · 2021-05-06T16:28:48Z

This has been resolved in master by #7243. Thank you. Feel free to reopen it, if you face the issue again.

jiasenwu added bug Something isn't working help wanted Open to be worked on labels Apr 6, 2021

Borda added 3rd party Related to a 3rd-party priority: 1 Medium priority task accelerator: tpu Tensor Processing Unit labels Apr 12, 2021

Borda assigned kaushikb11 Apr 12, 2021

kaushikb11 closed this as completed May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'No TPU devices were found' continues to exist for v2-32. #6849

'No TPU devices were found' continues to exist for v2-32. #6849

jiasenwu commented Apr 6, 2021

kaushikb11 commented Apr 6, 2021

jiasenwu commented Apr 7, 2021 •

edited

Loading

jiasenwu commented Apr 7, 2021

kaushikb11 commented Apr 7, 2021

jiasenwu commented Apr 7, 2021 •

edited

Loading

kaushikb11 commented May 6, 2021

'No TPU devices were found' continues to exist for v2-32. #6849

'No TPU devices were found' continues to exist for v2-32. #6849

Comments

jiasenwu commented Apr 6, 2021

🐛 Bug

To Reproduce

kaushikb11 commented Apr 6, 2021

jiasenwu commented Apr 7, 2021 • edited Loading

jiasenwu commented Apr 7, 2021

kaushikb11 commented Apr 7, 2021

jiasenwu commented Apr 7, 2021 • edited Loading

kaushikb11 commented May 6, 2021

jiasenwu commented Apr 7, 2021 •

edited

Loading

jiasenwu commented Apr 7, 2021 •

edited

Loading