Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'No TPU devices were found' continues to exist for v2-32. #6849

Closed
jiasenwu opened this issue Apr 6, 2021 · 6 comments
Closed

'No TPU devices were found' continues to exist for v2-32. #6849

jiasenwu opened this issue Apr 6, 2021 · 6 comments
Assignees
Labels
3rd party Related to a 3rd-party accelerator: tpu Tensor Processing Unit bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task

Comments

@jiasenwu
Copy link

jiasenwu commented Apr 6, 2021

🐛 Bug

The error is still similar to that previously, as described in #6778. I am running the check code with pytorch-lightning master branch.

All the 3 slaves show the same exception.

raceback (most recent call last):
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device.py", line 31, in inner_f
    queue.put(func(*args, **kwargs))
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device.py", line 70, in _is_device_tpu
    return len(xm.get_xla_supported_devices("TPU")) > 0
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 136, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 18, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:258 : Check failed: default_device_target != options_.global_device_map.end() 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace()
	xla::XrtComputationClient::XrtComputationClient(xla::XrtComputationClient::Options, std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >, xla::XrtLocalService*)
	xla::ComputationClient::Create()
	
	
	xla::ComputationClient::Get()
	
	
	_PyCFunction_FastCallDict
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	PyEval_EvalCodeEx
	
	PyObject_Call
	
	_PyObject_GenericGetAttrWithDict
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	PyEval_EvalCodeEx
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	PyEval_EvalCodeEx
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallDict
	_PyObject_FastCallDict
	_PyObject_Call_Prepend
	PyObject_Call
	
	
	_PyObject_FastCallDict
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	PyEval_EvalCodeEx
	PyEval_EvalCode
	
	PyCFunction_Call
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallDict
	_PyObject_FastCallDict
	_PyObject_CallMethodIdObjArgs
	PyImport_ImportModuleLevelObject
	_PyEval_EvalFrameDefault
	PyEval_EvalCodeEx
	PyEval_EvalCode
	
	PyCFunction_Call
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallDict
	_PyObject_FastCallDict
	_PyObject_CallMethodIdObjArgs
	PyImport_ImportModuleLevelObject
	_PyEval_EvalFrameDefault
	PyEval_EvalCodeEx
	PyEval_EvalCode
	
	PyCFunction_Call
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
*** End stack trace ***

/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: ModelCheckpoint(save_last=True, save_top_k=None, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None).
  warnings.warn(*args, **kwargs)
Traceback (most recent call last):
  File "play.py", line 119, in <module>
    main()
  File "play.py", line 102, in main
    checkpoint_callback=checkpointer,
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 40, in insert_env_defaults
    return fn(self, **kwargs)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 307, in __init__
    replace_sampler_ddp, deterministic, precision, amp_backend, amp_level, plugins
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 97, in __init__
    self.tpu_cores = device_parser.parse_tpu_cores(tpu_cores)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 113, in parse_tpu_cores
    raise MisconfigurationException('No TPU devices were found.')
pytorch_lightning.utilities.exceptions.MisconfigurationException: No TPU devices were found.

The master node looks ok:

GPU available: False, used: False
TPU available: True, using: 8 TPU cores

To Reproduce

the same as #6778

@jiasenwu jiasenwu added bug Something isn't working help wanted Open to be worked on labels Apr 6, 2021
@kaushikb11
Copy link
Contributor

@jiasenwu Did you try it on Colab or GCP VM?
Also, did you export the env variables required to connect to TPU? TPU_IP_ADDRESS & XRT_TPU_CONFIG.

@jiasenwu
Copy link
Author

jiasenwu commented Apr 7, 2021

@jiasenwu Did you try it on Colab or GCP VM?
Also, did you export the env variables required to connect to TPU? TPU_IP_ADDRESS & XRT_TPU_CONFIG.

With GCP VM, and I ran the command like the following on one instance:

python -m torch_xla.distributed.xla_dist --tpu=node-1 --docker-image=gcr.io/tpu-pytorch/xla:r1.8 \
    --docker-run-flag=--rm=true \
    --docker-run-flag=--shm-size=16GB \
    --docker-run-flag=-v \
    --docker-run-flag=/home/jiasen/cvml_projects/tpu/torch/nfnet:/app \
    --docker-run-flag=-w \
    --docker-run-flag=/app \
    --env=XLA_USE_BF16=1 \
    -- bash -c "pip install git+https://github.com/PyTorchLightning/pytorch-lightning timm && python play.py"

The xla_dist should have handled all env-vars stuff.

@jiasenwu
Copy link
Author

jiasenwu commented Apr 7, 2021

@jiasenwu Did you try it on Colab or GCP VM?
Also, did you export the env variables required to connect to TPU? TPU_IP_ADDRESS & XRT_TPU_CONFIG.

Does it work well in Colab?

@kaushikb11
Copy link
Contributor

It does work well on both, Colab & GCP VM. I haven't experimented with python -m torch_xla.distributed.xla_dist yet. Could you spawn up a VM and try it inside of it? Ref: https://cloud.google.com/tpu/docs/tutorials/pytorch-pod

Also, could you also test it with the Pytorch XLA test script to test if it's a config or Lightning Issue. Ref: https://github.com/pytorch/xla#start-distributed-training

@jiasenwu
Copy link
Author

jiasenwu commented Apr 7, 2021

It does work well on both, Colab & GCP VM. I haven't experimented with python -m torch_xla.distributed.xla_dist yet. Could you spawn up a VM and try it inside of it? Ref: https://cloud.google.com/tpu/docs/tutorials/pytorch-pod

Also, could you also test it with the Pytorch XLA test script to test if it's a config or Lightning Issue. Ref: https://github.com/pytorch/xla#start-distributed-training

AFAIK, xla_dist is still the best way to do the pod training. I don't know how it could be done within a Colab.

Besides, I have done quite a few times successful jobs with xla_dist, including official xla's test scripts, and also our production model in a v2/3-32 env (but I need to force _TPU_AVAILABLE flag to be true as I explained in #6778 ).

I still believe the logic of detection of TPU device is kind of insufficient, but maybe xla should be blamed as well since the precondition of using xm.get_xla_supported_devices is totally unclear.

@Borda Borda added 3rd party Related to a 3rd-party priority: 1 Medium priority task accelerator: tpu Tensor Processing Unit labels Apr 12, 2021
@kaushikb11
Copy link
Contributor

This has been resolved in master by #7243. Thank you. Feel free to reopen it, if you face the issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party accelerator: tpu Tensor Processing Unit bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task
Projects
None yet
Development

No branches or pull requests

3 participants