-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'No TPU devices were found' continues to exist for v2-32. #6849
Comments
@jiasenwu Did you try it on Colab or GCP VM? |
With GCP VM, and I ran the command like the following on one instance:
The xla_dist should have handled all env-vars stuff. |
Does it work well in Colab? |
It does work well on both, Colab & GCP VM. I haven't experimented with Also, could you also test it with the Pytorch XLA test script to test if it's a config or Lightning Issue. Ref: https://github.com/pytorch/xla#start-distributed-training |
AFAIK, Besides, I have done quite a few times successful jobs with I still believe the logic of detection of TPU device is kind of insufficient, but maybe xla should be blamed as well since the precondition of using |
This has been resolved in master by #7243. Thank you. Feel free to reopen it, if you face the issue again. |
🐛 Bug
The error is still similar to that previously, as described in #6778. I am running the check code with pytorch-lightning master branch.
All the 3 slaves show the same exception.
The master node looks ok:
To Reproduce
the same as #6778
The text was updated successfully, but these errors were encountered: