-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL error when using ddp with 2 gpus #3865
Comments
Hi! thanks for your contribution!, great first issue! |
mind upgrading to 1.0.2? And try to reproduce using this model-> https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report_model.py |
@kekeblom how do you launch your script? how do I reproduce it with the bug report template?
That should be fine, ddp launches 1 process per gpu. |
Actually, I haven't seen this happen again. So I can't reproduce. Might have been the version or then it's somehow related to the gpu that gets scheduled. Maybe it's worth closing the issue. I'll let you know if I re-encouter the problem. |
Speak of the devil. @awaelchli Tried it on 1.0.3 with 2 gpus. Modified the It seems it only happens when running on a machine with GTX 1080 gpus. Machines with GTX 1080 Ti or RTX2080 do not appear to suffer from this issue. Here is the output of
Here is the stack trace:
|
🐛 Bug
I try to run pytorch lighting using ddp with 2 gpus. Running with one gpu works fine. Using fp16 vs not results in the same error. See the stacktrace at the end of the post to see the error. I also tried ddp2 and dp, but both of those fail with a different error.
To Reproduce
Not sure. Let me know what I can do to diagnose.
I'm running my code on a cluster where each gpu is locked to one process. I'm using NCCL version 2.4.8.
I tried pytorch-lightning versions
0.9.0
,0.9.1rc4
,0.10.0rc1
. All of them result in the same error. I'm running pytorch version1.6
.Expected behavior
I expected training to start running smoothly using both gpus.
Environment
- GPU:
- GeForce GTX 1080
- GeForce GTX 1080
- available: True
- version: 10.1
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.10.0rc1
- tqdm: 4.46.1
- OS: Linux
- architecture:
- 64bit
-
- processor:
- python: 3.7.7
- version: Proposal for help #1 SMP Tue May 12 16:57:42 UTC 2020
Additional context
Stacktrace and error.
The text was updated successfully, but these errors were encountered: