Distributed training problem #2

julycetc · 2020-01-03T06:50:37Z

you use nccl in the distributed training, my problem is do you use nccl in pytorch or do you install nccl
seperately?And how do you set your environment variable?I am queite confused about it.Thanks very much!I meet the following problem when i use two machine to run the code.

INFO NET/Plugin : No plugin found (libnccl-net.so)
2.NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:400, unhandled cuda error.
3.NCCL INFO NET/IB : No device found

meijieru · 2020-01-03T15:23:12Z

Actually we use a docker within a cloud environment. The docker itself is a self-compiled PyTorch environment with NCCL installed, so I am not sure about how to install it manually. Maybe you could refer to the official document from Nvidia. Sorry for the inconvenience.
I have listed the environment variable used in the code in the README.md.

meijieru added the question Further information is requested label Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training problem #2

Distributed training problem #2

julycetc commented Jan 3, 2020 •

edited

Loading

meijieru commented Jan 3, 2020 •

edited

Loading

Distributed training problem #2

Distributed training problem #2

Comments

julycetc commented Jan 3, 2020 • edited Loading

meijieru commented Jan 3, 2020 • edited Loading

julycetc commented Jan 3, 2020 •

edited

Loading

meijieru commented Jan 3, 2020 •

edited

Loading