Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training problem #2

Open
julycetc opened this issue Jan 3, 2020 · 1 comment
Open

Distributed training problem #2

julycetc opened this issue Jan 3, 2020 · 1 comment
Labels
question Further information is requested

Comments

@julycetc
Copy link

julycetc commented Jan 3, 2020

you use nccl in the distributed training, my problem is do you use nccl in pytorch or do you install nccl
seperately?And how do you set your environment variable?I am queite confused about it.Thanks very much!I meet the following problem when i use two machine to run the code.

  1. INFO NET/Plugin : No plugin found (libnccl-net.so)
    2.NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:400, unhandled cuda error.
    3.NCCL INFO NET/IB : No device found
@meijieru
Copy link
Owner

meijieru commented Jan 3, 2020

  1. Actually we use a docker within a cloud environment. The docker itself is a self-compiled PyTorch environment with NCCL installed, so I am not sure about how to install it manually. Maybe you could refer to the official document from Nvidia. Sorry for the inconvenience.
  2. I have listed the environment variable used in the code in the README.md.

@meijieru meijieru added the question Further information is requested label Apr 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants