-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed multi-node cpu only support (MULTI_CPU) #63
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for digging into this! It looks great and with a little bit of polishing (mostly naming), it should be ready to be merged soon!
You should add a line in the main README in the list at the end with - multinode CPU
! Also it looks like your suggested integration supports different types of launchers (from the number of env variables looked at). Could you document this somewhere?
src/accelerate/state.py
Outdated
print("Warning: Looks like distributed multinode run but MASTER_ADDR env not set, using '127.0.0.1' as default") | ||
print("If this run hangs, try exporting rank 0's hostname as MASTER_ADDR") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thin we should raise a ValueError here telling the user to pass a MASTER_ADDR
instead of relying on a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some MPI-like custom backends can initialize without MASTER_ADDR so I just put a warning. I can change to ValueError if you think that's a better choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are in a test where backend != "mpi"
, so I would adapt the error message (to say the backend needs a MASTER_ADDR) and raise an error. For MPI, it will still work without the MASTER_ADDR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change.
bf91de5
to
276bbb8
Compare
I have made the suggested changes. Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the adjustments, two more little things and we should be good to merge!
On your cluster just run: | ||
|
||
```bash | ||
mpirun -np 2 python examples/nlp_example.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this require an install of some library? Let's add how to!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, added a line on how to get OpenMPI...
Last step, could you run |
Sorry, my system doesn't have black. Make gave error.
Can you please go ahead and push it to PR? |
Just installed black and fixed formatting... It is still giving an error about torch_ccl import but I don't know how to fix it... |
Fixed issue with flake8 and squashed to single commit... Should be good to merge now. |
Thanks again for all your work on this! |
Possible fix for #32