Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed multi-node cpu only support (MULTI_CPU) #63

Merged
merged 1 commit into from
May 4, 2021

Conversation

ddkalamk
Copy link
Contributor

@ddkalamk ddkalamk commented May 3, 2021

Possible fix for #32

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for digging into this! It looks great and with a little bit of polishing (mostly naming), it should be ready to be merged soon!

You should add a line in the main README in the list at the end with - multinode CPU! Also it looks like your suggested integration supports different types of launchers (from the number of env variables looked at). Could you document this somewhere?

src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/commands/config/cluster.py Show resolved Hide resolved
src/accelerate/state.py Outdated Show resolved Hide resolved
src/accelerate/state.py Outdated Show resolved Hide resolved
Comment on lines 174 to 179
print("Warning: Looks like distributed multinode run but MASTER_ADDR env not set, using '127.0.0.1' as default")
print("If this run hangs, try exporting rank 0's hostname as MASTER_ADDR")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thin we should raise a ValueError here telling the user to pass a MASTER_ADDR instead of relying on a default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some MPI-like custom backends can initialize without MASTER_ADDR so I just put a warning. I can change to ValueError if you think that's a better choice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in a test where backend != "mpi", so I would adapt the error message (to say the backend needs a MASTER_ADDR) and raise an error. For MPI, it will still work without the MASTER_ADDR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change.

src/accelerate/state.py Outdated Show resolved Hide resolved
src/accelerate/state.py Outdated Show resolved Hide resolved
@ddkalamk ddkalamk force-pushed the multi_cpu branch 2 times, most recently from bf91de5 to 276bbb8 Compare May 3, 2021 17:00
@ddkalamk
Copy link
Contributor Author

ddkalamk commented May 3, 2021

I have made the suggested changes. Please take a look.
Thanks.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the adjustments, two more little things and we should be good to merge!

On your cluster just run:

```bash
mpirun -np 2 python examples/nlp_example.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this require an install of some library? Let's add how to!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple options e.g. Intel MPI, OpenMPI or MVAPICH. I assume, typical HPC clusters would have some version available or users can install them in userspace easily. example instructions for Open-MPI are here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added a line on how to get OpenMPI...

src/accelerate/state.py Outdated Show resolved Hide resolved
@sgugger
Copy link
Collaborator

sgugger commented May 3, 2021

Last step, could you run make style on your branch? This should take care of the failing test.
Let me know if run into any trouble and I can push it on the PR instead.

@ddkalamk
Copy link
Contributor Author

ddkalamk commented May 3, 2021

Sorry, my system doesn't have black. Make gave error.

black tests src examples
make: black: Command not found
make: *** [style] Error 127

Can you please go ahead and push it to PR?
Thanks.

@ddkalamk
Copy link
Contributor Author

ddkalamk commented May 3, 2021

Just installed black and fixed formatting... It is still giving an error about torch_ccl import but I don't know how to fix it...

src/accelerate/state.py Outdated Show resolved Hide resolved
@ddkalamk
Copy link
Contributor Author

ddkalamk commented May 4, 2021

Fixed issue with flake8 and squashed to single commit... Should be good to merge now.

@sgugger sgugger merged commit df260fa into huggingface:main May 4, 2021
@sgugger
Copy link
Collaborator

sgugger commented May 4, 2021

Thanks again for all your work on this!

@ddkalamk ddkalamk deleted the multi_cpu branch May 4, 2021 14:38
@sgugger sgugger mentioned this pull request May 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants