Add distributed multi-node cpu only support (MULTI_CPU) #63

ddkalamk · 2021-05-03T03:32:53Z

Possible fix for #32

sgugger

Thanks a lot for digging into this! It looks great and with a little bit of polishing (mostly naming), it should be ready to be merged soon!

You should add a line in the main README in the list at the end with - multinode CPU! Also it looks like your suggested integration supports different types of launchers (from the number of env variables looked at). Could you document this somewhere?

src/accelerate/accelerator.py

src/accelerate/commands/config/cluster.py

src/accelerate/state.py

src/accelerate/commands/config/config_utils.py

src/accelerate/state.py

sgugger · 2021-05-03T13:44:28Z

src/accelerate/state.py

+                        print("Warning: Looks like distributed multinode run but MASTER_ADDR env not set, using '127.0.0.1' as default")
+                        print("If this run hangs, try exporting rank 0's hostname as MASTER_ADDR")


I thin we should raise a ValueError here telling the user to pass a MASTER_ADDR instead of relying on a default.

Some MPI-like custom backends can initialize without MASTER_ADDR so I just put a warning. I can change to ValueError if you think that's a better choice.

We are in a test where backend != "mpi", so I would adapt the error message (to say the backend needs a MASTER_ADDR) and raise an error. For MPI, it will still work without the MASTER_ADDR.

Made the change.

src/accelerate/state.py

ddkalamk · 2021-05-03T17:14:06Z

I have made the suggested changes. Please take a look.
Thanks.

sgugger

Thanks for all the adjustments, two more little things and we should be good to merge!

sgugger · 2021-05-03T17:48:18Z

README.md

+On your cluster just run:
+
+```bash
+mpirun -np 2 python examples/nlp_example.py


Does this require an install of some library? Let's add how to!

There are multiple options e.g. Intel MPI, OpenMPI or MVAPICH. I assume, typical HPC clusters would have some version available or users can install them in userspace easily. example instructions for Open-MPI are here.

Ok, added a line on how to get OpenMPI...

src/accelerate/state.py

sgugger · 2021-05-03T18:45:55Z

Last step, could you run make style on your branch? This should take care of the failing test.
Let me know if run into any trouble and I can push it on the PR instead.

ddkalamk · 2021-05-03T18:51:36Z

Sorry, my system doesn't have black. Make gave error.

black tests src examples
make: black: Command not found
make: *** [style] Error 127

Can you please go ahead and push it to PR?
Thanks.

ddkalamk · 2021-05-03T18:57:56Z

Just installed black and fixed formatting... It is still giving an error about torch_ccl import but I don't know how to fix it...

src/accelerate/state.py

ddkalamk · 2021-05-04T04:35:29Z

Fixed issue with flake8 and squashed to single commit... Should be good to merge now.

sgugger · 2021-05-04T12:13:06Z

Thanks again for all your work on this!

sgugger reviewed May 3, 2021

View reviewed changes

ddkalamk force-pushed the multi_cpu branch 2 times, most recently from bf91de5 to 276bbb8 Compare May 3, 2021 17:00

sgugger approved these changes May 3, 2021

View reviewed changes

sgugger reviewed May 3, 2021

View reviewed changes

src/accelerate/state.py Outdated Show resolved Hide resolved

ddkalamk force-pushed the multi_cpu branch from 71e43ce to 152ecb3 Compare May 4, 2021 04:23

Add distributed multi-node cpu only support (MULTI_CPU)

5b69a23

ddkalamk force-pushed the multi_cpu branch from 152ecb3 to 5b69a23 Compare May 4, 2021 04:33

sgugger merged commit df260fa into huggingface:main May 4, 2021

ddkalamk deleted the multi_cpu branch May 4, 2021 14:38

sgugger mentioned this pull request May 5, 2021

dsitributed on mulit-CPU #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed multi-node cpu only support (MULTI_CPU) #63

Add distributed multi-node cpu only support (MULTI_CPU) #63

ddkalamk commented May 3, 2021

sgugger left a comment

sgugger May 3, 2021

ddkalamk May 3, 2021

sgugger May 3, 2021

ddkalamk May 3, 2021

ddkalamk commented May 3, 2021

sgugger left a comment

sgugger May 3, 2021

ddkalamk May 3, 2021

ddkalamk May 3, 2021

sgugger commented May 3, 2021

ddkalamk commented May 3, 2021

ddkalamk commented May 3, 2021

ddkalamk commented May 4, 2021

sgugger commented May 4, 2021

		print("Warning: Looks like distributed multinode run but MASTER_ADDR env not set, using '127.0.0.1' as default")
		print("If this run hangs, try exporting rank 0's hostname as MASTER_ADDR")

Add distributed multi-node cpu only support (MULTI_CPU) #63

Add distributed multi-node cpu only support (MULTI_CPU) #63

Conversation

ddkalamk commented May 3, 2021

sgugger left a comment

Choose a reason for hiding this comment

sgugger May 3, 2021

Choose a reason for hiding this comment

ddkalamk May 3, 2021

Choose a reason for hiding this comment

sgugger May 3, 2021

Choose a reason for hiding this comment

ddkalamk May 3, 2021

Choose a reason for hiding this comment

ddkalamk commented May 3, 2021

sgugger left a comment

Choose a reason for hiding this comment

sgugger May 3, 2021

Choose a reason for hiding this comment

ddkalamk May 3, 2021

Choose a reason for hiding this comment

ddkalamk May 3, 2021

Choose a reason for hiding this comment

sgugger commented May 3, 2021

ddkalamk commented May 3, 2021

ddkalamk commented May 3, 2021

ddkalamk commented May 4, 2021

sgugger commented May 4, 2021