Add init_system API for multi-host GPU. #8364

zhangqiaorjc · 2021-10-25T22:39:17Z

Currently JAX doesn't expose our multi-host GPU backend to our open source users.

This PR exposes an experimental API jax.distributed.initialize to initialize the multi-host GPU backend.

I tested it on 2 GPU VMs on GCP

== VM0

$ TF_CPP_MIN_LOG_LEVEL=0 python -c "import jax; jax.distributed.initialize('10.128.0.47:1456', 2, 0)"
2021-10-26 01:19:17.133471: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/distribute
d/service.cc:369] Jax service listening on 10.128.0.47:1456
2021-10-26 01:19:28.339404: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/distribute
d/client.cc:129] Connected to distributed JAX controller
2021-10-26 01:19:28.375758: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/distribute
d/client.cc:163] Waiting for all distributed JAX tasks to shut down.
2021-10-26 01:19:28.376179: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/distribute
d/client.cc:180] Distributed task shutdown result: OK
2021-10-26 01:19:28.397722: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/distribute
d/service.cc:381] Jax service shutting down

== VM1

$ TF_CPP_MIN_LOG_LEVEL=0 python -c "import jax; jax.distributed.initialize('10.128.0.47:1456', 2, 1)"
2021-10-26 01:19:28.339474: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/distribute
d/client.cc:129] Connected to distributed JAX controller
2021-10-26 01:19:28.371461: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/distribute
d/client.cc:163] Waiting for all distributed JAX tasks to shut down.
2021-10-26 01:19:28.376473: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/distribute
d/client.cc:180] Distributed task shutdown result: OK

An example of this API in action, see code that does model parallelism on 2 GPU VMs

https://gist.github.com/zhangqiaorjc/0ae6e7114fb0b3e9243e6420e4d6f3e4

See screenshot for results

https://photos.google.com/share/AF1QipMfIpFOpmckl86lU4WS4nb2IzMDkrOqLyafa4C3Vx7zMqoyy6NOM8PiS8gH7zaLIw?key=bjVRWHZoRmFUTkhhLVBOdzFlYWg4bG5nZ3NJYVpB

jax/_src/dist_system.py

jax/dist_system.py

hawkinsp · 2021-10-26T18:57:07Z

jax/_src/dist_system.py

+  client.connect()
+
+  factory = functools.partial(xla_client.make_gpu_client, client, node_id)
+  xla_bridge.register_backend_factory('gpu', factory, priority=300)


You should verify the GPU backend has not already been initialized.

xla_bridge.backends() seems to init backends only once already

Yes, but you still need to ensure that you issue an error if the backend has already been initialized.

added a check and will raise RuntimeError if gpu backend is already initialized

also updated xla_bridge to error out if a backend is already registered when we try to register a new one

jax/_src/distributed.py

CHANGELOG.md

hawkinsp · 2021-10-26T20:34:26Z

jax/_src/distributed.py

+def init_system(coordinator_address: str, num_processes: int, process_id: int):
+  """Initialize distributed system for topology discovery.
+
+  init_system is required to setup the runtime for multi-host GPU usage.


Give an example of how to use it and when/why.

Added example.

skipping doctest

To run T5x on multi-node and multi-GPUs, `jax.distributed.initialize` needs to be called with appropriate setup as mentioned here: jax-ml/jax#8364. Added a command line flag - `multiprocess` to enable multiprocess T5x run on GPUs. Also, added command line flags for the arguments to `jax.distributed.initialize`, namely - `coordinator_address`, `num_processes` and `process_id`. Example usage 1 (2 processes, running on 2 separate nodes, 8 GPUs each): On the first node: ``` python3 ${T5X_DIR}/t5x/train.py \ --gin_file="t5x/examples/t5/t5_1_1/examples/base_wmt_from_scratch.gin" \ --gin.MODEL_DIR=\"${MODEL_DIR}\" \ --tfds_data_dir=${TFDS_DATA_DIR} \ --multiprocess \ --coordinator_address=i.p.ad.dr:port \ --num_processes=2 \ --process_id=0 ``` On the second node: ``` python3 ${T5X_DIR}/t5x/train.py \ --gin_file="t5x/examples/t5/t5_1_1/examples/base_wmt_from_scratch.gin" \ --gin.MODEL_DIR=\"${MODEL_DIR}\" \ --tfds_data_dir=${TFDS_DATA_DIR} \ --multiprocess \ --coordinator_address=i.p.ad.dr:port \ --num_processes=2 \ --process_id=1 ``` Notice that the `process_id` is different for the two processes. Also, substitute the appropriate coordinator_address in `i.p.ad.dr:port` Example usage 2 (1 node, 2 processes, 4 GPUs each): ``` CUDA_VISIBLE_DEVICES=0,1,2,3 python3 ${T5X_DIR}/t5x/train.py \ --gin_file="t5x/examples/t5/t5_1_1/examples/base_wmt_from_scratch.gin" \ --gin.MODEL_DIR=\"${MODEL_DIR}\" \ --tfds_data_dir=${TFDS_DATA_DIR} \ --multiprocess \ --coordinator_address=127.0.0.1:12345 \ --num_processes=2 \ --process_id=0 & \ && CUDA_VISIBLE_DEVICES=4,5,6,7 python3 ${T5X_DIR}/t5x/train.py \ --gin_file="t5x/examples/t5/t5_1_1/examples/base_wmt_from_scratch.gin" \ --gin.MODEL_DIR=\"${MODEL_DIR}\" \ --tfds_data_dir=${TFDS_DATA_DIR} \ --multiprocess \ --coordinator_address=127.0.0.1:12345 \ --num_processes=2 \ --process_id=1 ``` More information about multiprocess JAX runs: jax-ml/jax#2731 Note: T5x partitioning fix: google-research#608 complements this change. Fixes google-research#410/google-research#89

google-cla bot added the cla: yes label Oct 25, 2021

zhangqiaorjc force-pushed the dsys branch 3 times, most recently from dd39818 to 8205350 Compare October 25, 2021 23:27

zhangqiaorjc added the pull ready Ready for copybara import and testing label Oct 26, 2021

zhangqiaorjc force-pushed the dsys branch from 8205350 to 2249aef Compare October 26, 2021 01:03

zhangqiaorjc requested a review from hawkinsp October 26, 2021 01:03

zhangqiaorjc force-pushed the dsys branch from 2249aef to 0c38d16 Compare October 26, 2021 02:01

hawkinsp reviewed Oct 26, 2021

View reviewed changes

zhangqiaorjc force-pushed the dsys branch 3 times, most recently from c30ec02 to a774a19 Compare October 26, 2021 20:05

hawkinsp reviewed Oct 26, 2021

View reviewed changes

zhangqiaorjc force-pushed the dsys branch 2 times, most recently from 2cb595b to a0a1559 Compare October 26, 2021 20:57

zhangqiaorjc self-assigned this Oct 26, 2021

zhangqiaorjc force-pushed the dsys branch 3 times, most recently from f2dacf1 to c320d65 Compare October 26, 2021 21:19

Add jax.distributed.initialize for multi-host GPU.

0be30fb

zhangqiaorjc force-pushed the dsys branch from c320d65 to 0be30fb Compare October 26, 2021 21:38

hawkinsp approved these changes Oct 28, 2021

View reviewed changes

google-ml-butler bot added the kokoro:force-run label Oct 28, 2021

kokoro-team removed the kokoro:force-run label Oct 28, 2021

copybara-service bot merged commit 934bfc0 into jax-ml:main Oct 28, 2021

cloudhan mentioned this pull request Nov 19, 2021

Support multinode training on GPU #2731

Closed

sudhakarsingh27 mentioned this pull request Jun 23, 2022

Add support for Multiprocess (multi-host/multi-node) T5x runs google-research/t5x#626

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add init_system API for multi-host GPU. #8364

Add init_system API for multi-host GPU. #8364

zhangqiaorjc commented Oct 25, 2021 •

edited

Loading

hawkinsp Oct 26, 2021

zhangqiaorjc Oct 26, 2021

hawkinsp Oct 26, 2021

zhangqiaorjc Oct 26, 2021

zhangqiaorjc Oct 26, 2021

hawkinsp Oct 26, 2021

zhangqiaorjc Oct 26, 2021

zhangqiaorjc Oct 26, 2021

Add init_system API for multi-host GPU. #8364

Add init_system API for multi-host GPU. #8364

Conversation

zhangqiaorjc commented Oct 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangqiaorjc commented Oct 25, 2021 •

edited

Loading