[Core][Distributed] use cpu/gloo to initialize pynccl #4248

youkaichao · 2024-04-22T02:35:40Z

nccl initialization requires broadcasting a unique id, which lives in cpu memory. previously, we only have one nccl backend process group, so we have to move the unique id to gpu, broadcast it, and then move it back to cpu.

After #3904 , we always have a cpu/gloo backend, so we don't need to move the unique id around. We can just broadcast it in cpu memory.

zhuohan123

LGTM! Left some small comments.

vllm/distributed/device_communicators/pynccl.py

zhuohan123 · 2024-04-23T21:47:46Z

vllm/distributed/device_communicators/pynccl.py

+        current_device = torch.cuda.current_device()
+        try:
+            torch.cuda.set_device(device)
+            NCCL_CHECK(
+                _c_ncclCommInitRank(ctypes.byref(self.comm), self.world_size,
+                                    self.unique_id, self.rank))
+            self.stream = torch.cuda.Stream()
+        finally:
+            torch.cuda.set_device(current_device)


Why do we need a try...finally block here? Can the program continue to run when there is an exception in the try block?

It can, but I think it would be better for a function to be pure, i.e. don't implicitly modify some global state.

)

youkaichao added 6 commits April 21, 2024 18:32

remove useless stream initialization

41ccb4b

remove nccl allreduce in nccl initialization

e04c8f3

fix tests

26e8513

fix pynccl

297cb82

improve init

177d347

use device as an option to initialize nccl

80e3983

youkaichao requested a review from zhuohan123 April 22, 2024 19:48

youkaichao mentioned this pull request Apr 23, 2024

[Bug]: vllm stall on llama3-70b warmup with 0.4.1 #4277

Closed

youkaichao added 7 commits April 23, 2024 10:47

use nccl check for displaying error

447d0d4

fix value

94c2479

fix bytes

a996223

Merge branch 'main' into pynccl_init_improve

5253415

fix lint

ff1513d

use set_device

1192948

add assert

29b8cf1

zhuohan123 approved these changes Apr 23, 2024

View reviewed changes

add more comments

c90c1bb

youkaichao merged commit 91f50a6 into vllm-project:main Apr 24, 2024
47 checks passed

youkaichao deleted the pynccl_init_improve branch April 24, 2024 01:32

xjpang pushed a commit to xjpang/vllm that referenced this pull request Apr 25, 2024

[Core][Distributed] use cpu/gloo to initialize pynccl (vllm-project#4248

514006d

)

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024

[Core][Distributed] use cpu/gloo to initialize pynccl (vllm-project#4248

16883fd

)

alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024

[Core][Distributed] use cpu/gloo to initialize pynccl (vllm-project#4248

2a29eeb

)

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024

[Core][Distributed] use cpu/gloo to initialize pynccl (vllm-project#4248

671c816

)

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Distributed] use cpu/gloo to initialize pynccl #4248

[Core][Distributed] use cpu/gloo to initialize pynccl #4248

youkaichao commented Apr 22, 2024 •

edited

Loading

zhuohan123 left a comment

zhuohan123 Apr 23, 2024

youkaichao Apr 23, 2024

[Core][Distributed] use cpu/gloo to initialize pynccl #4248

[Core][Distributed] use cpu/gloo to initialize pynccl #4248

Conversation

youkaichao commented Apr 22, 2024 • edited Loading

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 Apr 23, 2024

Choose a reason for hiding this comment

youkaichao Apr 23, 2024

Choose a reason for hiding this comment

youkaichao commented Apr 22, 2024 •

edited

Loading