Add CUDA graph-based all reduce launcher #26

WoosukKwon · 2023-04-05T02:45:37Z

Related to #22

This PR uses CUDA graph to reduce the CPU overhead of NCCL all reduce operation.

zhuohan123

LGTM!

zhuohan123 · 2023-04-05T17:22:53Z

cacheflow/parallel_utils/parallel_state.py

+        self.group = get_tensor_model_parallel_group()
+        self.buffer = torch.empty(
+            size=(max_num_tokens, hidden_size),
+            dtype=torch.half, # FIXME: hardcoded dtype


Add a dtype argument for this class?

Disable NPU merged to OV master recently

Install and configure use of the NCCL version recommended by vLLM via the [vllm-nccl](https://github.com/vllm-project/vllm-nccl) package. The install is a little wonky... but this set of changes should work. Signed-off-by: Travis Johnson <[email protected]>

deps: bump fastapi to >= 0.109.1

Update max_context_len for custom paged attention.

…c466a3 Rebase habana_main up to cc466a3

…inear_fusion_and_prepack Enable linear fusion/prepack and MOE AWQ fusion

WoosukKwon added 4 commits April 5, 2023 01:07

Add -tp and -pp

0f86522

Add graph-based all reduce launcher

8f4c648

max_batch_size -> max_num_batched_tokens

8077445

max_batch_size -> max_num_batched_tokens

1cfdb00

WoosukKwon requested a review from zhuohan123 April 5, 2023 09:31

zhuohan123 approved these changes Apr 5, 2023

View reviewed changes

Address comments & Code cleaning

d406199

WoosukKwon merged commit 12659a0 into main Apr 5, 2023

WoosukKwon deleted the graph branch April 5, 2023 18:17

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add CUDA graph-based all reduce launcher (vllm-project#26)

6376304

slyalin pushed a commit to slyalin/vllm that referenced this pull request Apr 4, 2024

Merge pull request vllm-project#26 from ilya-lavrenov/disable-npu

818e384

Disable NPU merged to OV master recently

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

Merge pull request vllm-project#26 from dtrifiro/bump-deps

255735f

deps: bump fastapi to >= 0.109.1

fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024

Merge pull request vllm-project#26 from ROCm/cl/updates-pag-shomy

fa75cba

Update max_context_len for custom paged attention.

tianyil1 pushed a commit to tianyil1/vllm that referenced this pull request Jun 5, 2024

Merge pull request vllm-project#26 from HabanaAI/habana_main_rebase_c…

ae3d612

…c466a3 Rebase habana_main up to cc466a3

bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Jun 25, 2024

Merge pull request vllm-project#26 from intel-sandbox/jianan/enable_l…

dddd40f

…inear_fusion_and_prepack Enable linear fusion/prepack and MOE AWQ fusion

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA graph-based all reduce launcher #26

Add CUDA graph-based all reduce launcher #26

WoosukKwon commented Apr 5, 2023 •

edited

Loading

zhuohan123 left a comment

zhuohan123 Apr 5, 2023

WoosukKwon Apr 5, 2023

Add CUDA graph-based all reduce launcher #26

Add CUDA graph-based all reduce launcher #26

Conversation

WoosukKwon commented Apr 5, 2023 • edited Loading

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 Apr 5, 2023

Choose a reason for hiding this comment

WoosukKwon Apr 5, 2023

Choose a reason for hiding this comment

WoosukKwon commented Apr 5, 2023 •

edited

Loading