Optimize tensor parallel execution speed #17

zhuohan123 · 2023-03-31T15:27:00Z

Speed before this PR:

ubuntu@ray-zhuohan-cf-head-d95da8d2-compute:~/nfs/cacheflow/cacheflow$ python benchmark/benchmark_latency.py --model facebook/opt-13b
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='facebook/opt-13b', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tens
or_parallel_size=1)
2023-03-31 14:17:41,580 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266
# GPU blocks: 1975, # CPU blocks: 3276
Warm up step
Profile step: 100%|██████████████████████████████████████████████████████████████| 3/3 [00:15<00:00,  5.18s/it]
Avg latency: 5.184098243713379 seconds

Speed after this PR:

ubuntu@ray-zhuohan-cf-head-d95da8d2-compute:~/nfs/cacheflow/cacheflow$ python benchmark/benchmark_latency.py --model facebook/opt-13b
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='facebook/opt-13b', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tensor_parallel_size=1)
2023-03-31 15:20:04,885 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266
# GPU blocks: 1975, # CPU blocks: 3276
Warm up step
Profile step: 100%|██████████████████████████████████████████████████████████████| 3/3 [00:10<00:00,  3.49s/it]
Avg latency: 3.492198626200358 seconds

WoosukKwon

Awesome! Thanks for the effort.

benchmark/benchmark_latency.py

Rebase fp8_kv branch with upstream (3-07-2024)

These Dockerfile changes: - Update the release stage to work with the recently refactored `requirements-common.txt` / `requirements-cuda.txt` split - Fixup the kernel compilation in the `build` stage to correctly pick up cuda - Install the kernels from this docker build rather than pulling a precompiled wheel. We can swap that back once a new wheel is available with the correct pytorch version + updated interfaces --------- Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Joe Runde <[email protected]>

[ROCm] adding a missing triton autotune config

zhuohan123 added 2 commits March 31, 2023 15:25

Optimize tensor parallel execution speed

a32f244

add more files

c3e6bce

zhuohan123 requested a review from WoosukKwon March 31, 2023 15:32

WoosukKwon approved these changes Mar 31, 2023

View reviewed changes

WoosukKwon reviewed Mar 31, 2023

View reviewed changes

benchmark/benchmark_latency.py Outdated Show resolved Hide resolved

nit

2bea93e

zhuohan123 merged commit c45f3c3 into main Mar 31, 2023

zhuohan123 deleted the optimize-tp-speed branch June 18, 2023 07:22

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Optimize tensor parallel execution speed (vllm-project#17)

ad3d36f

AdrianAbeyta referenced this pull request in ROCm/vllm Mar 8, 2024

Merge pull request #17 from ROCm/IFU-2024-03-01-fp8-kv

b3d81e0

Rebase fp8_kv branch with upstream (3-07-2024)

fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024

Merge pull request vllm-project#17 from ROCm/triton-config-fix

bebcbe6

[ROCm] adding a missing triton autotune config

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize tensor parallel execution speed #17

Optimize tensor parallel execution speed #17

zhuohan123 commented Mar 31, 2023

WoosukKwon left a comment

Optimize tensor parallel execution speed #17

Optimize tensor parallel execution speed #17

Conversation

zhuohan123 commented Mar 31, 2023

WoosukKwon left a comment

Choose a reason for hiding this comment