Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize tensor parallel execution speed #17

Merged
merged 3 commits into from
Mar 31, 2023
Merged

Conversation

zhuohan123
Copy link
Member

Speed before this PR:

ubuntu@ray-zhuohan-cf-head-d95da8d2-compute:~/nfs/cacheflow/cacheflow$ python benchmark/benchmark_latency.py --model facebook/opt-13b
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='facebook/opt-13b', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tens
or_parallel_size=1)
2023-03-31 14:17:41,580 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266
# GPU blocks: 1975, # CPU blocks: 3276
Warm up step
Profile step: 100%|██████████████████████████████████████████████████████████████| 3/3 [00:15<00:00,  5.18s/it]
Avg latency: 5.184098243713379 seconds

Speed after this PR:

ubuntu@ray-zhuohan-cf-head-d95da8d2-compute:~/nfs/cacheflow/cacheflow$ python benchmark/benchmark_latency.py --model facebook/opt-13b
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='facebook/opt-13b', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tensor_parallel_size=1)
2023-03-31 15:20:04,885 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266
# GPU blocks: 1975, # CPU blocks: 3276
Warm up step
Profile step: 100%|██████████████████████████████████████████████████████████████| 3/3 [00:10<00:00,  3.49s/it]
Avg latency: 3.492198626200358 seconds

@zhuohan123 zhuohan123 requested a review from WoosukKwon March 31, 2023 15:32
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for the effort.

@zhuohan123 zhuohan123 merged commit c45f3c3 into main Mar 31, 2023
@zhuohan123 zhuohan123 deleted the optimize-tp-speed branch June 18, 2023 07:22
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
AdrianAbeyta referenced this pull request in ROCm/vllm Mar 8, 2024
Rebase fp8_kv branch with upstream (3-07-2024)
z103cb referenced this pull request in z103cb/opendatahub_vllm Apr 22, 2024
These Dockerfile changes:
- Update the release stage to work with the recently refactored
`requirements-common.txt` / `requirements-cuda.txt` split
- Fixup the kernel compilation in the `build` stage to correctly pick up
cuda
- Install the kernels from this docker build rather than pulling a
precompiled wheel. We can swap that back once a new wheel is available
with the correct pytorch version + updated interfaces

---------

Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024
[ROCm] adding a missing triton autotune config
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants