Automatically configure KV cache size #6

WoosukKwon · 2023-03-03T10:05:40Z

This PR adds OPT memory analyzer to the system, and uses it to automatically determine the KV cache size.

Tested models:

OPT-125M
OPT-350M
OPT-1.3B
OPT-2.7B
OPT-6.7B
OPT-13B

Tested GPUs:

A100

* finish changing scheduler * finish merge * fix model * Fix (vllm-project#5) * fix problems * fix * delete unused params * remove redundant comments --------- Co-authored-by: Xiangyu Tian <[email protected]>

…ect#6)

Add missing Python requirements

Co-authored-by: Mor Zusman <[email protected]>

[CI/Build] Dockerfile.ubi : Remove test stage

FP8 on A100 for PHIMOE

Kuntai disagg refactor

…tokens [Bugfix] Include encoder_prompt_tokens in num_prompt_tokensin UsageInfo

WoosukKwon added 17 commits March 3, 2023 04:16

Fix a bug in 1D shape

e5a1fa8

Minor

342275f

Minor

b91a2fa

[WIP] Add memory analyzer

d78e2fb

Automatically config GPU/CPU blocks

2649eb5

Remove TODO

1ae7420

Merge branch 'main' into autoconfig

6654b34

Merge branch 'main' into autoconfig

fcbf027

Add max_num_batched_tokens argument

350ed27

Minor

6f5b41b

Minor

2d03918

Refactor model utils

8ec00fe

Re-implement memory analyzer

84203fc

Fix __init__

96b216c

Use memory analyzer in server.py

c89d440

Add psutil to README

f5d1e2c

Fix comment

cc63c24

WoosukKwon merged commit e9d3f2f into main Mar 12, 2023

WoosukKwon deleted the autoconfig branch March 12, 2023 07:23

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add memory analyzer & utomatically configure KV cache size (vllm-proj…

de10960

…ect#6)

slyalin pushed a commit to slyalin/vllm that referenced this pull request Mar 21, 2024

Merge pull request vllm-project#6 from mzegla/extended_requirements

2922b06

Add missing Python requirements

mzusman added a commit to mzusman/vllm that referenced this pull request Apr 16, 2024

dtype (vllm-project#6)

00bce1f

Co-authored-by: Mor Zusman <[email protected]>

dtrifiro referenced this pull request in dtrifiro/vllm Apr 26, 2024

Merge pull request #6 from z103cb/ibm_main_docker_ubi_updates

91e4a51

[CI/Build] Dockerfile.ubi : Remove test stage

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Closed

Starmys pushed a commit to Starmys/vllm that referenced this pull request May 20, 2024

Merge pull request vllm-project#6 from wenxcs/wenxh/fp8-on-a100

4e56e27

FP8 on A100 for PHIMOE

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

This was referenced Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Closed

ehuaa mentioned this pull request Jul 19, 2024

[Bug]: The vllm is disconnected after running for some time #5084

Closed

xinzaifeixiang1992 mentioned this pull request Jul 24, 2024

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错 #6734

Closed

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

Minami-su mentioned this pull request Aug 11, 2024

[Bug]: vllm is crashed on v0.5.3.post1 #7161

Closed

wangwensuo mentioned this pull request Aug 22, 2024

[Bug]: llama3-405b-fp8 NCCL communication #7775

Closed

zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024

Merge pull request vllm-project#6 from KuntaiDu/kuntai-disagg-refactor

9eefec2

Kuntai disagg refactor

liulisi16323 mentioned this pull request Sep 24, 2024

[Bug]: v0.5.5 crash: "AssertionError: expected running sequences" #8016

Closed

1 task

heheda12345 added a commit to heheda12345/vllm that referenced this pull request Sep 25, 2024

Merge pull request vllm-project#6 from vllm-project/chang/num_prompt_…

9b931bf

…tokens [Bugfix] Include encoder_prompt_tokens in num_prompt_tokensin UsageInfo

Clint-chan mentioned this pull request Sep 29, 2024

[Bug]: Vllm0.6.2 UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown #8933

Open

1 task

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

This was referenced Oct 12, 2024

[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered #6976

Closed

[Bug]: Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered #9306

Open

xxzhang0927 mentioned this pull request Oct 30, 2024

[Bug]: Engine iteration timed out. This should never happen! #9839

Open

1 task

hteeyeoh mentioned this pull request Dec 6, 2024

[Bug]: Not able to install/compile vllm using alpine linux base image #10924

Open

1 task

HelenaSak mentioned this pull request Feb 19, 2025

[Bug]: watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered #8177

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically configure KV cache size #6

Automatically configure KV cache size #6

WoosukKwon commented Mar 3, 2023 •

edited

Loading

Automatically configure KV cache size #6

Automatically configure KV cache size #6

Conversation

WoosukKwon commented Mar 3, 2023 • edited Loading

WoosukKwon commented Mar 3, 2023 •

edited

Loading