Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script #27

WoosukKwon · 2023-04-05T06:00:21Z

This PR adds query stride to the multi_query_cached_kv_attention kernel so that it can support non-contiguous query tensors in our OPT and LLaMA models (after #20).

This PR also adds a benchmark script comparing our multi_query_cached_kv_attention and the optimized Flash attention implementation.

suquark

LGTM, thanks!

…hmark script (vllm-project#27) * Add query stride to multi_query_cached_kv_attention * Add kernel benchmark script

Add bitsandbytes to requirements and use fixed vllm version in the client

This fixes a miss where I had seen usages of `.labels` `**`a dictionary into kwargs, and I accidentally passed a raw dictionary as a value instead of using keyword arguments 🤦. This caused metrics to show eg. `method="{'method':'prefill'}"` instead of `method=prefill` Signed-off-by: Joe Runde <[email protected]>

* Bucketing/Warmup WIP * Cleanup * Revert "Fix model_output_idx on HPU (vllm-project#27)" This reverts commit 90dfa92. * Rework selected_token_indices fix to also work with block_size padding * Simple prompt attention POC * Remove cumsum * MQA/GQA support for simple prompt_attention * Cleanup * Fix typo * Restore profiling runs

…t#27) SUMMARY: * initial set of "actions with a little a" that are the building blocks for eventual CI system * "build test" workflow * "remote push" workflow on `a10g` * update some requirement files to have packages listed in alphabetical order NOTE: this PR is still somewhat nebulas as i'm still working through building and testing "neuralmagic-vllm" in our automation environment. TEST: currently, i'm working through various workflow components, i.e. "actions with a little a". the bits making up the actions in this PR have been constructed from my notes along the way. we can do a "complete" run that includes: linting, building, installing, and running tests. GHA link ... https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7975058564 `testmo` ... https://neuralmagic.testmo.net/automation/runs/view/8097 Latest GHA link ... https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7992489982 --------- Co-authored-by: andy-neuma <[email protected]>

* [Kernel] Enable custome AR on ROCm * Install amdsmi in Docker in preparation for custom all reduce (cherry picked from commit f6cfb9bf31e9feeefbdedecf2165f80dd0564b75) * Fix for yapf * Linting and small fixes to vLLM syntax (cherry picked from commit 2cf8103bfb0afce59b28a06c5bbe905983c42728) --------- Co-authored-by: Matthew Wong <[email protected]>

WoosukKwon added 2 commits April 5, 2023 05:56

Add query stride to multi_query_cached_kv_attention

94378b7

Add kernel benchmark script

69ba6f0

WoosukKwon requested a review from suquark April 5, 2023 06:00

suquark approved these changes Apr 8, 2023

View reviewed changes

suquark merged commit c267b1a into main Apr 8, 2023

suquark deleted the kernel-fix branch April 8, 2023 20:36

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add query stride to multi_query_cached_kv_attention & Add kernel benc…

e80d156

…hmark script (vllm-project#27) * Add query stride to multi_query_cached_kv_attention * Add kernel benchmark script

slyalin pushed a commit to slyalin/vllm that referenced this pull request Apr 15, 2024

Merge pull request vllm-project#27 from mzegla/missing_req

948137a

Add bitsandbytes to requirements and use fixed vllm version in the client

tianyil1 pushed a commit to tianyil1/vllm that referenced this pull request Jun 5, 2024

Fix model_output_idx on HPU (vllm-project#27)

90dfa92

ZHJ19970917 mentioned this pull request Jul 14, 2024

[Bug]: When using qwen-32b-chat-awq with multi-threaded access, errors occur after approximately several hundred visits.”vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.“ #6421

Closed

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script #27

Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script #27

WoosukKwon commented Apr 5, 2023

suquark left a comment

Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script #27

Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script #27

Conversation

WoosukKwon commented Apr 5, 2023

suquark left a comment

Choose a reason for hiding this comment