Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script #27

Merged
merged 2 commits into from
Apr 8, 2023

Conversation

WoosukKwon
Copy link
Collaborator

This PR adds query stride to the multi_query_cached_kv_attention kernel so that it can support non-contiguous query tensors in our OPT and LLaMA models (after #20).

This PR also adds a benchmark script comparing our multi_query_cached_kv_attention and the optimized Flash attention implementation.

@WoosukKwon WoosukKwon requested a review from suquark April 5, 2023 06:00
Copy link
Contributor

@suquark suquark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@suquark suquark merged commit c267b1a into main Apr 8, 2023
@suquark suquark deleted the kernel-fix branch April 8, 2023 20:36
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
…hmark script (vllm-project#27)

* Add query stride to multi_query_cached_kv_attention

* Add kernel benchmark script
slyalin pushed a commit to slyalin/vllm that referenced this pull request Apr 15, 2024
Add bitsandbytes to requirements and use fixed vllm version in the client
z103cb referenced this pull request in z103cb/opendatahub_vllm May 16, 2024
This fixes a miss where I had seen usages of `.labels` `**`a dictionary
into kwargs, and I accidentally passed a raw dictionary as a value
instead of using keyword arguments 🤦. This caused metrics to show eg.
`method="{'method':'prefill'}"` instead of `method=prefill`

Signed-off-by: Joe Runde <[email protected]>
tianyil1 pushed a commit to tianyil1/vllm that referenced this pull request Jun 5, 2024
tianyil1 pushed a commit to tianyil1/vllm that referenced this pull request Jun 5, 2024
* Bucketing/Warmup WIP

* Cleanup

* Revert "Fix model_output_idx on HPU (vllm-project#27)"

This reverts commit 90dfa92.

* Rework selected_token_indices fix to also work with block_size padding

* Simple prompt attention POC

* Remove cumsum

* MQA/GQA support for simple prompt_attention

* Cleanup

* Fix typo

* Restore profiling runs
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
…t#27)

SUMMARY:
* initial set of "actions with a little a" that are the building blocks
for eventual CI system
* "build test" workflow
* "remote push" workflow on `a10g`
* update some requirement files to have packages listed in alphabetical
order

NOTE: this PR is still somewhat nebulas as i'm still working through
building and testing "neuralmagic-vllm" in our automation environment.

TEST:
currently, i'm working through various workflow components, i.e.
"actions with a little a". the bits making up the actions in this PR
have been constructed from my notes along the way.

we can do a "complete" run that includes: linting, building, installing,
and running tests.

GHA link ...
https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7975058564
`testmo` ... https://neuralmagic.testmo.net/automation/runs/view/8097

Latest GHA link ...
https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7992489982

---------

Co-authored-by: andy-neuma <[email protected]>
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
…t#27)

SUMMARY:
* initial set of "actions with a little a" that are the building blocks
for eventual CI system
* "build test" workflow
* "remote push" workflow on `a10g`
* update some requirement files to have packages listed in alphabetical
order

NOTE: this PR is still somewhat nebulas as i'm still working through
building and testing "neuralmagic-vllm" in our automation environment.

TEST:
currently, i'm working through various workflow components, i.e.
"actions with a little a". the bits making up the actions in this PR
have been constructed from my notes along the way.

we can do a "complete" run that includes: linting, building, installing,
and running tests.

GHA link ...
https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7975058564
`testmo` ... https://neuralmagic.testmo.net/automation/runs/view/8097

Latest GHA link ...
https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7992489982

---------

Co-authored-by: andy-neuma <[email protected]>
dllehr-amd pushed a commit to dllehr-amd/vllm that referenced this pull request Jul 22, 2024
* [Kernel] Enable custome AR on ROCm

* Install amdsmi in Docker in preparation for custom all reduce

(cherry picked from commit f6cfb9bf31e9feeefbdedecf2165f80dd0564b75)

* Fix for yapf

* Linting and small fixes to vLLM syntax

(cherry picked from commit 2cf8103bfb0afce59b28a06c5bbe905983c42728)

---------

Co-authored-by: Matthew Wong <[email protected]>
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants