-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script #27
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
suquark
approved these changes
Apr 8, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
hongxiayang
pushed a commit
to hongxiayang/vllm
that referenced
this pull request
Feb 13, 2024
…hmark script (vllm-project#27) * Add query stride to multi_query_cached_kv_attention * Add kernel benchmark script
slyalin
pushed a commit
to slyalin/vllm
that referenced
this pull request
Apr 15, 2024
Add bitsandbytes to requirements and use fixed vllm version in the client
z103cb
referenced
this pull request
in z103cb/opendatahub_vllm
May 16, 2024
This fixes a miss where I had seen usages of `.labels` `**`a dictionary into kwargs, and I accidentally passed a raw dictionary as a value instead of using keyword arguments 🤦. This caused metrics to show eg. `method="{'method':'prefill'}"` instead of `method=prefill` Signed-off-by: Joe Runde <[email protected]>
tianyil1
pushed a commit
to tianyil1/vllm
that referenced
this pull request
Jun 5, 2024
tianyil1
pushed a commit
to tianyil1/vllm
that referenced
this pull request
Jun 5, 2024
* Bucketing/Warmup WIP * Cleanup * Revert "Fix model_output_idx on HPU (vllm-project#27)" This reverts commit 90dfa92. * Rework selected_token_indices fix to also work with block_size padding * Simple prompt attention POC * Remove cumsum * MQA/GQA support for simple prompt_attention * Cleanup * Fix typo * Restore profiling runs
yukavio
pushed a commit
to yukavio/vllm
that referenced
this pull request
Jul 3, 2024
…t#27) SUMMARY: * initial set of "actions with a little a" that are the building blocks for eventual CI system * "build test" workflow * "remote push" workflow on `a10g` * update some requirement files to have packages listed in alphabetical order NOTE: this PR is still somewhat nebulas as i'm still working through building and testing "neuralmagic-vllm" in our automation environment. TEST: currently, i'm working through various workflow components, i.e. "actions with a little a". the bits making up the actions in this PR have been constructed from my notes along the way. we can do a "complete" run that includes: linting, building, installing, and running tests. GHA link ... https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7975058564 `testmo` ... https://neuralmagic.testmo.net/automation/runs/view/8097 Latest GHA link ... https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7992489982 --------- Co-authored-by: andy-neuma <[email protected]>
yukavio
pushed a commit
to yukavio/vllm
that referenced
this pull request
Jul 3, 2024
…t#27) SUMMARY: * initial set of "actions with a little a" that are the building blocks for eventual CI system * "build test" workflow * "remote push" workflow on `a10g` * update some requirement files to have packages listed in alphabetical order NOTE: this PR is still somewhat nebulas as i'm still working through building and testing "neuralmagic-vllm" in our automation environment. TEST: currently, i'm working through various workflow components, i.e. "actions with a little a". the bits making up the actions in this PR have been constructed from my notes along the way. we can do a "complete" run that includes: linting, building, installing, and running tests. GHA link ... https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7975058564 `testmo` ... https://neuralmagic.testmo.net/automation/runs/view/8097 Latest GHA link ... https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7992489982 --------- Co-authored-by: andy-neuma <[email protected]>
dllehr-amd
pushed a commit
to dllehr-amd/vllm
that referenced
this pull request
Jul 22, 2024
* [Kernel] Enable custome AR on ROCm * Install amdsmi in Docker in preparation for custom all reduce (cherry picked from commit f6cfb9bf31e9feeefbdedecf2165f80dd0564b75) * Fix for yapf * Linting and small fixes to vLLM syntax (cherry picked from commit 2cf8103bfb0afce59b28a06c5bbe905983c42728) --------- Co-authored-by: Matthew Wong <[email protected]>
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds query stride to the
multi_query_cached_kv_attention
kernel so that it can support non-contiguous query tensors in our OPT and LLaMA models (after #20).This PR also adds a benchmark script comparing our
multi_query_cached_kv_attention
and the optimized Flash attention implementation.