Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor attention kernels #53

Merged
merged 17 commits into from
May 3, 2023
Merged

Refactor attention kernels #53

merged 17 commits into from
May 3, 2023

Conversation

WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented May 2, 2023

This PR refactors attention kernels, making the helper functions more modular and pruning unused code. This PR will make it easier to add support for a new data type such as bfloat16.

In addition, this PR reduces the computation overhead of the attention kernel, by using the reduced precision (i.e., fp16) for logits * V instead of the full precision. This is compatible with the FasterTransformer's implementation.

@WoosukKwon WoosukKwon requested a review from zhuohan123 May 2, 2023 07:38
@@ -0,0 +1,5 @@
#pragma once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the define guard instead of #pragma once per Google's C++ style guide :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either options have pros and cons. I think it's safe to use #pragma once, because it is commonly used in DL projects such as PyTorch and FasterTransformer.

@WoosukKwon WoosukKwon requested a review from zhuohan123 May 3, 2023 06:27
@WoosukKwon
Copy link
Collaborator Author

WoosukKwon commented May 3, 2023

Performance (batch_size=8, context_len=512, num_heads=40, head_size=128):

Before: 83.4 us
After: 82.5 us

There's slight improvement in the kernel performance due to the use of fp16 in logits * values.

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request Jun 18, 2024
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
dllehr-amd pushed a commit to dllehr-amd/vllm that referenced this pull request Jul 22, 2024
…kar-amd-patch-1

Revert "Revert "Tune fused_moe_kernel for TP 1,2,4,8 and bf16 and fp16, updated moe kern…""
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants