-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor attention kernels #53
Conversation
@@ -0,0 +1,5 @@ | |||
#pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the define guard instead of #pragma once
per Google's C++ style guide :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either options have pros and cons. I think it's safe to use #pragma once
, because it is commonly used in DL projects such as PyTorch and FasterTransformer.
Performance (batch_size=8, context_len=512, num_heads=40, head_size=128):
There's slight improvement in the kernel performance due to the use of fp16 in |
…rovements dockerfile improvements
…kar-amd-patch-1 Revert "Revert "Tune fused_moe_kernel for TP 1,2,4,8 and bf16 and fp16, updated moe kern…""
This PR refactors attention kernels, making the helper functions more modular and pruning unused code. This PR will make it easier to add support for a new data type such as bfloat16.
In addition, this PR reduces the computation overhead of the attention kernel, by using the reduced precision (i.e., fp16) for
logits * V
instead of the full precision. This is compatible with the FasterTransformer's implementation.