[Misc] Support attention logits soft-capping with flash-attn #7022

WoosukKwon · 2024-08-01T08:01:41Z

This PR adds support for attention logits soft-capping in the FlashAttention backend.

This is done by

Moving logits_soft_cap from AttentionMetadata to AttentionImpl.
Using vllm-flash-attn == 2.6.1 which added support for soft-capping.

Note that vllm-flash-attn v2.6.1 uses PyTorch 2.4.0, so the PR must be merged after #6951

This will hopefully resolve many of the issues in running Gemma2 models.

github-actions · 2024-08-01T08:01:53Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

comaniac

LGTM

…oject#7022)

…oject#7022) Signed-off-by: Alvant <[email protected]>

…oject#7022)

WoosukKwon added 4 commits August 1, 2024 00:13

Add test

8ce2cc3

Add logits_soft_cap as attribute

b3b0ae0

2.6.1

2a5798e

Minor

61ef2da

Merge branch 'main' into fa-gemma2

45ac1eb

WoosukKwon requested review from comaniac and LiuXiaoxuanPKU August 1, 2024 18:13

comaniac approved these changes Aug 1, 2024

View reviewed changes

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 1, 2024

WoosukKwon merged commit 805a8a7 into main Aug 1, 2024
59 of 63 checks passed

WoosukKwon deleted the fa-gemma2 branch August 1, 2024 20:14

dtrifiro mentioned this pull request Aug 5, 2024

Sync with [email protected] opendatahub-io/vllm#120

Closed

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[Misc] Support attention logits soft-capping with flash-attn (vllm-pr…

e0f5ea0

…oject#7022)

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Misc] Support attention logits soft-capping with flash-attn (vllm-pr…

8b6f882

…oject#7022) Signed-off-by: Alvant <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Misc] Support attention logits soft-capping with flash-attn (vllm-pr…

97a6161

…oject#7022)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc] Support attention logits soft-capping with flash-attn #7022

[Misc] Support attention logits soft-capping with flash-attn #7022

WoosukKwon commented Aug 1, 2024 •

edited

Loading

github-actions bot commented Aug 1, 2024

comaniac left a comment

[Misc] Support attention logits soft-capping with flash-attn #7022

[Misc] Support attention logits soft-capping with flash-attn #7022

Conversation

WoosukKwon commented Aug 1, 2024 • edited Loading

github-actions bot commented Aug 1, 2024

comaniac left a comment

Choose a reason for hiding this comment

WoosukKwon commented Aug 1, 2024 •

edited

Loading