Implements the attention kernel with vertical and slash sparse pattern described in Appendix C.4.2 of https://arxiv.org/abs/2407.02490 (as sparse_attn_func) #33

minminsun · 2024-12-19T06:48:20Z

Implements the attention kernel with vertical and slash sparse pattern described in Appendix C.4.2 of https://arxiv.org/abs/2407.02490 (as sparse_attn_func).

LucasWilkinson

vllm-flash-attn has unfortunately diverged (in conflicting ways) from the upstream but we are trying to simplify diffs with upstream and in that spirit I think it would be really helpful if we could push most of the additions in
csrc/flash_attn/src/flash_fwd_kernel.h and
csrc/flash_attn/src/flash_fwd_launch_template.h
into their own file, i.e. move them to files like:
csrc/flash_attn/src/vllm_extensions/flash_fwd_sparse_kernel.h and csrc/flash_attn/src/vllm_extensions/flash_fwd_sparse_launch_template.h

(Im looking into ways to reduce diffs in csrc/flash_attn/flash_api.cpp but this is trickier so I think what's in this PR currently is fine, we can address it in a future PR)

tests/flash_attn_wrapper.py

vllm_flash_attn/flash_attn_interface.py

csrc/flash_attn/flash_api.cpp

minminsun · 2025-01-06T10:06:54Z

vllm-flash-attn has unfortunately diverged (in conflicting ways) from the upstream but we are trying to simplify diffs with upstream and in that spirit I think it would be really helpful if we could push most of the additions in csrc/flash_attn/src/flash_fwd_kernel.h and csrc/flash_attn/src/flash_fwd_launch_template.h into their own file, i.e. move them to files like: csrc/flash_attn/src/vllm_extensions/flash_fwd_sparse_kernel.h and csrc/flash_attn/src/vllm_extensions/flash_fwd_sparse_launch_template.h

(Im looking into ways to reduce diffs in csrc/flash_attn/flash_api.cpp but this is trickier so I think what's in this PR currently is fine, we can address it in a future PR)

Thanks for your suggestion! I've moved code to new files.

LucasWilkinson

I've moved code to new files.
Thank you!

Thanks for addressing my previous comments and thanks for the contribution! I did another pass and left some more comments.

Overall the kernel seems quite cool but there seems to be alot of commented out code (thats appears to not be of the "un-comment for useful debug prints" style) that could use cleaning up. I think I caught most of it but another clean-up pass could be useful here.

csrc/flash_attn/src/flash_fwd_sparse_kernel.h

vllm_flash_attn/flash_attn_interface.py

tests/test_vllm_flash_attn.py

vllm_flash_attn/flash_attn_interface.py

csrc/flash_attn/src/flash_fwd_sparse_kernel.h

LucasWilkinson

Thanks for all the changes! Overall looks pretty good to me, left a couple more (optional) nits.

My final concern is binary size (we are a bit sensitive to this in vLLM), do you know which head dims are actually being used (since theres only a limited set of models using this currently)? Ideally we'd only build and ship those for now

Can you get the DCO check to pass be signing off on the commits https://github.com/apps/dco. After that I think everything is good from my side, @WoosukKwon not sure if you want to take a look?

vllm_flash_attn/flash_attn_interface.py

LucasWilkinson · 2025-01-08T16:50:50Z

csrc/flash_attn/src/flash_fwd_sparse_kernel.h

+                // flash::copy</*Is_even_MN=*/true, Is_even_K>(gmem_tiled_copy_QKV, tVgVBlock, tVsV, tKVcKV, tKVpKV);
+                #pragma unroll
+                for (int m = 0; m < size<1>(tVgVToken); ++m) {
+                    if (true) {  // Is_even_MN


vllm_flash_attn/flash_attn_interface.py

Signed-off-by: Minmin Sun <[email protected]>

minminsun · 2025-01-09T08:38:30Z

Thanks for all the changes! Overall looks pretty good to me, left a couple more (optional) nits.

My final concern is binary size (we are a bit sensitive to this in vLLM), do you know which head dims are actually being used (since theres only a limited set of models using this currently)? Ideally we'd only build and ship those for now

Can you get the DCO check to pass be signing off on the commits https://github.com/apps/dco. After that I think everything is good from my side, @WoosukKwon not sure if you want to take a look?

Keeping headdim 128 only is enough for us for now.

The DCO check is now passed. Thank you again for your thorough review and valuable suggestions!

LucasWilkinson

LGTM now, thanks!

minminsun · 2025-01-12T06:28:40Z

LGTM now, thanks!

Thank you! Could you please help to get the PR merged? The vLLM PR vllm-project/vllm#11844 depends on this.

LucasWilkinson · 2025-01-13T14:52:06Z

@minminsun could you please expand the PR description to something like:

"Implements the kernel described in Appendix C.4.2 https://arxiv.org/abs/2407.02490 (as sparse_attn_func)"

Just so there will be a more useful commit message

minminsun · 2025-01-13T17:10:24Z

@minminsun could you please expand the PR description to something like:

"Implements the kernel described in Appendix C.4.2 https://arxiv.org/abs/2407.02490 (as sparse_attn_func)"

Just so there will be a more useful commit message

Sure!

csrc/flash_attn/flash_api.cpp

tests/test_vllm_flash_attn.py

csrc/flash_attn/flash_api.cpp

Signed-off-by: Minmin Sun <[email protected]>

LucasWilkinson

Thanks for adding the varlen api! it appears the minf.py is not used? am I missing something or can this be removed?

csrc/flash_attn/flash_api.cpp

Signed-off-by: Minmin Sun <[email protected]>

minminsun · 2025-01-15T16:48:00Z

Thanks for adding the varlen api! it appears the minf.py is not used? am I missing something or can this be removed?

No, it's not used. Removed.

yzh119 · 2025-01-18T06:09:09Z

Hi @minminsun @LucasWilkinson @WoosukKwon , I just saw this PR, you might be interested in flashinfer's sparse attention implementation (which supports fine-grained block size, and both fa2/3 templates), you can try this feature using the sparse attention API in flashinfer: https://docs.flashinfer.ai/api/sparse.html. and it was used in projects such as quest.

LucasWilkinson · 2025-01-19T16:08:02Z

Cool thanks for the letting us know, ill check it out!

LucasWilkinson reviewed Jan 2, 2025

View reviewed changes

tests/flash_attn_wrapper.py Outdated Show resolved Hide resolved

vllm_flash_attn/flash_attn_interface.py Show resolved Hide resolved

csrc/flash_attn/flash_api.cpp Outdated Show resolved Hide resolved

csrc/flash_attn/flash_api.cpp Outdated Show resolved Hide resolved

minminsun force-pushed the main branch from 68d037c to f12eba3 Compare January 6, 2025 09:00

minminsun force-pushed the main branch from f12eba3 to 92ae9df Compare January 6, 2025 11:26

LucasWilkinson reviewed Jan 7, 2025

View reviewed changes

sighingnow mentioned this pull request Jan 8, 2025

Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support vllm-project/vllm#11844

Open

LucasWilkinson reviewed Jan 8, 2025

View reviewed changes

minminsun added 8 commits January 9, 2025 08:29

Add sparse attention with virtical and slash

fcc8a21

Signed-off-by: Minmin Sun <[email protected]>

update

7b2c7a2

Signed-off-by: Minmin Sun <[email protected]>

move sparse_attn to new files

e1a41d0

Signed-off-by: Minmin Sun <[email protected]>

Refine

b206be4

Signed-off-by: Minmin Sun <[email protected]>

remove registering as custom op

371a03a

Signed-off-by: Minmin Sun <[email protected]>

address review comments

866ca51

Signed-off-by: Minmin Sun <[email protected]>

remove window_size and useless code

cb64a43

Signed-off-by: Minmin Sun <[email protected]>

only keep hdim128

050052d

Signed-off-by: Minmin Sun <[email protected]>

minminsun force-pushed the main branch from ebeff9e to 050052d Compare January 9, 2025 08:29

LucasWilkinson approved these changes Jan 9, 2025

View reviewed changes

minminsun changed the title ~~Add sparse attention with vertical and slash~~ Implements the attention kernel with vertical and slash sparse pattern described in Appendix C.4.2 of https://arxiv.org/abs/2407.02490 (as sparse_attn_func) Jan 13, 2025

WoosukKwon reviewed Jan 14, 2025

View reviewed changes

csrc/flash_attn/flash_api.cpp Outdated Show resolved Hide resolved

WoosukKwon reviewed Jan 14, 2025

View reviewed changes

tests/test_vllm_flash_attn.py Outdated Show resolved Hide resolved

WoosukKwon reviewed Jan 14, 2025

View reviewed changes

csrc/flash_attn/flash_api.cpp Show resolved Hide resolved

minminsun added 3 commits January 14, 2025 04:21

add seqlen_q=1 in ut and remove useless code

7bbe317

Signed-off-by: Minmin Sun <[email protected]>

support batch_size > 1

b94e887

Signed-off-by: Minmin Sun <[email protected]>

add interface sparse_attn_varlen_func

d7b3975

Signed-off-by: Minmin Sun <[email protected]>

LucasWilkinson reviewed Jan 15, 2025

View reviewed changes

csrc/flash_attn/flash_api.cpp Outdated Show resolved Hide resolved

remove useless code

0dbe623

Signed-off-by: Minmin Sun <[email protected]>

LucasWilkinson merged commit 6e1f8b6 into vllm-project:main Jan 15, 2025
1 check passed

houseroad mentioned this pull request Jan 16, 2025

Eliminate c10::optional usage in vllm_flash_attn #38

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements the attention kernel with vertical and slash sparse pattern described in Appendix C.4.2 of https://arxiv.org/abs/2407.02490 (as sparse_attn_func) #33

Implements the attention kernel with vertical and slash sparse pattern described in Appendix C.4.2 of https://arxiv.org/abs/2407.02490 (as sparse_attn_func) #33

minminsun commented Dec 19, 2024 •

edited

Loading

LucasWilkinson left a comment •

edited

Loading

minminsun commented Jan 6, 2025

LucasWilkinson left a comment •

edited

Loading

LucasWilkinson left a comment

LucasWilkinson Jan 8, 2025

minminsun commented Jan 9, 2025

LucasWilkinson left a comment

minminsun commented Jan 12, 2025

LucasWilkinson commented Jan 13, 2025

minminsun commented Jan 13, 2025

LucasWilkinson left a comment

minminsun commented Jan 15, 2025

yzh119 commented Jan 18, 2025 •

edited

Loading

LucasWilkinson commented Jan 19, 2025

Implements the attention kernel with vertical and slash sparse pattern described in Appendix C.4.2 of https://arxiv.org/abs/2407.02490 (as sparse_attn_func) #33

Implements the attention kernel with vertical and slash sparse pattern described in Appendix C.4.2 of https://arxiv.org/abs/2407.02490 (as sparse_attn_func) #33

Conversation

minminsun commented Dec 19, 2024 • edited Loading

LucasWilkinson left a comment • edited Loading

Choose a reason for hiding this comment

minminsun commented Jan 6, 2025

LucasWilkinson left a comment • edited Loading

Choose a reason for hiding this comment

LucasWilkinson left a comment

Choose a reason for hiding this comment

LucasWilkinson Jan 8, 2025

Choose a reason for hiding this comment

minminsun commented Jan 9, 2025

LucasWilkinson left a comment

Choose a reason for hiding this comment

minminsun commented Jan 12, 2025

LucasWilkinson commented Jan 13, 2025

minminsun commented Jan 13, 2025

LucasWilkinson left a comment

Choose a reason for hiding this comment

minminsun commented Jan 15, 2025

yzh119 commented Jan 18, 2025 • edited Loading

LucasWilkinson commented Jan 19, 2025

minminsun commented Dec 19, 2024 •

edited

Loading

LucasWilkinson left a comment •

edited

Loading

LucasWilkinson left a comment •

edited

Loading

yzh119 commented Jan 18, 2025 •

edited

Loading