[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth #13245

lingfanyu · 2025-02-13T23:35:29Z

Previous version of NKI flash attention kernel did not vectorize KV cache loading to fully utilize HBM bandwidth. As a result, the kernel is bottlenecked by fetching paged KV cache from HBM.

We apply vectorization in this PR to fully saturate DMA bandwidth.

@liangfu

Signed-off-by: Lingfan Yu <[email protected]>

github-actions · 2025-02-13T23:35:41Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Lingfan Yu <[email protected]>

JF-D

LGTM!

aws-qieqingy · 2025-02-14T17:38:17Z

vllm/attention/ops/nki_flash_attn.py

-        )
-        return
-
-    if nisa.get_nc_version() == nisa.nc_version.gen3:


Are these kernels targeting trn2? DMA transpose could bring better performance on trn2 onward.

Yes, it targets trn2. But here we are simplifying the code by removing the option to transpose v in kernel. So should_transpose_v is always False. We expect kernel input layout of value to be (batch, num_kv_head, seqlen_q, D)

lingfanyu · 2025-02-18T03:49:00Z

Hi @simon-mo , could you please add me to Buildkite org so that I can unblock Neuron tests? Thanks!

Signed-off-by: Lingfan Yu <[email protected]>

liangfu

Thanks for contributing!

To my understanding, the code change contains three main changes: 1/ mask reordering, 2/ vectorized KV cache loading, 3/ enable loading large block_tables. I think it's better to test these new capabilities individually, instead of bundle the integrated tests at higher level. But feel free to chime in.

liangfu · 2025-02-19T01:25:42Z

tests/neuron/test_prefix_prefill.py

+    ).bool()
+    attn_mask = torch.concat([prior_mask_padded, active_mask_padded], dim=1)
+
+    # reorder_mask_outside = True


liangfu · 2025-02-19T01:27:13Z

tests/neuron/test_prefix_prefill.py

+    ],
+)
+@torch.inference_mode()
+def test_flash_paged_attention_numerical(


(asking because i was expecting to be some what consistent with GPU test cases, like kernels/test_prefix_prefill.py, test_batch_prefill_kernels.py or test_page.py )

are we intentionally trying to rename the test function?

liangfu · 2025-02-19T01:41:20Z

tests/neuron/test_prefix_prefill.py

+    block_size: int,
+    large_tile_size,
+    mixed_precision: bool,
+    reorder_mask_outside: bool,


it would be better if we could have a separate test function for this new capability (e.g. reorder_mask_outside).

Will separate it out in next PR #13455

liangfu · 2025-02-19T01:44:14Z

tests/neuron/test_prefix_prefill.py

-            "constant",
-            0,
-        )
+    assert LARGE_TILE_SZ >= B_P_SIZE


assert with message ?

liangfu · 2025-02-19T01:48:32Z

vllm/attention/ops/nki_flash_attn.py

+def transform_block_tables_for_indirect_load(
+    block_tables,
+    block_size_tiling_factor,
+    num_head,
+    head_id,
+):


it would be better if we could have a unit test for this.

Unit test has been added in tests/neuron/test_block_table.py

liangfu · 2025-02-19T01:56:15Z

vllm/attention/ops/nki_flash_attn.py

                    B_P_SIZE=B_P_SIZE,
                    B_F_SIZE=B_F_SIZE,
                    B_D_SIZE=B_D_SIZE,
+                    qk_res_buffer=None,


qk_res_buffer is already None by default. no?

liangfu · 2025-02-19T01:57:41Z

vllm/attention/ops/nki_flash_attn.py

+        cur_k_tile[:, :] = nl.load(key[batch_id, head_id, :, :],
+                                   dtype=cur_k_tile.dtype)


loading while casting can be tricky. make it separate ?

liangfu · 2025-02-19T01:58:54Z

vllm/attention/ops/nki_flash_attn.py

+    context_kv_len = total_seq_len - total_query_len
+
+    B_P_SIZE = 128
+    # assuming LARGE_TILE_SIZE >= B_P_SIZE


add assertion?

liangfu · 2025-02-19T02:04:37Z

vllm/attention/ops/nki_flash_attn.py

+    mask_reordered=True,
+    return_debug_tensors=False,


to be more consistent with

vllm/vllm/v1/attention/backends/flash_attn.py

Lines 197 to 213 in 4c82229

flash_attn_varlen_func(

q=query[:num_actual_tokens],

k=key_cache,

v=value_cache,

out=output[:num_actual_tokens],

cu_seqlens_q=attn_metadata.query_start_loc,

max_seqlen_q=attn_metadata.max_query_len,

seqused_k=attn_metadata.seq_lens,

max_seqlen_k=attn_metadata.max_seq_len,

softmax_scale=self.scale,

causal=True,

alibi_slopes=self.alibi_slopes,

window_size=self.sliding_window,

block_table=attn_metadata.block_table,

softcap=self.logits_soft_cap,

fa_version=self.vllm_flash_attn_version,

)

remove both and set these internal argument inside the function call?

liangfu · 2025-02-19T02:06:04Z

vllm/attention/ops/nki_flash_attn.py

+            cur_mask = nl.load(
+                mask[
                    nl.ds(i * B_P_SIZE, B_P_SIZE),
-                    nl.ds(j * LARGE_TILE_SZ + m_i * B_F_SIZE, B_F_SIZE),
-                ])
+                    nl.ds(large_k_tile_idx * LARGE_TILE_SZ, LARGE_TILE_SZ),
+                ],
+                dtype=mask.dtype,
+            )


loading while casting can be tricky. consider make it separate ?

Signed-off-by: Lingfan Yu <[email protected]>

lingfanyu · 2025-02-19T22:59:50Z

@liangfu Thanks for the review. I updated following your suggestions.

To my understanding, the code change contains three main changes: 1/ mask reordering, 2/ vectorized KV cache loading, 3/ enable loading large block_tables. I think it's better to test these new capabilities individually, instead of bundle the integrated tests at higher level. But feel free to chime in.

Will do in PR #13455

Signed-off-by: Lingfan Yu <[email protected]>

liangfu

Thanks for the update.

…ximize DMA bandwidth (vllm-project#13245) Signed-off-by: Lingfan Yu <[email protected]>

…ximize DMA bandwidth (vllm-project#13245) Signed-off-by: Lingfan Yu <[email protected]> Signed-off-by: Michael Glass <[email protected]>

lingfanyu added 5 commits February 13, 2025 06:35

vectorize DMA for blocktable and KV cache

15f4c6b

Signed-off-by: Lingfan Yu <[email protected]>

modularize code and remove most input shape constraints

4cb597b

Signed-off-by: Lingfan Yu <[email protected]>

get rid of flash config and transpose_v, add more doc string

1f03bbd

Signed-off-by: Lingfan Yu <[email protected]>

consolidate test cases

954febe

Signed-off-by: Lingfan Yu <[email protected]>

revert code reorganize, postpone to future pr

178cbea

Signed-off-by: Lingfan Yu <[email protected]>

lingfanyu marked this pull request as ready for review February 13, 2025 23:36

lingfanyu mentioned this pull request Feb 14, 2025

[Neuron][Kernel] NKI Flash PagedAttention with BlockSparse Execution Plan #13249

Draft

helpfer func to reorder mask for DMA vectorization

26b4c36

Signed-off-by: Lingfan Yu <[email protected]>

JF-D reviewed Feb 14, 2025

View reviewed changes

aws-qieqingy reviewed Feb 14, 2025

View reviewed changes

lingfanyu added 2 commits February 18, 2025 11:53

Merge branch 'vllm-project:main' into fast_vectorized_dma

54f52c0

update shape assertions

e314989

Signed-off-by: Lingfan Yu <[email protected]>

liangfu suggested changes Feb 19, 2025

View reviewed changes

lingfanyu added 2 commits February 19, 2025 22:33

add unit tests for block table loading and transformation

bd69d00

Signed-off-by: Lingfan Yu <[email protected]>

update code following review comments

f0a0a32

Signed-off-by: Lingfan Yu <[email protected]>

lingfanyu added 2 commits February 19, 2025 23:22

update doc string

cd9fee0

Signed-off-by: Lingfan Yu <[email protected]>

Merge branch 'vllm-project:main' into fast_vectorized_dma

160defd

liangfu approved these changes Feb 20, 2025

View reviewed changes

liangfu mentioned this pull request Jan 28, 2025

[RFC][Exploratory]: vLLM Neuron Backend with V1 Architecture #11152

Open

6 tasks

simon-mo merged commit 3317008 into vllm-project:main Feb 21, 2025
19 checks passed

kerthcet pushed a commit to kerthcet/vllm that referenced this pull request Feb 21, 2025

[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to ma…

bb6168c

…ximize DMA bandwidth (vllm-project#13245) Signed-off-by: Lingfan Yu <[email protected]>

JenZhao pushed a commit to JenZhao/vllm that referenced this pull request Feb 21, 2025

[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to ma…

00039c7

…ximize DMA bandwidth (vllm-project#13245) Signed-off-by: Lingfan Yu <[email protected]>

lingfanyu deleted the fast_vectorized_dma branch February 21, 2025 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth #13245

[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth #13245

lingfanyu commented Feb 13, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 13, 2025

JF-D left a comment

aws-qieqingy Feb 14, 2025

lingfanyu Feb 14, 2025

lingfanyu commented Feb 18, 2025

liangfu left a comment •

edited

Loading

liangfu Feb 19, 2025

liangfu Feb 19, 2025

lingfanyu Feb 19, 2025

liangfu Feb 19, 2025

lingfanyu Feb 19, 2025

liangfu Feb 19, 2025

liangfu Feb 19, 2025

lingfanyu Feb 19, 2025

liangfu Feb 19, 2025

lingfanyu Feb 19, 2025

liangfu Feb 19, 2025

lingfanyu Feb 19, 2025

liangfu Feb 19, 2025

liangfu Feb 19, 2025

lingfanyu Feb 19, 2025

liangfu Feb 19, 2025

lingfanyu Feb 19, 2025

lingfanyu commented Feb 19, 2025

liangfu left a comment

		cur_k_tile[:, :] = nl.load(key[batch_id, head_id, :, :],
		dtype=cur_k_tile.dtype)

	flash_attn_varlen_func(
	q=query[:num_actual_tokens],
	k=key_cache,
	v=value_cache,
	out=output[:num_actual_tokens],
	cu_seqlens_q=attn_metadata.query_start_loc,
	max_seqlen_q=attn_metadata.max_query_len,
	seqused_k=attn_metadata.seq_lens,
	max_seqlen_k=attn_metadata.max_seq_len,
	softmax_scale=self.scale,
	causal=True,
	alibi_slopes=self.alibi_slopes,
	window_size=self.sliding_window,
	block_table=attn_metadata.block_table,
	softcap=self.logits_soft_cap,
	fa_version=self.vllm_flash_attn_version,
	)

[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth #13245

[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth #13245

Conversation

lingfanyu commented Feb 13, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 13, 2025

JF-D left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lingfanyu commented Feb 18, 2025

liangfu left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lingfanyu commented Feb 19, 2025

liangfu left a comment

Choose a reason for hiding this comment

lingfanyu commented Feb 13, 2025 •

edited by github-actions bot

Loading

liangfu left a comment •

edited

Loading