perf: FlashAttention-3 style MLA PageAttention #887

yzh119 · 2025-02-23T08:56:12Z

This PR is the followup of #804 , we implemented a FlashAttention-3 version of warp specialization pattern (splitting on head-dimension) in #804 for faster attention on Hopper GPUs. Compared to the previous version (in FA2 style), this PR did the following changes:

use one warpgroup for producer, two warpgroup for consumer.
use async wgmma instead of mma.
use the software pipeline algorithm in FlashAttention-3, to overlap CUDA-Cores and Tensor-Cores operations.
Compared to original attention, MLA uses the same set of K and V (the ckv matrix), if we reuse the CTA_TILE_KV=64 and PIPE_STAGES=2, the software pipeline algorithm would block the memory copy for next KV-Tile (both the pipe slots were be occupied), original attention do not have this issue because it has both pipeline_k and pipeline_v, doubling the stages. This PR changes CTA_TILE_KV=32 and PIPE_STAGES=4 to ensure we can compute the current KV-tile while loading the next KV-Tile, when using software pipeline.
Unlike original attention, we can't reuse V shared memory space for O. This PR designed a circular buffer for o_smem that reuses the KV slots, one KV-slot is not large enough for o_smem so we use two KV shared memory slot for one o_smem, a barrier is required to guarantee the memory order.

Pipeline

This figures explains our pipeline design:

Results

Benchmark result on H100 SXM3 (80GB, 3352GB/s).

This PR (fa3 template), page_size=1:

Config: batch_size=64, seq_len=1024, num_heads=64
Memory bandwidth: 1305.40 GB/s
Config: batch_size=128, seq_len=1024, num_heads=64
Memory bandwidth: 2228.56 GB/s
Config: batch_size=768, seq_len=1024, num_heads=64
Memory bandwidth: 2759.33 GB/s
Config: batch_size=64, seq_len=2048, num_heads=64
Memory bandwidth: 1766.33 GB/s
Config: batch_size=128, seq_len=2048, num_heads=64
Memory bandwidth: 2498.08 GB/s
Config: batch_size=768, seq_len=2048, num_heads=64
Memory bandwidth: 2768.37 GB/s

#804 + #863 (fa2 template), page_size=1:

Config: batch_size=64, seq_len=1024, num_heads=64
Memory bandwidth: 1067.74 GB/s
Config: batch_size=128, seq_len=1024, num_heads=64
Memory bandwidth: 1761.25 GB/s
Config: batch_size=768, seq_len=1024, num_heads=64
Memory bandwidth: 2065.78 GB/s
Config: batch_size=64, seq_len=2048, num_heads=64
Memory bandwidth: 1384.35 GB/s
Config: batch_size=128, seq_len=2048, num_heads=64
Memory bandwidth: 1892.64 GB/s
Config: batch_size=768, seq_len=2048, num_heads=64
Memory bandwidth: 2075.97 GB/s

The template is designed to use ampere style LDGSTS, which is prioritized for page_size=1 (but also works for larger page_size). Using TMA and multicast could further improve performance for page_size larger than 1, we leave them for future work.

upd

This PR fixes the header include, following changes in #887.

yzh119 added 6 commits February 23, 2025 06:29

upd

ea21670

upd

address conflict

eb7f924

upd

9a92b02

prepare for new pipeline

438555e

upd

7c99a88

cleanup

60e805d

yzh119 mentioned this pull request Feb 23, 2025

[Tracing Issue] Multi-head Latent Attention #792

Closed

4 tasks

yzh119 added 8 commits February 23, 2025 10:35

fix kv_len=0 and barrier_O

b91cc1c

bugfix

a799dde

improve barrier_Q and barrier_O

0bb23d1

remove some unrelated changes

1324534

upd

ebef0ba

remove unrelated changes

c8bb1c9

upd

a0d9746

keep precision

446fccb

yzh119 merged commit 2b24293 into main Feb 23, 2025

yzh119 mentioned this pull request Feb 23, 2025

perf: memory efficient deepseek mla fused page-attention kernel #804

Merged

MasterJH5574 added a commit that referenced this pull request Feb 23, 2025

[JIT] Fix MLA header in TVM binding

649475a

This PR fixes the header include, following changes in #887.

MasterJH5574 mentioned this pull request Feb 23, 2025

[JIT] Fix MLA header in TVM binding #889

Merged

yzh119 pushed a commit that referenced this pull request Feb 23, 2025

[JIT] Fix MLA header in TVM binding (#889)

ce34c1f

This PR fixes the header include, following changes in #887.

This was referenced Feb 23, 2025

Refactor flashinfer logic for deepseek v3 and fix accuracy bug sgl-project/sglang#3785

Merged

[WIP][Kernel] Flashinfer MLA support vllm-project/vllm#13630

Draft

This was referenced Feb 24, 2025

[Feature] mla speed sgl-project/sglang#3806

Closed

FlashMLA from DeepSeek #892

Open

yzh119 mentioned this pull request Feb 24, 2025

[Tracking Issue] MLA performance tracking #897

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: FlashAttention-3 style MLA PageAttention #887

perf: FlashAttention-3 style MLA PageAttention #887

yzh119 commented Feb 23, 2025 •

edited

Loading

perf: FlashAttention-3 style MLA PageAttention #887

perf: FlashAttention-3 style MLA PageAttention #887

Conversation

yzh119 commented Feb 23, 2025 • edited Loading

Pipeline

Results

yzh119 commented Feb 23, 2025 •

edited

Loading