Skip to content

Latest commit

 

History

History
18 lines (12 loc) · 690 Bytes

FLASH_ATTENTION.md

File metadata and controls

18 lines (12 loc) · 690 Bytes

FlashAttention in mistral.rs

Mistral.rs supports FlashAttention V2 and V3 on CUDA devices (V3 is only supported when CC >= 9.0).

Note: If compiled with FlashAttention and PagedAttention is enabled, then FlashAttention will be used in tandem to accelerate the prefill phase.

Using FlashAttention V2/V3

To use FlashAttention V2/V3, compile with the following feature flags.

FlashAttention Feature flag
V2 (CC < 9.0) --features flash-attn
V3 (CC >= 9.0) --features flash-attn-v3

Note: FlashAttention V2 and V3 are mutually exclusive Note: To use FlashAttention in the Python API, compile from source.