Mistral.rs supports FlashAttention V2 and V3 on CUDA devices (V3 is only supported when CC >= 9.0).
Note: If compiled with FlashAttention and PagedAttention is enabled, then FlashAttention will be used in tandem to accelerate the prefill phase.
To use FlashAttention V2/V3, compile with the following feature flags.
FlashAttention | Feature flag |
---|---|
V2 (CC < 9.0) | --features flash-attn |
V3 (CC >= 9.0) | --features flash-attn-v3 |
Note: FlashAttention V2 and V3 are mutually exclusive Note: To use FlashAttention in the Python API, compile from source.