llama : add comments about experimental flags #7544

ggerganov · 2024-05-26T12:19:43Z

Certain combinations of [EXPERIMENTAL] llama_context_params are not always supported:

    struct llama_context_params {
        ...

        enum ggml_type type_k; // data type for K cache [EXPERIMENTAL]
        enum ggml_type type_v; // data type for V cache [EXPERIMENTAL]

        bool flash_attn;  // whether to use flash attention [EXPERIMENTAL]

        ...
    };

Here is a list of known incompatibilities (we can try to update it in the future):

flash_attn == true && type_k == F16 && type_v == F16
- CPU
  - Generally slower compared to no-FA, mostly used for testing purposes
- CUDA
- Metal
  - Slow or can't build for models with head size = 256 (Metal (iOS): Compute function exceeds available temporary registers #7261)
- Vulkan
- SYCL
flash_attn == true && (type_k != F16 || type_v != F16)
- CPU
- CUDA - partial support (CUDA: quantized KV support for FA vec #7527)
- Metal
- Vulkan
- SYCL
flash_attn == false && type_v != F16
- Not supported because the V cache is stored transposed, which prevents quantization

llama : add comments about experimental flags

aa0de27

ggerganov force-pushed the gg/fattn-warn branch from c99a2d9 to aa0de27 Compare May 26, 2024 12:20

ggerganov merged commit eaf6e03 into master May 27, 2024
67 checks passed

ggerganov deleted the gg/fattn-warn branch May 27, 2024 06:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : add comments about experimental flags #7544

llama : add comments about experimental flags #7544

ggerganov commented May 26, 2024 •

edited

Loading

llama : add comments about experimental flags #7544

llama : add comments about experimental flags #7544

Conversation

ggerganov commented May 26, 2024 • edited Loading

ggerganov commented May 26, 2024 •

edited

Loading