Support for 256 head dim #67

Sanger2000 · 2022-11-01T21:33:04Z

Really love this repo, I've been using it to finetune CodeGen models with >2k context windows.

It's way faster than hugging face (3x) and slightly faster than Megatron for the 350M and 2.7b parameter CodeGen models but doesn't work for the 6.1B and 16B parameter models as they have a head dimension of 256.

I would imagine CodeGen finetuning will be a solid use-case for flash attention since coding models can really benefit from long context windows. And CodeGen is basically SOTA for coding (competitive with Codex).

Is this something that is even possible with flash attention?

tridao · 2022-11-01T21:40:55Z

Thanks for the kind words! I agree that code generation is a great use case!

We haven't been able to get speedup on headdim=256. In order to reduce memory reads/writes, we load some block of Q, K, V from GPU memory to SRAM, and SRAM size is the main constraint (e.g. 163 KB per streaming multiprocessor on A100). As head dimension gets large, we can't fit the block into SRAM without making the block size very small, and thus making the whole computation slower.

For now we support head dimension up to 128.

Sanger2000 · 2022-11-01T21:58:50Z

Ahh, yeah I see. Looks like I'll have to find another pretrained code model to use for long sequences 😔

Again this code has been fantastic. My mind was blown when I saw how high I could set batch sizes without running out of memory.

ZhongYingMatrix · 2022-11-09T03:38:26Z

(e.g. 163 KB per streaming multiprocessor on A100)
Hi, is SRAM on A100 163KB or 192KB. I notice the latter in paper

tridao · 2022-11-09T05:26:25Z

The total SRAM per multiprocessor is 192KB, but only 163KB is usable for the programmer (the remaining is L1 cache I believe).

ZhongYingMatrix · 2022-11-09T06:28:17Z

Thanks for your reply~
So can we assume that M in the paper is 163 * 1024 = 166912, and B_c = M/4/d (assume d=64) = 652 while B_r = 64?
I'm kind of confused cause the B_c and B_r are both fixed 128 in the triton version, and I have not figured out where the Cuda version config them. Is it alright only if the SRAM can hold it no matter how we set B_c/B_r?

tridao · 2022-11-09T07:19:38Z

In practice the block sizes are set to optimize for speed and ease of implementation. For example, the Triton version set block sizes to (128, 128) because that's what the Triton compiler support (other shapes will lead to wrong results or compiler error). As another example, block sizes are always powers of 2, otherwise it's much harder to implement.
The CUDA version sets B_c = 128 or 256, and B_r = 16.
For example, the dispatch here corresponds to B_c = 256 and B_r = 16.

shijie-wu · 2022-12-06T22:24:51Z

Out of curiosity, what does the performance trade off curve looks like for head dim > 128? I understand for 256 the trade off is not worth it but will slightly bigger head dim be worth supporting?

tridao · 2022-12-06T22:31:46Z

I have not implemented anything for headdim > 128 (it takes effort to implement the low-level shared memory loading/writing for different head dimensions, for best efficiency).
If you have info on this that would be very helpful.

shijie-wu · 2022-12-06T22:54:18Z

Thanks! I don't have any info on this as well. I assumed that it's not supported because you have tried it and the result is mixed. We will explore it when we get a chance.

tridao · 2023-08-14T16:56:06Z

As of v2 we support all head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800.

tridao closed this as completed Nov 5, 2022

WoosukKwon mentioned this issue Mar 2, 2023

Use FlashAttention for multi_query_kv_attention vllm-project/vllm#4

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for 256 head dim #67

Support for 256 head dim #67

Sanger2000 commented Nov 1, 2022 •

edited

Loading

tridao commented Nov 1, 2022

Sanger2000 commented Nov 1, 2022

ZhongYingMatrix commented Nov 9, 2022

tridao commented Nov 9, 2022

ZhongYingMatrix commented Nov 9, 2022

tridao commented Nov 9, 2022

shijie-wu commented Dec 6, 2022

tridao commented Dec 6, 2022

shijie-wu commented Dec 6, 2022

tridao commented Aug 14, 2023

Support for 256 head dim #67

Support for 256 head dim #67

Comments

Sanger2000 commented Nov 1, 2022 • edited Loading

tridao commented Nov 1, 2022

Sanger2000 commented Nov 1, 2022

ZhongYingMatrix commented Nov 9, 2022

tridao commented Nov 9, 2022

ZhongYingMatrix commented Nov 9, 2022

tridao commented Nov 9, 2022

shijie-wu commented Dec 6, 2022

tridao commented Dec 6, 2022

shijie-wu commented Dec 6, 2022

tridao commented Aug 14, 2023

Sanger2000 commented Nov 1, 2022 •

edited

Loading