Enable offloading multi-query attention by Flash Attention #990

masahi · 2023-09-28T00:23:08Z

Following apache/tvm#15831, this is the mlc-side change to enable MQA offload.

To use the new option, use_flash_attn_mqa, ${TVM_HOME} must point to a TVM build with that commit. In particular, since the TVM submodule pulled by mlc is a custom one that doesn't support flash attention at all, it cannot use this new feature. The option is set to False by default to avoid potential troubles.

sunggg

Overall, LGTM. Thanks for the PR!

Two high-level comments:

Would be nice to have at least one testcase.
Is this about mlc-ai/relax not having tvm/unity yet?

In particular, since the TVM submodule pulled by mlc is a custom one that doesn't support flash attention at all, it cannot use this new feature.

sunggg · 2023-10-03T16:05:35Z

mlc_llm/core.py

+                mod["prefill"] = rewrite_attention(mod["prefill"], use_flash_mqa=True)
+                mod["decode"] = rewrite_attention(mod["decode"], use_flash_mqa=True)
+
+            mod["prefill"] = rewrite_attention(mod["prefill"], use_flash_mqa=False)


For args.use_flash_attn_mqa==True, do we need to run rewrite_attention twice?

Yes, this is for a case where there are both MQA and regular attention in the same model. I don't think it would come up in practice, but I added for completeness.

masahi · 2023-10-03T19:20:47Z

Is this about mlc-ai/relax not having tvm/unity yet?

No, when I checked the commit history of that fork at that time, they explicitly reverted all flash attention related PRs to apache/unity (probably to make compilation faster). That doesn't seem to be the case anymore.

sunggg · 2023-10-04T00:25:51Z

Is this about mlc-ai/relax not having tvm/unity yet?

No, when I checked the commit history of that fork at that time, they explicitly reverted all flash attention related PRs to apache/unity (probably to make compilation faster). That doesn't seem to be the case anymore.

If this is the case, would it make sense to enable this MQA offload by default?

masahi · 2023-10-04T00:51:17Z

I don't want to require Flash Attention for mlc, since it is only needed for MQA and flash attention can be problematic for packaging purposes etc due to its insane compilation time. Moreover, flash attention doesn't seem to be faster than cutlass fMHA for LLM decoding workload, so unless the context length is very large, this optimization doesn't give good speed up over the default explicit repeat + cutlass fmha path.

So for now this feature is experimental.

masahi added 6 commits September 26, 2023 07:39

wip

f58a5e0

update

3190210

fix

03f9bf3

fix cmake

40a582f

disable by default

93ac337

fix

c0856f0

sunggg reviewed Oct 3, 2023

View reviewed changes

sunggg approved these changes Oct 4, 2023

View reviewed changes

vinx13 approved these changes Oct 4, 2023

View reviewed changes

vinx13 merged commit 1c82625 into mlc-ai:main Oct 4, 2023

masahi mentioned this pull request Oct 30, 2023

[Transform] Provide IRModule transform for rewrite_attention #1052

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable offloading multi-query attention by Flash Attention #990

Enable offloading multi-query attention by Flash Attention #990

masahi commented Sep 28, 2023

sunggg left a comment

sunggg Oct 3, 2023

masahi Oct 3, 2023

masahi commented Oct 3, 2023 •

edited

Loading

sunggg commented Oct 4, 2023

masahi commented Oct 4, 2023

Enable offloading multi-query attention by Flash Attention #990

Enable offloading multi-query attention by Flash Attention #990

Conversation

masahi commented Sep 28, 2023

sunggg left a comment

Choose a reason for hiding this comment

sunggg Oct 3, 2023

Choose a reason for hiding this comment

masahi Oct 3, 2023

Choose a reason for hiding this comment

masahi commented Oct 3, 2023 • edited Loading

sunggg commented Oct 4, 2023

masahi commented Oct 4, 2023

masahi commented Oct 3, 2023 •

edited

Loading