Remove graph breaks for torch.compile() in flash_attention_forward when Lllama Model is padding free tuned #33932

Abhishek-TAMU · 2024-10-03T22:16:35Z

What does this PR do?

This PR removes the function call prepare_fa2_from_position_ids in flash_attention_forward as it causes graph break when torch_compile flag is turned on in Training arguments to use in SFTTrainer to perform padding free tuning of Llama model. This is because code in prepare_fa2_from_position_ids incur a cpu-gpu sync that is unavoidable.
Hence cu_seq_lens_q, cu_seq_lens_k, max_length_k, max_length_q is now taken from the batch in DataCollatorForCompletionOnlyLM with this PR to avoid call to prepare_fa2_from_position_ids in flash_attention_forward.

CC: @ani300 @ArthurZucker

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: Abhishek <[email protected]>

ArthurZucker

A very nice PR and very much welcome!
Let's add general kwargs, #31446 has some commits with that

ArthurZucker · 2024-10-04T15:52:24Z

src/transformers/models/llama/modeling_llama.py

+                    cu_seq_lens_q=cu_seq_lens_q,
+                    cu_seq_lens_k=cu_seq_lens_k,
+                    max_length_q=max_length_q,
+                    max_length_k=max_length_k,


Actually something we had planned 😅 cc @gante on generate unpadding the input!

@Cyrilvallez as well if you want to have fun IMO can be quite impactfull!

ArthurZucker · 2024-10-04T15:55:02Z

src/transformers/models/llama/modeling_llama.py

+        cu_seq_lens_q: Optional[torch.LongTensor] = None,
+        cu_seq_lens_k: Optional[torch.LongTensor] = None,
+        max_length_q: int = 0,
+        max_length_k: int = 0,


these are are FlashAttention specific. IMO it would make sense to just add them as fa2_kwargs for example. We can use something like this:

transformers/src/transformers/processing_utils.py

Line 81 in ce81eb8

class TextKwargs(TypedDict, total=False):

ArthurZucker · 2024-10-04T15:56:44Z

This way we can potentially add more kwargs without changing the forward!

Signed-off-by: Abhishek <[email protected]>

Abhishek-TAMU · 2024-10-08T20:36:49Z

Thanks for the review and welcoming this PR.
The changes suggested by you have been made @ArthurZucker

ArthurZucker

Nice!

ArthurZucker · 2024-10-09T11:50:20Z

src/transformers/models/llama/modeling_llama.py

-        cu_seq_lens_k: Optional[torch.LongTensor] = None,
-        max_length_q: int = 0,
-        max_length_k: int = 0,
+        **kwargs,


Suggested change

**kwargs,

**kwargs: Unpack[Fa2Kwargs],

ArthurZucker · 2024-10-09T11:50:35Z

src/transformers/models/llama/modeling_llama.py

-        cu_seq_lens_k: Optional[torch.LongTensor] = None,
-        max_length_q: int = 0,
-        max_length_k: int = 0,
+        **fa2_kwargs: Fa2Kwargs,


Suggested change

**fa2_kwargs: Fa2Kwargs,

**fa2_kwargs: Unpack[Fa2Kwargs],

ArthurZucker

Thinking that we can call then flash_attn_kwargs to not depend on versioning !

Signed-off-by: Abhishek <[email protected]>

…sformers into compile_llama

Abhishek-TAMU · 2024-10-09T16:45:39Z

@ArthurZucker Made the necessary changes.
Feel free to suggest changes if required any. Thanks!

Abhishek-TAMU · 2024-10-09T19:12:18Z

@ArthurZucker Would you mind facilitating in moving ahead with the related PR in TRL which supports this PR: huggingface/trl#2158 ?

ArthurZucker · 2024-10-10T07:37:54Z

Okay! Overall looks good.

we need to protect the import of Unpack :
let's just add an example in the documentation of how to use this! a small python snippet!

ArthurZucker · 2024-10-10T07:38:12Z

src/transformers/models/llama/modeling_llama.py

@@ -18,7 +18,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import math
-from typing import List, Optional, Tuple, Union
+from typing import List, Optional, Tuple, Union, Unpack


Suggested change

from typing import List, Optional, Tuple, Union, Unpack

from typing import List, Optional, Tuple, Union

ArthurZucker · 2024-10-10T07:38:18Z

src/transformers/models/llama/modeling_llama.py

+from ...processing_utils import (
+    FlashAttentionKwargs,
+)    


Suggested change

from ...processing_utils import (

FlashAttentionKwargs,

)

from ...processing_utils import (

FlashAttentionKwargs, Unpack

)

ArthurZucker · 2024-10-10T07:38:55Z

The CI's should mostly go green with this!
Then you will have make fix-copies that is gonna propagate the changes!

ArthurZucker · 2024-10-10T07:47:38Z

Once we merge I'll ping TRL team to make sure they don't miss it!

Signed-off-by: Abhishek <[email protected]>

This reverts commit 39d2868.

Signed-off-by: Abhishek <[email protected]>

ArthurZucker

Okay! This LGTM!
You just need to run the make fix-copies to make sure CIs go green 🚀

Signed-off-by: Abhishek <[email protected]>

ArthurZucker · 2024-10-22T17:03:35Z

If you have issue with just make fix copies I can take it over if you want!

Abhishek-TAMU · 2024-10-22T17:05:45Z

Sure, That would be helpful. Thank you! There seems to be some mismatch in src/transformers/models/glm/modeling_glm.py.

ArthurZucker · 2024-10-23T13:22:01Z

On it!

ArthurZucker · 2024-10-23T15:21:08Z

I am just waiting on #34283 to be merged!

…le_llama

ArthurZucker · 2024-10-23T15:52:51Z

The new helper will be this:

(with loss kwargs!)

Abhishek-TAMU · 2024-10-23T23:31:41Z

Thanks @ArthurZucker for the code change to accomodate LossKwargs.

…le_llama

ArthurZucker · 2024-10-24T09:02:49Z

Thanks @Abhishek-TAMU for your contribution! 🚀

Abhishek-TAMU · 2024-10-31T15:02:05Z

Once we merge I'll ping TRL team to make sure they don't miss it

Hi @ArthurZucker, do you mind facilitating this ? PR: huggingface/trl#2158

…en Lllama Model is padding free tuned (huggingface#33932) * fix: fixes for graph breaks Signed-off-by: Abhishek <[email protected]> * fix: formatting Signed-off-by: Abhishek <[email protected]> * fix: import error Signed-off-by: Abhishek <[email protected]> * fix: Add Fa2Kwargs Signed-off-by: Abhishek <[email protected]> * fix: PR Changes Signed-off-by: Abhishek <[email protected]> * PR changes Signed-off-by: Abhishek <[email protected]> * PR changes Signed-off-by: Abhishek <[email protected]> * PR changes Signed-off-by: Abhishek <[email protected]> * PR changes Signed-off-by: Abhishek <[email protected]> * Revert "PR changes" This reverts commit 39d2868. * PR changes Signed-off-by: Abhishek <[email protected]> * fix: FlashAttentionKwarg Signed-off-by: Abhishek <[email protected]> * fix: FlashAttentionKwarg Signed-off-by: Abhishek <[email protected]> * PR Changes Signed-off-by: Abhishek <[email protected]> * PR Changes Signed-off-by: Abhishek <[email protected]> * PR Changes Signed-off-by: Abhishek <[email protected]> * PR Changes Signed-off-by: Abhishek <[email protected]> * PR Changes Signed-off-by: Abhishek <[email protected]> * addition of documentation Signed-off-by: Abhishek <[email protected]> * change in _flash_attention_forward Signed-off-by: Abhishek <[email protected]> * make fix-copies Signed-off-by: Abhishek <[email protected]> * revert make fix-copies Signed-off-by: Abhishek <[email protected]> * fix copies * style * loss kwargs typing * style and pull latest changes --------- Signed-off-by: Abhishek <[email protected]> Co-authored-by: Arthur Zucker <[email protected]>

ma787639046 · 2024-12-17T04:56:33Z

Hi @Abhishek-TAMU @ArthurZucker , very nice PR for adding FlashAttentionKwargs.
In modeling_llama.py#L959, I noticed that **flash_attn_kwargs is added to the inputs of decoder_layer when gradient checkpointing is not used. Could you please also add flash_attn_kwargs when using gradient checkpointing in the above if branch at line 938? If checkpointing function does not accept kwargs, can we make all FlashAttentionKwargs as optional input fields of LlamaDecoderLayer.forward?

ArthurZucker · 2024-12-20T09:04:15Z

kwargs are by default optional!
Checkpoint indeed does not support kwargs, at least the way we have formulated it. Will be fixed by #34987

Abhishek-TAMU added 3 commits October 3, 2024 16:47

fix: fixes for graph breaks

d541997

Signed-off-by: Abhishek <[email protected]>

fix: formatting

35b2aa6

Signed-off-by: Abhishek <[email protected]>

fix: import error

5cefb84

Signed-off-by: Abhishek <[email protected]>

ArthurZucker reviewed Oct 4, 2024

View reviewed changes

Abhishek-TAMU and others added 2 commits October 7, 2024 13:05

fix: Add Fa2Kwargs

aa7b014

Signed-off-by: Abhishek <[email protected]>

Merge branch 'main' into compile_llama

c42deaa

ArthurZucker reviewed Oct 9, 2024

View reviewed changes

Abhishek-TAMU added 3 commits October 9, 2024 12:36

fix: PR Changes

926481b

Signed-off-by: Abhishek <[email protected]>

Merge branch 'compile_llama' of https://github.com/Abhishek-TAMU/tran…

85f1330

…sformers into compile_llama

Merge branch 'main' into compile_llama

01fb377

ArthurZucker reviewed Oct 10, 2024

View reviewed changes

qgallouedec mentioned this pull request Oct 10, 2024

Remove graph breaks for torch.compile() in padding free branch in DataCollatorForCompletionOnlyLM huggingface/trl#2158

Merged

5 tasks

ArthurZucker mentioned this pull request Oct 10, 2024

Allow passing 2D attention mask #27640

Open

Abhishek-TAMU and others added 9 commits October 10, 2024 19:23

Merge branch 'main' into compile_llama

5ec657f

PR changes

20a4dd6

Signed-off-by: Abhishek <[email protected]>

PR changes

045ef16

Signed-off-by: Abhishek <[email protected]>

PR changes

d2796f6

Signed-off-by: Abhishek <[email protected]>

PR changes

39d2868

Signed-off-by: Abhishek <[email protected]>

Revert "PR changes"

83747b5

This reverts commit 39d2868.

PR changes

b642d45

Signed-off-by: Abhishek <[email protected]>

Merge branch 'huggingface:main' into compile_llama

d760818

fix: FlashAttentionKwarg

d03e673

Signed-off-by: Abhishek <[email protected]>

change in _flash_attention_forward

f23c955

Signed-off-by: Abhishek <[email protected]>

ArthurZucker reviewed Oct 22, 2024

View reviewed changes

Abhishek-TAMU added 3 commits October 22, 2024 11:34

Merge remote-tracking branch 'huggingface/main' into compile_llama

ba54841

make fix-copies

67c7828

Signed-off-by: Abhishek <[email protected]>

revert make fix-copies

8d2ec29

Signed-off-by: Abhishek <[email protected]>

Merge remote-tracking branch 'huggingface/main' into compile_llama

480c78d

ArthurZucker added 2 commits October 23, 2024 16:25

fix copies

5a903da

style

05f9a80

ArthurZucker mentioned this pull request Oct 23, 2024

Enable Gradient Accumulation fix across all models + trainer fully in forward() #34283

Merged

5 tasks

Merge branch 'main' of github.com:huggingface/transformers into compi…

6843a9c

…le_llama

loss kwargs typing

a6e2601

ArthurZucker added 2 commits October 24, 2024 09:23

Merge branch 'main' of github.com:huggingface/transformers into compi…

dd0bd9a

…le_llama

style and pull latest changes

cb08b63

ArthurZucker merged commit 65753d6 into huggingface:main Oct 24, 2024
21 of 25 checks passed

hiyouga mentioned this pull request Nov 2, 2024

Mismatched keyword argument names of llama make GA fix invalid #34577

Closed

4 tasks

ArthurZucker mentioned this pull request Nov 14, 2024

Torch.compile fail during inference with meta-llama/Meta-Llama-3.1-8B-Instruct #34604

Closed

4 tasks

fzyzcjy mentioned this pull request Nov 28, 2024

BatchEncoding.to throws away columns silently, thus no way to pass non-tensor columns such as String in Trainer metric computation #34983

Closed

4 tasks

ArthurZucker mentioned this pull request Jan 21, 2025

RuntimeError: self and mat2 must have the same dtype, but got Float and BFloat16 when training with torch_compile #35382

Open

4 tasks

garrett361 mentioned this pull request Jan 28, 2025

fix FlashAttentionKwargs RoPE #35941

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove graph breaks for torch.compile() in flash_attention_forward when Lllama Model is padding free tuned #33932

Remove graph breaks for torch.compile() in flash_attention_forward when Lllama Model is padding free tuned #33932

Abhishek-TAMU commented Oct 3, 2024 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

ArthurZucker Oct 4, 2024

ArthurZucker Oct 4, 2024

ArthurZucker Oct 4, 2024

ArthurZucker commented Oct 4, 2024

Abhishek-TAMU commented Oct 8, 2024

ArthurZucker left a comment

ArthurZucker Oct 9, 2024

ArthurZucker Oct 9, 2024

ArthurZucker left a comment

Abhishek-TAMU commented Oct 9, 2024

Abhishek-TAMU commented Oct 9, 2024

ArthurZucker commented Oct 10, 2024

ArthurZucker Oct 10, 2024

ArthurZucker Oct 10, 2024

ArthurZucker commented Oct 10, 2024

ArthurZucker commented Oct 10, 2024

ArthurZucker left a comment

ArthurZucker commented Oct 22, 2024

Abhishek-TAMU commented Oct 22, 2024 •

edited

Loading

ArthurZucker commented Oct 23, 2024

ArthurZucker commented Oct 23, 2024

ArthurZucker commented Oct 23, 2024

Abhishek-TAMU commented Oct 23, 2024

ArthurZucker commented Oct 24, 2024

Abhishek-TAMU commented Oct 31, 2024 •

edited

Loading

ma787639046 commented Dec 17, 2024

ArthurZucker commented Dec 20, 2024

	from typing import List, Optional, Tuple, Union, Unpack
	from typing import List, Optional, Tuple, Union

Remove graph breaks for torch.compile() in flash_attention_forward when Lllama Model is padding free tuned #33932

Remove graph breaks for torch.compile() in flash_attention_forward when Lllama Model is padding free tuned #33932

Conversation

Abhishek-TAMU commented Oct 3, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

ArthurZucker left a comment • edited Loading

Choose a reason for hiding this comment

ArthurZucker Oct 4, 2024

Choose a reason for hiding this comment

ArthurZucker Oct 4, 2024

Choose a reason for hiding this comment

ArthurZucker Oct 4, 2024

Choose a reason for hiding this comment

ArthurZucker commented Oct 4, 2024

Abhishek-TAMU commented Oct 8, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Oct 9, 2024

Choose a reason for hiding this comment

ArthurZucker Oct 9, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Abhishek-TAMU commented Oct 9, 2024

Abhishek-TAMU commented Oct 9, 2024

ArthurZucker commented Oct 10, 2024

ArthurZucker Oct 10, 2024

Choose a reason for hiding this comment

ArthurZucker Oct 10, 2024

Choose a reason for hiding this comment

ArthurZucker commented Oct 10, 2024

ArthurZucker commented Oct 10, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Oct 22, 2024

Abhishek-TAMU commented Oct 22, 2024 • edited Loading

ArthurZucker commented Oct 23, 2024

ArthurZucker commented Oct 23, 2024

ArthurZucker commented Oct 23, 2024

Abhishek-TAMU commented Oct 23, 2024

ArthurZucker commented Oct 24, 2024

Abhishek-TAMU commented Oct 31, 2024 • edited Loading

ma787639046 commented Dec 17, 2024

ArthurZucker commented Dec 20, 2024

Abhishek-TAMU commented Oct 3, 2024 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

Abhishek-TAMU commented Oct 22, 2024 •

edited

Loading

Abhishek-TAMU commented Oct 31, 2024 •

edited

Loading