[PyTorch] Miscellanous fixes for FP8 DPA module #804

cyanguwa · 2024-04-24T00:59:02Z

This PR

fixes cuDNN version extraction in unit tests due to versioning difference between cuDNN pre-9.0 and post-9.0,
allows for better compatibility with older checkpoints (pre-TE 1.6). Since TE 1.6, FusedAttention has been subclassed with TEBaseModule, and an _extra_state has been added to the module's state_dict. _extra_state contains FP8 meta data, but due to the subclassing, the addition of _extra_state to state_dict happens regardless of FP8 training or F16 training. This PR allows users to load older checkpoints (which do not have _extra_state for FusedAttention), as well as save and load new checkpoints as usual (which will contain _extra_state for FusedAttention).

Signed-off-by: Charlene Yang <[email protected]>

…old checkpoints Signed-off-by: Charlene Yang <[email protected]>

Signed-off-by: Charlene Yang <[email protected]>

ksivaman · 2024-04-24T01:34:35Z

I looked at this further post our sync, looks like tp_size/tp_group aren't used at all at the DPA/FusedAttention level. Can we simply remove/deprecate them? @cyanguwa

ksivaman

Added comment

cyanguwa · 2024-04-24T20:13:31Z

I looked at this further post our sync, looks like tp_size/tp_group aren't used at all at the DPA/FusedAttention level. Can we simply remove/deprecate them? @cyanguwa

We do use tp_group_initialized in prepare_forward. Also, if we don't keep track of tp_size/tp_group, how do we manage amax reduction for TP groups, or does fp8.py already handle that? @ksivaman

Signed-off-by: Charlene Yang <[email protected]>

…with fp8_group Signed-off-by: Charlene Yang <[email protected]>

cyanguwa · 2024-04-24T21:49:03Z

/te-ci pytorch

ksivaman · 2024-04-24T21:52:52Z

With #575, the amax reduction is handled in the reduce_and_update_fp8_tensors function using the fp8_group passed into the autocast. So we don't store the tensor parallel group to handle it separately.

mikolajblaz · 2024-04-26T15:09:05Z

@cyanguwa Regarding checkpoints compatibility: not requiring _extra_state in state dict is good, although I believe it doesn't solve the problem on the application side in case of switching from one attention implementation to another. Would such interoperability of attention layers be possible? It would require _extra_state to live on the same level as the default attention implementation.

…xtra_state Signed-off-by: Charlene Yang <[email protected]>

cyanguwa · 2024-04-26T22:10:30Z

@mikolajblaz I've moved core_attention.fused_attention._extra_state to core_attention._extra_state in b94a1ee. Let me know if you have any thoughts/comments. Thanks.

Signed-off-by: cyanguwa <[email protected]>

Signed-off-by: Charlene Yang <[email protected]>

…re_attention; keep the test Signed-off-by: Charlene Yang <[email protected]>

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa · 2024-04-30T21:46:25Z

@ksivaman could you please help take another look?

cyanguwa · 2024-04-30T21:51:17Z

/te-ci pytorch

cyanguwa · 2024-04-30T22:03:35Z

I had some discussion with mikolajblaz offline and we decided to not pursue the move from core_attention.fused_attention._extra_state to core_attention._extra_state. The possible solutions all look unclean and may not even guarantee the loading of checkpoints is correct. This is because PyTorch relies heavily on the module structure and FusedAttention is currently a submodule of DotProductAttention, so it's very hard to manipulate state_dict()/load_state_dict() calls to get around this hierarchical structure. Will consider this another time.

https://github.com/pytorch/pytorch/blob/74b7c56517f97c5d813620da9a479417a564e8b4/torch/nn/modules/module.py#L2164

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa · 2024-05-01T17:58:02Z

/te-ci pytorch

ptrendx

LGTM

cyanguwa · 2024-05-01T21:41:10Z

/te-ci pytorch

ksivaman

LGTM

* initialize tp_group for FP8 DPA Signed-off-by: Charlene Yang <[email protected]> * fix cuDNN version in unit tests for cuDNN v9 Signed-off-by: Charlene Yang <[email protected]> * add hook to ignore missing fused_attn._extra_states if training from old checkpoints Signed-off-by: Charlene Yang <[email protected]> * remove test and redundant implementation from last commit Signed-off-by: Charlene Yang <[email protected]> * remove warning message and replace with docstring Signed-off-by: Charlene Yang <[email protected]> * remove tp_size/tp_group in FusedAttention; amax reduction is handled with fp8_group Signed-off-by: Charlene Yang <[email protected]> * move core_attention.fused_attention._extra_state to core_attention._extra_state Signed-off-by: Charlene Yang <[email protected]> * simplify post_state_dict_hooks between FU and DPA Signed-off-by: Charlene Yang <[email protected]> * add temporary test Signed-off-by: Charlene Yang <[email protected]> * remove previous attempts to move core_attention.fused_attention to core_attention; keep the test Signed-off-by: Charlene Yang <[email protected]> * remove the test Signed-off-by: Charlene Yang <[email protected]> * disable pylint self arg for hook which is required by hook Signed-off-by: Charlene Yang <[email protected]> --------- Signed-off-by: Charlene Yang <[email protected]> Signed-off-by: cyanguwa <[email protected]>

* initialize tp_group for FP8 DPA Signed-off-by: Charlene Yang <[email protected]> * fix cuDNN version in unit tests for cuDNN v9 Signed-off-by: Charlene Yang <[email protected]> * add hook to ignore missing fused_attn._extra_states if training from old checkpoints Signed-off-by: Charlene Yang <[email protected]> * remove test and redundant implementation from last commit Signed-off-by: Charlene Yang <[email protected]> * remove warning message and replace with docstring Signed-off-by: Charlene Yang <[email protected]> * remove tp_size/tp_group in FusedAttention; amax reduction is handled with fp8_group Signed-off-by: Charlene Yang <[email protected]> * move core_attention.fused_attention._extra_state to core_attention._extra_state Signed-off-by: Charlene Yang <[email protected]> * simplify post_state_dict_hooks between FU and DPA Signed-off-by: Charlene Yang <[email protected]> * add temporary test Signed-off-by: Charlene Yang <[email protected]> * remove previous attempts to move core_attention.fused_attention to core_attention; keep the test Signed-off-by: Charlene Yang <[email protected]> * remove the test Signed-off-by: Charlene Yang <[email protected]> * disable pylint self arg for hook which is required by hook Signed-off-by: Charlene Yang <[email protected]> --------- Signed-off-by: Charlene Yang <[email protected]> Signed-off-by: cyanguwa <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>

cyanguwa added 5 commits April 23, 2024 21:02

initialize tp_group for FP8 DPA

2cf9067

Signed-off-by: Charlene Yang <[email protected]>

fix cuDNN version in unit tests for cuDNN v9

df6fea0

Signed-off-by: Charlene Yang <[email protected]>

add hook to ignore missing fused_attn._extra_states if training from …

ae3de42

…old checkpoints Signed-off-by: Charlene Yang <[email protected]>

remove test and redundant implementation from last commit

7532773

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into fp8_dpa/misc_fixes

b4105f4

ksivaman requested changes Apr 24, 2024

View reviewed changes

cyanguwa requested a review from ksivaman April 24, 2024 20:13

cyanguwa added 4 commits April 24, 2024 13:14

Merge branch 'main' into fp8_dpa/misc_fixes

4d45bba

remove warning message and replace with docstring

befc86d

Signed-off-by: Charlene Yang <[email protected]>

remove tp_size/tp_group in FusedAttention; amax reduction is handled …

273cd4e

…with fp8_group Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into fp8_dpa/misc_fixes

8e168dc

Merge branch 'main' into fp8_dpa/misc_fixes

9bfd19e

cyanguwa added 2 commits April 26, 2024 21:57

move core_attention.fused_attention._extra_state to core_attention._e…

b94a1ee

…xtra_state Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into fp8_dpa/misc_fixes

5d35ff4

cyanguwa added 8 commits April 29, 2024 15:05

Merge branch 'main' into fp8_dpa/misc_fixes

3924565

Signed-off-by: cyanguwa <[email protected]>

Merge branch 'main' into fp8_dpa/misc_fixes

8f52de6

simplify post_state_dict_hooks between FU and DPA

84b3d78

Signed-off-by: Charlene Yang <[email protected]>

add temporary test

8ed65cd

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'NVIDIA:main' into fp8_dpa/misc_fixes

de0f072

remove previous attempts to move core_attention.fused_attention to co…

4635fdc

…re_attention; keep the test Signed-off-by: Charlene Yang <[email protected]>

remove the test

cd9777b

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into fp8_dpa/misc_fixes

2840f19

cyanguwa requested a review from ptrendx April 30, 2024 21:45

cyanguwa added 2 commits May 1, 2024 17:57

disable pylint self arg for hook which is required by hook

ab8a7d3

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into fp8_dpa/misc_fixes

9304d26

ptrendx approved these changes May 1, 2024

View reviewed changes

Merge branch 'main' into fp8_dpa/misc_fixes

b78163d

ksivaman approved these changes May 1, 2024

View reviewed changes

ksivaman merged commit 6459fd8 into NVIDIA:main May 2, 2024
18 of 20 checks passed

ksivaman added the 1.6.0 label May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Miscellanous fixes for FP8 DPA module #804

[PyTorch] Miscellanous fixes for FP8 DPA module #804

cyanguwa commented Apr 24, 2024 •

edited

Loading

ksivaman commented Apr 24, 2024

ksivaman left a comment

cyanguwa commented Apr 24, 2024 •

edited

Loading

cyanguwa commented Apr 24, 2024

ksivaman commented Apr 24, 2024

mikolajblaz commented Apr 26, 2024

cyanguwa commented Apr 26, 2024

cyanguwa commented Apr 30, 2024

cyanguwa commented Apr 30, 2024

cyanguwa commented Apr 30, 2024

cyanguwa commented May 1, 2024

ptrendx left a comment

cyanguwa commented May 1, 2024

ksivaman left a comment

[PyTorch] Miscellanous fixes for FP8 DPA module #804

[PyTorch] Miscellanous fixes for FP8 DPA module #804

Conversation

cyanguwa commented Apr 24, 2024 • edited Loading

ksivaman commented Apr 24, 2024

ksivaman left a comment

Choose a reason for hiding this comment

cyanguwa commented Apr 24, 2024 • edited Loading

cyanguwa commented Apr 24, 2024

ksivaman commented Apr 24, 2024

mikolajblaz commented Apr 26, 2024

cyanguwa commented Apr 26, 2024

cyanguwa commented Apr 30, 2024

cyanguwa commented Apr 30, 2024

cyanguwa commented Apr 30, 2024

cyanguwa commented May 1, 2024

ptrendx left a comment

Choose a reason for hiding this comment

cyanguwa commented May 1, 2024

ksivaman left a comment

Choose a reason for hiding this comment

cyanguwa commented Apr 24, 2024 •

edited

Loading

cyanguwa commented Apr 24, 2024 •

edited

Loading