-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add head_mask and decoder_head_mask to FSMT #9819
Add head_mask and decoder_head_mask to FSMT #9819
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for syncing fsmt with others, @stancld - including the docs and better markdown!!!
Is there a way to ensure in the future that if things are added to Bart they are added to fsmt too? On modeling level - they are very similar.
I know than one can add, for example, a line like this
before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for taking care of it @stancld!
tests/test_modeling_fsmt.py
Outdated
test_head_masking = False | ||
test_head_masking = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be removed (as from all other places where it is set to True
) as this is the default on the superclass.
Imo it makes it more readabe to only have items which are not tested (and which ultimately should be), this way from a quick hover on the test file we can understand what is left to do to have all the common tests coverage. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's a good point. I'll create a PR removing all redundant test_head_masking = True from other source code files later today.
(Also, I did rebase as jaxlib
was updated in #9831. Now, it should be, hopefully, fine without any errors :))
@@ -170,6 +170,7 @@ | |||
GPT2TokenizerFast = None | |||
HerbertTokenizerFast = None | |||
LayoutLMTokenizerFast = None | |||
LEDTokenizerFast = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually recently corrected in another commit this morning - so there might be merge conflicts, but should be easy to resolve
Remove test_head_masking flag from test_modeling_fsmt.py since test_head_masking is set to be True by default (thus it is redundant to store).
* Rebase necessary due to an update of jaxlib * Remove test_head_masking=True in tests/test_modeling_fsmt.py as it is redundant.
…stancld/transformers into fsmt_encoder_decoder_head_masks
@LysandreJik, @patrickvonplaten - how can we make sure fsmt gets tracked and synced with all the bart-family changes? while the tokenizer is different, the model is ~95% identical. |
as @stancld said, we can do that with some statements of the following kind:
The difference between the BART and FSMT implementation of the targeted object must only be the "BART" occurrences that change to "FSMT". @sgugger can tell you more about it. |
Thank you, @LysandreJik I think this is really a question to @patrickvonplaten - who I remember was planning to refactor FSMT to match what he did for Bart. So if this is still planned, Patrick perhaps you could add this item to the agenda - keeping FSMT in sync with the Bart-family (modeling only - tokenizer is similar to xlm). So the currently proposed solution can't be used, since Bart diverged since FSMT forked it. It might help to treat FSMT as Bart with the main difference of it having a dual vocab and no tied weights - and a few layers that are different - but identical otherwise. (again for the model only). |
Yes, the FSTM / ProphetNet refactor is still on my ToDo List (think next week is reasonable). After the refactor I'll try to add as many # Copied from statements to keep the models in sync. Nevertheless, this PR can be merged as it is now! Great work @stancld |
This PR implements
head_mask
anddecoder_head_mask
for FSMT and it is the follow-up to the open issue #9814.Motivation: This PR is a part of an endeavour to enable the usage of
head_mask
anddecoder_head_mask
for all encoder-decoder transformers following the recent work on BART-like models (#9569).Fixes: #9814
Reviewer: @stas00