Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add head_mask and decoder_head_mask to FSMT #9819

Merged

Conversation

stancld
Copy link
Contributor

@stancld stancld commented Jan 26, 2021

This PR implements head_mask and decoder_head_mask for FSMT and it is the follow-up to the open issue #9814.

Motivation: This PR is a part of an endeavour to enable the usage of head_mask and decoder_head_mask for all encoder-decoder transformers following the recent work on BART-like models (#9569).


Fixes: #9814

Reviewer: @stas00

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for syncing fsmt with others, @stancld - including the docs and better markdown!!!

Is there a way to ensure in the future that if things are added to Bart they are added to fsmt too? On modeling level - they are very similar.

@stancld
Copy link
Contributor Author

stancld commented Jan 27, 2021

I know than one can add, for example, a line like this

# Copied from transformers.models.bart.modeling_bart.BartAttention with Bart->FSMT

before Attention module in FSMT. However, this does not copy only additions, but the whole module from BART, which is, in this case, undesirable, I guess, as these modules are a little bit different. But maybe there is another way I am not aware of.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for taking care of it @stancld!

Comment on lines 129 to 137
test_head_masking = False
test_head_masking = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed (as from all other places where it is set to True) as this is the default on the superclass.

Imo it makes it more readabe to only have items which are not tested (and which ultimately should be), this way from a quick hover on the test file we can understand what is left to do to have all the common tests coverage. Let me know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a good point. I'll create a PR removing all redundant test_head_masking = True from other source code files later today.

(Also, I did rebase as jaxlib was updated in #9831. Now, it should be, hopefully, fine without any errors :))

@@ -170,6 +170,7 @@
GPT2TokenizerFast = None
HerbertTokenizerFast = None
LayoutLMTokenizerFast = None
LEDTokenizerFast = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually recently corrected in another commit this morning - so there might be merge conflicts, but should be easy to resolve

stancld and others added 4 commits January 27, 2021 13:02
Remove test_head_masking flag from test_modeling_fsmt.py
since test_head_masking is set to be True by default (thus it is redundant to store).
* Rebase necessary due to an update of jaxlib

* Remove test_head_masking=True in tests/test_modeling_fsmt.py
as it is redundant.
@stas00
Copy link
Contributor

stas00 commented Jan 27, 2021

@LysandreJik, @patrickvonplaten - how can we make sure fsmt gets tracked and synced with all the bart-family changes? while the tokenizer is different, the model is ~95% identical.

@LysandreJik
Copy link
Member

as @stancld said, we can do that with some statements of the following kind:

# Copied from transformers.models.bart.modeling_bart.BartAttention with Bart->FSMT

The difference between the BART and FSMT implementation of the targeted object must only be the "BART" occurrences that change to "FSMT". @sgugger can tell you more about it.

@stas00
Copy link
Contributor

stas00 commented Jan 29, 2021

Thank you, @LysandreJik

I think this is really a question to @patrickvonplaten - who I remember was planning to refactor FSMT to match what he did for Bart. So if this is still planned, Patrick perhaps you could add this item to the agenda - keeping FSMT in sync with the Bart-family (modeling only - tokenizer is similar to xlm).

So the currently proposed solution can't be used, since Bart diverged since FSMT forked it.

It might help to treat FSMT as Bart with the main difference of it having a dual vocab and no tied weights - and a few layers that are different - but identical otherwise. (again for the model only).

@patrickvonplaten
Copy link
Contributor

I think this is really a question to @patrickvonplaten - who I remember was planning to refactor FSMT to match what he did for Bart. So if this is still planned, Patrick perhaps you could add this item to the agenda - keeping FSMT in sync with the Bart-family (modeling only - tokenizer is similar to xlm).

Yes, the FSTM / ProphetNet refactor is still on my ToDo List (think next week is reasonable). After the refactor I'll try to add as many # Copied from statements to keep the models in sync. Nevertheless, this PR can be merged as it is now!

Great work @stancld

@patrickvonplaten patrickvonplaten merged commit 0c6c0af into huggingface:master Feb 1, 2021
@stancld stancld deleted the fsmt_encoder_decoder_head_masks branch February 2, 2021 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing head_mask and decoder_head_mask arguments in encoder-decoder models
4 participants