Add PreLN to fsmt module #15747

jinmang2 · 2022-02-21T04:59:03Z

What does this PR do?

Add pre-layer normalization module to FSMT module to match fairseq transformers.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@stas00

HuggingFaceDocBuilder · 2022-02-21T04:59:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

stas00

Thank you for working on this, @jinmang2

Overall looks good. I left a few suggestions.

Would it be better to name the new options encoder_pre_layernorm to better match the known concept? normalize_before is sort of asking for a noun after it. same for decoder.

The next step is to add a functional test, by building upon one of the existing fsmt tests.

Also if there are pre-trained models we could port that use preLN - it'd be great for then doing a quality test, which would be tricky to do with a dummy random data.

src/transformers/models/fsmt/convert_fsmt_original_pytorch_checkpoint_to_pytorch.py

src/transformers/models/fsmt/modeling_fsmt.py

stas00 · 2022-02-22T04:40:33Z

cc: @patil-suraj - if you'd like to have another pair of eyes on this PR and also fyi as this will now create a bit of a diversion from your unmerged fsmt PR from long time ago. Not sure what you would want to do about it. It has kind of fallen between cracks due to the performance regression.

sgugger · 2022-02-23T09:12:27Z

This is typically the type of changes we usually don't accept inside one model architecture as it doesn't pass the test for new models: can an existing checkpoint for FSMT be used with this new config argument and give sensible results? -> No

I think this should warrant a new model.

patil-suraj · 2022-02-23T15:25:31Z

This is how BART was initially designed and was refactored to align with the general philosophy of the library to have self-contained model files. cf #9343. So this should be a new model.

Also there are other fairseq models in the library that use PreLN for ex mBART and M2M100. Could you maybe check if you can load your checkpoints with these models ?

stas00 · 2022-02-23T18:14:16Z

This is typically the type of changes we usually don't accept inside one model architecture as it doesn't pass the test for new models: can an existing checkpoint for FSMT be used with this new config argument and give sensible results? -> No

Not arguing whether it should be a new model or not, but unless I'm missing something, why the answer to :

can an existing checkpoint for FSMT be used with this new config argument and give sensible results?

is No? I don't see any reason why it shouldn't work and not give identical results with this PR? It won't have any of the new model config entries and will work as intended.

jinmang2 · 2022-02-23T18:18:10Z

First of all, thank you for leaving a comment about my PR!

I wanted to port pororo's translation and wsd model to transformers. When I tested it myself, I got the same result as fairseq's reasoning result when I modified it with the corresponding PR.

https://github.com/kakaobrain/pororo

I will check the following items mentioned above and mention them.

mBART
Test results

sgugger · 2022-02-23T20:03:58Z

Not arguing whether it should be a new model or not, but unless I'm missing something, why the answer to can an existing checkpoint for FSMT be used with this new config argument and give sensible results? is No? I don't see any reason why it shouldn't work and not give identical results with this PR? It won't have any of the new model config entries and will work as intended.

When using an existing pretrained checkpoint with the config option encoder_normalize_before=True or decoder_normalize_before=True, the pretrained model will load properly (as all weights match) but a user won't get decent results with the model.

We can write a perfect modular toolbox that work with a current model implementation without breaking backward compatibility and have identical results, with just adding many new options in the configuration. That does not mean it conforms to the philosophy or pass the test of "this new option can be used with existing checkpoints".

stas00 · 2022-02-23T20:17:12Z

When using an existing pretrained checkpoint with the config option encoder_normalize_before=True or decoder_normalize_before=True, the pretrained model will load properly (as all weights match) but a user won't get decent results with the model.

If you're talking about the existing checkpoints - they won't have these config flags set in their config and as long as the default value for the new config options is False they will get identical results, as essentially they will be running the same modeling code as before this PR.

If the defaults for the new config options are set to True, then indeed it'd break.

stas00 · 2022-02-24T15:50:08Z

So talked some more with Sylvain and we aren't using the same test, hence the different pictures. He was using defaults True for new configs, I was using defaults False - hence the difference in the outcome.

And for me the idea of when using a new arch or not comes down to:

the config is for minor settings, like beam width, activation checkpointing, etc. which sometimes may lead to slightly less than optimal outcome but as close as possible to the best result.
any new config that changes the control flow (enables/disables code/changes order) which would lead to a different numerical outcome on the modeling level requires a new arch.

In other words no new architectural features can be added once a given modeling code is unleashed into the world and at least one checkpoint started using it, and a new architecture needs to be created.

jinmang2 · 2022-02-28T01:58:28Z

I did a force-pushed on my local branch 30 minutes ago because of my commit mistake. If this behavior is a problem, please let me know! I will close the PR and commit a new one.

jinmang2 · 2022-02-28T01:58:43Z

This is how BART was initially designed and was refactored to align with the general philosophy of the library to have self-contained model files. cf #9343. So this should be a new model.

Also there are other fairseq models in the library that use PreLN for ex mBART and M2M100. Could you maybe check if you can load your checkpoints with these models ?

@patil-suraj Because there is no SinusoidalPositionalEmbedding class in mBART script, missing key occurs separately from the convert script. Therefore, porting with the script seems impossible.

jinmang2 · 2022-02-28T01:58:50Z

Overall looks good. I left a few suggestions.

@stas00 I checked all your suggestions and fixed them all in 2982403 commit!

The next step is to add a functional test, by building upon one of the existing fsmt tests.

Does the functional test need to modify test_modeling_fsmt.py? Or can I provide the test result to the PR to see if the pre_layernorm option works normally?

Also if there are pre-trained models we could port that use preLN - it'd be great for then doing a quality test, which would be tricky to do with a dummy random data.

Among the models of the PORORO library developed by OSLO developer hyunwoong ko, some models such as translation and word sense disambiguation use the normalize_before option. Is it okay to test the result and reflect it in the convert script?

patil-suraj · 2022-02-28T10:47:09Z

This is how BART was initially designed and was refactored to align with the general philosophy of the library to have self-contained model files. cf #9343. So this should be a new model.

Also there are other fairseq models in the library that use PreLN for ex mBART and M2M100. Could you maybe check if you can load your checkpoints with these models ?

@patil-suraj Because there is no SinusoidalPositionalEmbedding class in mBART script, missing key occurs separately from the convert script. Therefore, porting with the script seems impossible.

I see. Could you try with M2M100 then ? M2M100 also uses PreLN and SinusoidalPositionalEmbedding.

stas00 · 2022-03-02T02:57:07Z

Overall looks good. I left a few suggestions.

@stas00 I checked all your suggestions and fixed them all in 2982403 commit!

Thank you.

The next step is to add a functional test, by building upon one of the existing fsmt tests.

Does the functional test need to modify test_modeling_fsmt.py? Or can I provide the test result to the PR to see if the pre_layernorm option works normally?

I'm not sure I understand your question. What I meant is this new code can't be merged w/o tests exercising it.

And, of course, I'm not trying to contradict the maintainers who requested a new architecture. But it'd need a new test and it'd be something new and not test_modeling_fsmt.py, but you can of course model it after. Probably test_modeling_fsmt_preln.py or something of the sorts - based on the new arch name that you will choose.

Also if there are pre-trained models we could port that use preLN - it'd be great for then doing a quality test, which would be tricky to do with a dummy random data.

Among the models of the PORORO library developed by OSLO developer hyunwoong ko, some models such as translation and word sense disambiguation use the normalize_before option. Is it okay to test the result and reflect it in the convert script?

if these models fit the architecture by all means that would be perfect. Except we would want the model to be on the https://huggingface.co/models hub first, so that the test suite could run reliably.

Add PreLN to fsmt module

bb59685

stas00 suggested changes Feb 22, 2022

View reviewed changes

fix names of pre-layernorm options and bugs

2982403

jinmang2 requested a review from stas00 February 28, 2022 01:59

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PreLN to fsmt module #15747

Add PreLN to fsmt module #15747

jinmang2 commented Feb 21, 2022 •

edited

Loading

HuggingFaceDocBuilder commented Feb 21, 2022

stas00 left a comment •

edited

Loading

stas00 commented Feb 22, 2022 •

edited

Loading

sgugger commented Feb 23, 2022

patil-suraj commented Feb 23, 2022

stas00 commented Feb 23, 2022

jinmang2 commented Feb 23, 2022 •

edited

Loading

sgugger commented Feb 23, 2022

stas00 commented Feb 23, 2022 •

edited

Loading

stas00 commented Feb 24, 2022

jinmang2 commented Feb 28, 2022

jinmang2 commented Feb 28, 2022

jinmang2 commented Feb 28, 2022 •

edited

Loading

patil-suraj commented Feb 28, 2022

stas00 commented Mar 2, 2022 •

edited

Loading

Add PreLN to fsmt module #15747

Add PreLN to fsmt module #15747

Conversation

jinmang2 commented Feb 21, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilder commented Feb 21, 2022

stas00 left a comment • edited Loading

Choose a reason for hiding this comment

stas00 commented Feb 22, 2022 • edited Loading

sgugger commented Feb 23, 2022

patil-suraj commented Feb 23, 2022

stas00 commented Feb 23, 2022

jinmang2 commented Feb 23, 2022 • edited Loading

sgugger commented Feb 23, 2022

stas00 commented Feb 23, 2022 • edited Loading

stas00 commented Feb 24, 2022

jinmang2 commented Feb 28, 2022

jinmang2 commented Feb 28, 2022

jinmang2 commented Feb 28, 2022 • edited Loading

patil-suraj commented Feb 28, 2022

stas00 commented Mar 2, 2022 • edited Loading

jinmang2 commented Feb 21, 2022 •

edited

Loading

stas00 left a comment •

edited

Loading

stas00 commented Feb 22, 2022 •

edited

Loading

jinmang2 commented Feb 23, 2022 •

edited

Loading

stas00 commented Feb 23, 2022 •

edited

Loading

jinmang2 commented Feb 28, 2022 •

edited

Loading

stas00 commented Mar 2, 2022 •

edited

Loading