Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PreLN to fsmt module #15747

Closed
wants to merge 2 commits into from
Closed

Add PreLN to fsmt module #15747

wants to merge 2 commits into from

Conversation

jinmang2
Copy link

@jinmang2 jinmang2 commented Feb 21, 2022

What does this PR do?

Add pre-layer normalization module to FSMT module to match fairseq transformers.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@stas00

@HuggingFaceDocBuilder
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, @jinmang2

Overall looks good. I left a few suggestions.

Would it be better to name the new options encoder_pre_layernorm to better match the known concept? normalize_before is sort of asking for a noun after it. same for decoder.

The next step is to add a functional test, by building upon one of the existing fsmt tests.

Also if there are pre-trained models we could port that use preLN - it'd be great for then doing a quality test, which would be tricky to do with a dummy random data.

@stas00
Copy link
Contributor

stas00 commented Feb 22, 2022

cc: @patil-suraj - if you'd like to have another pair of eyes on this PR and also fyi as this will now create a bit of a diversion from your unmerged fsmt PR from long time ago. Not sure what you would want to do about it. It has kind of fallen between cracks due to the performance regression.

@sgugger
Copy link
Collaborator

sgugger commented Feb 23, 2022

This is typically the type of changes we usually don't accept inside one model architecture as it doesn't pass the test for new models: can an existing checkpoint for FSMT be used with this new config argument and give sensible results? -> No

I think this should warrant a new model.

@patil-suraj
Copy link
Contributor

This is how BART was initially designed and was refactored to align with the general philosophy of the library to have self-contained model files. cf #9343. So this should be a new model.

Also there are other fairseq models in the library that use PreLN for ex mBART and M2M100. Could you maybe check if you can load your checkpoints with these models ?

@stas00
Copy link
Contributor

stas00 commented Feb 23, 2022

This is typically the type of changes we usually don't accept inside one model architecture as it doesn't pass the test for new models: can an existing checkpoint for FSMT be used with this new config argument and give sensible results? -> No

Not arguing whether it should be a new model or not, but unless I'm missing something, why the answer to :

can an existing checkpoint for FSMT be used with this new config argument and give sensible results?

is No? I don't see any reason why it shouldn't work and not give identical results with this PR? It won't have any of the new model config entries and will work as intended.

@jinmang2
Copy link
Author

jinmang2 commented Feb 23, 2022

First of all, thank you for leaving a comment about my PR!

I wanted to port pororo's translation and wsd model to transformers. When I tested it myself, I got the same result as fairseq's reasoning result when I modified it with the corresponding PR.

I will check the following items mentioned above and mention them.

  • mBART
  • Test results

@sgugger
Copy link
Collaborator

sgugger commented Feb 23, 2022

Not arguing whether it should be a new model or not, but unless I'm missing something, why the answer to can an existing checkpoint for FSMT be used with this new config argument and give sensible results? is No? I don't see any reason why it shouldn't work and not give identical results with this PR? It won't have any of the new model config entries and will work as intended.

When using an existing pretrained checkpoint with the config option encoder_normalize_before=True or decoder_normalize_before=True, the pretrained model will load properly (as all weights match) but a user won't get decent results with the model.

We can write a perfect modular toolbox that work with a current model implementation without breaking backward compatibility and have identical results, with just adding many new options in the configuration. That does not mean it conforms to the philosophy or pass the test of "this new option can be used with existing checkpoints".

@stas00
Copy link
Contributor

stas00 commented Feb 23, 2022

When using an existing pretrained checkpoint with the config option encoder_normalize_before=True or decoder_normalize_before=True, the pretrained model will load properly (as all weights match) but a user won't get decent results with the model.

If you're talking about the existing checkpoints - they won't have these config flags set in their config and as long as the default value for the new config options is False they will get identical results, as essentially they will be running the same modeling code as before this PR.

If the defaults for the new config options are set to True, then indeed it'd break.

@stas00
Copy link
Contributor

stas00 commented Feb 24, 2022

So talked some more with Sylvain and we aren't using the same test, hence the different pictures. He was using defaults True for new configs, I was using defaults False - hence the difference in the outcome.

And for me the idea of when using a new arch or not comes down to:

  1. the config is for minor settings, like beam width, activation checkpointing, etc. which sometimes may lead to slightly less than optimal outcome but as close as possible to the best result.
  2. any new config that changes the control flow (enables/disables code/changes order) which would lead to a different numerical outcome on the modeling level requires a new arch.

In other words no new architectural features can be added once a given modeling code is unleashed into the world and at least one checkpoint started using it, and a new architecture needs to be created.

@jinmang2
Copy link
Author

I did a force-pushed on my local branch 30 minutes ago because of my commit mistake. If this behavior is a problem, please let me know! I will close the PR and commit a new one.

@jinmang2
Copy link
Author

This is how BART was initially designed and was refactored to align with the general philosophy of the library to have self-contained model files. cf #9343. So this should be a new model.

Also there are other fairseq models in the library that use PreLN for ex mBART and M2M100. Could you maybe check if you can load your checkpoints with these models ?

@patil-suraj Because there is no SinusoidalPositionalEmbedding class in mBART script, missing key occurs separately from the convert script. Therefore, porting with the script seems impossible.

@jinmang2
Copy link
Author

jinmang2 commented Feb 28, 2022

Overall looks good. I left a few suggestions.

@stas00 I checked all your suggestions and fixed them all in 2982403 commit!

The next step is to add a functional test, by building upon one of the existing fsmt tests.

Does the functional test need to modify test_modeling_fsmt.py? Or can I provide the test result to the PR to see if the pre_layernorm option works normally?

Also if there are pre-trained models we could port that use preLN - it'd be great for then doing a quality test, which would be tricky to do with a dummy random data.

Among the models of the PORORO library developed by OSLO developer hyunwoong ko, some models such as translation and word sense disambiguation use the normalize_before option. Is it okay to test the result and reflect it in the convert script?

@jinmang2 jinmang2 requested a review from stas00 February 28, 2022 01:59
@patil-suraj
Copy link
Contributor

This is how BART was initially designed and was refactored to align with the general philosophy of the library to have self-contained model files. cf #9343. So this should be a new model.

Also there are other fairseq models in the library that use PreLN for ex mBART and M2M100. Could you maybe check if you can load your checkpoints with these models ?

@patil-suraj Because there is no SinusoidalPositionalEmbedding class in mBART script, missing key occurs separately from the convert script. Therefore, porting with the script seems impossible.

I see. Could you try with M2M100 then ? M2M100 also uses PreLN and SinusoidalPositionalEmbedding.

@stas00
Copy link
Contributor

stas00 commented Mar 2, 2022

Overall looks good. I left a few suggestions.

@stas00 I checked all your suggestions and fixed them all in 2982403 commit!

Thank you.

The next step is to add a functional test, by building upon one of the existing fsmt tests.

Does the functional test need to modify test_modeling_fsmt.py? Or can I provide the test result to the PR to see if the pre_layernorm option works normally?

I'm not sure I understand your question. What I meant is this new code can't be merged w/o tests exercising it.

And, of course, I'm not trying to contradict the maintainers who requested a new architecture. But it'd need a new test and it'd be something new and not test_modeling_fsmt.py, but you can of course model it after. Probably test_modeling_fsmt_preln.py or something of the sorts - based on the new arch name that you will choose.

Also if there are pre-trained models we could port that use preLN - it'd be great for then doing a quality test, which would be tricky to do with a dummy random data.

Among the models of the PORORO library developed by OSLO developer hyunwoong ko, some models such as translation and word sense disambiguation use the normalize_before option. Is it okay to test the result and reflect it in the convert script?

if these models fit the architecture by all means that would be perfect. Except we would want the model to be on the https://huggingface.co/models hub first, so that the test suite could run reliably.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants