-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PreLN to fsmt module #15747
Add PreLN to fsmt module #15747
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this, @jinmang2
Overall looks good. I left a few suggestions.
Would it be better to name the new options encoder_pre_layernorm
to better match the known concept? normalize_before
is sort of asking for a noun after it. same for decoder.
The next step is to add a functional test, by building upon one of the existing fsmt tests.
Also if there are pre-trained models we could port that use preLN - it'd be great for then doing a quality test, which would be tricky to do with a dummy random data.
src/transformers/models/fsmt/convert_fsmt_original_pytorch_checkpoint_to_pytorch.py
Outdated
Show resolved
Hide resolved
src/transformers/models/fsmt/convert_fsmt_original_pytorch_checkpoint_to_pytorch.py
Outdated
Show resolved
Hide resolved
cc: @patil-suraj - if you'd like to have another pair of eyes on this PR and also fyi as this will now create a bit of a diversion from your unmerged fsmt PR from long time ago. Not sure what you would want to do about it. It has kind of fallen between cracks due to the performance regression. |
This is typically the type of changes we usually don't accept inside one model architecture as it doesn't pass the test for new models: can an existing checkpoint for FSMT be used with this new config argument and give sensible results? -> No I think this should warrant a new model. |
This is how BART was initially designed and was refactored to align with the general philosophy of the library to have self-contained model files. cf #9343. So this should be a new model. Also there are other fairseq models in the library that use |
Not arguing whether it should be a new model or not, but unless I'm missing something, why the answer to :
is |
First of all, thank you for leaving a comment about my PR! I wanted to port pororo's translation and wsd model to transformers. When I tested it myself, I got the same result as fairseq's reasoning result when I modified it with the corresponding PR. I will check the following items mentioned above and mention them.
|
When using an existing pretrained checkpoint with the config option We can write a perfect modular toolbox that work with a current model implementation without breaking backward compatibility and have identical results, with just adding many new options in the configuration. That does not mean it conforms to the philosophy or pass the test of "this new option can be used with existing checkpoints". |
If you're talking about the existing checkpoints - they won't have these config flags set in their config and as long as the default value for the new config options is If the defaults for the new config options are set to True, then indeed it'd break. |
So talked some more with Sylvain and we aren't using the same test, hence the different pictures. He was using defaults And for me the idea of when using a new arch or not comes down to:
In other words no new architectural features can be added once a given modeling code is unleashed into the world and at least one checkpoint started using it, and a new architecture needs to be created. |
I did a force-pushed on my local branch 30 minutes ago because of my commit mistake. If this behavior is a problem, please let me know! I will close the PR and commit a new one. |
@patil-suraj Because there is no |
@stas00 I checked all your suggestions and fixed them all in 2982403 commit!
Does the functional test need to modify
Among the models of the PORORO library developed by OSLO developer hyunwoong ko, some models such as translation and word sense disambiguation use the |
I see. Could you try with |
Thank you.
I'm not sure I understand your question. What I meant is this new code can't be merged w/o tests exercising it. And, of course, I'm not trying to contradict the maintainers who requested a new architecture. But it'd need a new test and it'd be something new and not
if these models fit the architecture by all means that would be perfect. Except we would want the model to be on the https://huggingface.co/models hub first, so that the test suite could run reliably. |
What does this PR do?
Add pre-layer normalization module to FSMT module to match fairseq transformers.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@stas00