Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BORT #9112

Closed
wants to merge 23 commits into from
Closed

Add BORT #9112

wants to merge 23 commits into from

Conversation

stefan-it
Copy link
Collaborator

@stefan-it stefan-it commented Dec 15, 2020

Hi,

this PR adds the recently introduced BORT model from @adewynter and Daniel J. Perry from the Alexa team into Transformers.


BORT was introduced in the Optimal Subarchitecture Extraction For BERT.

Details about BORT:

We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al. (2018) by applying recent breakthroughs in algorithms for neural architecture search. This optimal subset, which we refer to as "Bort", is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of 5.5% the original BERT-large architecture, and 16% of the net size. Bort is also able to be pretrained in 288 GPU hours, which is 1.2% of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large (Liu et al., 2019), and about 33% of that of the world-record, in GPU hours, required to train BERT-large on the same hardware. It is also 7.9x faster on a CPU, as well as being better performing than other compressed variants of the architecture, and some of the non-compressed variants: it obtains performance improvements of between 0.3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks.

This should fix #8135 🤗


ToDo tasks:

  • Upload models (both PyTorch and TensorFlow model) to model hub
  • Add conversion script from Gluonnlp to Transformers
  • Enable unit tests (they are working and just wait for the model upload)

@stefan-it stefan-it marked this pull request as draft December 15, 2020 00:29
@patrickvonplaten patrickvonplaten self-requested a review December 15, 2020 09:12
input_ids = tf.convert_to_tensor(
[[0, 18077, 4082, 7804, 8606, 6195, 2457, 3321, 11, 10489, 16, 269, 2579, 328, 2]],
dtype=tf.int32,
) # Schloß Nymphenburg in Munich is really nice!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Das stimmt!

@@ -0,0 +1,143 @@
# coding=utf-8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) we could add some examples here as well similar to how it's done for MT5:

I think for BortModel we can just show how to get the last_hidden_state and for all other models we could show how to get the loss for fine-tuning.

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks great! Think we only have to wait now for the name and then we're good to go :-)

@julien-c
Copy link
Member

🔥 Looking forward to taking a look at the conversion script from GluonNLP/mxnet!

@stefan-it
Copy link
Collaborator Author

@patrickvonplaten I added some examples for both modeling_bort.py and modeling_tf_bort.py` 🤗

@julien-c The conversion script is also added - you just need to install gluonnlp==0.8.3 and mxnet==1.5.0.

These versions are defined in the BORT requirements file. The conversion script also performs a version check.

>>> hidden_states = outputs.last_hidden_state
"""

config_class = BortConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add model_type = 'bort' for each class here -> see MT5 for comparison:

"""

config_class = BortConfig

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add model type here for all models:

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool! Have one (nit) and I think we should add the model_type to each aliased class

@patrickvonplaten patrickvonplaten marked this pull request as ready for review December 29, 2020 23:17
@patrickvonplaten patrickvonplaten changed the title Add support for BORT Add BORT Dec 29, 2020
Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@patrickvonplaten
Copy link
Contributor

We'll have to think a bit how to advertise this. Let me draft up a "Contribution Proposal" for the fine-tuning algorithm.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this model! There are a few things to adapt to have the same API as the current master and I would very much like to be consistent with the paper and use Bort (not BORT) everywhere in the docs.

an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

BORT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
BORT
Bort

The authors don't use the all caps.

Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The BORT model was proposed in `Optimal Subarchitecture Extraction for BERT <https://arxiv.org/abs/2010.10499>`__ by
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The BORT model was proposed in `Optimal Subarchitecture Extraction for BERT <https://arxiv.org/abs/2010.10499>`__ by
The Bort model was proposed in `Optimal Subarchitecture Extraction for BERT <https://arxiv.org/abs/2010.10499>`__ by

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.BortTokenizerFast
:members:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:members:
:members: forward

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @sgugger made a typo here, I think you can leave it as :members: given that it's a fast tokenizer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry! I meant it for the PyTorch models :-)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.BortModel
:members:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:members:
:members: forward

Here and for the rest of the PyTorch models.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFBortModel
:members:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:members:
:members: call

Here and for the rest of the TF models.

Comment on lines +805 to +808
We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al. (2018) by
applying recent breakthroughs in algorithms for neural architecture search. This optimal subset, which we refer to as
"Bort", is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of 5.5% the
original BERT-large architecture, and 16% of the net size.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model summary doesn't use the first-person pronouns. This should be changed to fir the style of the rest of the document: "Same as BERT but with xxx..."

Also it doesn't seem to be placed in the right section. If it's like BERT, it should be in the autoeconding models part.

@@ -233,6 +239,7 @@
(MPNetConfig, (MPNetTokenizer, MPNetTokenizerFast)),
(TapasConfig, (TapasTokenizer, None)),
(LEDConfig, (LEDTokenizer, LEDTokenizerFast)),
(BortConfig, (BortTokenizer, BortTokenizerFast)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be removed I believe.

# See the License for the specific language governing permissions and
# limitations under the License.

from ...file_utils import is_sentencepiece_available, is_tf_available, is_tokenizers_available, is_torch_available
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This init should be adapted to the new style (see any model init in current master) to avoid importing TF/PyTorch when not required.

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" BORT model configuration """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
""" BORT model configuration """
""" Bort model configuration """

class BortConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.BortModel` or a
:class:`~transformers.TFBortModel`. It is used to instantiate a BORT model according to the specified arguments,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere, BORT -> Bort (we should use the same name as the authors, written in the same way).

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @stefan-it, thanks a lot for your contribution!

If Bort can be loaded seamlessly in the BERT architecture, is there really a need to redefine all models in PyTorch and TensorFlow? We would need to do this for all models on the hub if that was the case. If there was a change in one of the models I would understand, but given that it's an exact copy if BERT I don't think that's necessary at all.

I understand the conversion script, however. I would just replace the modelname "bort" by "bert" in that model script so that the models are loadable directly in the BERT architecture.

I see that Bort requires RoBERTa tokenizers, which isn't a problem either; tokenizers can be decoupled from their models by specifying a tokenizer_class in the model config, similarly to what BERTweet does.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.BortTokenizerFast
:members:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @sgugger made a typo here, I think you can leave it as :members: given that it's a fast tokenizer.

@patrickvonplaten
Copy link
Contributor

Hey @stefan-it,

I've discussed a bit with @LysandreJik and @sgugger offline and I do agree with @LysandreJik after having thought about it again. I think it's better if we actually don't add any new code (besides the conversion script) that should be added to src/transformers/models/bert/ and the docs page. I'm very sorry to have you asked to go down this road! I think however it does make more sense to not add any "tokenizer" or "model" code as those are exact copies of the RobertaTokenizer and BertModel. It's probably most efficient to open a new PR and only add the required files. Super sorry again!

@gaceladri
Copy link

Are we planning to implement the architectural optimization (FPTAS) or just the pre-trained models?

@stefan-it stefan-it mentioned this pull request Jan 26, 2021
@patrickvonplaten
Copy link
Contributor

Are we planning to implement the architectural optimization (FPTAS) or just the pre-trained models?

Great question! For now, we'll just add the model weights - see: #9813. A community contribution showing how to do FPTAS in a notebook would be extremely valuable though.

@patrickvonplaten
Copy link
Contributor

Closing in favor of #9813

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bort (Amazon's reduced BERT)
6 participants