New TF embeddings (cleaner and faster) #9418

jplu · 2021-01-05T11:25:33Z

What does this PR do?

This PR propose a better implementation of the embedding layer for the BERT-Like TF models. Another benefit of this cleaning is a better computational performance:

model = TFBertForMaskedLM.from_pretrained("bert-base-cased")
cProfile.run("model(model.dummy_inputs)")

# current master
56150 function calls (55318 primitive calls) in 0.096 seconds

# with new embeddings implem
55732 function calls (54891 primitive calls) in 0.080 seconds

This new implementation should be compatible with the incoming rework of the resizing proposed in #9193. A similar work will be applied to TFSharedEmbeddings in a next PR.

All slow/quick tests passes.

EDIT: I don't know why Github has some issues to pin the reviewers, so pinging @LysandreJik @sgugger and @patrickvonplaten

patrickvonplaten · 2021-01-05T13:56:36Z

src/transformers/modeling_tf_utils.py

+
+        super().build(input_shape=input_shape)
+
+    def get_config(self):


What is this function needed for?

This is a required function when a layer takes some parameters in its __init__ to become serializable, see more detail in the doc https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer#get_config and https://www.tensorflow.org/guide/keras/custom_layers_and_models#you_can_optionally_enable_serialization_on_your_layers

Basically this is what does the @keras_serializable

patrickvonplaten · 2021-01-05T13:58:20Z

src/transformers/modeling_tf_utils.py

+
+        super().build(input_shape)
+
+    def get_config(self):


same why do we need this function?

patrickvonplaten · 2021-01-05T13:58:31Z

src/transformers/modeling_tf_utils.py

+
+        super().build(input_shape=input_shape)
+
+    def get_config(self):


Why is the function required?

patrickvonplaten · 2021-01-05T14:01:22Z

src/transformers/models/albert/modeling_tf_albert.py

            name="token_type_embeddings",
        )
+        self.embeddings = tf.keras.layers.Add()


Do we need this? This layer is just an "add" operation no? Why is it called embeddings?

This is the optimized version of doing tensor+tensor+.. the other advantage to use this layer (other than computational perf) is that it handles some checking over the given tensors such as a proper shape.

I named it embeddings because it represents the addition of all the embeddings.

The name could be clearer I think: embeddings_sum is more explicit.

patrickvonplaten · 2021-01-05T14:01:53Z

src/transformers/models/albert/modeling_tf_albert.py

@@ -501,10 +448,10 @@ def __init__(self, config, add_pooling_layer=True, **kwargs):
        )

    def get_input_embeddings(self):
-        return self.embeddings


does this still return the same type?

Yes! Still a tf.keras.layers.Layer object.

I don't understand how it can be used above (line 420) in a tf.matmul if it's a layer and not a weight.

patrickvonplaten · 2021-01-05T14:03:41Z

I like this PR in general!

Just wondering about two things:

Do we need this get_config function?
Not a huge fan of the Add() keras layer...does this really improve performance much?

LysandreJik

From a quick look you're factoring the embeddings computation in three classes that will live in modeling_tf_utils.py.

Usually we try to be as explicit as possible and display every operation in a single file, while here we're applying different embeddings operations in another file. I think this goes against our "everything in one file" principle.

Is there a good reason for the embeddings to be the exception to this rule? Personally I think I would like to see directly in the file that the embeddings are computed differently according to the matrix sizes, but putting these layers in the modeling_tf_utils.py makes it abstracted/hidden.

jplu · 2021-01-05T14:25:40Z

Good point @LysandreJik! Basically here most of the models share the similar embedding computation that stay inside their respective file. What has been exported is just the specific computation, which means that WordEmbeddings, PositionalEmbeddings and TokenTypeEmbeddings are always the same doesn't matter who is using it.

The same logic that is currently applied to TFSharedEmbeddings.

sgugger

Just reviewed the general approach on one model for now and I have some questions before going further. If I understand correctly, the computation of the three different types of embeddings is split in three different ways to maximize the speedup but I wonder if it's documented from TF or just some tests on one particular setup. Before adding the extra complexity, I would like to be sure it brings a speedup on almost all possible environments (CPU, GPU, multi-GPU, TPU) without any loss in memory footprint (one-hot encoding the token type ids seems harmless, but we never know).

As for putting those in modeling utils versus the model file, I agree with Lysandre that this breaks our philosophy of putting everything in each model file. I emitted the same reserves for TFSharedEmbeddings when it was introduced.

sgugger · 2021-01-05T15:15:13Z

src/transformers/models/albert/modeling_tf_albert.py

            name="token_type_embeddings",
        )
+        self.embeddings = tf.keras.layers.Add()


The name could be clearer I think: embeddings_sum is more explicit.

sgugger · 2021-01-05T15:20:54Z

src/transformers/models/albert/modeling_tf_albert.py

@@ -501,10 +448,10 @@ def __init__(self, config, add_pooling_layer=True, **kwargs):
        )

    def get_input_embeddings(self):
-        return self.embeddings


I don't understand how it can be used above (line 420) in a tf.matmul if it's a layer and not a weight.

jplu · 2021-01-05T15:46:45Z

Just reviewed the general approach on one model for now and I have some questions before going further. If I understand correctly, the computation of the three different types of embeddings is split in three different ways to maximize the speedup but I wonder if it's documented from TF or just some tests on one particular setup. Before adding the extra complexity, I would like to be sure it brings a speedup on almost all possible environments (CPU, GPU, multi-GPU, TPU) without any loss in memory footprint (one-hot encoding the token type ids seems harmless, but we never know).

I basically took example on the official implementation of Transformer encoder available in the Google Repo https://github.com/tensorflow/models/tree/master/official/nlp/keras_nlp . After having done several experiments (only on CPU and GPU though), I end up to extract from this an optimal version for each embedding.

As for putting those in modeling utils versus the model file, I agree with Lysandre that this breaks our philosophy of putting everything in each model file. I emitted the same reserves for TFSharedEmbeddings when it was introduced.

I don't mind to copy/paste the same layers in all the concerned files if it is the recommended way. @sgugger @LysandreJik Will you be more confident if I create a version for each model and add the comment # copied from .... everytime it is a strong copy/paste?

I don't understand how it can be used above (line 420) in a tf.matmul if it's a layer and not a weight.

Now the get_input_embeddings returns a WordEmbeddings layer that has a word_embeddings attribute. If you look at the Bert model for example, the layer TFBertLMPredictionHead takes a WordEmbeddings layer as input_embeddings and use the WordEmbeddings.word_embeddings attribute into the tf.matmul.

sgugger · 2021-01-07T15:32:02Z

Now the get_input_embeddings returns a WordEmbeddings layer that has a word_embeddings attribute. If you look at the Bert model for example, the layer TFBertLMPredictionHead takes a WordEmbeddings layer as input_embeddings and use the WordEmbeddings.word_embeddings attribute into the tf.matmul.

So this part confuses me. Why name word_embeddings the weights inside the WordEmbeddings? It causes so much headache when reading the code afterward as we keep seeing some word_embeddings attributes which might either be an embedding layer or a weight.

Also, how does the new organization not screw up pretrained weights? From what I understand, the old world_embeddings in the BertEmbeddings layer used to be a weight and now it's a layer with an added world_embeddings attribute?

jplu · 2021-01-08T08:25:13Z

So this part confuses me. Why name word_embeddings the weights inside the WordEmbeddings? It causes so much headache when reading the code afterward as we keep seeing some word_embeddings attributes which might either be an embedding layer or a weight.

I agree it is confusing, if you prefer it can be called weight such as in TFSharedEmbeddings I think it would be a more suitable name. This renaming will make easier the kind of checking (from the incoming PR on ebd resizing)

 def _get_word_embedding_weight(self, embedding_layer):
        if hasattr(embedding_layer, "word_embeddings"):
            return embedding_layer.word_embeddings
        elif hasattr(embedding_layer, "weight"):
            return embedding_layer.weight
        elif hasattr(embedding_layer, "decoder"):
            return embedding_layer.decoder
        else:
            # Here we build the word embeddings weights if not exists.
            # And then we retry to get the attribute once built.
            self(self.dummy_inputs)
            if hasattr(embedding_layer, "word_embeddings"):
                return embedding_layer.word_embeddings
            elif hasattr(embedding_layer, "weight"):
                return embedding_layer.weight
            elif hasattr(embedding_layer, "decoder"):
                return embedding_layer.decoder
            else:
                return None

No more word_embeddings or weight, only weight. What do you think?

Also, how does the new organization not screw up pretrained weights? From what I understand, the old world_embeddings in the BertEmbeddings layer used to be a weight and now it's a layer with an added world_embeddings attribute?

This is because before we where using a name score and not anymore in this PR. Let's say that defining a name scope or creating a layer represents the same thing. In both cases the weight is named 'tf_bert_model/bert/embeddings/word_embeddings/weight:0' until now the word_embeddings part of the naming was because the embeddings was created in the context of tf.name_scope("word_embeddings"): , in this PR it has the same name but because of the name of the new WordEmbeddings layer.

sgugger · 2021-01-08T14:30:23Z

Yes, having only "weight" makes more sense to me, and it would make the code easier to read. Thanks for explaining why the name of the weight doesn't change for loading!

jplu · 2021-01-11T22:29:56Z

I found another advantage of these new embedding computation. It allows our models to be compiled in XLA_GPU and XLA_TPU which was not the case before. Small proof test on a machine with a GPU:

from transformers import TFBertModel
import tensorflow as tf

model = TFBertModel.from_pretrained("bert-base-cased")

@tf.function(experimental_compile=True)
def run():
    return model(model.dummy_inputs)

outputs = run()

On master fails with:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Trying to access resource _AnonymousVar4 located in device /job:localhost/replica:0/task:0/device:CPU:0 from device /job:localhost/replica:0/task:0/device:GPU:0 [Op:__inference_run_4637]

On this PR works as expected. The reason is because the tf.keras.layers.Embeddings layers are initialized when the model is instanciated instead of being initialized at build time.

jplu · 2021-01-12T10:31:55Z

Now, each model has its own WordEmbedding, TokenTypeEmbeddings and PositionEmbedding layer in the model file decorated with the comment #Copied from... and the words_embeddings weights have been renamed into weight to make it more understandable and aligned with the name in TFSharedEmbeddings.

sgugger

Thanks for the modifications! This looks way better now, I think.

sgugger · 2021-01-12T14:23:20Z

src/transformers/modeling_tf_utils.py

+        if embeds is not None:
+            return embeds
+
+        model(model.dummy_inputs)


Add a comment here to say we retry after building the model, just in case it was not already?

src/transformers/modeling_tf_utils.py

src/transformers/models/electra/modeling_tf_electra.py

sgugger · 2021-01-12T14:29:38Z

src/transformers/models/electra/modeling_tf_electra.py

@@ -118,7 +234,7 @@ def call(self, hidden_states, attention_mask=None, head_mask=None, output_attent
        attention_scores = tf.einsum("aecd,abcd->acbe", key_layer, query_layer)

        if attention_mask is not None:
-            # Apply the attention mask is (precomputed for all layers in TFElectraModel call() function)
+            # Apply the attention mask is (precomputed for all layers in TFBertModel call() function)


This should not be replaced, see comment above.

sgugger · 2021-01-12T14:32:41Z

src/transformers/models/longformer/modeling_tf_longformer.py

@@ -536,96 +652,41 @@ def create_position_ids_from_inputs_embeds(self, inputs_embeds):

        Returns: tf.Tensor
        """
-        seq_length = shape_list(inputs_embeds)[1]
-        position_ids = tf.range(self.padding_idx + 1, seq_length + self.padding_idx + 1, dtype=tf.int32)[tf.newaxis, :]
+        bsz, seq_length = shape_list(tensor=inputs_embeds)[:2]


bsz is a bit too short IMO, batch_size should be used (here and two lines below).

sgugger · 2021-01-12T14:39:43Z

src/transformers/models/mpnet/modeling_tf_mpnet.py

    def call(
        self,
        input_ids=None,
        attention_mask=None,
-        token_type_ids=None,


I understand MPNet's implementation has some token_type_ids it doesn't use, but I'd leave them for now here until there is a general fix (that also deals with the PyTorch implementation). The tokenizer still return those token_type_ids so this would cause problem if a user feeds the output of a tokenizer to one of those models.

token_type_ids is not in the PyTorch implementation, so I think the Tokenizer should be fixed in same time than the TF model.

sgugger · 2021-01-12T14:40:33Z

src/transformers/models/roberta/modeling_tf_roberta.py

@@ -132,96 +249,41 @@ def create_position_ids_from_inputs_embeds(self, inputs_embeds):

        Returns: tf.Tensor
        """
-        seq_length = shape_list(inputs_embeds)[1]
-        position_ids = tf.range(self.padding_idx + 1, seq_length + self.padding_idx + 1, dtype=tf.int32)[tf.newaxis, :]
+        bsz, seq_length = shape_list(tensor=inputs_embeds)[:2]


Same as before: bsz -> batch_size

sgugger · 2021-01-12T14:41:58Z

tests/test_modeling_tf_common.py

+            embeds = getattr(embedding_layer, "weight", None)
+
+            if embeds is not None:
+                return embeds
+
+            embeds = getattr(embedding_layer, "decoder", None)
+
+            if embeds is not None:
+                return embeds
+
+            model(model.dummy_inputs)
+
+            embeds = getattr(embedding_layer, "weight", None)
+
+            if embeds is not None:
+                return embeds
+
+            embeds = getattr(embedding_layer, "decoder", None)
+
+            if embeds is not None:
+                return embeds
+
+            return None


Same comments as for modeling_utils (plus what are we testing if we just use the same code?)

Co-authored-by: Sylvain Gugger <[email protected]>

patrickvonplaten

LGTM in general. One thing I'm not 100% sure about is whether we really need to add keras layers like tf.keras.layers.Add() if we start doing this for the embeddings now, I'm wondering if we should do the same for all residual connections in the self-attention blocks

jplu · 2021-01-19T16:39:50Z

LGTM in general. One thing I'm not 100% sure about is whether we really need to add keras layers like tf.keras.layers.Add() if we start doing this for the embeddings now, I'm wondering if we should do the same for all residual connections in the self-attention blocks

In the absolute, yes we should. In an ideal world, everytime TF proposes a function/layer for doing something we should use it, as it is part of the optimization process. I know and I understand that it might seems confusing and starts to diverge with what PT looks like.

LysandreJik

Yes, this LGTM. I also agree with your explanations regarding the Add layers.

patrickvonplaten · 2021-01-28T15:23:14Z

src/transformers/models/albert/modeling_tf_albert.py

        """
-        if mode == "embedding":
-            return self._embedding(input_ids, position_ids, token_type_ids, inputs_embeds, training=training)


this was a single matrix multiplication before no?

patrickvonplaten · 2021-01-28T15:24:13Z

src/transformers/modeling_tf_utils.py

+        return dict(list(base_config.items()) + list(config.items()))
+
+    def call(self, input_ids):
+        flat_input_ids = tf.reshape(tensor=input_ids, shape=[-1])


those are multiple operation that replaced a single matrix operation no?

jplu requested review from LysandreJik, sgugger and patrickvonplaten and removed request for LysandreJik, sgugger and patrickvonplaten January 5, 2021 13:27

patrickvonplaten reviewed Jan 5, 2021

View reviewed changes

LysandreJik reviewed Jan 5, 2021

View reviewed changes

sgugger reviewed Jan 5, 2021

View reviewed changes

jplu force-pushed the new-tf-embeddings branch from 7a1c5b1 to c1cb284 Compare January 11, 2021 22:41

jplu added 15 commits January 12, 2021 11:23

Add last models

e647a1d

Apply style

e63033e

Update the template

4164391

Remove unused imports

7ee8a12

Rename attribute

7a0f373

Import embeddings in their own model file

cde1d9e

Replace word_embeddings per weight

e9392b6

fix naming

7b27658

Fix Albert

3fd7708

Fix Albert

011915c

Fix Longformer

7f5e8a6

Fix Lxmert Mobilebert and MPNet

1e60eb0

Fix copy

ec93f44

Fix template

cba84a1

Update the get weights function

0eeac1e

jplu force-pushed the new-tf-embeddings branch from f5981f0 to 0eeac1e Compare January 12, 2021 10:31

jplu mentioned this pull request Jan 12, 2021

Fix mixed precision in TF models #9163

Merged

sgugger approved these changes Jan 12, 2021

View reviewed changes

jplu and others added 3 commits January 12, 2021 15:48

Update src/transformers/modeling_tf_utils.py

43942ad

Co-authored-by: Sylvain Gugger <[email protected]>

Update src/transformers/models/electra/modeling_tf_electra.py

03e2986

Co-authored-by: Sylvain Gugger <[email protected]>

address Sylvain's comments

8c0ae6b

patrickvonplaten approved these changes Jan 19, 2021

View reviewed changes

Merge branch 'master' into new-tf-embeddings

581e8bd

LysandreJik approved these changes Jan 20, 2021

View reviewed changes

jplu merged commit 14042d5 into huggingface:master Jan 20, 2021

jplu deleted the new-tf-embeddings branch January 20, 2021 11:21

patrickvonplaten reviewed Jan 28, 2021

View reviewed changes

New TF embeddings (cleaner and faster) #9418

New TF embeddings (cleaner and faster) #9418

Conversation

jplu commented Jan 5, 2021 • edited Loading

What does this PR do?

Choose a reason for hiding this comment

jplu Jan 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jplu Jan 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Jan 5, 2021

LysandreJik left a comment

Choose a reason for hiding this comment

jplu commented Jan 5, 2021 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jplu commented Jan 5, 2021 • edited Loading

sgugger commented Jan 7, 2021

jplu commented Jan 8, 2021 • edited Loading

sgugger commented Jan 8, 2021

jplu commented Jan 11, 2021 • edited Loading

jplu commented Jan 12, 2021 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

jplu commented Jan 19, 2021 • edited Loading

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jplu commented Jan 5, 2021 •

edited

Loading

jplu Jan 5, 2021 •

edited

Loading

jplu Jan 5, 2021 •

edited

Loading

jplu commented Jan 5, 2021 •

edited

Loading

jplu commented Jan 5, 2021 •

edited

Loading

jplu commented Jan 8, 2021 •

edited

Loading

jplu commented Jan 11, 2021 •

edited

Loading

jplu commented Jan 12, 2021 •

edited

Loading

jplu commented Jan 19, 2021 •

edited

Loading