-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New TF embeddings (cleaner and faster) #9418
Conversation
|
||
super().build(input_shape=input_shape) | ||
|
||
def get_config(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this function needed for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a required function when a layer takes some parameters in its __init__
to become serializable, see more detail in the doc https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer#get_config and https://www.tensorflow.org/guide/keras/custom_layers_and_models#you_can_optionally_enable_serialization_on_your_layers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically this is what does the @keras_serializable
|
||
super().build(input_shape) | ||
|
||
def get_config(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same why do we need this function?
|
||
super().build(input_shape=input_shape) | ||
|
||
def get_config(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the function required?
name="token_type_embeddings", | ||
) | ||
self.embeddings = tf.keras.layers.Add() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? This layer is just an "add" operation no? Why is it called embeddings
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the optimized version of doing tensor+tensor+..
the other advantage to use this layer (other than computational perf) is that it handles some checking over the given tensors such as a proper shape.
I named it embeddings
because it represents the addition of all the embeddings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name could be clearer I think: embeddings_sum
is more explicit.
@@ -501,10 +448,10 @@ def __init__(self, config, add_pooling_layer=True, **kwargs): | |||
) | |||
|
|||
def get_input_embeddings(self): | |||
return self.embeddings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this still return the same type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Still a tf.keras.layers.Layer
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how it can be used above (line 420) in a tf.matmul
if it's a layer and not a weight.
I like this PR in general! Just wondering about two things:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a quick look you're factoring the embeddings computation in three classes that will live in modeling_tf_utils.py
.
Usually we try to be as explicit as possible and display every operation in a single file, while here we're applying different embeddings operations in another file. I think this goes against our "everything in one file" principle.
Is there a good reason for the embeddings to be the exception to this rule? Personally I think I would like to see directly in the file that the embeddings are computed differently according to the matrix sizes, but putting these layers in the modeling_tf_utils.py
makes it abstracted/hidden.
Good point @LysandreJik! Basically here most of the models share the similar embedding computation that stay inside their respective file. What has been exported is just the specific computation, which means that The same logic that is currently applied to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reviewed the general approach on one model for now and I have some questions before going further. If I understand correctly, the computation of the three different types of embeddings is split in three different ways to maximize the speedup but I wonder if it's documented from TF or just some tests on one particular setup. Before adding the extra complexity, I would like to be sure it brings a speedup on almost all possible environments (CPU, GPU, multi-GPU, TPU) without any loss in memory footprint (one-hot encoding the token type ids seems harmless, but we never know).
As for putting those in modeling utils versus the model file, I agree with Lysandre that this breaks our philosophy of putting everything in each model file. I emitted the same reserves for TFSharedEmbeddings
when it was introduced.
name="token_type_embeddings", | ||
) | ||
self.embeddings = tf.keras.layers.Add() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name could be clearer I think: embeddings_sum
is more explicit.
@@ -501,10 +448,10 @@ def __init__(self, config, add_pooling_layer=True, **kwargs): | |||
) | |||
|
|||
def get_input_embeddings(self): | |||
return self.embeddings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how it can be used above (line 420) in a tf.matmul
if it's a layer and not a weight.
I basically took example on the official implementation of Transformer encoder available in the Google Repo https://github.com/tensorflow/models/tree/master/official/nlp/keras_nlp . After having done several experiments (only on CPU and GPU though), I end up to extract from this an optimal version for each embedding.
I don't mind to copy/paste the same layers in all the concerned files if it is the recommended way. @sgugger @LysandreJik Will you be more confident if I create a version for each model and add the comment
Now the |
So this part confuses me. Why name Also, how does the new organization not screw up pretrained weights? From what I understand, the old |
I agree it is confusing, if you prefer it can be called def _get_word_embedding_weight(self, embedding_layer):
if hasattr(embedding_layer, "word_embeddings"):
return embedding_layer.word_embeddings
elif hasattr(embedding_layer, "weight"):
return embedding_layer.weight
elif hasattr(embedding_layer, "decoder"):
return embedding_layer.decoder
else:
# Here we build the word embeddings weights if not exists.
# And then we retry to get the attribute once built.
self(self.dummy_inputs)
if hasattr(embedding_layer, "word_embeddings"):
return embedding_layer.word_embeddings
elif hasattr(embedding_layer, "weight"):
return embedding_layer.weight
elif hasattr(embedding_layer, "decoder"):
return embedding_layer.decoder
else:
return None No more
This is because before we where using a name score and not anymore in this PR. Let's say that defining a name scope or creating a layer represents the same thing. In both cases the weight is named |
Yes, having only "weight" makes more sense to me, and it would make the code easier to read. Thanks for explaining why the name of the weight doesn't change for loading! |
I found another advantage of these new embedding computation. It allows our models to be compiled in XLA_GPU and XLA_TPU which was not the case before. Small proof test on a machine with a GPU: from transformers import TFBertModel
import tensorflow as tf
model = TFBertModel.from_pretrained("bert-base-cased")
@tf.function(experimental_compile=True)
def run():
return model(model.dummy_inputs)
outputs = run() On master fails with:
On this PR works as expected. The reason is because the |
7a1c5b1
to
c1cb284
Compare
f5981f0
to
0eeac1e
Compare
Now, each model has its own |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the modifications! This looks way better now, I think.
if embeds is not None: | ||
return embeds | ||
|
||
model(model.dummy_inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here to say we retry after building the model, just in case it was not already?
@@ -118,7 +234,7 @@ def call(self, hidden_states, attention_mask=None, head_mask=None, output_attent | |||
attention_scores = tf.einsum("aecd,abcd->acbe", key_layer, query_layer) | |||
|
|||
if attention_mask is not None: | |||
# Apply the attention mask is (precomputed for all layers in TFElectraModel call() function) | |||
# Apply the attention mask is (precomputed for all layers in TFBertModel call() function) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be replaced, see comment above.
@@ -536,96 +652,41 @@ def create_position_ids_from_inputs_embeds(self, inputs_embeds): | |||
|
|||
Returns: tf.Tensor | |||
""" | |||
seq_length = shape_list(inputs_embeds)[1] | |||
position_ids = tf.range(self.padding_idx + 1, seq_length + self.padding_idx + 1, dtype=tf.int32)[tf.newaxis, :] | |||
bsz, seq_length = shape_list(tensor=inputs_embeds)[:2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bsz
is a bit too short IMO, batch_size
should be used (here and two lines below).
def call( | ||
self, | ||
input_ids=None, | ||
attention_mask=None, | ||
token_type_ids=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand MPNet's implementation has some token_type_ids
it doesn't use, but I'd leave them for now here until there is a general fix (that also deals with the PyTorch implementation). The tokenizer still return those token_type_ids so this would cause problem if a user feeds the output of a tokenizer to one of those models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
token_type_ids
is not in the PyTorch implementation, so I think the Tokenizer should be fixed in same time than the TF model.
@@ -132,96 +249,41 @@ def create_position_ids_from_inputs_embeds(self, inputs_embeds): | |||
|
|||
Returns: tf.Tensor | |||
""" | |||
seq_length = shape_list(inputs_embeds)[1] | |||
position_ids = tf.range(self.padding_idx + 1, seq_length + self.padding_idx + 1, dtype=tf.int32)[tf.newaxis, :] | |||
bsz, seq_length = shape_list(tensor=inputs_embeds)[:2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before: bsz -> batch_size
tests/test_modeling_tf_common.py
Outdated
embeds = getattr(embedding_layer, "weight", None) | ||
|
||
if embeds is not None: | ||
return embeds | ||
|
||
embeds = getattr(embedding_layer, "decoder", None) | ||
|
||
if embeds is not None: | ||
return embeds | ||
|
||
model(model.dummy_inputs) | ||
|
||
embeds = getattr(embedding_layer, "weight", None) | ||
|
||
if embeds is not None: | ||
return embeds | ||
|
||
embeds = getattr(embedding_layer, "decoder", None) | ||
|
||
if embeds is not None: | ||
return embeds | ||
|
||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as for modeling_utils (plus what are we testing if we just use the same code?)
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general. One thing I'm not 100% sure about is whether we really need to add keras layers like tf.keras.layers.Add()
if we start doing this for the embeddings now, I'm wondering if we should do the same for all residual connections in the self-attention blocks
In the absolute, yes we should. In an ideal world, everytime TF proposes a function/layer for doing something we should use it, as it is part of the optimization process. I know and I understand that it might seems confusing and starts to diverge with what PT looks like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this LGTM. I also agree with your explanations regarding the Add
layers.
""" | ||
if mode == "embedding": | ||
return self._embedding(input_ids, position_ids, token_type_ids, inputs_embeds, training=training) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was a single matrix multiplication before no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
return dict(list(base_config.items()) + list(config.items())) | ||
|
||
def call(self, input_ids): | ||
flat_input_ids = tf.reshape(tensor=input_ids, shape=[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those are multiple operation that replaced a single matrix operation no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YEs
What does this PR do?
This PR propose a better implementation of the embedding layer for the BERT-Like TF models. Another benefit of this cleaning is a better computational performance:
This new implementation should be compatible with the incoming rework of the resizing proposed in #9193. A similar work will be applied to
TFSharedEmbeddings
in a next PR.All slow/quick tests passes.
EDIT: I don't know why Github has some issues to pin the reviewers, so pinging @LysandreJik @sgugger and @patrickvonplaten