Add head_mask/decoder_head_mask for TF BART models #9639

stancld · 2021-01-16T22:55:31Z

This PR adds head_mask and decoder_head_mask input arguments for TF BART-based models. The full list of models is as follows:

TFBART
TFMBart
TFBlenderbot
TFBlenderbotSmall
TFMarian
TFPegasus

This PR can be deemed as a TF counterpart to the PR #9569.

Further information:

I've added test_headmasking functionality to tests/test_modeling_tf_common.py
TODO: Add a test (as a part of test_headmasking) to verify that we can get a gradient back for importance score computation. I am not so familiar with TensorFlow, therefore, I am not fully sure with a TF equivalent to

outputs = model(**inputs, return_dict=True)

output = sum(t.sum() for t in outputs[0])
output = output.sum()
output.backward()

Reviewer: @patrickvonplaten

* Add head_mask and decoder_head_mask input arguments for TF BART-based models as a TF counterpart to the PR huggingface#9569 * Add test_headmasking functionality to tests/test_modeling_tf_common.py * TODO: Add a test to verify that we can get a gradient back for importance score computation

patrickvonplaten · 2021-01-18T08:01:36Z

@stancld, thanks so much for tackling this! I think it would be a great addition if we could add a test_headmasking method for TF as well.
I think it might be better if we don't try to have the exact test as in PyTorch. For now it should be enough to just leave out all gradient-related statements, like .backward(), requires_grad(...) in TF. The attentions output should still be 0 accordingly.

…_models

Remove redundant #TODO note from tests/test_modeling_tf_common.py

stancld · 2021-01-18T18:54:35Z

Hey @patrickvonplaten, I hope this PR is ready for review. There's newly implemented test_headmasking method which follows the method from PyTorch testing except for the gradient-related statements as you pointed above.

It seems all checks have passed after rebasing this PR.

patrickvonplaten · 2021-01-20T09:45:36Z

tests/test_modeling_tf_common.py

@@ -1068,10 +1141,13 @@ def _check_match_tokens(self, generated_ids, bad_words_ids):
        return False


+global_rng = random.Random()


not sure if we want to have a global random seed, was this an intended change or just used for debugging?

Here, I followed the convention from the corresponding PyTorch testing file test_modeling_common.py, where global_rng is also defined for a similar purpose. However, I think it's okay to omit this change and leave it as it was :)

I would be more in favor to keep it as it was.

@jplu I removed global_rng and leave it as it was before changes. Hopefully, now this PR is ready for a final review

patrickvonplaten

Very clean! Thanks a lot for the contribution.
I'm only wondering about the global random seed, what was the reason behind this change?

patrickvonplaten · 2021-01-20T09:47:37Z

Also, @jplu it would be great if you could take a quick look if this is all serving compatible (I don't see a reason why it wouldn't be though)

jplu

Thanks a lot for this really nice PR!! Before to approve on my side, can you comment these lines:

    @slow
    def test_saved_model_with_hidden_states_output(self):
        pass

    @slow
    def test_saved_model_with_attentions_output(self):
        pass

    def test_saved_model_creation(self):
        pass

    def test_saved_model_creation_extended(self):
        pass

And run them with the environment variable RUN_SLOW=1 in order to be sure that everything is ok. You can do it at least just for BART.

If all these tests are passing then I will approve otherwise we will have to fix them.

jplu · 2021-01-20T09:52:36Z

tests/test_modeling_tf_common.py

@@ -618,6 +622,75 @@ def check_encoder_attentions_output(outputs):
            self.assertEqual(model.config.output_hidden_states, True)
            check_encoder_attentions_output(outputs)

+    def test_headmasking(self):


As this test is only for few specific models, I would better see this test inside their respective test files and remove all the occurences of test_head_masking = False

I'd rather have that in a common file with test_head_masking=False in the other files; that's what we do for tests that are shared, and it gives the incentive of switching these to True.

LysandreJik

LGTM! Thanks for working on this @stancld

LysandreJik · 2021-01-20T11:33:20Z

tests/test_modeling_tf_common.py

@@ -618,6 +622,75 @@ def check_encoder_attentions_output(outputs):
            self.assertEqual(model.config.output_hidden_states, True)
            check_encoder_attentions_output(outputs)

+    def test_headmasking(self):


I'd rather have that in a common file with test_head_masking=False in the other files; that's what we do for tests that are shared, and it gives the incentive of switching these to True.

LysandreJik · 2021-01-20T11:34:06Z

tests/test_modeling_tf_common.py

+                arg_names = [*signature.parameters.keys()]
+                if "decoder_head_mask" in arg_names:  # necessary diferentiation because of T5 model
+                    inputs["decoder_head_mask"] = head_mask


sgugger

Thanks for adding this. There might be a few assert/shapes to rework.

sgugger · 2021-01-20T15:06:25Z

src/transformers/models/bart/modeling_tf_bart.py

@@ -230,6 +231,15 @@ def call(

        attn_weights = tf.nn.softmax(attn_weights, axis=-1)

+        if layer_head_mask is not None:
+            assert layer_head_mask.shape == (


I think we should use shape_list here. The assert might also need to be wrapped in some tf operation. @jplu will know better than me.

Oh good catch!!

the assert must be also replaced by a tf.debugging.assert equivalent.

sgugger · 2021-01-20T15:07:09Z

src/transformers/models/bart/modeling_tf_bart.py

+            assert inputs["head_mask"].shape[0] == (
+                len(self.layers)
+            ), f"The head_mask should be specified for {len(self.layers)} layers, but it is for {inputs['head_mask'].shape[0]}."


Same here (I won't comment on the other modeling files, but obviously it applies to all of them).

Indeed same here

sgugger · 2021-01-20T15:07:24Z

src/transformers/models/bart/modeling_tf_bart.py

+            assert inputs["head_mask"].shape[0] == (
+                len(self.layers)
+            ), f"The head_mask should be specified for {len(self.layers)} layers, but it is for {inputs['head_mask'].shape[0]}."


And same here.

And here as well!

jplu · 2021-01-20T17:37:02Z

Just done further tests on your PR and the changes are not graph compliant and the following slow tests are failing:

test_saved_model_creation
test_saved_model_creation_extended

One of the reasons is what @sgugger raised.

stancld · 2021-01-20T17:44:06Z

@jplu @sgugger Thank you very much for your comments and suggested solution. I'll try to fix these issues and send a new commit!

* Add back head_mask and decoder_head_mask to BART-based ...Model after the last commit * Remove head_mask ande decoder_head_mask from input_dict in TF test_train_pipeline_custom_model as these two have different shape than other input args (Necessary for passing this test)

stancld · 2021-01-22T16:59:25Z

Hi @jplu, could you, please, review the changes in the code I've done to say whether assertions are done more appropriately now? :)
I've been also struggling to run (on my local) those four slow tests you mentioned last time, but I'm gonna have a look at that at the weekend if we're afraid of not passing.

jplu · 2021-01-22T17:09:56Z

I confirm that the assertions are done more appropriately now! Those four tests are among the most important one for the TF code base (they are run in slow mode because unfortunately they take some time to be executed).

If you need some help to make them pass, I will be happy to.

jplu · 2021-01-22T17:13:43Z

tests/test_modeling_tf_common.py

+        # head_mask and decoder_head_mask has different shapes than other input args
+        if "head_mask" in inputs_dict:
+            del inputs_dict["head_mask"]
+        if "decoder_head_mask" in inputs_dict:
+            del inputs_dict["decoder_head_mask"]


Why removing them? We are not supposed to possibly train a Seq2Seq model with these arguments?

@jplu - Currently, I'm not fully sure how to implement this test with head_mask and decoder_head_mask, as they have different shapes than input_ids, attention_mask,.... While the latter ones have shapes of (batch_size, seq_len), the head_mask has (num_layers, num_attention_heads). This results in the following error

> X = tf.data.Dataset.from_tensor_slices( (inputs_dict, np.ones((self.model_tester.batch_size, self.model_tester.seq_length, num_labels, 1))) ).batch(1) ---------------------------------- self = Dimension(13), other = Dimension(5) def assert_is_compatible_with(self, other): """Raises an exception if `other` is not compatible with this Dimension. Args: other: Another Dimension. Raises: ValueError: If `self` and `other` are not compatible (see is_compatible_with). """ if not self.is_compatible_with(other): > raise ValueError("Dimensions %s and %s are not compatible" % (self, other)) E ValueError: Dimensions 13 and 5 are not compatible ../../../../../../miniconda3/envs/bart/lib/python3.8/site-packages/tensorflow/python/framework/tensor_shape.py:281: ValueError

Do you have any idea of how to overcome this problem?

Furthermore, at this moment, this test does not consider head_mask for any BERT-like models because head_mask is not tested for these models at all. I'm definitely willing to implement such testing for other than BART-like models as well, once we're sure about the proper implementation.

Ok, indeed in a tf dataset all the inputs must share the same shape. To overcome this you should create your own test_train_pipeline_custom_model for each Seq2Seq model and replace the tf dataset by the dictionary.

Anyway, you should not modify a common test just to make pass the test for a single model.

To be honest, I don't really agree here @jplu . IMO it's fine to remove head_mask from the inputs_dict. We remove the head_mask for other tests as well, such as

transformers/tests/test_modeling_common.py

Line 521 in fac7cfb

if "head_mask" in inputs_dict:

,

transformers/tests/test_modeling_common.py

Line 1076 in fac7cfb

blacklist_non_batched_params = ["head_mask", "decoder_head_mask"]

.

Also the test_train_pipeline_custom_model is not meant to test the head masking and removing just means that the test covers the exact same functionalities as before. All "normal" head-masking tests were enabled in this PR.
I'm fine with leaving it as it for now

Humm, ok so that means you should not be supposed to train BART with such arguments, right?

Contemporary research does not consider training models with head_mask. The layer masking is applied during inference, once the model is trained with all attention heads.

Ok, perfect then! I'm fine to let it like this :)

jplu · 2021-01-25T17:16:14Z

@jplu I removed global_rng and leave it as it was before changes. Hopefully, now this PR is ready for a final review

Are these tests finally pass? :

test_saved_model_with_hidden_states_output
test_saved_model_with_attentions_output
test_saved_model_creation
test_saved_model_creation_extended

If yes, I will approve the PR :)

stancld · 2021-01-25T18:18:00Z

@jplu I ran these 4 aforementioned tests for BART and all those tests passed.

jplu

Ok then it's perfect!! Great work @stancld !

LysandreJik · 2021-01-26T08:50:12Z

Merging, thanks a lot for your efforts @stancld!!

stancld added 2 commits January 18, 2021 16:32

Merge remote-tracking branch 'upstream/master' into head_mask_TF_bart…

ef34c08

…_models

Remove redundant #TODO note

f419c92

Remove redundant #TODO note from tests/test_modeling_tf_common.py

patrickvonplaten reviewed Jan 20, 2021

View reviewed changes

patrickvonplaten approved these changes Jan 20, 2021

View reviewed changes

patrickvonplaten requested review from sgugger, LysandreJik and jplu January 20, 2021 09:47

jplu reviewed Jan 20, 2021

View reviewed changes

LysandreJik approved these changes Jan 20, 2021

View reviewed changes

sgugger approved these changes Jan 20, 2021

View reviewed changes

stancld added 4 commits January 20, 2021 20:32

Fix assertions

d330880

Make style

9af6eb1

Resolve conflict in modeling_tf_bart.py

e6b7861

jplu reviewed Jan 22, 2021

View reviewed changes

Revert adding global_rng in test_modeling_tf_common.py

e5fc853

Merge remote-tracking branch 'upstream/master'

0ba22f0

jplu approved these changes Jan 25, 2021

View reviewed changes

LysandreJik merged commit 1867d9a into huggingface:master Jan 26, 2021

stancld deleted the head_mask_TF_bart_models branch January 26, 2021 16:36

stancld mentioned this pull request Jan 26, 2021

Missing head_mask and decoder_head_mask arguments in encoder-decoder models #9814

Closed

stancld mentioned this pull request Feb 3, 2021

Add head_mask and decoder_head_mask to TF LED #9988

Merged

		@@ -1068,10 +1141,13 @@ def _check_match_tokens(self, generated_ids, bad_words_ids):
		return False


		global_rng = random.Random()

Add head_mask/decoder_head_mask for TF BART models #9639

Add head_mask/decoder_head_mask for TF BART models #9639

Conversation

stancld commented Jan 16, 2021 • edited Loading

patrickvonplaten commented Jan 18, 2021

stancld commented Jan 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

patrickvonplaten commented Jan 20, 2021

jplu left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jplu commented Jan 20, 2021 • edited Loading

stancld commented Jan 20, 2021

stancld commented Jan 22, 2021

jplu commented Jan 22, 2021 • edited Loading

jplu Jan 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten Jan 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jplu commented Jan 25, 2021

stancld commented Jan 25, 2021

jplu left a comment

Choose a reason for hiding this comment

LysandreJik commented Jan 26, 2021

stancld commented Jan 16, 2021 •

edited

Loading

jplu left a comment •

edited

Loading

jplu commented Jan 20, 2021 •

edited

Loading

jplu commented Jan 22, 2021 •

edited

Loading

jplu Jan 22, 2021 •

edited

Loading

patrickvonplaten Jan 25, 2021 •

edited

Loading