Recurrent Attention: standalone machine translation example #11421

andhus · 2018-10-17T15:20:38Z

Summary

Standalone example of recurrent attention as per @farizrahman4u suggestion. There is thorough documentation in the script itself.

The script contains a base class for recurrent attention mechanisms. The purpose of this is to make it simple to write custom attention mechanisms. This is the main logic needed to implement the specific mechanism (by extending the base class):

def attention_call(self,
                   inputs,
                   cell_states,
                   attended,
                   attention_states,
                   attended_mask,
                   training=None):
    # only one attended sequence (verified in build)
    assert len(attended) == 1
    attended = attended[0]
    attended_mask = attended_mask[0]
    h_cell_tm1 = cell_states[0]

    # compute attention weights
    w = K.repeat(K.dot(h_cell_tm1, self.W_a), K.shape(attended)[1])
    u = K.dot(attended, self.U_a)
    e = K.exp(K.dot(K.tanh(w + u), self.v_a))

    if attended_mask is not None:
        e = e * K.cast(K.expand_dims(attended_mask, -1), K.dtype(e))

    # weighted average of attended
    a = e / K.sum(e, axis=1, keepdims=True)
    c = K.sum(a * attended, axis=1, keepdims=False)

    return c, [c]

The lines below summarizes how the attention mechanism is used, in summary: an RNNCell is wrapped by the attention mechanism and the attended constans are provided to the RNN:

decoder = RNN(
    cell=DenseAnnotationAttention(
        cell=GRUCell(RECURRENT_UNITS),
        units=DENSE_ATTENTION_UNITS),
    return_sequences=True)
h1 = decoder(y_emb, constants=x_enc)

Related Issues

#11172 (+multiple previous issues and PRs linked from there)

PR Overview

The PR contains a single example script. It is under review/discussion what parts might make it into the core api.

[y] This PR requires new unit tests [y/n] (make sure tests are included)
TODO, definitely needed if RNNAttentionCell is added to core api. Tests should be added/done also to validate implementation in this example.
[?] This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
[y] This PR is backwards compatible [y/n]
[n] This PR changes the current API [y/n] (all API changes need to be approved by fchollet)

…ttention_standalone_example

examples/recurrent_attention_machine_translation.py

andhus · 2018-10-17T15:30:32Z

examples/recurrent_attention_machine_translation.py

+              batch_size=BATCH_SIZE,
+              epochs=EPOCHS,
+              validation_data=(
+                  [target_seqs_train[:, :-1], input_seqs_train],


use validation data!

farizrahman4u · 2018-10-17T19:16:38Z

@fchollet This is a very neat and thorough PR. Please review and discuss what parts needs to be moved into the Keras API and what should stay in the example.

andhus · 2018-10-17T22:04:20Z

examples/recurrent_attention_machine_translation.py

+        return K.max(K.stack([x_1, x_2], axis=-1), axis=-1, keepdims=False)
+
+    h2 = TimeDistributed(Lambda(dense_maxout))(concatenate([h1, y_emb]))
+    y_pred = TimeDistributed(Dense(target_tokenizer.num_words))(h2)


Softmax missing!

andhus · 2018-10-18T08:56:11Z

Regarding TODO(4) in the docs: This diff clarifies the changes needed to improve the efficiency of the attention mechanism. It is a little bit less intuitive, why I left it out of this PR. It boosts training speed with about 50% (on MacBookPro CPU, tensorflow backend).

fchollet

Thanks for the PR. I think this is a useful example and we can include it. However, it seems quite long. Is there anything you could afford to leave out?

examples/recurrent_attention_machine_translation.py

fchollet · 2018-10-21T20:29:07Z

examples/recurrent_attention_machine_translation.py

+    initializers,
+    regularizers,
+    constraints)
+from keras.engine import (


Style: don't import from engine (it's an internal factoring module). Instead do:

import keras from keras import layers

Then use e.g. layers.Dense

examples/recurrent_attention_machine_translation.py

fchollet · 2018-10-21T20:43:17Z

examples/recurrent_attention_machine_translation.py

+                  [target_seqs_val[:, :-1], input_seqs_val],
+                  target_seqs_val[:, 1:, None]))
+
+    # TODO add logic for greedy/beam search generation


I think stopping the example at model.fit is too restrictive, this should be an end-to-end example showing how to do inference as well (like we do in the other translation example).

Totally agree. Beam search will make the example even significantly longer more complex (but is most relevant). Will add both greedy and beam search inference to the example and then we can decide.

farizrahman4u · 2018-10-22T03:00:36Z

@fchollet What about moving the RNNAttentionCell class to Keras?

…notationAttention

…attention_standalone_example_efficient

…ttention_standalone_example

…ra complexity)

…g data

…ttention_standalone_example

andhus · 2018-12-27T22:00:19Z

Hi @farizrahman4u, @gabrieldemarmiesse @lvapeab @fchollet! I found some time to properly validate this implementation (there were some subtle bugs) and fix all the remaining TODOs. It achieves "decent" performance for the given dataset in an hour on a K80 (the original paper used a 1000x larger dataset and trained for several days).

As discussed, the example is long - but it is also a complete replication of the (quite old but prominent) paper on recurrent attention, including beam-search readout. I think it serves as a good reference. As pointed out before, we can continue the discussion regarding if some parts should be added to the core API (if/when there is time) and simplify this example accordingly.

gabrieldemarmiesse · 2018-12-27T22:17:16Z

Thanks a lot @andhus for your work. This must have taken a lot of time. I'll take a look at it tomorrow for a first review, and I think @fchollet will also read it when he has more time.

gabrieldemarmiesse

Thanks for the PR. I'm not an expert in RNNs, but I hope I can help make this PR better.

examples/recurrent_attention_machine_translation.py

gabrieldemarmiesse · 2018-12-28T14:11:06Z

examples/recurrent_attention_machine_translation.py

+                return self.score < other.score
+
+            def __gt__(self, other):
+                return other.score > other.score


I think there is a mistake here. Maybe return self.score > other.score ?

Good catch!

...the headpush/pop only uses __lt__ so on wont have affected the results.

gabrieldemarmiesse · 2018-12-28T14:16:35Z

examples/recurrent_attention_machine_translation.py

+                    elif len(beams_updated) < search_width:
+                        # not full search width
+                        heapq.heappush(beams_updated, new_beam)
+                    elif new_beam.score > beams_updated[0].score:


Maybe elif new_beam > beams_updated[0] ? Otherwise __gt__ and __lt__ are never used in your example.

Good point, I can do this - but yes they are anyway, in the heapq - this was the main reason for implementing comparison methods of the Beam.

I see. Thanks for the explanation!

examples/recurrent_attention_machine_translation.py

gabrieldemarmiesse · 2018-12-28T15:27:51Z

examples/recurrent_attention_machine_translation.py

+
+
+if __name__ == '__main__':
+    DATA_DIR = 'data/wmt16_mmt'


Can I suggest the following:

from keras.utils.data_utils import get_file base_name = 'wmt16_mmt_' origin = 'http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/' get_file(base_name + 'train', origin=origin + 'training.tar.gz', untar=True) get_file(base_name + 'val', origin=origin + 'validation.tar.gz', untar=True) tar_file = get_file(base_name + 'test', origin=origin + 'mmt16_task1_test.tar.gz', untar=True) DATA_DIR = os.path.dirname(tar_file)

Taking "standalone" a step further.

Since the extracted files won't have the base_name I create a new cache_subdir instead - otherwise files with a very generic name (train.en) ends up in the datasets dir.

Good idea, I didn't find a way to make a subdir (I didn't look much into it).

gabrieldemarmiesse · 2018-12-28T15:31:08Z

examples/recurrent_attention_machine_translation.py

+- NOTE that a different dataset (wmt14) is used in [1], which is _orders of
+    magnitude_ larger than the dataset used here (348M vs 0.35M words). The model
+    in [1] was trained for 252 hours (!) on a Tesla Quadro K6000, whereas for the
+    data in this example the model starts to overfit after < 1 hour (15 epochs)


is < a typo?

no, I meant "less than", but better type it out.

andhus · 2018-12-28T15:47:17Z

@gabrieldemarmiesse This diff: https://github.com/andhus/keras/pull/6/files shows how we can skip 300+ lines by removing the base class AttentionCellWrapper (100 lines is just docs of the base class).

For the standalone example, this makes sense I guess. The drawback is that more overhead is added to the implementation of the specific attention mechanism DenseAnnotationAttention.

The most "pressing" need for (something like) the base class in the core API is that we need to use the private keras.engine.base_layer._collect_previous_mask method to extract the masks of the attended tensor(s). In the AttentionCellWrapper, the masks are extracted and explicitly passed to the attention_call (abstract) method. Or is there another way using only API functionality to extract the masks?

…auto-download data, fix Beam comp.)

gabrieldemarmiesse · 2018-12-28T17:03:08Z

Good question. I don't know much about it. Maybe someone else can give some insight? @farizrahman4u do you know how we can avoid calling the private function?

andhus · 2019-03-27T17:01:02Z

@gabrieldemarmiesse @farizrahman4u @fchollet I'd love to wrap-up this one. It has been reviewed, trained until convergence and sanity checked. It adds clear value as there are no attention examples currently.

The remaining question was whether to get rid of the base class or not. I vote for removing it, i.e. apply this diff https://github.com/andhus/keras/pull/6/files and get rid of 300 lines. It was never intended for the example (but to standardize and remove boilerplate for attention cell wrappers in general). Given current speed of progress :) I don't think it is reasonable to think that this will be added to the core API in the near future (it can always be found in history of this PR).

For reference (@gabrieldemarmiesse 4 Nov 2018):

On an organisation note (because I can't say I understand very well what is going on), I would suggest to

Add this example to the examples directory, that is, merging this PR since there seems to be a consensus about the quality of this example.

Discuss later in another PR what should leave this example and go in the codebase. This is because this step will surely include a rework of the documentation + tests.

I propose doing this in two steps because the time to process a PR is usually an exponential function of the changes.

rbturnbull · 2019-08-03T05:02:49Z

Hi @andhus - thanks so much for your work on this. I'm excited to be able use this. I tried the standalone example in Keras 2.2.4 using Tensorflow 1.14.0. It died in the K.rnn call at line 2974 of tensorflow_backend.py where it does:
output = tf.where(tiled_mask_t, output, states[0])
tf.where needs the x and y tensors to be the same shape but output in the demo is cell_output concatenated with attention_h and states[0] (i.e. [?,3000]) and states[0] cell_state from from the GRU which is [?,1000].

This is only a problem in K.rnn if there is masking so when I turned off mask_zero in the target sequence embedding the code started to run.

I'm not sure where the break down in the logic is happening. Do you have any idea how this could be fixed?

In regards to the code generally, I have a few thoughts:

I really like that you have the AttentionCellWrapper class and I recommend that you keep it. It will make it easier to add in other types of attention such as the Multiplicative Attention from Luong's 2015 paper.
I think that it would be better to wrap the building of the u tensor into the DenseAnnotationAttention layer. Perhaps you could call the existing class DenseAnnotationAttentionCell and then make a new class called just DenseAnnotationAttention where in the call function you build the u tensor and then return RNN(cell=cell, return_sequences=True). Does that make sense?
I'm not sure that name DenseAnnotationAttention will be clear for users. Maybe BahdanauAttention or AdditiveAttension would be clearer and match how the mechanism is talked about in the literature.
Finally, I think it's really important to be able to output the attention weights somehow because often these are used to use as a kind of soft alignment (and these are often visualized in papers). Do you know how this could be an optional output?

Again, I'm very impressed by this and it would be great to see this part of Keras proper in the near future!

JoyceCoder · 2019-08-06T04:28:31Z

Hi @andhus - thanks so much for your work on this. I'm excited to be able use this. I tried the standalone example in Keras 2.2.4 using Tensorflow 1.14.0. It died in the K.rnn call at line 2974 of tensorflow_backend.py where it does:
output = tf.where(tiled_mask_t, output, states[0])
tf.where needs the x and y tensors to be the same shape but output in the demo is cell_output concatenated with attention_h and states[0] (i.e. [?,3000]) and states[0] cell_state from from the GRU which is [?,1000].

This is only a problem in K.rnn if there is masking so when I turned off mask_zero in the target sequence embedding the code started to run.

I'm not sure where the break down in the logic is happening. Do you have any idea how this could be fixed?

In regards to the code generally, I have a few thoughts:

I really like that you have the AttentionCellWrapper class and I recommend that you keep it. It will make it easier to add in other types of attention such as the Multiplicative Attention from Luong's 2015 paper.

I think that it would be better to wrap the building of the u tensor into the DenseAnnotationAttention layer. Perhaps you could call the existing class DenseAnnotationAttentionCell and then make a new class called just DenseAnnotationAttention where in the call function you build the u tensor and then return RNN(cell=cell, return_sequences=True). Does that make sense?

I'm not sure that name DenseAnnotationAttention will be clear for users. Maybe BahdanauAttention or AdditiveAttension would be clearer and match how the mechanism is talked about in the literature.

Finally, I think it's really important to be able to output the attention weights somehow because often these are used to use as a kind of soft alignment (and these are often visualized in papers). Do you know how this could be an optional output?

Again, I'm very impressed by this and it would be great to see this part of Keras proper in the near future!

Hi,@rbturnbull
I met the same problem.But I pull this repo to my local,and use the tensorflow_backend in this repo,and slove this problem.
Maybe it can help you.

todd-cook · 2019-10-30T21:59:50Z

examples/recurrent_attention_machine_translation.py

+        return output_texts, output_scores
+
+    # Translate first 3 samples from validation data
+    for input_text, target_text in zip(input_texts_val, target_texts_val)[:3]:


This is the only line that breaks running under PY3, change line 926 to:

for input_text, target_text in list(zip(input_texts_val, target_texts_val))[:3]:

zip is eager in PY2, lazy in PY3

todd-cook

This is a great PR @andhus and I hope to see it merged soon so that I can evangelize how one can easily use attention in Keras.

I pulled the branch and ran it successfully using:
Keras==2.3.1
tensorflow_gpu == 2.0.0
Python 3.7

with one small modification, changed line 926 so that it will run under PY3 where zip is lazy eval.

bertsky · 2020-02-19T17:11:37Z

Thanks @andhus for this outstanding PR! I hope this gets merged soon – it's been over 2 years (taking into account the ones leading up here).

I am also in favour of keeping the base class in the example, as this would allow making follow-up PRs both for incorporating it into the base API and for adding other attention mechanisms (or features like alignment output/visualization, local attention etc) independently.

There is but one issue which I think should be addressed/fixed: With the current implementation, one cannot make use of Keras' layer sharing with DenseAnnotationAttention. (This is necessary to share the decoder weights when defining separate learning and inference models for the NMT encoder-decoder example.)

The reason is that the constructor of AttentionCellWrapper will assign the given cell directly to the instance, which causes the (inherited) attribute tracker to add it to _layers, and in turn the RNN.trainable_weights property will get to see the cell's weights as well. But AttentionCellWrapper's default implementation of that property already adds them. Hence there will be double references! As a fix, one can use the same trick as in RNN's constructor:

def __init__(self, cell, ...):
    # self.cell = cell
    self._set_cell(cell)
...
@disable_tracking
def _set_cell(self, cell):
    self.cell = cell

andhus added 6 commits October 16, 2018 11:25

add draft of recurrent attention machine translation example

80c4620

add functional implementation of attentive cell

aa9a3d5

complete working example with docs

c2e5760

Merge branch 'master' of github.com:keras-team/keras into recurrent_a…

c7c0637

…ttention_standalone_example

add support for masking of attended

900f524

set general data path

a517b56

andhus mentioned this pull request Oct 17, 2018

Recurrent Attention API for keras #11172

Closed

andhus commented Oct 17, 2018

View reviewed changes

examples/recurrent_attention_machine_translation.py Show resolved Hide resolved

andhus commented Oct 17, 2018

View reviewed changes

farizrahman4u assigned fchollet Oct 17, 2018

farizrahman4u requested a review from fchollet October 17, 2018 19:24

farizrahman4u self-assigned this Oct 17, 2018

farizrahman4u self-requested a review October 17, 2018 19:24

andhus commented Oct 17, 2018

View reviewed changes

andhus added 2 commits October 18, 2018 10:19

add softmax to output, use validation data, improve docs

2fa8caf

make modifications to improve efficiency of attention mechanism

74a1880

andhus added 5 commits October 18, 2018 21:35

add bias terms to attention

2da8df9

merge addition of biases, remove TODO on efficiency

f1b6678

WIP - added firds version of greedy inference

0579196

fix bad 'start' and 'end' tokens

22c5ee0

fix PEP8 violations

691fa0c

fchollet reviewed Oct 21, 2018

View reviewed changes

andhus added 5 commits October 22, 2018 10:02

fix imports, add docs to and remove activity_regularizer from DenseAn…

feb951d

…notationAttention

Merge branch 'recurrent_attention_standalone_example' into recurrent_…

0c8f934

…attention_standalone_example_efficient

merge updates from base branches, fix conflicts, clean up greddy example

ecdb61a

add first draft of beam search

5b0ff1a

tmp

0fae569

andhus added 10 commits December 15, 2018 16:51

Merge branch 'master' of github.com:keras-team/keras into recurrent_a…

8be0f0b

…ttention_standalone_example

set attention output_mode to concatenate

49a2070

temp: in the middle of merge conflict

55beebc

fix truncate bug, remove decomposed model or inference (had bug + ext…

b475823

…ra complexity)

clean up imports

7ab0493

update docs for efficient implementation, add clipnorm

491abfc

fix maxout bug causing missing weights, fit tokenizer only on trainin…

565be57

…g data

add print-out examples at 15 epochs

72f9c47

improve docs and comments

4fad478

Merge branch 'master' of github.com:keras-team/keras into recurrent_a…

c77e978

…ttention_standalone_example

remove empty line (pep8 violation)

1d78890

andhus added 3 commits December 28, 2018 15:28

fix attention_mask issue due to constants passed as tuple by K.rnn

5e9b115

improve readability of padding

ad289a6

fix minor typo

156e78c

gabrieldemarmiesse reviewed Dec 28, 2018

View reviewed changes

fix review comments by @gabrieldemarmiesse (add reference in README, …

b8d2961

…auto-download data, fix Beam comp.)

andhus added 2 commits December 28, 2018 18:07

update docs due to auto-download of data

ef5d8b4

remove file basename (accidental leftover)

7cc25be

gabrieldemarmiesse mentioned this pull request Feb 7, 2019

Add simple Attention layer (attention.py) keras-team/keras-contrib#436

Closed

todd-cook reviewed Oct 30, 2019

View reviewed changes

todd-cook suggested changes Oct 30, 2019

View reviewed changes

fchollet closed this Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recurrent Attention: standalone machine translation example #11421

Recurrent Attention: standalone machine translation example #11421

andhus commented Oct 17, 2018

andhus Oct 17, 2018

farizrahman4u commented Oct 17, 2018

andhus Oct 17, 2018

andhus commented Oct 18, 2018 •

edited

Loading

fchollet left a comment

fchollet Oct 21, 2018

fchollet Oct 21, 2018

andhus Oct 22, 2018

farizrahman4u commented Oct 22, 2018

andhus commented Dec 27, 2018

gabrieldemarmiesse commented Dec 27, 2018

gabrieldemarmiesse left a comment

gabrieldemarmiesse Dec 28, 2018

andhus Dec 28, 2018

andhus Dec 28, 2018

gabrieldemarmiesse Dec 28, 2018

andhus Dec 28, 2018

gabrieldemarmiesse Dec 28, 2018

gabrieldemarmiesse Dec 28, 2018

andhus Dec 28, 2018

andhus Dec 28, 2018 •

edited

Loading

gabrieldemarmiesse Dec 28, 2018

gabrieldemarmiesse Dec 28, 2018

andhus Dec 28, 2018

andhus commented Dec 28, 2018

gabrieldemarmiesse commented Dec 28, 2018

andhus commented Mar 27, 2019

rbturnbull commented Aug 3, 2019

JoyceCoder commented Aug 6, 2019

todd-cook Oct 30, 2019 •

edited

Loading

todd-cook left a comment

bertsky commented Feb 19, 2020

Recurrent Attention: standalone machine translation example #11421

Recurrent Attention: standalone machine translation example #11421

Conversation

andhus commented Oct 17, 2018

Summary

Related Issues

PR Overview

Choose a reason for hiding this comment

farizrahman4u commented Oct 17, 2018

Choose a reason for hiding this comment

andhus commented Oct 18, 2018 • edited Loading

fchollet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

farizrahman4u commented Oct 22, 2018

andhus commented Dec 27, 2018

gabrieldemarmiesse commented Dec 27, 2018

gabrieldemarmiesse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andhus Dec 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andhus commented Dec 28, 2018

gabrieldemarmiesse commented Dec 28, 2018

andhus commented Mar 27, 2019

rbturnbull commented Aug 3, 2019

JoyceCoder commented Aug 6, 2019

todd-cook Oct 30, 2019 • edited Loading

Choose a reason for hiding this comment

todd-cook left a comment

Choose a reason for hiding this comment

bertsky commented Feb 19, 2020

andhus commented Oct 18, 2018 •

edited

Loading

andhus Dec 28, 2018 •

edited

Loading

todd-cook Oct 30, 2019 •

edited

Loading