From 83578ab3206344e2b9a23884a78726cdb492d4c7 Mon Sep 17 00:00:00 2001
From: "Georgios (George) Roumpos" <roumposg@users.noreply.github.com>
Date: Tue, 15 Jan 2019 12:37:41 -0800
Subject: [PATCH 1/7] Design review for "Attention for Dense Networks"

---
 rfcs/20190115-dense-attention.md | 506 +++++++++++++++++++++++++++++++
 1 file changed, 506 insertions(+)
 create mode 100644 rfcs/20190115-dense-attention.md

diff --git a/rfcs/20190115-dense-attention.md b/rfcs/20190115-dense-attention.md
new file mode 100644
index 000000000..c81a145e7
--- /dev/null
+++ b/rfcs/20190115-dense-attention.md
@@ -0,0 +1,506 @@
+# Attention for Dense networks on Keras
+
+| Status        | Proposed                                                                   |
+:-------------- |:-------------------------------------------------------------------------- |
+| **Author(s)** | Georgios Roumpos (roumposg@google.com)                                     |
+| **Sponsors**  | Karmel Allison (karmel@google.com), Francois Chollet (fchollet@google.com) |
+| **Updated**   | 2019-01-15                                                                 |
+
+## Objective and Motivation
+
+Recently people have had success using the Attention mechanism in dense layers,
+e.g. CNN+Attention or Transformer networks. Some examples are the
+["Attention is all you need"](https://arxiv.org/abs/1706.03762) paper, and
+models in
+[semantic text similarity](https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html).
+`tf.keras.layers` is the recommended way to build models in tensorflow, but it
+does not have a layer for attention that works with CNN/DNN networks. We would
+like to contribute this capability.
+
+Keras is an API spec that can be implemented across different languages and
+backends, and `tf.keras` is a particular implementation of that spec. This
+document contains code examples for `tensorflow`, but the same API should work
+everywhere.
+
+### Recurrent Neural Networks
+
+Although not the primary focus of this proposal, the Attention layers in this
+proposal work with some configurations of RNN networks.
+Namely, when users create
+[tf.keras.layers.LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)
+with `return_sequences=True`, the rest works the same way as CNN.
+
+Unfortunately, this technique does not cover sequence-to-sequence RNN models.
+In these models, the value is the states of encoder, and the query is the input
+of the decoder. The decoder needs to slide its input based on the timesteps, and
+feed them one by one. So, the output of the attention layer at timestep T
+affects the output at T+1.
+
+## Previous Work
+
+[tf.contrib.seq2seq.LuongAttention](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/LuongAttention)
+and
+[tf.contrib.seq2seq.BahdanauAttention](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/BahdanauAttention)
+are implementations of dot-product (Luong) and additive (Bahdanau) Attention
+respectively for RNN in Tensorflow. This proposal is based on this
+implementation, but works with CNN/Dense networks.
+
+There is an implementation of Attention as a `tf.layers.Layer` subclass under
+https://github.com/tensorflow/models/tree/master/official/transformer,
+specifically in
+https://github.com/tensorflow/models/blob/master/official/transformer/model/attention_layer.py.
+That file implements dot-product attention proposed in this file, and also
+supports multi-head. Our proposal is to expose such a method in
+`tf.keras.layers`. In addition, our proposal creates variables inside the
+`build()` method, rather than the `call()` method.
+
+There is ongoing work to add Attention in Keras, namely
+https://github.com/keras-team/keras/pull/11421. That proposal addresses
+Attention mechanism for RNN networks only. I cannot see a way to make it work
+for CNN/Dense networks, which are the motivation for our proposal.
+
+https://github.com/keras-team/keras/issues/9263 contains an example of a Keras
+Layer that implements a CNN+Attention network. That example merges CNN and
+Attention into the same class, whereas our proposal is modular. In the
+Examples section, we present an example of how to build a
+CNN+Attention model.
+
+https://github.com/keras-team/keras/issues/7341 is a request to add an Attention
+Layer. Our proposal will resolve that request.
+
+https://github.com/keras-team/keras/issues/7803 is a request for a Multi-Head
+Attention Layer. Multi-Head Attention is not covered in this proposal, but can
+be implemented as a follow-up, as discussed in the
+[Multi-Head Attention](#multi-head-attention) section.
+
+There are a few more issues that request Attention for RNN. They are covered
+either by https://github.com/keras-team/keras/pull/11421 or our proposal:
+
+*   https://github.com/keras-team/keras/issues/5738
+*   https://github.com/keras-team/keras/issues/4962
+*   https://github.com/keras-team/keras/issues/2525
+
+## Design Proposal
+
+We propose to implement the following common attention layers:
+
+*   `Attention`: Basic dot-product attention, a.k.a. Luong-style attention.
+    Follows
+    [tf.contrib.seq2seq.LuongAttention](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/LuongAttention).
+    This attention has two forms.
+    *   The first is standard dot-product attention, as described in: Minh-Thang
+        Luong, Hieu Pham, Christopher D. Manning. "Effective Approaches to
+        Attention-based Neural Machine Translation." EMNLP 2015.
+        https://arxiv.org/abs/1508.04025.
+    *   The second is the scaled form inspired partly by the normalized form of
+        additive (Bahdanau-style) attention. To enable the second form,
+        construct the object with parameter `scale=True`.
+*   `AdditiveAttention`: Additive attention, a.k.a. Bahdanau-style attention.
+    Follows
+    [tf.contrib.seq2seq.BahdanauAttention](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/BahdanauAttention).
+    This attention has two forms.
+    *   The first is additive attention, as described in: Dzmitry Bahdanau,
+        Kyunghyun Cho, Yoshua Bengio. "Neural Machine Translation by Jointly
+        Learning to Align and Translate." ICLR 2015.
+        https://arxiv.org/abs/1409.0473.
+    *   The second is the normalized form. This form is inspired by the weight
+        normalization article: Tim Salimans, Diederik P. Kingma. "Weight
+        Normalization: A Simple Reparameterization to Accelerate Training of
+        Deep Neural Networks." https://arxiv.org/abs/1602.07868. To enable the
+        second form, construct the object with parameter `normalize=True`.
+
+## Detailed Design
+
+According to the general definition of attention (see
+https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184/lectures/lecture11.pdf),
+"Given a set of vector values, and a vector query, attention is a technique to
+compute a weighted sum of the values, dependent on the query."
+
+There are four input tensors:
+
+*   `query` of shape `[batch_size, Tq, dim]`
+*   `value` of shape `[batch_size, Tv, dim]`
+*   `query_mask` (optional) of shape `[batch_size, Tq]`. Boolean tensor,
+    typically calculated from the query length tensor. Used to mask the output
+    tensor. This is similar to the `mask` argument of
+    [tf.keras.backend.rnn](https://www.tensorflow.org/api_docs/python/tf/keras/backend/rnn).
+*   `value_mask` (optional) of shape `[batch_size, Tv]`. Boolean tensor,
+    typically calculated from the value length tensor. It is used to mask
+    `value` elements beyond this length so they do not contribute to the result.
+
+The output is of shape `[batch_size, Tq, dim]`.
+
+Following the pattern of other Keras layers, we pass the list `[query, value]`
+as `inputs` and we pass the list `[query_mask, value_mask]` as the `mask`
+argument. Namely, the interface for `Attention` will be as follows:
+
+```python
+class Attention(tf.keras.layers.Layer):
+  """Basic dot-product attention layer, a.k.a. Luong-style attention.
+
+  The calculation follows the steps:
+  1. Calculate scores with shape `[batch_size, Tq, Tv]` as a query-value
+     dot product: `scores = tf.matmul(query, value, transpose_b=True)`.
+  2. Use scores to calculate a distribution with shape
+     `[batch_size, Tq, Tv]`: `distribution = tf.nn.softmax(scores)`.
+  3. Use `distribution` to create a linear combination of `value` with
+     shape `batch_size, Tq, dim]`:
+     `return tf.matmul(distribution, value)`.
+
+  Args:
+    scale: If `True`, will create a scalar variable to scale the attention
+      scores.
+  """
+
+  def __init__(
+      self,
+      scale=False,
+      **kwargs):
+
+  def build(self, input_shape):
+    """Creates scale variable if scale==True."""
+
+  def call(self, inputs, mask=None):
+    """Applies basic dot-product attention.
+
+    Args:
+      inputs: List of the following tensors:
+        * query: Query `Tensor` of shape `[batch_size, Tq, dim]`.
+        * value: Value `Tensor` of shape `[batch_size, Tv, dim]`.
+      mask: List of the following tensors:
+        * query_mask: A boolean mask `Tensor` of shape `[batch_size, Tq]`.
+          If given, the output will be zero at the positions where
+          `mask==False`.
+        * value_mask: A boolean mask `Tensor` of shape `[batch_size, Tv]`.
+          If given, will apply the mask such that values at positions where
+          `mask==False` do not contribute to the result.
+    Returns:
+      Attention outputs of shape `[batch_size, Tq, dim]`.
+    """
+```
+
+Similarly, the interface for `AdditiveAttention` will be:
+
+```python
+class AdditiveAttention(tf.keras.layers.Layer):
+  """Additive attention layer, a.k.a. Bahdanau-style attention.
+
+  The calculation follows the steps:
+  1. Reshape `query` and `value` into shapes `[batch_size, Tq, 1, dim]`
+     and `[batch_size, 1, Tv, dim]` respectively.
+  2. Calculate scores with shape `[batch_size, Tq, Tv]` as a non-linear
+     sum: `scores = tf.reduce_sum(tf.tanh(query + value), axis=-1)`
+  3. Use scores to calculate a distribution with shape
+     `[batch_size, Tq, Tv]`: `distribution = tf.nn.softmax(scores)`.
+  4. Use `distribution` to create a linear combination of `value` with
+     shape `batch_size, Tq, dim]`:
+     `return tf.matmul(distribution, value)`.
+
+  Args:
+    normalize: If True, will create scale and bias variables to normalize
+    scores.
+  """
+
+  def __init__(
+      self,
+      normalize=False,
+      **kwargs):
+
+  def build(self, input_shape):
+    """Creates variables."""
+
+  def call(self, inputs, mask=None):
+    """Applies additive attention.
+
+    Args:
+      inputs: List of the following tensors:
+        * query: Query `Tensor` of shape `[batch_size, Tq, dim]`.
+        * value: Value `Tensor` of shape `[batch_size, Tv, dim]`.
+      mask: List of the following tensors:
+        * query_mask: A boolean mask `Tensor` of shape `[batch_size, Tq]`.
+          If given, the output will be zero at the positions where
+          `mask==False`.
+        * value_mask: A boolean mask `Tensor` of shape `[batch_size, Tv]`.
+          If given, will apply the mask such that values at positions where
+          `mask==False` do not contribute to the result.
+    Returns:
+      Attention outputs of shape `[batch_size, Tq, dim]`.
+    """
+```
+
+The implementations for both Attention layers can be in the same file. They can
+reuse a private method with the following signature:
+
+```python
+def _apply_attention_scores(scores, value, value_mask=None):
+  """Applies attention scores to the given value tensor.
+
+  Args:
+    scores: Scores tensor of shape `[batch_size, Tq, Tv]`.
+    value: Value tensor of shape `[batch_size, Tv, dim]`.
+    value_mask: A boolean mask `Tensor` of shape `[batch_size, Tv]`.
+      If given, will apply the mask such that values at positions where
+      `mask==False` do not contribute to the result.
+
+  Returns:
+    Tensor of shape `[batch_size, Tq, dim]`.
+  """
+```
+
+Implementations of other Attention mechanisms can reuse this method, as well.
+So, that method can be made public. An alternative using inheritance is
+discussed in the "Base Attention Class" section. The
+rest of the code is specific to each Attention mechanism.
+
+Although not the primary focus of this proposal, the Attention layers work with
+RNN networks, such as
+[tf.keras.layers.LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM):
+When creating the LSTM, users need to set `return_sequences=True`, and the rest
+works the same way as CNN. It is unclear whether this method suffices to create
+the most common RNN+Attention models.
+
+We will first work on the implementation for Tensorflow.
+
+### Self-Attention
+
+The Self-Attention variant can be implemented by passing the same tensor to both
+`query` and `value`.
+
+### Multi-Head Attention
+
+This is an Attention variant proposed in
+["Attention is all you need"](https://arxiv.org/abs/1706.03762). This variant is
+not covered in our proposal. But could be implemented as an additional feature,
+e.g. by adding a `num_heads` argument that defaults to 1. The implementation
+will split the `query` and `value` tensors into `num_heads` tensors, calculate
+attention for each pair, then stack the results. This transformation can be
+implemented as a private method that is reused by all attention layers.
+
+### Transformer
+
+Transformer is a DNN+Attention network proposed in
+["Attention is all you need"](https://arxiv.org/abs/1706.03762). There is an
+implementation of it under
+https://github.com/tensorflow/models/tree/master/official/transformer, which
+uses a custom Attention implementation. Our proposal will simplify the
+Transformer network constructions, because users can reuse the Attention layers,
+rather than writing custom ones.
+
+### Position Representations
+
+DNN+Attention networks do not model relative or absolute position information in
+their structure. Instead, position information is modeled as an additional term
+in the model output. The proposed techniques can be implemented as an additional
+feature in the Attention API.
+
+*   https://arxiv.org/abs/1803.02155 describes how relative position
+    representation can be added to dot-product attention.
+*   https://arxiv.org/abs/1503.08895 and https://arxiv.org/abs/1706.03762 show
+    how absolute position information can be added as a deterministic function
+    of position.
+
+### 2D and 3D
+
+Attention is typically used in 1D sequences, such as text. It is conceivable
+that people may try to use it with 2D or 3D sequences, such as with the outputs
+of `Conv2D` or `Conv3D` layers. To make this work, users can follow the example
+code:
+
+```python
+query_orig_shape = tf.shape(query)
+query = tf.reshape(query, [batch_size, -1, dim])
+value = tf.reshape(value, [batch_size, -1, dim])
+attention = tf.keras.layers.Attention()([query, value])
+attention = tf.reshape(attention, query_orig_shape)
+```
+
+Alternatively, we could add the above reshapes inside the `Attention`
+implementation, so that 2D and 3D sequences can be supported out of the box. But
+given that this is a rare use case, we will not support it in the first version.
+
+## Examples
+
+Here is an example of a `tf.estimator` `model_fn`. It creates a CNN+Attention
+model for query and value sequence features:
+
+```python
+def model_fn_with_attention(features, labels, mode):
+  """Model function that uses Attention."""
+  # Prepare the sequence embeddings for the query and value features.
+  query_column = tf.contrib.feature_column.\
+    sequence_categorical_column_with_vocabulary_file('query', vocabulary_file)
+  value_column = tf.contrib.feature_column.\
+    sequence_categorical_column_with_vocabulary_file('value', vocabulary_file)
+  query_embedding_column, value_embedding_column = (
+      tf.feature_column.shared_embedding_columns(
+          [query_column, value_column], dimension=50))
+  # Query embeddings with shape [batch_size, Tq, embedding_dim], where Tq is the
+  # maximum sequence length for this batch.
+  # Query length with shape [batch_size] and values in the range [0, Tq).
+  query_embeddings, query_length = (
+      tf.contrib.feature_column.sequence_input_layer(
+          features, [query_embedding_column]))
+  # Value embeddings with shape [batch_size, Tv, embedding_dim] and value length
+  # with shape [batch_size].
+  value_embeddings, value_length = (
+      tf.contrib.feature_column.sequence_input_layer(
+          features, [value_embedding_column]))
+
+  # CNN layer.
+  cnn_layer = tf.keras.layers.Conv1D(
+      filters=100,
+      kernel_size=4,
+      # Use 'same' padding so outputs have the same shape as inputs.
+      padding='same')
+  # Query encoding of shape [batch_size, Tq, filters].
+  query_seq_encoding = cnn_layer(query_embeddings)
+  # Value encoding of shape [batch_size, Tv, filters].
+  value_seq_encoding = cnn_layer(value_embeddings)
+
+  # Query-value attention of shape [batch_size, Tq, filters].
+  query_value_attention_seq = tf.keras.layers.Attention()(
+      [query_seq_encoding, value_seq_encoding],
+      mask=[_sequence_mask(query_seq_encoding),
+            _sequence_mask(value_seq_encoding)])
+
+  # Reduce over the sequence axis to produce encodings of shape
+  # [batch_size, filters].
+  query_encoding = tf.keras.layers.GlobalAveragePooling1D()(
+      query_seq_encoding)
+  query_value_attention = tf.keras.layers.GlobalAveragePooling1D()(
+      query_value_attention_seq)
+
+  # Concatenate query and document encodings to produce a DNN input layer.
+  input_layer = tf.keras.layers.Concatenate()(
+      [query_encoding, query_value_attention])
+
+  # Add DNN layers, and use a head to return EstimatorSpec.
+  # Follow the code in tf.estimator.DNNClassifier.
+  # …
+
+def _sequence_mask(t)
+  """Creates a boolean mask for tensor t."""
+  return tf.sequence_mask(t, maxlen=tf.shape(t)[-2])
+```
+
+There is ongoing work to implement `sequence_input_layer` as a Keras layer.
+After this work is completed, all the model above can be written as a succession
+of Keras layers. In particular, the input layer will be created as:
+
+```python
+query_input_layer = tf.feature_column.SequenceFeatures([query_embedding_column])
+query_embeddings, query_length = query_input_layer(features)
+value_input_layer = tf.feature_column.SequenceFeatures([value_embedding_column])
+value_embeddings, value_length = value_input_layer(features)
+```
+
+Here is the same example using Keras. For simplicity, we skip `query_mask` and
+`value_mask`, which can be created based on the sequence length.
+
+```python
+# Variable-length int sequences.
+query_input = keras.Input(shape=(None,), dtype='int32')
+value_input = keras.Input(shape=(None,), dtype='int32')
+
+# Embedding lookup.
+token_embedding = keras.layers.Embedding(max_tokens, dimension)
+# Query embeddings of shape [batch_size, Tq, dimension].
+query_embeddings = token_embedding(query_input)
+# Value embeddings of shape [batch_size, Tv, dimension].
+value_embeddings = token_embedding(query_input)
+
+# CNN layer.
+cnn_layer = keras.layers.Conv1D(
+    filters=100,
+    kernel_size=4,
+    # Use 'same' padding so outputs have the same shape as inputs.
+    padding='same')
+# Query encoding of shape [batch_size, Tq, filters].
+query_seq_encoding = cnn_layer(query_embeddings)
+# Value encoding of shape [batch_size, Tv, filters].
+value_seq_encoding = cnn_layer(value_embeddings)
+
+# Query-value attention of shape [batch_size, Tq, filters].
+query_value_attention_seq = keras.layers.Attention()(
+    [query_seq_encoding, value_seq_encoding])
+
+# Reduce over the sequence axis to produce encodings of shape
+# [batch_size, filters].
+query_encoding = keras.layers.GlobalAveragePooling1D()(
+    query_seq_encoding)
+query_value_attention = keras.layers.GlobalAveragePooling1D()(
+    query_value_attention_seq)
+
+# Concatenate query and document encodings to produce a DNN input layer.
+input_layer = keras.layers.Concatenate()(
+    [query_encoding, query_value_attention])
+
+# Add DNN layers, and create Model.
+# ...
+```
+
+## Alternatives Considered
+
+### Base Attention Class
+
+We could have a base attention class that implements the
+`apply_attention_scores()` method so that subclasses could reuse that method.
+The base class could be as follows:
+
+```python
+class BaseAttention(tf.keras.layers.Layer):
+  """Base Attention class.
+
+  Implementations of attention mechanisms should inherit from this class, and
+  reuse the `apply_attention_scores()` method.
+  """
+
+  def __init__(self, **kwargs):
+    super(BaseAttention, self).__init__(**kwargs)
+
+  def apply_attention_scores(self, scores, value, value_mask=None):
+    """Applies attention scores to the given value tensor.
+
+    Args:
+      scores: Scores tensor of shape `[batch_size, Tq, Tv]`.
+      value: Value tensor of shape `[batch_size, Tv, dim]`.
+      value_mask: A boolean mask `Tensor` of shape `[batch_size, Tv]`.
+        If given, will apply the mask such that values at positions where
+        `mask==False` do not contribute to the result.
+
+    Returns:
+      Tensor of shape `[batch_size, Tq, dim]`.
+    """
+```
+
+Pros:
+
+*   Inheritance is used extensively in Keras. This alternative follows that
+    pattern.
+*   When external users inherit from `BaseAttention`, they can freely reuse the
+    `apply_attention_scores()` method.
+*   When new common methods are added, such as `split_heads` and `combine_heads`
+    for multi-headed attention, they can be added to this class.
+
+Cons:
+
+*   Inheritance hierarchies in python hinder troubleshooting. Because there is
+    no compile-time linking, users need to perform regular-expression searches
+    across multiple files to discover which method is called.
+
+## Questions and Discussion Topics
+
+*   The examples in this doc are in Tensorflow. Will the API work in other
+    languages and backends?
+*   What other implementations do we need for other languages/backends?
+*   What is the best interface for RNN? This proposal works for some basic
+    cases, but https://github.com/keras-team/keras/pull/11421 proposes a more
+    specialized interface. Perhaps we need both?
+*   What other arguments should we expose? E.g. Attention distribution
+    (probabilities) is calculated from attention scores using `softmax`. Maybe
+    we can expose a `distribution_fn`, of `probability_fn` argument that
+    defaults to `softmax`.
+*   We use terminology from
+    https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184/lectures/lecture11.pdf.
+    Namely the inputs are `query` and `value`. Is this the accepted terminology?
+*   Are there any other common variants of Attention we should implement?

From 5936b1b46f545de368e39ec79e4c6fa72586adac Mon Sep 17 00:00:00 2001
From: "Georgios (George) Roumpos" <roumposg@users.noreply.github.com>
Date: Thu, 17 Jan 2019 10:30:41 -0800
Subject: [PATCH 2/7] Update 20190115-dense-attention.md

---
 rfcs/20190115-dense-attention.md | 44 +++++++++++++++++++++++++-------
 1 file changed, 35 insertions(+), 9 deletions(-)

diff --git a/rfcs/20190115-dense-attention.md b/rfcs/20190115-dense-attention.md
index c81a145e7..7a90818b6 100644
--- a/rfcs/20190115-dense-attention.md
+++ b/rfcs/20190115-dense-attention.md
@@ -87,6 +87,15 @@ We propose to implement the following common attention layers:
 *   `Attention`: Basic dot-product attention, a.k.a. Luong-style attention.
     Follows
     [tf.contrib.seq2seq.LuongAttention](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/LuongAttention).
+    The calculation follows the steps:
+    1. Calculate scores with shape `[batch_size, Tq, Tv]` as a query-value
+       dot product: `scores = tf.matmul(query, value, transpose_b=True)`.
+    2. Use scores to calculate a distribution with shape
+       `[batch_size, Tq, Tv]`: `distribution = tf.nn.softmax(scores)`.
+    3. Use `distribution` to create a linear combination of `value` with
+       shape `batch_size, Tq, dim]`:
+       `return tf.matmul(distribution, value)`.
+
     This attention has two forms.
     *   The first is standard dot-product attention, as described in: Minh-Thang
         Luong, Hieu Pham, Christopher D. Manning. "Effective Approaches to
@@ -98,6 +107,17 @@ We propose to implement the following common attention layers:
 *   `AdditiveAttention`: Additive attention, a.k.a. Bahdanau-style attention.
     Follows
     [tf.contrib.seq2seq.BahdanauAttention](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/BahdanauAttention).
+    The calculation follows the steps:
+    1. Reshape `query` and `value` into shapes `[batch_size, Tq, 1, dim]`
+       and `[batch_size, 1, Tv, dim]` respectively.
+    2. Calculate scores with shape `[batch_size, Tq, Tv]` as a non-linear
+       sum: `scores = tf.reduce_sum(tf.tanh(query + value), axis=-1)`
+    3. Use scores to calculate a distribution with shape
+       `[batch_size, Tq, Tv]`: `distribution = tf.nn.softmax(scores)`.
+    4. Use `distribution` to create a linear combination of `value` with
+       shape `batch_size, Tq, dim]`:
+       `return tf.matmul(distribution, value)`.
+
     This attention has two forms.
     *   The first is additive attention, as described in: Dzmitry Bahdanau,
         Kyunghyun Cho, Yoshua Bengio. "Neural Machine Translation by Jointly
@@ -269,8 +289,11 @@ The Self-Attention variant can be implemented by passing the same tensor to both
 ### Multi-Head Attention
 
 This is an Attention variant proposed in
-["Attention is all you need"](https://arxiv.org/abs/1706.03762). This variant is
-not covered in our proposal. But could be implemented as an additional feature,
+["Attention is all you need"](https://arxiv.org/abs/1706.03762). This variant
+can be implemented by using multiple attention layers, one for each head.
+
+If we later decide that we need a cleaner API, we can implement it as a
+feature of attention layers,
 e.g. by adding a `num_heads` argument that defaults to 1. The implementation
 will split the `query` and `value` tensors into `num_heads` tensors, calculate
 attention for each pair, then stack the results. This transformation can be
@@ -294,17 +317,20 @@ in the model output. The proposed techniques can be implemented as an additional
 feature in the Attention API.
 
 *   https://arxiv.org/abs/1803.02155 describes how relative position
-    representation can be added to dot-product attention.
+    representation can be added to dot-product attention. This must be
+    implemented as a feature of attention layers. It cannot be done as a
+    separate composable layer.
 *   https://arxiv.org/abs/1503.08895 and https://arxiv.org/abs/1706.03762 show
     how absolute position information can be added as a deterministic function
-    of position.
+    of position. This can be implemented as a separate keras layer that composes
+    with the `Embedding` and `Attention` layers.
 
-### 2D and 3D
+### 2D, 3D and n-D
 
 Attention is typically used in 1D sequences, such as text. It is conceivable
-that people may try to use it with 2D or 3D sequences, such as with the outputs
-of `Conv2D` or `Conv3D` layers. To make this work, users can follow the example
-code:
+that people may try to use it with 2D, 3D or n-D sequences, such as with the
+outputs of `Conv2D` or `Conv3D` layers. To make this work, users can follow the
+example code:
 
 ```python
 query_orig_shape = tf.shape(query)
@@ -315,7 +341,7 @@ attention = tf.reshape(attention, query_orig_shape)
 ```
 
 Alternatively, we could add the above reshapes inside the `Attention`
-implementation, so that 2D and 3D sequences can be supported out of the box. But
+implementation, so that n-D sequences can be supported out of the box. But
 given that this is a rare use case, we will not support it in the first version.
 
 ## Examples

From 016927af893117a832d61b4be529c022bbfbe731 Mon Sep 17 00:00:00 2001
From: "Georgios (George) Roumpos" <roumposg@users.noreply.github.com>
Date: Fri, 18 Jan 2019 09:11:47 -0800
Subject: [PATCH 3/7] Update 20190115-dense-attention.md

---
 rfcs/20190115-dense-attention.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/rfcs/20190115-dense-attention.md b/rfcs/20190115-dense-attention.md
index 7a90818b6..3754027e5 100644
--- a/rfcs/20190115-dense-attention.md
+++ b/rfcs/20190115-dense-attention.md
@@ -42,8 +42,9 @@ affects the output at T+1.
 and
 [tf.contrib.seq2seq.BahdanauAttention](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/BahdanauAttention)
 are implementations of dot-product (Luong) and additive (Bahdanau) Attention
-respectively for RNN in Tensorflow. This proposal is based on this
-implementation, but works with CNN/Dense networks.
+respectively for RNN in Tensorflow. There is ongoing work to implement those
+as Keras layers. Our proposal will follow the same implementation details,
+namely same mathematical operations, but will work with CNN/Dense networks.
 
 There is an implementation of Attention as a `tf.layers.Layer` subclass under
 https://github.com/tensorflow/models/tree/master/official/transformer,

From 60825b7fd2392b33c93e927aab708f4a64603a76 Mon Sep 17 00:00:00 2001
From: "Georgios (George) Roumpos" <roumposg@users.noreply.github.com>
Date: Fri, 18 Jan 2019 09:23:59 -0800
Subject: [PATCH 4/7] Update 20190115-dense-attention.md

---
 rfcs/20190115-dense-attention.md | 41 ++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/rfcs/20190115-dense-attention.md b/rfcs/20190115-dense-attention.md
index 3754027e5..6622361f8 100644
--- a/rfcs/20190115-dense-attention.md
+++ b/rfcs/20190115-dense-attention.md
@@ -515,6 +515,47 @@ Cons:
     no compile-time linking, users need to perform regular-expression searches
     across multiple files to discover which method is called.
 
+### Query, value and mask arguments
+
+An alternative to the `mask` argument would be to pass `query_mask` and
+`value_mask` as separate arguments, namely:
+
+```python
+  def call(self, inputs, query_mask=None, value_mask=None):
+    """Applies basic dot-product attention.
+
+    Args:
+      inputs: List of the following tensors:
+        * query: Query `Tensor` of shape `[batch_size, Tq, dim]`.
+        * value: Value `Tensor` of shape `[batch_size, Tv, dim]`.
+      query_mask: A boolean mask `Tensor` of shape `[batch_size, Tq]`.
+        If given, the output will be zero at the positions where
+        `mask==False`.
+      value_mask: A boolean mask `Tensor` of shape `[batch_size, Tv]`.
+        If given, will apply the mask such that values at positions where
+        `mask==False` do not contribute to the result.
+    Returns:
+      Attention outputs of shape `[batch_size, Tq, dim]`.
+    """
+```
+
+Another variation would be to pass `query` and `value` as named arguments:
+
+```python
+  def call(self, query, value, query_mask=None, value_mask=None):
+```
+
+Pros:
+
+* Code is self-documenting.
+* Could prevent some user bugs related to the ordering of arguments.
+
+Cons:
+
+* Passing arguments as lists is a pattern used in Keras layers, such as
+  `tf.keras.layers.Add`. E.g. see the code in
+  https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/python/keras/layers/merge.py#L205
+
 ## Questions and Discussion Topics
 
 *   The examples in this doc are in Tensorflow. Will the API work in other

From b46328aa23b92443a85df5b680e7d0ae1717d3e8 Mon Sep 17 00:00:00 2001
From: "Georgios (George) Roumpos" <roumposg@users.noreply.github.com>
Date: Tue, 22 Jan 2019 09:32:30 -0800
Subject: [PATCH 5/7] Update 20190115-dense-attention.md

---
 rfcs/20190115-dense-attention.md | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/rfcs/20190115-dense-attention.md b/rfcs/20190115-dense-attention.md
index 6622361f8..82f8e4f88 100644
--- a/rfcs/20190115-dense-attention.md
+++ b/rfcs/20190115-dense-attention.md
@@ -287,6 +287,34 @@ We will first work on the implementation for Tensorflow.
 The Self-Attention variant can be implemented by passing the same tensor to both
 `query` and `value`.
 
+There is a common case that requires special treatment: decoder self-attention.
+In this case, we need to prevent flow of information from the "future" towards
+the "past". So, position `i` cannot attend to positions `j > i`. This can be
+accomplished by masking the attention scores with a
+[lower triangular matrix](https://en.wikipedia.org/wiki/Triangular_matrix).
+This variant is the "Masked attention" in Figure 1 of the
+["Attention is all you need"](https://arxiv.org/abs/1706.03762) paper.
+
+This is a common case that we should cover. The mask needs to be applied to the
+scores, so this cannot be implemented as a separate composable layer. It needs
+to be a feature of the proposed attention layers. Because "masking" is a general
+technique, we should choose a special name for this technique, such as "causal
+mask".
+
+A causal mask can be implemented in the following ways.
+
+a. Add a constructor argument such as `causal_mask=False` in the proposed
+   attention layers.
+   * pro: No new classes are required.
+   * con: `causal_mask` makes no sense when `query` and `value` are different.
+
+b. Add special classes for self-attention, namely `SelfAttention` and
+   `AdditiveSelfAttention`, and use `causal_mask=False` as a constructor
+   argument. They can share most of the implementation details with the
+   `Attention` and `AdditiveAttention` classes.
+   * pro: Safer, easier to understand.
+   * con: Requires new classes.
+
 ### Multi-Head Attention
 
 This is an Attention variant proposed in

From c7e458426fdccd19e4bbd97c2f39e0fe02825816 Mon Sep 17 00:00:00 2001
From: "Georgios (George) Roumpos" <roumposg@users.noreply.github.com>
Date: Fri, 25 Jan 2019 08:30:20 -0800
Subject: [PATCH 6/7] Update 20190115-dense-attention.md

---
 rfcs/20190115-dense-attention.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/rfcs/20190115-dense-attention.md b/rfcs/20190115-dense-attention.md
index 82f8e4f88..d6dcb30f6 100644
--- a/rfcs/20190115-dense-attention.md
+++ b/rfcs/20190115-dense-attention.md
@@ -358,8 +358,9 @@ feature in the Attention API.
 
 Attention is typically used in 1D sequences, such as text. It is conceivable
 that people may try to use it with 2D, 3D or n-D sequences, such as with the
-outputs of `Conv2D` or `Conv3D` layers. To make this work, users can follow the
-example code:
+outputs of `Conv2D` or `Conv3D` layers. In fact, recent research applies
+self-attention to 2D images https://arxiv.org/abs/1805.08318.
+To make n-D work with the proposed layers, users can follow the example code:
 
 ```python
 query_orig_shape = tf.shape(query)

From 8ee865abbcbd280242b9f7cb68be16b2fa09337a Mon Sep 17 00:00:00 2001
From: "Georgios (George) Roumpos" <roumposg@users.noreply.github.com>
Date: Mon, 11 Feb 2019 09:03:08 -0800
Subject: [PATCH 7/7] Update 20190115-dense-attention.md

---
 rfcs/20190115-dense-attention.md | 178 ++++++++++++++++++++++---------
 1 file changed, 125 insertions(+), 53 deletions(-)

diff --git a/rfcs/20190115-dense-attention.md b/rfcs/20190115-dense-attention.md
index d6dcb30f6..bdbd0b44f 100644
--- a/rfcs/20190115-dense-attention.md
+++ b/rfcs/20190115-dense-attention.md
@@ -1,10 +1,10 @@
 # Attention for Dense networks on Keras
 
-| Status        | Proposed                                                                   |
+| Status        | Accepted                                                                   |
 :-------------- |:-------------------------------------------------------------------------- |
 | **Author(s)** | Georgios Roumpos (roumposg@google.com)                                     |
 | **Sponsors**  | Karmel Allison (karmel@google.com), Francois Chollet (fchollet@google.com) |
-| **Updated**   | 2019-01-15                                                                 |
+| **Updated**   | 2019-02-11                                                                 |
 
 ## Objective and Motivation
 
@@ -25,14 +25,14 @@ everywhere.
 ### Recurrent Neural Networks
 
 Although not the primary focus of this proposal, the Attention layers in this
-proposal work with some configurations of RNN networks.
-Namely, when users create
+proposal work with some configurations of RNN networks. Namely, when users
+create
 [tf.keras.layers.LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)
 with `return_sequences=True`, the rest works the same way as CNN.
 
-Unfortunately, this technique does not cover sequence-to-sequence RNN models.
-In these models, the value is the states of encoder, and the query is the input
-of the decoder. The decoder needs to slide its input based on the timesteps, and
+Unfortunately, this technique does not cover sequence-to-sequence RNN models. In
+these models, the value is the states of encoder, and the query is the input of
+the decoder. The decoder needs to slide its input based on the timesteps, and
 feed them one by one. So, the output of the attention layer at timestep T
 affects the output at T+1.
 
@@ -42,9 +42,9 @@ affects the output at T+1.
 and
 [tf.contrib.seq2seq.BahdanauAttention](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/BahdanauAttention)
 are implementations of dot-product (Luong) and additive (Bahdanau) Attention
-respectively for RNN in Tensorflow. There is ongoing work to implement those
-as Keras layers. Our proposal will follow the same implementation details,
-namely same mathematical operations, but will work with CNN/Dense networks.
+respectively for RNN in Tensorflow. There is ongoing work to implement those as
+Keras layers. Our proposal will follow the same implementation details, namely
+same mathematical operations, but will work with CNN/Dense networks.
 
 There is an implementation of Attention as a `tf.layers.Layer` subclass under
 https://github.com/tensorflow/models/tree/master/official/transformer,
@@ -72,7 +72,7 @@ Layer. Our proposal will resolve that request.
 https://github.com/keras-team/keras/issues/7803 is a request for a Multi-Head
 Attention Layer. Multi-Head Attention is not covered in this proposal, but can
 be implemented as a follow-up, as discussed in the
-[Multi-Head Attention](#multi-head-attention) section.
+Multi-Head Attention section.
 
 There are a few more issues that request Attention for RNN. They are covered
 either by https://github.com/keras-team/keras/pull/11421 or our proposal:
@@ -89,15 +89,16 @@ We propose to implement the following common attention layers:
     Follows
     [tf.contrib.seq2seq.LuongAttention](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/LuongAttention).
     The calculation follows the steps:
-    1. Calculate scores with shape `[batch_size, Tq, Tv]` as a query-value
-       dot product: `scores = tf.matmul(query, value, transpose_b=True)`.
-    2. Use scores to calculate a distribution with shape
-       `[batch_size, Tq, Tv]`: `distribution = tf.nn.softmax(scores)`.
-    3. Use `distribution` to create a linear combination of `value` with
-       shape `batch_size, Tq, dim]`:
-       `return tf.matmul(distribution, value)`.
+
+    1.  Calculate scores with shape `[batch_size, Tq, Tv]` as a query-key dot
+        product: `scores = tf.matmul(query, key, transpose_b=True)`.
+    2.  Use scores to calculate a distribution with shape `[batch_size, Tq,
+        Tv]`: `distribution = tf.nn.softmax(scores)`.
+    3.  Use `distribution` to create a linear combination of `value` with shape
+        `batch_size, Tq, dim]`: `return tf.matmul(distribution, value)`.
 
     This attention has two forms.
+
     *   The first is standard dot-product attention, as described in: Minh-Thang
         Luong, Hieu Pham, Christopher D. Manning. "Effective Approaches to
         Attention-based Neural Machine Translation." EMNLP 2015.
@@ -105,21 +106,23 @@ We propose to implement the following common attention layers:
     *   The second is the scaled form inspired partly by the normalized form of
         additive (Bahdanau-style) attention. To enable the second form,
         construct the object with parameter `scale=True`.
+
 *   `AdditiveAttention`: Additive attention, a.k.a. Bahdanau-style attention.
     Follows
     [tf.contrib.seq2seq.BahdanauAttention](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/BahdanauAttention).
     The calculation follows the steps:
-    1. Reshape `query` and `value` into shapes `[batch_size, Tq, 1, dim]`
-       and `[batch_size, 1, Tv, dim]` respectively.
-    2. Calculate scores with shape `[batch_size, Tq, Tv]` as a non-linear
-       sum: `scores = tf.reduce_sum(tf.tanh(query + value), axis=-1)`
-    3. Use scores to calculate a distribution with shape
-       `[batch_size, Tq, Tv]`: `distribution = tf.nn.softmax(scores)`.
-    4. Use `distribution` to create a linear combination of `value` with
-       shape `batch_size, Tq, dim]`:
-       `return tf.matmul(distribution, value)`.
+
+    1.  Reshape `query` and `key` into shapes `[batch_size, Tq, 1, dim]` and
+        `[batch_size, 1, Tv, dim]` respectively.
+    2.  Calculate scores with shape `[batch_size, Tq, Tv]` as a non-linear sum:
+        `scores = tf.reduce_sum(tf.tanh(query + key), axis=-1)`
+    3.  Use scores to calculate a distribution with shape `[batch_size, Tq,
+        Tv]`: `distribution = tf.nn.softmax(scores)`.
+    4.  Use `distribution` to create a linear combination of `value` with shape
+        `batch_size, Tq, dim]`: `return tf.matmul(distribution, value)`.
 
     This attention has two forms.
+
     *   The first is additive attention, as described in: Dzmitry Bahdanau,
         Kyunghyun Cho, Yoshua Bengio. "Neural Machine Translation by Jointly
         Learning to Align and Translate." ICLR 2015.
@@ -137,10 +140,12 @@ https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184/lectures/lecture11.
 "Given a set of vector values, and a vector query, attention is a technique to
 compute a weighted sum of the values, dependent on the query."
 
-There are four input tensors:
+There are five input tensors:
 
 *   `query` of shape `[batch_size, Tq, dim]`
 *   `value` of shape `[batch_size, Tv, dim]`
+*   `key` (optional) of shape `[batch_size, Tv, dim]`. If not given, will use
+    `value` for both `key` and `value`, which is the most common case.
 *   `query_mask` (optional) of shape `[batch_size, Tq]`. Boolean tensor,
     typically calculated from the query length tensor. Used to mask the output
     tensor. This is similar to the `mask` argument of
@@ -151,8 +156,8 @@ There are four input tensors:
 
 The output is of shape `[batch_size, Tq, dim]`.
 
-Following the pattern of other Keras layers, we pass the list `[query, value]`
-as `inputs` and we pass the list `[query_mask, value_mask]` as the `mask`
+Following the pattern of other Keras layers, we pass the list `[query, value,
+key]` as `inputs` and we pass the list `[query_mask, value_mask]` as the `mask`
 argument. Namely, the interface for `Attention` will be as follows:
 
 ```python
@@ -160,8 +165,8 @@ class Attention(tf.keras.layers.Layer):
   """Basic dot-product attention layer, a.k.a. Luong-style attention.
 
   The calculation follows the steps:
-  1. Calculate scores with shape `[batch_size, Tq, Tv]` as a query-value
-     dot product: `scores = tf.matmul(query, value, transpose_b=True)`.
+  1. Calculate scores with shape `[batch_size, Tq, Tv]` as a query-key dot
+     product: `scores = tf.matmul(query, key, transpose_b=True)`.
   2. Use scores to calculate a distribution with shape
      `[batch_size, Tq, Tv]`: `distribution = tf.nn.softmax(scores)`.
   3. Use `distribution` to create a linear combination of `value` with
@@ -188,6 +193,9 @@ class Attention(tf.keras.layers.Layer):
       inputs: List of the following tensors:
         * query: Query `Tensor` of shape `[batch_size, Tq, dim]`.
         * value: Value `Tensor` of shape `[batch_size, Tv, dim]`.
+        * key: Optional key `Tensor` of shape `[batch_size, Tv, dim]`. If not
+          given, will use `value` for both `key` and `value`, which is the
+          most common case.
       mask: List of the following tensors:
         * query_mask: A boolean mask `Tensor` of shape `[batch_size, Tq]`.
           If given, the output will be zero at the positions where
@@ -207,10 +215,10 @@ class AdditiveAttention(tf.keras.layers.Layer):
   """Additive attention layer, a.k.a. Bahdanau-style attention.
 
   The calculation follows the steps:
-  1. Reshape `query` and `value` into shapes `[batch_size, Tq, 1, dim]`
+  1. Reshape `query` and `key` into shapes `[batch_size, Tq, 1, dim]`
      and `[batch_size, 1, Tv, dim]` respectively.
   2. Calculate scores with shape `[batch_size, Tq, Tv]` as a non-linear
-     sum: `scores = tf.reduce_sum(tf.tanh(query + value), axis=-1)`
+     sum: `scores = tf.reduce_sum(tf.tanh(query + key), axis=-1)`
   3. Use scores to calculate a distribution with shape
      `[batch_size, Tq, Tv]`: `distribution = tf.nn.softmax(scores)`.
   4. Use `distribution` to create a linear combination of `value` with
@@ -237,6 +245,9 @@ class AdditiveAttention(tf.keras.layers.Layer):
       inputs: List of the following tensors:
         * query: Query `Tensor` of shape `[batch_size, Tq, dim]`.
         * value: Value `Tensor` of shape `[batch_size, Tv, dim]`.
+        * key: Optional key `Tensor` of shape `[batch_size, Tv, dim]`. If not
+          given, will use `value` for both `key` and `value`, which is the
+          most common case.
       mask: List of the following tensors:
         * query_mask: A boolean mask `Tensor` of shape `[batch_size, Tq]`.
           If given, the output will be zero at the positions where
@@ -291,8 +302,8 @@ There is a common case that requires special treatment: decoder self-attention.
 In this case, we need to prevent flow of information from the "future" towards
 the "past". So, position `i` cannot attend to positions `j > i`. This can be
 accomplished by masking the attention scores with a
-[lower triangular matrix](https://en.wikipedia.org/wiki/Triangular_matrix).
-This variant is the "Masked attention" in Figure 1 of the
+[lower triangular matrix](https://en.wikipedia.org/wiki/Triangular_matrix). This
+variant is the "Masked attention" in Figure 1 of the
 ["Attention is all you need"](https://arxiv.org/abs/1706.03762) paper.
 
 This is a common case that we should cover. The mask needs to be applied to the
@@ -315,18 +326,39 @@ b. Add special classes for self-attention, namely `SelfAttention` and
    * pro: Safer, easier to understand.
    * con: Requires new classes.
 
+**Decision**: Use argument `use_causal_mask=False` in the proposed attention
+layers and throw an error if sequence lengths are different
+
 ### Multi-Head Attention
 
 This is an Attention variant proposed in
 ["Attention is all you need"](https://arxiv.org/abs/1706.03762). This variant
 can be implemented by using multiple attention layers, one for each head.
 
-If we later decide that we need a cleaner API, we can implement it as a
-feature of attention layers,
-e.g. by adding a `num_heads` argument that defaults to 1. The implementation
-will split the `query` and `value` tensors into `num_heads` tensors, calculate
-attention for each pair, then stack the results. This transformation can be
-implemented as a private method that is reused by all attention layers.
+If we later decide that we need a cleaner API, we can implement it as a feature
+of attention layers, e.g. by adding a `num_heads` argument that defaults to 1.
+The implementation will reshape the `query` and `value` tensors by adding a
+`num_heads` dimension, calculate attention, then reshape the results. This
+transformation can be implemented as a private method that is reused by all
+attention layers. The only requirement by the user is that the last dimension
+`dim` of `query` and `value` tensors be divided by `num_heads`.
+
+Here is an example of how this can be implemented:
+
+```python
+# Reshape to [batch_size, num_heads, T, dim/num_heads]
+query_original_shape = tf.shape(query)
+query = tf.reshape(query, [batch_size, tq, dim / num_heads, num_heads])
+query = tf.transpose(query, [0, 3, 1, 2])
+value = tf.reshape(value, [batch_size, tv, dim / num_heads, num_heads])
+value = tf.transpose(value, [0, 3, 1, 2])
+# Calculate Attention
+…
+# Reshape to original shape
+attention = tf.transpose(attention, [0, 2, 3, 1])
+attention = tf.reshape(attention, query_original_shape)
+return attention
+```
 
 ### Transformer
 
@@ -335,8 +367,41 @@ Transformer is a DNN+Attention network proposed in
 implementation of it under
 https://github.com/tensorflow/models/tree/master/official/transformer, which
 uses a custom Attention implementation. Our proposal will simplify the
-Transformer network constructions, because users can reuse the Attention layers,
-rather than writing custom ones.
+Transformer network constructions, because users can use the proposed Attention
+layers, rather than writing custom ones.
+
+In particular, our proposal will replace the
+[Attention](https://github.com/tensorflow/models/blob/master/official/transformer/model/attention_layer.py)
+layer with the following differences:
+
+* `split_heads` and `combine_heads` methods will not be implemented in the first
+  version of the proposal. In later versions, they can be implemented as
+  discussed in the previous paragraph.
+* The `bias` argument in
+  [Attention](https://github.com/tensorflow/models/blob/master/official/transformer/model/attention_layer.py)
+  is used to mask the `value` tensor. This is replaced by the `mask` argument in
+  our proposal.
+* The `cache` argument in
+  [Attention](https://github.com/tensorflow/models/blob/master/official/transformer/model/attention_layer.py)
+  is only used for convenience, and is dropped in our proposal.
+* [Attention](https://github.com/tensorflow/models/blob/master/official/transformer/model/attention_layer.py)
+  applies a Dense layer to the input tensors. This is dropped in our proposal.
+  Instead, the user will need to apply a Dense layer separately if they need to.
+* [Attention](https://github.com/tensorflow/models/blob/master/official/transformer/model/attention_layer.py)
+  applies optional dropout to attention scores. This can be implemented as a
+  feature in a later version.
+
+Transformer is a complex network, but at its core it is a Dense layer plus
+self-attention. A simplified transformer network is shown in the following
+example:
+
+```python
+def transformer(input_tensor):
+  dense_layer = tf.keras.layers.Dense(hidden_units)
+  attention_layer = tf.keras.layers.Attention()
+  net = dense_layer(input_tensor)
+  return attention_layer([net, net])
+```
 
 ### Position Representations
 
@@ -359,8 +424,9 @@ feature in the Attention API.
 Attention is typically used in 1D sequences, such as text. It is conceivable
 that people may try to use it with 2D, 3D or n-D sequences, such as with the
 outputs of `Conv2D` or `Conv3D` layers. In fact, recent research applies
-self-attention to 2D images https://arxiv.org/abs/1805.08318.
-To make n-D work with the proposed layers, users can follow the example code:
+self-attention to 2D images, see https://arxiv.org/abs/1502.03044 and
+https://arxiv.org/abs/1805.08318. To make n-D work with the proposed layers,
+users can follow the example code:
 
 ```python
 query_orig_shape = tf.shape(query)
@@ -371,8 +437,8 @@ attention = tf.reshape(attention, query_orig_shape)
 ```
 
 Alternatively, we could add the above reshapes inside the `Attention`
-implementation, so that n-D sequences can be supported out of the box. But
-given that this is a rare use case, we will not support it in the first version.
+implementation, so that n-D sequences can be supported out of the box. But given
+that this is a rare use case, we will not support it in the first version.
 
 ## Examples
 
@@ -499,6 +565,9 @@ input_layer = keras.layers.Concatenate()(
 
 ### Base Attention Class
 
+**Decision**: Use this alternative. Come up with naming that distinguishes RNN
+Attention.
+
 We could have a base attention class that implements the
 `apply_attention_scores()` method so that subclasses could reuse that method.
 The base class could be as follows:
@@ -546,6 +615,9 @@ Cons:
 
 ### Query, value and mask arguments
 
+**Decision**: Do not use this alternative, because implicit masks would not
+work, such as those produced by `tf.keras.layers.Embedding`.
+
 An alternative to the `mask` argument would be to pass `query_mask` and
 `value_mask` as separate arguments, namely:
 
@@ -576,14 +648,14 @@ Another variation would be to pass `query` and `value` as named arguments:
 
 Pros:
 
-* Code is self-documenting.
-* Could prevent some user bugs related to the ordering of arguments.
+*   Code is self-documenting.
+*   Could prevent some user bugs related to the ordering of arguments.
 
 Cons:
 
-* Passing arguments as lists is a pattern used in Keras layers, such as
-  `tf.keras.layers.Add`. E.g. see the code in
-  https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/python/keras/layers/merge.py#L205
+*   Passing arguments as lists is a pattern used in Keras layers, such as
+    `tf.keras.layers.Add`. E.g. see the code in
+    https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/python/keras/layers/merge.py#L205
 
 ## Questions and Discussion Topics