RFC: Attention for Dense Networks on Keras #54

roumposg · 2019-01-15T20:38:06Z

The feedback phase will be open for 2 weeks until 2019-01-30

Attention for Dense Networks on Keras

Status	Proposed
Author(s)	Georgios Roumpos ([email protected])
Sponsors	Karmel Allison ([email protected]), Francois Chollet ([email protected])
Updated	2019-01-15

Summary

This RFC proposes adding a layer for attention in tf.keras.layers that works with CNN/DNN networks.
Recently people have had success using the Attention mechanism in dense layers,
e.g. CNN+Attention or Transformer networks. tf.keras.layers is the recommended way to build models in tensorflow, but it does not have a layer for attention that works with CNN/DNN networks.

Note that this review focuses on Dense networks, namely CNN/DNN. It does not cover Recurrent Neural Networks (RNN).

alextp · 2019-01-16T19:00:46Z

rfcs/20190115-dense-attention.md

+proposal work with some configurations of RNN networks.
+Namely, when users create
+[tf.keras.layers.LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)
+with `return_sequences=True`, the rest works the same way as CNN.


My impression is that tf.keras.layers.LSTM returns the output in shape [sequence_length, batch_size, ...] while CNNs operate in [batch_size, sequence_length, ...].

So maybe mention here that you'll need a transpose to use your attention layer on an LSTM?

Thanks for catching this. But I could not find any documentation or examples on the output shape for LSTM. Can you give me a pointer to verify the shape?

The output shape for LSTM is [batch, timestep, unit] when return_sequences=True. In the case of return_seqences=False, the output shape is [batch, unit].

alextp · 2019-01-16T19:02:05Z

rfcs/20190115-dense-attention.md

+[tf.keras.layers.LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)
+with `return_sequences=True`, the rest works the same way as CNN.
+
+Unfortunately, this technique does not cover sequence-to-sequence RNN models.


By "this technique" do you mean your proposal?

If so I think your proposal should support masking, as it's very common.

Can you clarify your comment, please? In the call() method, users can pass a mask. Do you think we need to do something additional?

By masking I mean the input-dependent masking which you use when doing self-attention in a decoder (where the attention at position i is allowed to look at all positions j < i).

I think this should compose well with a manual mask but maybe providing an example would be nice.

I see. This is a valid use case. Let me spend some more time to better understand this case and get back to you, thanks.

I thought about this, and I don't think this can be implemented as a manual mask or a composable layer. It needs to be a feature of the layer.

I added a couple of options about how this can be implemented in the "Self-attention" section. Let me know what you think, thanks.

alextp · 2019-01-16T19:02:58Z

rfcs/20190115-dense-attention.md

+
+We propose to implement the following common attention layers:
+
+*   `Attention`: Basic dot-product attention, a.k.a. Luong-style attention.


This section should put the equations for both forms, for clarity.

Thanks. I put pseudocode in the pydoc in the following section. Let me add that here, too (once I figure out how to do this in github, sorry for the delay).

alextp · 2019-01-16T19:04:24Z

rfcs/20190115-dense-attention.md

+
+The output is of shape `[batch_size, Tq, dim]`.
+
+Following the pattern of other Keras layers, we pass the list `[query, value]`


This feels awkward, specially passing a tuple as the mask argument. Why not have query_mask and value_mask arguments separately, for better self-documenting code?

Similarly for query and value; it's nice if users can tell easily which is which.

Francois suggested a list, because the same pattern is used in other places, such as tf.keras.layers.Add. See the code in https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/python/keras/layers/merge.py#L205

Let's discuss in the design review.

SG, I am adding it to "Alternatives Considered".

alextp · 2019-01-16T19:05:47Z

rfcs/20190115-dense-attention.md

+
+We will first work on the implementation for Tensorflow.
+
+### Self-Attention


Most uses of self-attention that I know of are seq2seq and so want masking.

Can you clarify what you mean? I think the mask argument this api supports is good enough for DNN/CNN networks, such as transformer. WDYT?

alextp · 2019-01-16T19:06:11Z

rfcs/20190115-dense-attention.md

+The Self-Attention variant can be implemented by passing the same tensor to both
+`query` and `value`.
+
+### Multi-Head Attention


I think this is covered by your proposal; just use multiple Attention layers; each will represent one attention head.

Good point. It is not very pretty, but it works. Let me update the text.

Multi head attention is usually batched so I think it requires a special case.

All tensors are assumed to be batched. Namely, input tensors are [batch_size, dim] etc. Is this what you mean?

I meant multi head attention is usually a single layer that takes [batch_size, num_heads, time, dim] and not num_heads layers that take [batch_size, time, dim] as @alextp proposed.

Well, then the user needs to split the tensor into num_heads tensors of shape [batch_size, time, dim]. This is one way to do it.

alextp · 2019-01-16T19:06:55Z

rfcs/20190115-dense-attention.md

+in the model output. The proposed techniques can be implemented as an additional
+feature in the Attention API.
+
+*   https://arxiv.org/abs/1803.02155 describes how relative position


It'd help to clarify here whether the position-dependent attention implementations compose with this layer or would require new layers.

Good point. For relative position representation, we need to modify the attention layer. Absolute position can be implemented as a separate layer that composes with this layer. Let me update the text.

alextp · 2019-01-16T19:07:43Z

rfcs/20190115-dense-attention.md

+    how absolute position information can be added as a deterministic function
+    of position.
+
+### 2D and 3D


It feels to me like we should be able to support n-d attention (not just 2d or 3d) with ~the same code, so we should do it.

That's true. Let me update the text.

LanceNorskog · 2019-01-17T02:03:33Z

The amount of "does this but does not do that" cautions in the proposal confirms a suspicion I've had for awhile: the Keras API is a little too low-level.

roumposg · 2019-01-17T18:38:22Z

@LanceNorskog thanks for the feedback. Do you have any suggestions on how to make it more powerful, starting from this review? Thanks.

guillaumekln · 2019-01-18T08:13:23Z

rfcs/20190115-dense-attention.md

+and
+[tf.contrib.seq2seq.BahdanauAttention](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/BahdanauAttention)
+are implementations of dot-product (Luong) and additive (Bahdanau) Attention
+respectively for RNN in Tensorflow. This proposal is based on this


By "based", do you mean it will extend classes from tf.seq2seq? It's unclear to me if this proposal has this recent change in mind: tensorflow/tensorflow@b797012.

Thank you for you comment. I mean that it is inspired from it, and that the implementation is consistent. Let me update the text.

Yes, I am aware of this change. It covers RNN networks, and it moves in the same direction as my proposal. Namely, implementing as keras.layer. Let me update the text to clarify that.

karmel · 2019-02-05T18:43:40Z

Notes from the review meeting:

Substantial changes?
- Some alternatives were brought up-- BaseAttention, Alex brought up another and Self Attention.
- Public comments were addressed with edits.
Layers will need to share some code. Shared method is one option, BaseAttention class has the advantage of inheritance being a common paradigm in Keras.
- Would the base class be sufficient for all the variations of attention? ie, multi-head attention, which has an extra dimension.
- Multi-head could be done with reshaping, ie a Reshape layer.
- Reshaping is (maybe) expensive on TPU.
- But the user shouldn't reshape-- would a child class of base attention be able to implement that?
- [TODO]: create a simple GNMT/Transformer, not necessarily working, to show what this would look like.
What about Recurrent Attention-- will this work for that?
- There have been attempts to add this to Keras that were complicated.
- RNN attention is more complicated. The generated value has to be combined with states, fed to the next step. Inputs are only seen per timestep.
- In other words, the implementation is totally different, it just happens to share a name.
- [TODO]: ideally, come up with a name that distinguishes from Recurrent Attention-- BaseDenseAttention, BaseCausalAttention. Make sure it's clear in the docstring, point to recurrent implementations
- Is there anything that is common, even if just the interface? The API should be reuasable in some parts, but it's tricky. Probably not worth trying to force into the same API.
Arg passing -- (query, value), (q_mask, v_mask) versus named kwargs.
- Named is nice, but keras convention is lists/tuples
- There are some downsides to not going with the convention in terms of how data is passed through to Keras, ie layers that generate masks expect to pass them through as masks=(0 … N)
- Conclusion: use inputs=(...), masks=(...)
Self-attention wants to prevent flow of info from the future to the past. Needs an additional mask. How would this work with the current proposal?
- Would have to be implemented by the layer, with a user-passed flag.
- causal_mask? Boolean.
- In convolutions there is precedent for padding=causal, so we can use that. But padding doesn't make sense in this context.
- use_causal_mask seems fine. Analogous to use_bias.
- But causal mask only makes sense in self-attention. Throw an error in normal cases? It doesn't make sense in terms of the model, but it does make mathematical sense. Let people do it, as you can imagine cases where that makes sense. Throw an error if the sequences are different lengths.
Sync to keras-team/keras: we can push out a project proposal for Keras community.
In attention is all you need, there is a "key"-- where is that?
- They always use the same for key and value.
- In seq2seq, there is a different key and value.
- You could allow inputs=(query, value, key)
- We should be able to flesh this out in the example noted above.

@karllessard

* Adding a doc to deprecate collections * Responding to Karmels comments * Minor fix to VariableTracker sample code * RFC for random numbers in TensorFlow 2.0 * Changes after some feedback * Removed 'global_seed' in the main code and showed the design with 'global_seed' in the Questions section. * Some changes after feedback * A tweak * Change after feedback * A tweak * changes * changes * fix link * new-rfc * changes * Update rfcs/20181225-tf-backend.md Co-Authored-By: alextp <[email protected]> * Added some considerations about tf.function * Renamed the internal name "op_generator" to "global_generator" * Changed seed size from 256 to 1024 bits * Initial signpost for community meetings Adding this so there is basic information about how to find the community calendar and get invited to meetings. * Add iCal link too * changes * Initial version of embedding and partitioned variable RFC. * Fix one formatting issue. * Fix another formatting issue. * Use markdown language for the table instead of HTML. * Add tensorflow/io R Package CRAN release instructions (tensorflow#53) * Added Design Review Notes * Make clear distinction between embedding variables and loadbalancing variables. * Added decisions below each question, and "how to use generators with distribution strategies". * Adopted Dong Lin's suggestions * Add a paragraph pointing out the problem with the `partition_strategy` argument. * RFC: Move from tf.contrib to addons (tensorflow#37) * Checkpoint addons RFC for review * Add code review to RFC Add future pull request information to criteria Update modified date added some description RFC Move to addons * Add weight decay optimizers * Remove conv2d_in_plane * Add group_norm * Accept addons RFC * Update alternatives since `DynamicPartition` and `DynamicStitch` do have GPU kernels. * Add a section for saving and restore `PartitionedVariable`. * Mention that variable types can be nested, attention needs to be paid to their saving and restoring mechanism. * Create README.md (tensorflow#57) * Splitted `_state_var` into `_state_var` and `_alg_var` (because of concerns from implementation), and changed status to "Accepted" * Updated timestamp * Moved the auto-selection of algorithm from `create_rng_state` to `Generator.__init__` * Update according to the discussion * Move performance heuristics in Distribution Strategy level. We will not expose knobs for users to control; * Emphasize that embedding support in v2 will all be via `Embedding` layer. Users can use `tf.compat.v1` to handle embedding by themselves; * Mention that default `partition_strategy` in v1 `embedding_lookup` is "mod", which will possibly break users's model when they update to TF 2.0; * We want to prioritize shuffling embedding after 2.0 release; * We have plans to serialize and deserialize `Embedding` layer and Distribution Strategies to allow loading a saved model to a different number of partitions. * Update relese binary build command for sig-io (tensorflow#58) This PR updates relese binary build command for sig-io Signed-off-by: Yong Tang <[email protected]> * Add Bryan to SIG IO release team (tensorflow#59) * Change to accepted * Add link to TensorFlow IO R package * Updated link for the friction log. (tensorflow#64) * Switch DistStrat revised API examples to TensorFlow 2 style. (tensorflow#63) * RFC: Attention for Dense Networks on Keras (tensorflow#54) * Design review for "Attention for Dense Networks" * RFC: Stateful Containers with tf.Module (tensorflow#56) * Create 20190117-tf-module.md * Update 20190117-tf-module.md * Loosen return type for variable properties. * Use Dense consistently. Thanks brilee@ for spotting! * Remove convert_to_tensor from examples. This wasn't ever required and including it might cause confusion. h/t pluskid@ gehring@ and awav@ * Remove owned_* methods. * Document `_flatten` See tensorflow/tensorflow@5076adf6 for more context. * Fix typo in module name. Thanks k-w-w@! * Update 20190117-tf-module.md * RFC: New tf.print (tensorflow#14) * New tf.print proposal * Attempt to fix table of contents * Removed not-working TOC label * Minor updates to the doc. * Update tf.print to be accepted * Added design review notes * Marking doc as accepted * Update cond_v2 design doc (tensorflow#70) * Update to bring in line with implementation * Added the symbol map to the RFC. * Updated testing section of the Community site. * Removed the 100%, formatting tweaks. * Update CHARTER.md * Change contact email address I will leave my current company soon, so update my email. * Create README.md * Logos for SIGs * Update README.md * Update addons owners (tensorflow#85) Add Yan Facai as another project lead. * Created a FAQ for TF 2.0. (tensorflow#78) Adding 2.0 related FAQ to the Testing group. * Request and charter for SIG JVM (tensorflow#86) Chartering docs for SIG JVM * Update CODEOWNERS Add @karllessard, @sjamesr and @tzolov as code owners for sigs/jvm. * Update CODEOWNERS Add missing / * Update CODEOWNERS Add @dynamicwebpaige as owner for sigs/testing/ * Update RFC with current information (tensorflow#89) Make current to SIG Addons * RFC: TF on Demand Project (tensorflow#69) * Adding an RFC for TF on Demand Project. * modified one line in tf-on-demand md file. * Changing RFC status from PROPOSED to ACCEPTED. * RFC: SavedModel Save/Load in 2.x (tensorflow#34) * RFC for SavedModel Save/Load in 2.x * Minor edits and a discussion topic for load() with multiple MetaGraphs * Tweak to the "Imported representations of signatures" section * Update "Importing existing SavedModels" with the .signatures change * Update RFC and add review notes * Status -> accepted * Update CHARTER.md New leads. * Update 20180920-unify-rnn-interface.md (tensorflow#81) Typo fix. * Update yyyymmdd-rfc-template.md Adding "user benefit" section into the RFC template, to encourage articulating the benefit to users in a clear way. * Update while_v2 design doc (tensorflow#71) * Update while_v2 design doc, include link to implementation * Update TF 2.0 FAQ to link to TensorBoard TF 2.0 tutorial (tensorflow#94) * CLN: update sig addons logo png (tensorflow#99) * Add SIG Keras Add a reference link to Keras' governance repository for SIG Keras. * RFC: String Tensor Unification (tensorflow#91) * RFC: String Tensor Unification * Updated rfcs/20190411-string-unification.md Updated TFLite sections to address feedback from @jdduke. Marked as Accepted. * Start RFC for tensor buffers

Design review for "Attention for Dense Networks"

83578ab

roumposg requested review from ewilderj, goldiegadde and martinwicke as code owners January 15, 2019 20:38

goldiegadde added the RFC: Proposed RFC Design Document label Jan 16, 2019

goldiegadde self-assigned this Jan 16, 2019

alextp reviewed Jan 16, 2019

View reviewed changes

bhack mentioned this pull request Jan 17, 2019

Recurrent Attention API for keras keras-team/keras#11172

Closed

ewilderj changed the title ~~Design review for "Attention for Dense Networks"~~ [RFC] Attention for Dense Networks on Keras Jan 17, 2019

ewilderj changed the title ~~[RFC] Attention for Dense Networks on Keras~~ RFC: Attention for Dense Networks on Keras Jan 17, 2019

Update 20190115-dense-attention.md

5936b1b

guillaumekln reviewed Jan 18, 2019

View reviewed changes

roumposg added 2 commits January 18, 2019 09:11

Update 20190115-dense-attention.md

016927a

Update 20190115-dense-attention.md

60825b7

roumposg added 2 commits January 22, 2019 09:32

Update 20190115-dense-attention.md

b46328a

Update 20190115-dense-attention.md

c7e4584

googlebot added the cla: yes label Jan 25, 2019

Update 20190115-dense-attention.md

8ee865a

karmel approved these changes Feb 11, 2019

View reviewed changes

ewilderj approved these changes Feb 11, 2019

View reviewed changes

ewilderj merged commit 9904c5c into tensorflow:master Feb 11, 2019

ewilderj added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Feb 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Attention for Dense Networks on Keras #54

RFC: Attention for Dense Networks on Keras #54

roumposg commented Jan 15, 2019 •

edited by goldiegadde

Loading

alextp Jan 16, 2019

roumposg Jan 17, 2019

qlzh727 Feb 5, 2019

alextp Jan 16, 2019

roumposg Jan 17, 2019

alextp Jan 17, 2019

roumposg Jan 18, 2019

roumposg Jan 22, 2019

alextp Jan 16, 2019

roumposg Jan 17, 2019

alextp Jan 16, 2019

roumposg Jan 17, 2019

alextp Jan 17, 2019

roumposg Jan 18, 2019

alextp Jan 16, 2019

roumposg Jan 17, 2019

alextp Jan 16, 2019

roumposg Jan 17, 2019

guillaumekln Jan 19, 2019

roumposg Jan 22, 2019

guillaumekln Jan 22, 2019 •

edited

Loading

roumposg Jan 22, 2019

alextp Jan 16, 2019

roumposg Jan 17, 2019

alextp Jan 16, 2019

roumposg Jan 17, 2019

LanceNorskog commented Jan 17, 2019

roumposg commented Jan 17, 2019

guillaumekln Jan 18, 2019

roumposg Jan 18, 2019

karmel commented Feb 5, 2019


		We propose to implement the following common attention layers:

		* `Attention`: Basic dot-product attention, a.k.a. Luong-style attention.


		The output is of shape `[batch_size, Tq, dim]`.

		Following the pattern of other Keras layers, we pass the list `[query, value]`


		We will first work on the implementation for Tensorflow.

		### Self-Attention

RFC: Attention for Dense Networks on Keras #54

RFC: Attention for Dense Networks on Keras #54

Conversation

roumposg commented Jan 15, 2019 • edited by goldiegadde Loading

Attention for Dense Networks on Keras

Summary

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guillaumekln Jan 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LanceNorskog commented Jan 17, 2019

roumposg commented Jan 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karmel commented Feb 5, 2019

roumposg commented Jan 15, 2019 •

edited by goldiegadde

Loading

guillaumekln Jan 22, 2019 •

edited

Loading