Skip to content
Veranep edited this page Nov 20, 2019 · 6 revisions

Sutskever, I., Vinyals, O., Le, Q. (2014, September 10). Sequence to sequence learning with neural networks. Retrieved November 20, 2019, from the arXiv database. This paper can be found here.

Summary: Deep Neural Networks (DNNs) are extremely powerful, but need input and target vectors of fixed dimensionality. For translation we use sequences of words, whose lengths we do not know beforehand, so we need a method that can work without a specified domain. For this two Long Short-Therm Memory (LSTM) architectures are used in this model, one to read the input sequence and return a large fixed-dimensional vector representation, and one to extract the target sequence from those vectors. These LSTM units can be seen as a very complex activation function acting on the previous state and a new input. The model reads in the input sentence in reverse, so short-term dependencies in the sequences become clear and can make the optimization easier. The goal is to estimate the conditional probability P(y1, ..., yT'|x1, ..., xT), where (y1, ..., yT') is the output sequence, (x1, ..., xT) the input sequence and length T not necessarily equal to length T'. The input sequence (x1, ..., xT) will be converted into a fixed-dimensional representation v, and then, summing over t (from 1 until T'), P(yt|v, y1, ..., yt-1) will be computed. Each p(yt|v, y1, ..., yt-1) distribution is represented using a softmax over all the words in the vocabulary. The training objective is maximizing 1/|s| * the sum of log P(T|S), where s is the training set, and the correct translation T and source sentence S are in s. After training, the most likely translation is found using Tml = argmax P(T|S). While creating the output sentence B hypotheses are stored and after each step of adding a word/end of sequence character the B most likely hypotheses are kept (a beam search decoder).

Kalchbrenner, N., Blunsom, P. (2013, October). Recurrent continuous translation models. Association for Computational Linguistics, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1700-1709. Retrieved November 20, 2019, from the ACL Anthology. This paper can be found here.

Nal Kalchbrenner is in Amsterdam, so even though the code is no longer on his website, we could email him/maybe even schedule a meeting.

Summary: Recurrent Continuous Translation Models (RCTM) are probabilistic models, which represent sequences of words in a continuous manner. For the generation of the translation a target Recurrent Language Model (RLM) is used, and the conditioning on the source sentence is modelled using a Convolutional Sentence Model. The RLM makes, in contrast to other n-gram models, no Markov assumptions about the dependencies of the words in the target sentence. A model just using paired sequences would disregard the similarities between different pairs, and lead to sparsity issues, especially for longer sequences. Continuous representations, however, are able to capture morphological, syntactic and semantic similarity and overcome the sparsity issues. This paper defines two different models, one that directly comes up with target sequences from the source sequences, and a second one that uses the intermediate step of creating a sentence out of the target words. The probability of a translation is easy to compute when using this RCTM, and translations can be generated directly from the model. The RCTM estimates the probability P(f|e) for target sentence f = f1, ..., fm being the translation of a source sentence e = e1, ..., ek, where the probability of fi (in f) is conditioned on f1:i-1 and e. Also, P(f) = P(fi|f1:i-1) summing for i over the words in f. Each word in f is in vocabulary V, along with three transformations: Input vocabulary, recurrent transformation, and output vocabulary transformation. Using these transformations and the index of a word in V, the conditional distribution P(fi = v|f1:i-1) can be computed for any v in V. For the CSM model a parser is not needed, which means it can also be used for languages for which no accurate parser is available. The first model in this paper RCTM1 has a bias towards shorter sentences and the representation of the source sentence uniformly contains all target words, instead of depending more on certain parts of the source sentence. This is solved in RCTM2, which uses a truncated CSM model and then an inverted CSM model to go from a source sentence, to source n-grams, to target n-grams and then to the target sentence.

Bahdanau, D., Cho, K., Bengio, Y. (2014, September 1). Neural machine translation by jointly learning to align and translate. Retrieved November 20, 2019, from the arXiv database. This paper can be found here.

Summary: This paper presents a model that extends what can be done by an encoder-decoder model (like in the first paper), by using automatic (soft-)search for relevant parts of the source sentence while predicting a word in the target sentence. The problem with an encoder-decoder model is that the neural network has to compress all the necessary information of the source sentence into the fixed-length vector. The extended model instead stores the source sentence as a sequence of vectors and chooses a subset of these vectors for each output sentence, by paying more attention to certain parts of the input sentence. The general encoder-decoder model is explained, even with references to the implementation in the first paper. The decoder uses RNN hidden states, annotations (comprised of the input sentences) and context vectors (which is a vector of weighted annotations). The decoder uses an alignment model, which scores how well the input matches the output, and which can be trained using backpropagation of the gradient of the cost function. The RNN will be bidirectional, which means sentences will be inputted once forwards and once backwards, and hidden states in both directions are computed, which make up the annotations. The annotation will therefore contain a summary of both preceding and following words. The soft allignment for example lets the choice of article depend on the following word, where using the extended model longer sentences were much better translated.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y. (2014, June 3). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Retrieved November 20, 2019, from the arXiv database. This paper can be found here.

This paper comes with a description of the suggested architecture in the appendix, and this seems to be the most comprehensible and yet interesting and challenging paper, so we could use this hand-in-hand with the pytorch/tensorflow implementations we find.

Summary: This paper seems to be the one the first paper compared itself to (and the first paper had better results). Two RNN are used that will act as an encoder and decoder pair, as part of a standard phrase-based statistical machine learning (SMT) system. This model uses the same context vector and encoding and decoding models as in the first paper, but as an activation function now a simple softmax can be used, as compared to the very complicated long short-term memory (LSTM) unit. A gradient-based algorithm can be used to estimate the model parameters. Once trained the model can be used to translate, or judge a given translation (by its probability). The paper also proposes a new activation function, which is motivated by the LSTM unit and has some memory, but is much simpler. The new function has a reset gate and an update gate, using the sigmoid function and two weight matrices per gate that must be learned, that have as input the source sentence and the previous hidden state. This system makes sure that the amount of older information that will be used and its influence on the translation can be controlled. Each hidden unit will have separate gates, that are crucial to get meaningful results. These results can be further complemented by combining them with a continuous space language model (CSLM), which means their results are orthogonal.

Clone this wiki locally