Hint-based Training for Non-Autoregressive Translation #118

kweonwooj · 2019-01-03T04:55:06Z

Abstract

propose to leverage hints from pre-trained AutoRegressive Translation (ART) model to train Non-AutoRegressive Translation (NART) model
- hints from hidden state
- hints from word alignment
on WMT14 EnDe, 17.8x faster inference with ~2.00 BLEU loss
- NART : 25.20 / 44 ms
- ART : 27.30 / 784 ms

NART models
- fully NART models suffer from loss of accuracy
To improve the accuracy of decoder,
- Gu et al 2017 introduces fertilities from SMT model and copies source tokens to initialize decoder states
- Lee et al 2018 propose iterative refinement process
- Kaiser et al 2018 embed an ART that outputs discrete latent variables, then use NART model
- there is a trade-off between inference speed and computational overhead of improving translation accuracy
Contribution
- improve translation accuracy via enriching training signals via two hints from pre-trained ART model

Empirical error analysis of NART models lead to two findings
- incoherent phrases and miss meaningful tokens on the source side
visualized incoherent phrases via cosine similarity of hidden layers
- NART models w/o hints have higher cosine similarity across hidden layers which leads to repetitive outputs
visualized missing tokens via attention weights
- NART models w/o hints have low accuracy on attention weights
Enhancing loss function using two additional information (cosine similarity between layers and attention weights) is the main contribution

Hints from hidden state
- provide penalty when NART hidden states are similar but ART hidden states are not.
Hints from word alignment
- KL Divergence loss
Initial Decoder State (z) : linear combination of source embedding
- exponential weight with source tokens in closer index having more weights
Multihead Positional Attention : additional sub-layer in decoder to re-configure the positions
Inference Tricks
- Length Prediction : instead of predicting target length, use constant bias C obtained from train corpus (no computational overhead)
- Length Range Prediction : instead of predicting a fixed length, predict over a range of target length
- ART re-scoring : use ART model to re-score multiple target candidates, to select the final one (rescoring can take place in non-autoregressive manner)

I totally agree that all the semantics and syntax are in source sentence, hence NART model can work, if we train them correctly
Inference Tricks seem to be a strong contribution that authors do not explicitly point out
ICLR submission rejected due to insufficient related work/story-telling and bad luck

The text was updated successfully, but these errors were encountered:

kweonwooj added NMT SOTA Decoder Let's Implement! Distillation Empirical labels Jan 3, 2019