Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hint-based Training for Non-Autoregressive Translation #118

Open
kweonwooj opened this issue Jan 3, 2019 · 0 comments
Open

Hint-based Training for Non-Autoregressive Translation #118

kweonwooj opened this issue Jan 3, 2019 · 0 comments

Comments

@kweonwooj
Copy link
Owner

Abstract

  • propose to leverage hints from pre-trained AutoRegressive Translation (ART) model to train Non-AutoRegressive Translation (NART) model
    • hints from hidden state
    • hints from word alignment
  • on WMT14 EnDe, 17.8x faster inference with ~2.00 BLEU loss
    • NART : 25.20 / 44 ms
    • ART : 27.30 / 784 ms

Details

Introduction

  • NART models
    • fully NART models suffer from loss of accuracy
  • To improve the accuracy of decoder,
    • Gu et al 2017 introduces fertilities from SMT model and copies source tokens to initialize decoder states
    • Lee et al 2018 propose iterative refinement process
    • Kaiser et al 2018 embed an ART that outputs discrete latent variables, then use NART model
    • there is a trade-off between inference speed and computational overhead of improving translation accuracy
  • Contribution
    • improve translation accuracy via enriching training signals via two hints from pre-trained ART model

Motivation

  • Empirical error analysis of NART models lead to two findings
    • incoherent phrases and miss meaningful tokens on the source side
  • visualized incoherent phrases via cosine similarity of hidden layers
    • NART models w/o hints have higher cosine similarity across hidden layers which leads to repetitive outputs
      screen shot 2019-01-03 at 1 43 19 pm
  • visualized missing tokens via attention weights
    • NART models w/o hints have low accuracy on attention weights
      screen shot 2019-01-03 at 1 44 11 pm
  • Enhancing loss function using two additional information (cosine similarity between layers and attention weights) is the main contribution

Hint-based NMT

screen shot 2019-01-03 at 1 40 12 pm

  • Hints from hidden state
    • provide penalty when NART hidden states are similar but ART hidden states are not.
      screen shot 2019-01-03 at 1 45 43 pm
  • Hints from word alignment
    • KL Divergence loss
      screen shot 2019-01-03 at 1 47 10 pm
  • Initial Decoder State (z) : linear combination of source embedding
    • exponential weight with source tokens in closer index having more weights
      screen shot 2019-01-03 at 1 47 50 pm
  • Multihead Positional Attention : additional sub-layer in decoder to re-configure the positions
  • Inference Tricks
    • Length Prediction : instead of predicting target length, use constant bias C obtained from train corpus (no computational overhead)
    • Length Range Prediction : instead of predicting a fixed length, predict over a range of target length
    • ART re-scoring : use ART model to re-score multiple target candidates, to select the final one (rescoring can take place in non-autoregressive manner)

Overall Performance

  • 17.8x speed-up with 1.90 BLEU loss in WMT14 EnDe
    screen shot 2019-01-03 at 1 51 17 pm

Personal Thoughts

  • I totally agree that all the semantics and syntax are in source sentence, hence NART model can work, if we train them correctly
  • Inference Tricks seem to be a strong contribution that authors do not explicitly point out
  • ICLR submission rejected due to insufficient related work/story-telling and bad luck

Link : https://openreview.net/pdf?id=r1gGpjActQ
Authors : Li et al 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant