Non-Autoregressive Neural Machine Translation

Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, Richard Socher

ICLR 2018 | code | openreview

Propose a non-autoregressive neural sequence model based on iterative refinement

Problematic:

Auto-regressive models (e.g. RNNs) are often non-Markovian and nonlinear and must resort to suboptimal approximate decoding (beam-search) to find the sequence with the highest probability. The paper dubs this issue "The decoding gap" which is induced by the "modeling gap" i.e. the model's expressiveness or ability to model the data. The total performance gap would be the sum of the two.

Contributions:

The authors suggest an non-autoregressive model with a larger "modeling gap" but with which the decoding gap is non-existent thus minimizing the overall gap.

The new model generates a full sequence with a decoder in the form of a transformer block.

\[ P(Y|x) = \prod_t p(y_t|x) \]

However latent variables are introduced in a refinement process where we keep modifying the sequence until a stopping criterion is met.

The final sequence is obtained by marginalizing the latent intermediate sequences: \[ P(Y|x) = \sum_{Y^0, ...Y^L} \prod_l P(Y^l| Y^{l-1}, x) \]

The decoder shared at all the passages from a sequence to the next one.

Since the summation above is intractable a deterministic lower bound is used instead. \[ \log P(Y|x) \geq \sum_l \log( P(\hat Y^l | \hat Y^{l-1}, x) ) \] where each sequence is selected as: \[ \hat Y = \arg\max P(.|\hat Y^{l-1}, x) \]

Experiments: