Stable and Effective Trainable Greedy Decoding for Sequence to Sequence Learning

Yun Chen , Kyunghyun Cho , Samuel R. Bowman, Victor O.K. Li

ICLR 2018 Workshop | arxiv | openreviw

Problematic:

Since the output space of seqences is expenontially large, heuristic search methods such as greedy decoding or beam search must be used to find high probability sequences. Greedy decoding is very fast but beam search leads to substantial improvment.

Contributions:

Propose a small neural network actor that observes and manipulates the hidden states of a previously trained decoder. Use beam search (B=35) to decode sentences with the plain decoder, rank them by BLEU and train the actor to encourage the decoder to generate the highest BLEU output in a single greedy decoding operation (i.e. without beam-search)

The actor takes for input the current decoder state \(h\) and the source context vector \(c\), and define the action \(a\) as:

\[ a = z \circ Wh + (1-z) \circ Uc, \] where \(z\) is a gate similar to other LSTM gates: \[ z = \sigma(W_zh + U_z c)\]

The actor decides whether to rely more on the source \(c\) or the decoder state \(h\) to generate the action.

The action is then added to the hidden state:

\[ \tilde h = f(h, c) + a \]

Training the actor:

The corpus of the 35 sequences from beam-search trades off two goals:

Having a high model likelihood so we can coerce the model to generate it without too much additional training.
Having a good translation quality.

Given a pair \(<x,y>\) the original model generates \(Z=\{z^1, ..z^B\}\) with beam-search decoding. We then choose the candidate with the highest score as our new target sequence. By doing so, we obtain a pseudo corpus \(D_{x,z}\).

We keep the underlying model fixed and train the actor by maximizing the likelihoof of the pseudo pairs:

\[ \hat\theta_{actor} = \arg\max_{\theta_actor} \sum_{<x,z>\in D_{x,z}} \log P(z|x, \hat\theta_{nmt} \theta_{actor}) \]

Gu et al 2017: trains a similar actor with critic-aware actor learning algorithm. > noisy gradient estimation of the critic. Requires carefull design.

Experiments:

Evaluated on IWSLT16 English-German on both directions. Experiments show that the use of the actor is practical to replace beam-search with greedy decoding in most cases.

Issues & comments: