Sequence-level training with Recurrent Neural Networks.

Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba

ICLR 2016 | arxiv | code* | code

Problematic:

Raise issues with training sequence prediction models with MLE in teacher-forcing mode.

MLE is not what the model is evaluated on in test time.
Integrating these metrics in the loss is not straighj-forward; mostly not differentiable and so do not allow for classic back-propagation.
Exposure bias ie discrepancy between the distribution during training (teacher-foricng) and the distribtion at evaluation time where the total error can escalate quickly.

Proposed solution:

Use reinforcement learning. Specifically, the REINFORCE algorithm.

A method for doing back-propagation on computational graphs that output a probability distribution on actions.

In this case the loss is :

\[E_{p_\theta}[r(y | y^*)]\]

This is easily applicable to RNNs, which output a soft-max probability distribution at each time step. Besides, to resolve the exposure bias, the outputed distribution will be generated in evalution mode (decoding), eventualy with beam-search to generate multiple candidates.

Training a model from scratch with this approach is infeasible, the other suggest warming-up with MLE then optimizing for a reward. These steps makes what the authors refer to as MIXER (Mixed Incremental Cross-Entropy Reinforce) which goes as follows:

Train model to optimize the likelihood of the target sequence, i.e. minimize the per time-step cross-entropy loss.
Then, for a target sequence of size T, optimize the cross-entropy for the T-Δ first time steps of the sequence and use Reinforce to get a gradient on the expected loss (e.g. negative BLEU) for the recursive generation of the rest of the Δ time steps.
Increase Δ and go back to 2., until Δ is equal to T.

Experiments:

Results on 3 benchmarks, summarization, machine translation and image captioning, show that this approach generates better outputs even without the beam-search.

Issues:

Does not outperform scheduled sampling (DaD data as demonstrator)