Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba
ICLR 2016 | arxiv | code* | code
Raise issues with training sequence prediction models with MLE in teacher-forcing mode.
MLE is not what the model is evaluated on in test time.
Integrating these metrics in the loss is not straighj-forward; mostly not differentiable and so do not allow for classic back-propagation.
Exposure bias ie discrepancy between the distribution during training (teacher-foricng) and the distribtion at evaluation time where the total error can escalate quickly.
Use reinforcement learning. Specifically, the REINFORCE algorithm.
A method for doing back-propagation on computational graphs that output a probability distribution on actions.
In this case the loss is :
\[E_{p_\theta}[r(y | y^*)]\]
This is easily applicable to RNNs, which output a soft-max probability distribution at each time step. Besides, to resolve the exposure bias, the outputed distribution will be generated in evalution mode (decoding), eventualy with beam-search to generate multiple candidates.
Training a model from scratch with this approach is infeasible, the other suggest warming-up with MLE then optimizing for a reward. These steps makes what the authors refer to as MIXER (Mixed Incremental Cross-Entropy Reinforce) which goes as follows:
Results on 3 benchmarks, summarization, machine translation and image captioning, show that this approach generates better outputs even without the beam-search.
Does not outperform scheduled sampling (DaD data as demonstrator)