Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer
NIPS 2015 | arxiv |
The paper raise the issue of exposure bias where there is gap between how the model are trained (teacher-forcing) and how they are used in test time (free running)
The authors propose to flip a coin at training time to either keep the conditioning token as is (i.e the ground truth) or replace it by the model's previous prediction (as if in test time). It's suggested to start with a high probability of keeping the ground-truth (default mode of training RNNs) and anneal that probability as the training advances following a schedule. The most efficient schedule in their experiments was the inverse sigmoid decay: \[ \epsilon_i = \frac{k}{k + \exp{(i/k)}}, \] where \(k\geq 1\) affects the speed of convergence and \(i\) is the current iteration.
Results on the tasks of image captioning, constituency parsing and speech recognition show that this approach improves over MLE with teacher-forced training.