Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, Yoshua Bengio
ICLR 2018 | openreview
Exposure bias of training RNN with ML in teacher forcing mode.
Use Actor-Critic methods from RL. It introduces a critic network that is trained to predict the value of an output token, given the policy of an actor network. This allows for directly optimizing a task-specific score.
In the supervised paradigm, the critic is conditionned on the ground truth output.
A trained predictor \(h\) is evaluated by computing the average task specific score \(R(\hat Y=h(X), Y)\). The conditionned RNN is viewed as a stochastic policy that generates actions and receives the task score as the return. Generally, the return \(R\) is received at the end of the sequence, but we can consider the case where it's partially received at intermediate steps in the form of rewards \(r_t\) where \(R=\sum_t r_t\) which eases the learning of the critic.
The value function for an unfinished prediction \(\hat Y_{\leq t}\) is defined as:
\[ V(\hat Y_{\leq t}; X, Y) = \mathbb E_{\hat Y_{>t}\sim_p(.|\hat Y_{\leq t}, X)}\sum_{\tau=t+1}^T r_\tau(\hat y_t; \hat Y_{<\tau}, Y) \]
And the value of a candidate ext token \(a\) for an unfinished prediction \(\hat Y_{1:t-1}\) is definded as the expected future return after generating \(a\):
\[ Q(a; \hat Y_{1:t-1}, X,Y) = \mathbb E_{\hat Y_{>t}\sim_p(.|\hat Y_{\leq t-1}, a, X)}\sum_{\tau=t+1}^T (r_t(a; \hat Y_{\leq t-1}, Y) + \sum_{\tau=t+1}^T r_\tau(\hat y_t; \hat Y_{\leq t-1}a\hat Y_{t+1:\tau}, Y)) \]