Reward augmented maximum likelihood for neural structured prediction

Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans

NIPS 2016 | arxiv |

Reward augmented maximum likelihood:

Given a dataset of inout-output pairs: \(\mathcal D = \{ (x^{(i)}, y^{\star(i)})\}\_{i=1}^N\) We learn a conditional distribution \(p\_\theta(y|x)\) consistent with \(\mathcal D\).

\[ p_\theta(y | x) =\prod_t^T p_\theta(y_t | x, y_{1:t-1}) \]

Beam search finds:

\[ \hat y(x) \approx \arg\max_y p_\theta(y | x) \]

The reward signal, BLEU score or negative edit distance measures the quality of the predictions \(\sum\_{(x, y^\star)} r(\hat y(x), y^\star)\)

Maximum Likelihood

\[ \begin{align} \mathcal L_{ML}(\theta) & = - \sum_{(x, y^\star)} \log p_\theta(y^\star | x)\\ & = \sum_{(x, y^\star)} D_{KL}\left(1_{[y=y^\star]} || p_\theta(y | x)\right) \\ \end{align} \]

RL
The entropy regularized expected reward:

\[ \begin{align} \mathcal L_{RL}(\theta; \tau) & = - \sum_{(x, y^\star)} \left[ \tau \mathbb H(p_\theta(y|x)) + \sum_{y\in\mathcal Y} p_\theta(y | x) r(y,y^\star) \right]\\ & = \sum_{(x, y^\star)} \tau D_{KL}\left(p_\theta(y | x) || q_\tau(y|y^\star) \right) + cst \end{align} \]

Where: \(q\_\tau(y|y^\star) = \dfrac{1}{Z} \exp\left(\dfrac{r(y^\star, y)}{ \tau}\right)\) optimized with REINFORCE

RAML

\[ \begin{align} \mathcal L_{RAML}(\theta; \tau) & = - \sum_{(x, y^\star)} \sum_{y\in\mathcal Y} q_\tau(y|y^\star)\log p_\theta(y | x)\\ & = \sum_{(x, y^\star)} D_{KL}\left(q_\tau(y|y^\star) || p_\theta(y | x)\right) + cst \\ \end{align} \]

Similar to ML in the direction of KL and similar to RL in the optimal conditional distribution.

The temperature \(\tau\) controls the concentration of \(q\_\tau\) : \(q\_\tau(y|y^\star) \xrightarrow[\tau \to 0]{} 1_{[y=y^\star]}\)

RAML Optimization:

\[ \nabla_\theta \mathcal L_{RL}(\theta; \tau) = - \sum_{(x, y^\star)} \mathbb E_{\tilde y \sim p_\theta(y | x) }\left[ \nabla_\theta \log p_\theta(\tilde y | x) . r(\tilde y, y^\star)\right] \]

== Sampling from the evolving model is much slower than sampling from the stationary distribution \(q\). ==

\[ \nabla_\theta \mathcal L_{RAML}(\theta; \tau) = - \sum_{(x, y^\star)} \mathbb E_{\tilde y \sim q(y | y^\star; \tau) }\left[ \nabla_\theta \log p_\theta(\tilde y | x)\right] \]

Usually sample one augmentation \(\tilde y\) per input x.

To compute the gradient of the model using the RAML approach, one needs to sample auxiliary outputs from the exponentiated payoff distribution, \(q(y|y^\star; \tau)\) When using Hamming/Edit distance, use stratified sampling:

if \(r(y,y^\star) = -D\_H(y, y^\star)\) one can draw exact samples from \(q\). If \(\mathcal Y = \\{1,...,v\\}^m\) then \(0 \leq r \leq m\) and \(card\left(\\{y\in \mathcal Y| r(y,y^\star)=k\\}\right) = {m \choose k} (v-1)^k\).

To compute \(Z\) we simply sum over \(k\).

img ###### RAML machine translation

== For each input sentence, we sample a single output at each step wrt to the hamming distance (Edit distance without insertions and deletions). One can sample according to exponentiated sentence BLEU scores, but we leave that to future work.==

Best perf with \(\tau=0.85, 0.9\)