Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans
NIPS 2016 | arxiv |
Given a dataset of inout-output pairs: \(\mathcal D = \{ (x^{(i)}, y^{\star(i)})\}\_{i=1}^N\) We learn a conditional distribution \(p\_\theta(y|x)\) consistent with \(\mathcal D\).
\[ p_\theta(y | x) =\prod_t^T p_\theta(y_t | x, y_{1:t-1}) \]
Beam search finds:\[ \hat y(x) \approx \arg\max_y p_\theta(y | x) \]
The reward signal, BLEU score or negative edit distance measures the quality of the predictions \(\sum\_{(x, y^\star)} r(\hat y(x), y^\star)\)
\[ \begin{align} \mathcal L_{ML}(\theta) & = - \sum_{(x, y^\star)} \log p_\theta(y^\star | x)\\ & = \sum_{(x, y^\star)} D_{KL}\left(1_{[y=y^\star]} || p_\theta(y | x)\right) \\ \end{align} \]
\[ \begin{align} \mathcal L_{RL}(\theta; \tau) & = - \sum_{(x, y^\star)} \left[ \tau \mathbb H(p_\theta(y|x)) + \sum_{y\in\mathcal Y} p_\theta(y | x) r(y,y^\star) \right]\\ & = \sum_{(x, y^\star)} \tau D_{KL}\left(p_\theta(y | x) || q_\tau(y|y^\star) \right) + cst \end{align} \]
Where: \(q\_\tau(y|y^\star) = \dfrac{1}{Z} \exp\left(\dfrac{r(y^\star, y)}{ \tau}\right)\) optimized with REINFORCE
\[ \begin{align} \mathcal L_{RAML}(\theta; \tau) & = - \sum_{(x, y^\star)} \sum_{y\in\mathcal Y} q_\tau(y|y^\star)\log p_\theta(y | x)\\ & = \sum_{(x, y^\star)} D_{KL}\left(q_\tau(y|y^\star) || p_\theta(y | x)\right) + cst \\ \end{align} \]
Similar to ML in the direction of KL and similar to RL in the optimal conditional distribution.
The temperature \(\tau\) controls the concentration of \(q\_\tau\) : \(q\_\tau(y|y^\star) \xrightarrow[\tau \to 0]{} 1_{[y=y^\star]}\)
\[ \nabla_\theta \mathcal L_{RL}(\theta; \tau) = - \sum_{(x, y^\star)} \mathbb E_{\tilde y \sim p_\theta(y | x) }\left[ \nabla_\theta \log p_\theta(\tilde y | x) . r(\tilde y, y^\star)\right] \]
== Sampling from the evolving model is much slower than sampling from the stationary distribution \(q\). ==\[ \nabla_\theta \mathcal L_{RAML}(\theta; \tau) = - \sum_{(x, y^\star)} \mathbb E_{\tilde y \sim q(y | y^\star; \tau) }\left[ \nabla_\theta \log p_\theta(\tilde y | x)\right] \]
Usually sample one augmentation \(\tilde y\) per input x.
To compute the gradient of the model using the RAML approach, one needs to sample auxiliary outputs from the exponentiated payoff distribution, \(q(y|y^\star; \tau)\) When using Hamming/Edit distance, use stratified sampling:
if \(r(y,y^\star) = -D\_H(y, y^\star)\) one can draw exact samples from \(q\). If \(\mathcal Y = \\{1,...,v\\}^m\) then \(0 \leq r \leq m\) and \(card\left(\\{y\in \mathcal Y| r(y,y^\star)=k\\}\right) = {m \choose k} (v-1)^k\).
To compute \(Z\) we simply sum over \(k\).
###### RAML machine translation
== For each input sentence, we sample a single output at each step wrt to the hamming distance (Edit distance without insertions and deletions). One can sample according to exponentiated sentence BLEU scores, but we leave that to future work.==
Best perf with \(\tau=0.85, 0.9\)