It' preferable to optimize w.r.t to an evaluation metric rather than MLE. caveat: not differentiable.
Approach already in use in SMT. (add ref to Och et al. 2013)
Not unlike, mixer (link), minimizes the expected risk on the training data.
Slightly different terminology: - instead of a reward, use a discrepancy/distance \(\Delta(y, y^*)\) where \(y\) is generated by the model and \(y^*\) the gold standard.
The optimized loss is thus: \[ E_{p_\theta}[ \Delta(y, y^*)]\]
The expectation above (in the gradient) is intractable due to the exponentially large search space \(\mathcal Y\), the non decomposability of \(\Delta\) and the context sensitiveness of NMT.
To alleviate the problem: consider a subest of the search space \(S\subset \mathcal Y\):
\[ \tilde R = \sum_{y \in S} Q(y, \alpha) \Delta(y, y^*) \]
\(Q\) is a distribution defined on \(S\):
\[ Q(y, \alpha) \propto p_theta(y)^\alpha, \]
where \(\alpha\) controls the sharpness.
\(S\) is a set of k distinct candidates sampled from the model (decoding mode, not top-k, k around 100)