Vlad Zhukov, Maksim Kretov
Nips2017' Workshop | arxiv | code*
Loss-evaluation mismatch: optimizing a surrogate loss different from the non-differentiable evaluation metric
A method for estimating a differentiable lower bound of expected BLEU score, without extensive sampling.
Bleu n-gram precision are computed as follows:
\[ p_n = \frac{\sum_{ng}\in C min(count_C(ng), count_R(ng))}{\sum C\in Candidates \sum_ng\in C count(ng)}, \]
then, \[ BLEU = BP . \exp\left( \sum_n w_n \log p_n\right), \] where \(BP = \min(1, e^{1-\frac{r}{c}})\), r,c the lenghts of the references and the candidate respectively.
Assuming a single reference for each candidate, and a vocabulary of size \(V\), define \(X\) as the matrix of shape \(c\times V\) containing one-hot encoded vectors for each word in the candidate, and \(Y\) of shape \(r\times V\) for the reference. Let \(S^n\) and \(P^n\) be the matrices defined as:
\[ S_{i,j}^n = \prod_k^{n-1} x_{i+k}x_{j+k} \] \[ P_{i,j}^n = \prod_k^{n-1} y_{i+k}x_{j+k} \] The counts are retrieved by summing a column values i.e.
\[ v_i^{x,n} = \sum_{j=0}^{c-n+1} S^n_{j,i} \]
The nominator in BLEU can be written as:
\[ O_n = \sum_{i=1}^c \frac{min(v_i^{x,n}, v_i^{y,n})}{v_i^{x,n}}, \] and BLEU's n-gram precision:
\[ p_n = \frac{O_n}{c - n +1} \]
\[ E_{x\sim p_x}[O_n^i] \geq \sum_{m_0}p_x^{i, m_0} .. \sum_{m_{n-1}} p_x^{i+n-1, m_n} \min(1, \frac{\sum_j \prod_{k=0}^{n-1} y_{j+k, m_k}}{1 + \sum_{l\neq i} \prod_{k=0}^{n-1} p_x^{l+k, m_k}}) \]
Too many errors in their math. Untested yet.