Xing Wang, Zhaopeng Tu, Deyi Xiong, Min Zhang
NMT uses mostly word-by-word (or char) generation; difficult to translate mutli-word expression/phrases/axioms... meaning of the phrase > sum of the words' meanings.
Integrate a phrase-based SMT into the NMT model. The SMT guided by the NMT propose a set of relevant phrases, The NMT scores the propsed phrases and select the most probable.
A sequence \(y\) can be decomposed with words \((w_1,....w_K)\) generated by the NMT and phrases \((p_1,...p_L)\) generated by the SMT.
The probability of generating the sequence is defined as: \[ p(y) = \prod_w (1-\lambda_{t(w)})P_{word}(w) \times \prod_p \lambda_{t(p)} P_{phrase}(p) \] \(t(.)\) is the decoding step corresponding to the word (resp. the phrase) \(\lambda\) is estimated the balancer an MLP taking for input the NMT's context vector, the previous decoding stat and the previously generated word. Intuitively, it's the importance weight of phrase over the word.
## To be continued...
## Check the followings: - Encoder-Decoder models with attached external structures (Gulcehre at al 2016, Gu et al 2016, Tang et al 2016 and Wang et al. 2017)