Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio
ICLR 2105 | arxiv |
Standard encoder-decoder models encode a source sentence into a fixed length vector that will be decoded next. The paper consider the use of a fixed length vector as a bottleneck: - difficulty to cope with long sentences
The alternative method proposed 'aligns and translates jointly' as follows:
The model generates a word in a translation
it searches for a set of positions in the source sentence where the most relevant information is concentrated => context vector
the model predicts a target word based on the context vector and the previously generated words.
The context vector:
\[p(y_t|\{y_1,..,y_{t-1}\}, x) = g(y_{t-1}, s_t, c)\]
The context varies at each prediction step:\[p(y_t\vert\{y_1,..,y_{t-1}\}, x) = g(y_{t-1}, s_t, c_t)\]
With \(s_t\) an RNN hidden state as usual, this time dependent on the context vector:\[s_t = f(s_{t-1}, y_{t-1}, c_t)\]
\[\mathbf h = (h_1,...h_{T_x})\]
to which an encoder maps the input sequence x. Each \(c_t\) is a weighted sum of these annotations:\[c_t = \mathbf \alpha_t ^T \mathbf h\]
. The weight vector is computed as:
\[\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{j=1}^{T_x} \exp(e_{tj})}\]
where:\[e_{ti} = a(s_{t-1}, h_i)\]
is an alignment model which score how well the inputs around position \(i\) and the output at position \(t\) match. This model is parameterized as a feedforward neural network
img
Better translation performance ever the basic encoder-decoder approach.
Linguistically plausible alignment between the source and target sentences (qualitative)