Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

ICLR 2105 | arxiv |

Method & key ideas:

Standard encoder-decoder models encode a source sentence into a fixed length vector that will be decoded next. The paper consider the use of a fixed length vector as a bottleneck: - difficulty to cope with long sentences

The alternative method proposed 'aligns and translates jointly' as follows:

The model generates a word in a translation
it searches for a set of positions in the source sentence where the most relevant information is concentrated => context vector
the model predicts a target word based on the context vector and the previously generated words.

The context vector:

The input sentence is encoded into a sequence of vectors and the context corresponds to a subset adaptively selected while decoding.

Instead of the usual

\[p(y_t|\{y_1,..,y_{t-1}\}, x) = g(y_{t-1}, s_t, c)\]

The context varies at each prediction step:

\[p(y_t\vert\{y_1,..,y_{t-1}\}, x) = g(y_{t-1}, s_t, c_t)\]

With \(s_t\) an RNN hidden state as usual, this time dependent on the context vector:

\[s_t = f(s_{t-1}, y_{t-1}, c_t)\]

The context vector:

This vector depends on a sequence of annotations

\[\mathbf h = (h_1,...h_{T_x})\]

to which an encoder maps the input sequence x. Each \(c_t\) is a weighted sum of these annotations:

\[c_t = \mathbf \alpha_t ^T \mathbf h\]

. The weight vector is computed as:

\[\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{j=1}^{T_x} \exp(e_{tj})}\]

where:

\[e_{ti} = a(s_{t-1}, h_i)\]

is an alignment model which score how well the inputs around position \(i\) and the output at position \(t\) match. This model is parameterized as a feedforward neural network

img

Datasets:

Results:

Better translation performance ever the basic encoder-decoder approach.
Linguistically plausible alignment between the source and target sentences (qualitative)