Sequence to Sequence Learning with Neural Networks

Ilya Sutskever, Oriol Vinyals, Quoc V. Le

NIPS 2014 | arxiv |

Encoder:

Given a sequence of vectors (hot-one encoding or any other embedding of vocabulary words) \(x=(x\_1, x\_2, ..,x\_{T\_x})\)

The encoder embed it into a vector \(c\). With an RNN, \(c\) is generated as follows:

\[\forall t ,\: h_t = f(x_t, h_{t-1})\] \[c = q(\{h_1,...,h_{T_x}\})\]

Where \(h_t\) is the RNN hidden state at time t.

\(f\) and \(q\) are some nonlinear functions e.g. \(f\) an LSTM and \(q\) outputs the last hidden state.

Decoder:

The decoder predicts the next word in the output sequence \(y\_t\) given the context vector \(c\) along with the previously generated words \(y\_1\) until \(y\_{t-1}\) The decoder defines a probability over the \(\mathbf y = (y\_1,...,y\_T)\) by the chain rule:

\[p(y) = \prod_{t=1}^T p(y_t\vert\{y_1,..,y_{t-1}\}, c)\]

With the RNN each conditional probability is modeled as:

\[p(y_t\vert\{y_1,..,y_{t-1}\}, c) = g(y_{t-1}, s_t, c) \]

Where \(g\) is a nonlinear function and \(s_t\) the hidden state of the RNN.