Ilya Sutskever, Oriol Vinyals, Quoc V. Le
NIPS 2014 | arxiv |
Given a sequence of vectors (hot-one encoding or any other embedding of vocabulary words) \(x=(x\_1, x\_2, ..,x\_{T\_x})\)
The encoder embed it into a vector \(c\). With an RNN, \(c\) is generated as follows:\[\forall t ,\: h_t = f(x_t, h_{t-1})\] \[c = q(\{h_1,...,h_{T_x}\})\]
Where \(h_t\) is the RNN hidden state at time t.
\(f\) and \(q\) are some nonlinear functions e.g. \(f\) an LSTM and \(q\) outputs the last hidden state.
\[p(y) = \prod_{t=1}^T p(y_t\vert\{y_1,..,y_{t-1}\}, c)\]
With the RNN each conditional probability is modeled as:\[p(y_t\vert\{y_1,..,y_{t-1}\}, c) = g(y_{t-1}, s_t, c) \]
Where \(g\) is a nonlinear function and \(s_t\) the hidden state of the RNN.