Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, Ruslan Salakhutdinov
It builds on DRAW (Deep Recurrent Attentive Writer) by DeepMind; Attentions mechanism with a sequential VAE. The model is part of the sequence-to-sequence framework where captions are a sequence of words, the image is a sequence of patches awn on a canvas.
The input caption \(y=(y\_1,..,y\_N)\) is transformed with a bidirectional RNN into \(m\) dimensional vector representations \((h\_1,..,h\_N)\)
This model extend DRAW by including the caption \(h\) at each step: