Generating Images from Captions with Attention

Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, Ruslan Salakhutdinov

ICLR 2016 | arxiv | code*

It builds on DRAW (Deep Recurrent Attentive Writer) by DeepMind; Attentions mechanism with a sequential VAE. The model is part of the sequence-to-sequence framework where captions are a sequence of words, the image is a sequence of patches awn on a canvas.

The input caption \(y=(y\_1,..,y\_N)\) is transformed with a bidirectional RNN into \(m\) dimensional vector representations \((h\_1,..,h\_N)\)

This model extend DRAW by including the caption \(h\) at each step: