Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell

CVPR 2015 | arxiv | code* |

Donahue 2015 (Saenko & Berkeley team)

An end-to-end trainable recurrent convolutional architecture.

The optimized loss is the usual log likelihood of the sequence (written as product of conditional probabilities). The LRCN model passes a visual input \(v_t\) through a CNN to produce a fixed-length vector \(\phi\_V(v\_t) = \phi\_t\) parameterized by V.

== fc6 from AlexNet performs better than fc7 as visual feature==

The LSTM unit compared to the conventional RNN unit is as follows:

For image description the visual input is constant (\(\phi\)). The LSTM input at timestep \(t\), \(x\_t=concat(\phi, w\_{t-1})\) where \(w_{1:T}\) denote the words embedding of the caption.

DRAWBACK: Feeding the image to the LSTM unit at each time step pushes the model to overfit the visual inputs of the training data.