Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data

Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, Xiaogang Wang

ECCV 2018 submission | arxiv | code* | code |

Problematic:

Captioning models replicate frequent phrases and follow generic templates occurring in the training corpus.

Contributions:

Improve discriminativeness and fidelity by involving a self-retrieval module.

Discriminativeness: How well can the caption distinguish its associated image from the rest.

The captioning model generates a caption while the self-retrieval module conducts text-to-image retrieval i.e. retrieve the source image from the generated caption.

Issues:

Generating each word of the caption contains non-differentiable operations (max or sampling). The paper adopts reinforcement learning with the negative retrieval loss as reward.

Advantages:

Can easily integrate unlabeled images to train the self-retrieval module. This allows for mining hard negative-examples from unlabeled data and boost the discriminativeness.

Details:

Text-to-image matching is performed only in the mini-batch.

Training the retrieval module:

Given an image encoder $E_i$, a GRU encoder $E_c$, an LSTM decoder $D_c$ and a batch of pairs $(I_i,C_i)_i$ (image and caption):

Encode the image $ v_i = E_i(I_i), i$ and the caption $c_i=E_c(C_i), \forall i$.
Compute the similarity between each image $i$ and caption $j$: $s(c_i, v_j)$.
Evaluate the retrieval loss: \[ L_{ret}(C_i,\{I_1,...I_n\})= \max_{j\neq i} (0, m-s(c_i, v_i) + s(c_i, v_j)) \]

The retrieval reward is simply the negative retrieval loss: $r_{ret} = - L_{ret}$.

Training the captioning module with RL:

\[ L_{RL}(\theta) = - \mathbb E_{C^s \sim p_\theta}[r(C^s)] \] With REINFORCE, the gradient is estimated via a single monte carlo sample from $p_\theta$: \[ \nabla_\theta L_{RL}(\theta) \approx (r(C^s) -b) \nabla_\theta \log p_\theta(C^s) \] where the bias $b$ is introduced to reduce the variance (Sutton 1998). It can be formulated as the reward of the caption generated by greedy decoding (pick the argmax $\forall t$).

The reward in this case is a weighted summation of CIDEr and the retrieval reward: \[ r(C_i^s) = r_{cider}(C_i^s) + \alpha . r_{ret}(C_i^s, \{I_1, ..., I_n\}) \]

Inclusion of unlabeled images:

Since the ground truth captions of the images are not required to compute the retrieval reward, we can easily add unlabeled images to the mix.

Unlabeled images are added to the minibatch, for which the CIDEr reward term is dropped.

\[ r(C_i^u) = r_{ret}(C_i^u, \{I_1, ... I_n\}\cup\{I^u_1, ...I_m^u\}) \]

Mining hard negative pairs:

A simple ranking of the unlabeled images given a caption is not optimal since image-caption pairs do not follow a one-to-one mapping (not discriminative!!), the top negative images may perfectly match the query.

Solution: Use moderately hard negatives. Instead of selecting the top ranking images, sample from a range $[h_{min}, h_{max}]$

Training strategy:

The captioning and self-retrieval modules share the same CNN - 0) Image encoder pre-trained on ImageNet. - 1a) Train the retrieval module on labeled data: - First the GRU and projection layers are trained (30 epochs). - Train the whole module (cnn included) (15 epochs). - 1b) Pre-train the captioning module with MLE (self-retrieval is fixed 'not even used') and scheduled sampling. - 2) Train the captioning module with the retrieval module fixed (including the CNN).

Experiments:

CNN encoder : Resnet-101 $dim(v) = 2048$
GRU Encoder : $d=1024$
Both features projected to the similarity space with dimension 1024
The similarity used is cosine.
Vocbulary size 9k (freq > 5).
The LSTM decoder with top-down attention (same as mine). (e=h=512)
$\alpha=1$ and labeled-unlabeled images in a minibatch is 1:1.
$h_{min} = 100$ and $h_{max}=1000$.

MSCOO results:

Bi-product: captions' uniqueness and novelty:

uniqueness : percentage of unique captions in all the generated captions. novelty: percentage of captions never seen in the training set. o

* evaluated on Karpathy's test split.

Check restart (Sgdr: stochastic gradient descent with warm restarts Loshchilov et al. 2016)