Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, Xiaogang Wang
ECCV 2018 submission | arxiv | code* | code |
Captioning models replicate frequent phrases and follow generic templates occurring in the training corpus.
Improve discriminativeness and fidelity by involving a self-retrieval module.
Discriminativeness: How well can the caption distinguish its associated image from the rest.
The captioning model generates a caption while the self-retrieval module conducts text-to-image retrieval i.e. retrieve the source image from the generated caption.
Generating each word of the caption contains non-differentiable operations (max or sampling). The paper adopts reinforcement learning with the negative retrieval loss as reward.
Can easily integrate unlabeled images to train the self-retrieval module. This allows for mining hard negative-examples from unlabeled data and boost the discriminativeness.
Text-to-image matching is performed only in the mini-batch.
Given an image encoder \(E_i\), a GRU encoder \(E_c\), an LSTM decoder \(D_c\) and a batch of pairs \((I_i,C_i)_i\) (image and caption):
The retrieval reward is simply the negative retrieval loss: \(r_{ret} = - L_{ret}\).
\[ L_{RL}(\theta) = - \mathbb E_{C^s \sim p_\theta}[r(C^s)] \] With REINFORCE, the gradient is estimated via a single monte carlo sample from \(p_\theta\): \[ \nabla_\theta L_{RL}(\theta) \approx (r(C^s) -b) \nabla_\theta \log p_\theta(C^s) \] where the bias \(b\) is introduced to reduce the variance (Sutton 1998). It can be formulated as the reward of the caption generated by greedy decoding (pick the argmax \(\forall t\)).
The reward in this case is a weighted summation of CIDEr and the retrieval reward: \[ r(C_i^s) = r_{cider}(C_i^s) + \alpha . r_{ret}(C_i^s, \{I_1, ..., I_n\}) \]
Since the ground truth captions of the images are not required to compute the retrieval reward, we can easily add unlabeled images to the mix.
Unlabeled images are added to the minibatch, for which the CIDEr reward term is dropped.
\[ r(C_i^u) = r_{ret}(C_i^u, \{I_1, ... I_n\}\cup\{I^u_1, ...I_m^u\}) \]
A simple ranking of the unlabeled images given a caption is not optimal since image-caption pairs do not follow a one-to-one mapping (not discriminative!!), the top negative images may perfectly match the query.
Solution: Use moderately hard negatives. Instead of selecting the top ranking images, sample from a range \([h_{min}, h_{max}]\)
The captioning and self-retrieval modules share the same CNN - 0) Image encoder pre-trained on ImageNet. - 1a) Train the retrieval module on labeled data: - First the GRU and projection layers are trained (30 epochs). - Train the whole module (cnn included) (15 epochs). - 1b) Pre-train the captioning module with MLE (self-retrieval is fixed 'not even used') and scheduled sampling. - 2) Train the captioning module with the retrieval module fixed (including the CNN).
uniqueness : percentage of unique captions in all the generated captions. novelty: percentage of captions never seen in the training set. o
* evaluated on Karpathy's test split.
Check restart (Sgdr: stochastic gradient descent with warm restarts Loshchilov et al. 2016)