Introduction

Benefiting from the significant advances of large-scale labeled datasets, such as ImageNet [7] and deep learning, especially deep convolutional neural networks (CNN) [24, 33]; the problems of image classification [73] and object recognition [56, 72] have been studied thoroughly. As a result, computers even outperform humans at these tasks [60]. Recently, automatically generating a sentence description for an image, has attracted much research attention in artificial intelligence. This problem, known as image captioning, plays an important role in computer vision, i.e., enabling computers to understand images, which can be exploited in wide applications, such as video tracking [28,29,30,31,32], cross-view retrieval [10, 39], sentiment analysis [36, 52], childhood education [54], and visual impairment rehabilitation [11]. However, image captioning is a challenging task due to the coverage of both computer vision and natural language processing technologies. Apart from the need for identifying the objects contained in an image [37], the generator should also be able to analyze their states, understand the relationship among them, and express the semantic information in natural language [59].

The early efforts on image captioning mainly adopt the template-based methods, which require recognizing the various elements, such as objects as well as their attributes and relationships in the first phase. These elements are then organized into sentences based on either templates [15, 25, 35, 62] or pre-defined language models [13, 26, 27, 46], which normally end up with rigid and limited descriptions. As a typical transfer-based method, nearest neighbor (NN) is employed to retrieve a description from the corpus for a given image [9]. Although this method cannot generate any novel sentence, it suggests that NN can indeed provide valuable information.

Inspired by recent advances in machine translation [6, 41, 55, 57], neural network-based methods have been widely applied in image captioning tasks [12, 14, 23, 43, 59] and achieved great success. These methods are primarily based on the encoder-decoder pipeline, consisting of two basic steps. First, visual features are extracted using CNN to encode the image into a fixed length embedding vector. Second, recurrent neural network (RNN), especially long short-term memory (LSTM) [18] is adopted as the decoder to generate the sentence description by maximizing the likelihood of a sentence given the visual features. Thanks to the feature representation capability of CNN and the temporal modeling of RNN, the neural network-based methods are more flexible, which can generate new sentences coherently. On the other hand, motivated by the attention [58] mechanisms, which have been proven to be effective in visual scene analysis [2, 47], different attention mechanisms are proposed for image captioning, such as region-based attention [21], visual attention [61], semantic attention [65], global-local attention [34], and spatial and channel-wise attention [3].

Despite achieving the state-of-the-art performance, existing CNN-RNN framework-based methods suffer from two main problems, as illustrated in Fig. 1:

  • Information inadequateness problem. These methods treat different words of a caption equally, which makes distinguishing the important parts of the caption difficult.

  • Misrecognition problem. The main subjects or scenes might be misrecognized using the traditional methods.

Fig. 1
figure 1

Motivation of the proposed model. a Traditional methods treat all the words (on the right of each image) equally without considering the relative importance. On the right of the words are the assigned weights based on the overall occurrences. b The main subjects or scenes are misrecognized using the traditional methods (red), which can be corrected with the help of consensus score between the neighboring references and the target image (green)

Obviously, in an image description, the words are not equally important. Take the first image of Fig. 1a as an example, the words “surfboards” and “wave” should be the most important as they constitute the main content of the image; “men” is the subject and “riding” is the status of the subject, which are less important; “two” “on” “a” “small” are relatively uninformative. Furthermore, once the subjects or scenes are misrecognized, the generation error would accumulate and cannot be easily corrected. The subsequent words in the caption may be disturbed by these irrelevant text context. Motivated by these observations, we propose to make use of the visual features and the labeled captions of the training images as references to address the above mentioned problems. The references are incorporated in both the training phase and the generation phase of the LSTM framework, which constitute the novel R-LSTM model. In the training phase, we endow the words in a caption with different weights in terms of their relevance to the target image, part of speech, and corresponding synonyms. A word with higher relevance score indicates high importance to describe the image, and thus a larger weight value is assigned to it when calculating the loss. In this way, the model could learn more in-depth information of the caption, such as what the principal objects are, which attributes are important to them and how they relate to each other. In the generation phase, the NNs of the input image are employed as references by jointly combining the consensus score [8] and the likelihood of generating sentence. The information provided by the NNs could help reduce the misrecognition from beginning, and better match the habit of human cognition.

We evaluate the proposed R-LSTM model on the MS COCO and Flickr30k datasets. The comparative results demonstrate the significant superiority of R-LSTM over the state-of-the-art approaches. We also report the performance of our method on the MS COCO Image Captioning Challenge. Comparing with all the latest approaches, we obtain comparable performances on all the 14 metrics.

The main contributions of this paper are threefold:

  1. 1.

    We propose to use the training images as references and design a novel model, named Reference based Long Short Term Memory (R-LSTM), for image captioning.

  2. 2.

    In the training phase, we assign unequal weights to different words according to the overall occurrences, part of speech, and corresponding synonyms. Such a training biased by the weights can better learn the in-depth information of the captions, which helps to address information inadequateness problem.

  3. 3.

    In the caption generation phase, we define a novel objective function by combining the consensus score and the traditional log likelihood to exploit the reference information from neighbor images of the target image. Reference-based generation can help to address the misrecognition problem and make the descriptions more natural sounding.

One preliminary conference version on R-LSTM for image captioning was first introduced in our previous work [4]. The enhancements in this paper as compared to the conference version lie in the following three aspects. First, we perform a more comprehensive review of related work. Second, we extend the weighted training by combining the overall occurrences, part of speech and corresponding synonyms. Third, we conduct more comparative experiments and enrich the analysis of results.

The rest of this paper is structured as follows. Section “Related Work” briefly reviews related work on image captioning. Section “System Overview” gives an overview of the proposed model. Detailed algorithms, including weighted training and reference-based generation are described in “Weighted Training” and “Generation Using Reference,” respectively. Experimental results and analysis are presented in “Experiments,” followed by a conclusion and the summary of future works in “Conclusion.”

Related Work

Generally, the existing image captioning algorithms can be divided into three categories based on the way of sentence generation [20]: template-based methods, transfer-based methods, and neural network-based methods.

The template-based methods either use templates or design a language model, which fill in slots of a template based on co-occurrence relations gained from the corpus [15], conditional random field [25], or web-scale n-gram data [35]. More complicated models have also been used to generate relatively flexible sentences. Mitchell et al. [46] exploited syntactic trees to create a data-driven model. Visual dependency representation is proposed to extract relationships among the objects [13]. The template-based methods are simple and intuitive, but are heavily hand-designed and unexpressive, which are not flexible enough to generate meaningful sentences.

The transfer-based methods are based on the retrieval approaches, which directly transfer the descriptions of the retrieved images to the query image. Some approaches [16, 19] took the input image as a query and selected a description in a joint image-sentence embedding space. Kuznetsova et al. [26, 27] retrieved images that are similar to the input image, extracted segments from their captions, and organized these segments into a sentence. Devlin et al. [9] simply found similar images and calculated the consensus score [8] of the corresponding captions to select the one with the highest score. The generated sentences by the transfer-based methods are often with correct grammar. However, these methods may misrecognize the visual content and cannot generate novel phrases or sentences, and thus are limited in image captioning. Notwithstanding, they indicate that we can take advantage of the images similar to the input image. This idea can be applied in other approaches, such as re-ranking candidate descriptions generated by other models [44] and emotion distribution prediction [70, 71]. We also undertake this idea in our generation process.

The neural network-based methods come from the recent advantages in machine translation [6, 55, 57], with the use of RNN. Mao et al. [43] proposed a multimodal layer to connect a deep CNN for images and a deep RNN for sentences, allowing the model to generate the next word given the input word and the image. Inspired by the encoder-decoder model [6] in machine translation, Vinyals et al. [59] used a deep CNN to encode the image instead of a RNN for sentences, and then used LSTM [18], a more powerful RNN, to decode the image vector to a sentence. Many works follow this idea, and apply attention mechanisms in the encoder. Xu et al. [61] extracted features from a convolutional layer rather than the fully connected layer. With each feature representing a fixed-size region of the image, the model can learn to change the focusing locations. Jin et al. [21] employed a pre-trained CNN for object detection to analyze the hierarchically segmented image, and then ran attention-based decoder on these visual elements. Combining the whole image feature with the words obtained from the image by attribute detectors can also drive the attention model [65]. Li et al. [34] proposed a global-local attention mechanism by integrating local representation at object-level with global representation at image level.

More recently, reinforcement learning has been integrated in the encoder-decoder framework. To address the deficiencies of exposure bias and a loss, which does not operate at the sequence level in traditional encoder-decoder frameworks, Ranzato et al. [51] proposed a novel sequence level training algorithm, named Mixed Incremental Cross-Entropy Reinforce (MIXER), that directly optimizes the metric used at test time. Liu et al. [40] proposed a novel training procedure for image captioning models based on policy gradient methods, which directly optimize for the metrics of interest, rather than just maximizing likelihood of human generated captions. Self-critical sequence training (SCST) [53] is proposed for image captioning by utilizing the output of its own test-time inference algorithm to normalize the rewards it experiences. It is noticed that directly optimizing the CIDEr metric with SCST and greedy decoding at test time is highly effective.

Please note that there are other captioning tasks that are related to our research, such as dense captioning [22] and video captioning [1, 48].

Similarly, our work follows the encoder-decoder model. But different from [59], the words in a caption are weighted in the training phase according to their relevance to the corresponding image, which well balances the model with the importance of a word to the caption. In the generation phase, we take advantage of the consensus score [8] to improve the quality of the sentences. Different from [44], which simply used the consensus score to re-rank the final candidate descriptions, we use this score in the whole generation process, which means that the decoder takes the neighbors’ information of the input image into account. With the likelihood of a sentence combined, we propose a better evaluation function than just maximizing the likelihood.

System Overview

Our goal is to generate a description sentence for an image. Suppose we have N training images I1,I2,⋯ ,IN, which also denote related visual features. For image In(n = 1,2,⋯ ,N), we have Mn correct description sentences \(S_{n1},S_{n2},\cdots ,S_{nM_{n}}\). Our task aims to maximize the likelihood of the correct descriptions given the training images by the following:

$$ \theta^{*}=\arg\max_{\theta}\sum\limits_{n = 1}^{N}\sum\limits_{m = 1}^{M_{n}}\mathcal{L}(S_{\text{nm}}|I_{n};\theta), $$
(1)

where 𝜃 are the parameters of our model and \(\mathcal {L}()\) is a pre-defined likelihood function. In the next section, we will firstly describe the conventional likelihood function for image captioning used in previous works [42, 64] (see Eq. 4), and then we will introduce the proposed likelihood objective function (see Eq. 5).

After training, we can generate a sentence for a test image J by the following:

$$ R=\arg\max_{R'}\mathcal{O}(R'|J;\theta^{*}), $$
(2)

where \(\mathcal {O}()\) is a pre-defined objective function. This objective function aims to generate the best sentence for the given image J. Usually, the log likelihood function is previously employed to replace \(\mathcal {O}()\). However, the conventional log likelihood function roughly selects the sentence with the highest probability learned by the model and may cause the misrecognition issue. In this paper, we introduce the supervision of the reference sentence and reformulate this objective function, aiming to resolve the misrecognition issue in the generation stage. The details are provided in “Generation Using Reference.”

The overview of the proposed image captioning method is illustrated in Fig. 2, which consists of two stages: weighted training and reference-based generation. For both stages, the deep ResNet-101 model [17] is employed as the encoder to extract CNN features of the target image and the training images. During the weighted training stage, the weight attached to each word in the training captions is calculated firstly. Then, the LSTM model is trained using the weighted words and CNN features of the training images under the proposed weighted likelihood objective. In the reference-based generation stage, the trained LSTM plays as a decoder role, which takes the CNN features of the target image as input and generates the description words one by one. Besides, in the generation stage, we jointly consider the likelihood and the consensus score as the evaluation function in beam search. Details can be referred to “Weighted Training” and “Generation Using Reference.”

Fig. 2
figure 2

Overview of the proposed R-LSTM model, including two parts: weighted training (in blue rectangle) and reference-based generation (inred rectangle). Each part is an encoder-decoder model, using CNN to encode the image and LSTM to decode the sentence. The functions w() and h() indicate that the reference information is used to weigh the input words when training and improve the output sentences when generating, respectively. l() is the log likelihood

Weighted Training

For similarity, we use I and S to replace In and Snm in Eq. 1, respectively. Suppose S = {s0,s1,s2,...,sT,sT+ 1}, where {s1,s2,...,sT} is the original labeled words, s0 is a special start word, sT+ 1 is a special stop word and T is the length of this particular sentence, which depends on I. At time t, the likelihood of word st is decided by the input image I and previous words s0,s1,...,st− 1:

$$ p(s_{t}| I, s_{0}, s_{1}, ... , s_{t-1}). $$
(3)

The joint log likelihood of description S, namely the objective likelihood function \(\mathcal {L}()\) in Eq. 1 of the NIC model [59], is calculated by the chain rule:

$$ \mathcal{L}(S|I)=\log p(S|I) = \sum\limits_{t = 1}^{T + 1} \log p(s_{t}| I, s_{0}, s_{1}, ... , s_{t-1}), $$
(4)

where the dependency on 𝜃 is dropped for convenience.

As stated in “Introduction” and illustrated in Fig. 1, different words are not equally meaningful and important. It is reasonable that the subject and its corresponding status express more information than the articles and prepositions. Unlike the NIC model [59], we take into consideration the words’ importance by assigning different weights to the words, which enables the model to be concentrated on the main information of the captions. Logically, we assign higher weights to the words, which correspond to important elements, such as the main subject, its status, and the environment, etc. Suppose the weight of word st to image I is w(st,I), then our model is trained to maximize the weighted log likelihood:

$$ \mathcal{L}(S,I) = \sum\limits_{t = 1}^{T + 1} w(s_{t},I)\log p(s_{t}| I, s_{0}, s_{1}, ... , s_{t-1}). $$
(5)

Note that in the training phase, the words s0,s1,...,st are given by the labeled caption. So their weights could be calculated as a preprocessing step.

There are different ways to calculate the weights of different words. We propose three schemes based on the overall occurrences, part of speech, and corresponding synonyms, respectively. The overall occurrence-based scheme follows the tag ranking approach [38] by calculating the weight of word si to image I as follows:

$$ w_{1}(s_{i}, I) = \frac{\beta p(s_{i}|I)}{p(s_{i})},\ i = 1,2,\cdots,N, $$
(6)

where β is a parameter to ensure the average of all the weights is 1, and p(si|I) denotes the likelihood of si in the captions of image I. The reason for dividing p(si|I) by p(si) is that a frequent word, such as “a” and “the,” is not informative although it may appear in most descriptions.

Based on Bayes’ rule, we have as follows:

$$ w_{1}(s_{i}, I) = \frac{\beta P(I|s_{i})P(s_{i})}{P(I)P(s_{i})} = \frac{\beta P(I|s_{i})}{P(I)}. $$
(7)

Since P(I) is determined given image I, we can redefine (7) as follows:

$$ w_{1}(s_{i}, I) \stackrel{.}{=} \beta P(I|s_{i}). $$
(8)

Based on kernel density estimation (KDE) [50],

$$ w_{1}(s_{i}, I) = \beta P(I|s_{i}) = \frac{\beta} {|G_{s_{i}}|} \sum\limits_{I_{j}\in G_{s_{i}}} K_{\sigma}(I - I_{j}), $$
(9)

where \(G_{s_{i}}\) denotes the set of images whose captions contain word si, and the Gaussian kernel function Kσ is defined as follows:

$$ K_{\sigma}(I - I_{j}) = \exp(-\frac{(\left\| I - I_{j} \right\|)^{2}}{\sigma^{2}}), $$
(10)

where the radius parameter σ is set as the average distance of each two images in the training set, and the image vectors are extracted from a deep CNN. Therefore, in a set of images containing a same description word, if an image is very similar to others, it is natural to infer that the word is very relevant to the image, and thus will be assigned with a high weight in the image’s captions. Otherwise, if an image does not look like other images, which means that the word is not important or is even noise to the image, the word will be given a low weight. Equation 10 is meaningful in two aspects: it measures the importance of different words in a same caption (Fig. 3) and the importance of a word to different images (Fig. 4).

Fig. 3
figure 3

The Kσ values of the target image with some other images, whose captions contain “bicycle” or “a” respectively. It is obvious that the former images have higher Kσ values than the latter ones, suggesting that images labeled with “bicycle” are similar to the target image whose main subject is a bicycle, while the uninformative “a” leads to less similarity

Fig. 4
figure 4

The Kσ values of two target images with some other images whose captions contain “train.” For the first target image, “train” along with “station” denotes the scene of the image, while in the second target image, “train” is the main subject. Therefore, the set of images containing “train” are more similar to the second target image, resulting in higher Kσ values

The part-of-speech-based strategy is conducted on the basis of the overall occurrences. We observe that the nouns and verbs in the captions are relatively more important and convey more information than others. The contributions of prepositions, conjunctions, and qualifiers to the whole caption sentence are relatively small. Therefore, we enlarge the weights of all the nouns and verbs, and reduce the weights of all the prepositions, conjunctions, and qualifiers by the following:

$$ w_{2}(s_{i}, I) = \mu_{s_{i}}*w_{1}(s_{i}, I), $$
(11)

where \(\mu _{s_{i}}\) is the parameter that controls the relative importance, the value of which depends on the different part of speech as follows:

$$ \mu_{s_{i}}= \left\{\begin{array}{ll} \mu_{1},\ \ \text{if}\ \text{pos}(s_{i})=\text{noun},\\ \mu_{2},\ \ \text{if}\ \text{pos}(s_{i})=\text{verb},\\ \mu_{3},\ \ \text{if}\ \text{pos}(s_{i})=\text{qualifier},\\ \mu_{4},\ \ \text{if}\ \text{pos}(s_{i})=\text{preposition/conjunction}, \end{array}\right. $$
(12)

where pos(si) is the part of speech of word si. The values of different part of speeches have a great impact on learning the primary parts of sentence. Therefore, we assign different weights to different part of speeches according to the information they have. Generally, nouns and verbs are much more informative to the image than qualifiers or prepositions/conjunctions, and thus make more contributions to the quality of the sentence. Motivated by this observation, we endow more values to the noun and the verb to direct the model to learn the informative words better. While for qualifiers, prepositions, and conjunctions, we put less weights to them. In the experiment, μ1, μ2, μ3, and μ4 are set to be 1.1, 1.05, 0.9, and 0.8, respectively.

The third scheme is based on the synonyms. We observe that many different words have similar meanings, such as role, character, and function. Jointly, modeling them together to calculate the weights may complement each other. We propose a weighting strategy based on words’ similarity by considering the semantic information. For word si, suppose the synonym set (‘synset’) of its k th meaning is ssik by WordNet,Footnote 1 we compute the weight of the synset as follows:

$$ \widehat{w}_{3}(\text{ss}_{\text{ik}}, I) = \frac{\beta} {|G_{\text{ss}_{\text{ik}}}|} \sum\limits_{I_{j}\in G_{\text{ss}_{\text{ik}}}} K_{\sigma}(I - I_{j}), $$
(13)

where \(G_{\text {ss}_{\text {ik}}}\) denotes the set of images whose captions contain “synset” ssik. Since a word belongs to several different “synsets,” we use the maximum value as the final weight:

$$ w_{3}(s_{i}, I) = \max_{k \in C_{s_{i}}} \widehat{w}_{3}(\text{ss}_{\text{ik}}, I), $$
(14)

where \(C_{s_{i}}\) is the set of different semantic meanings of si.

After obtaining the encoded CNN features of the training images and the weights attached to each word in the training captions, we can train the LSTM model, as shown in Fig. 5, by optimizing (1) with likelihood function defined in Eq. 5. The detailed operations of a LSTM unit are as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{ll} &x_{-1}=I,\ \ x_{t}=W_{e}s_{t}(t = 0,1,\cdots,T-1),\\ &f_{t}=\sigma(W_{\text{fx}}x_{t}+W_{\text{fh}}h_{t-1}+b_{f}),\\ &i_{t}=\sigma(W_{\text{ix}}x_{t}+W_{\text{ih}}h_{t-1}+b_{i}),\\ &o_{t}=\sigma(W_{\text{ox}}x_{t}+W_{\text{oh}}h_{t-1}+b_{o}),\\ &c_{t}=f_{t}\otimes c_{t-1}\oplus i_{t}\otimes \tanh(W_{\text{cx}}x_{t}+W_{\text{ch}}h_{t-1}+b_{c}),\\ &h_{t}=o_{t}\otimes c_{t},\\ &p_{t + 1}=\text{Softmax}(h_{t}),\\ \end{array} \end{array} $$
(15)

where ⊗, ⊕, tanh(⋅), and σ(⋅) are the product with a gate value, sum operation, hyperbolic tangent function, and sigmoid function, respectively. Wx, Wh, We, and b are the parameters learned by our model and shared in all steps.

Fig. 5
figure 5

A typical LSTM unit, consisting of forget gate, input gate, and output gate

Generation Using Reference

After training, the model can generate a description R = {r0,r1,r2,...,rM,rM+ 1} (r0 and rM+ 1 are special start word and stop word, respectively) for test image J by optimizing the following objective function:

$$ \mathcal{O}(R,J) = (1-\alpha)l(R,J) + \alpha h(R,J), $$
(16)

where h(R,J) is the consensus score of sentence R, and l(R,J) is the log likelihood:

$$ l(R,\!J) = \log p(R|J) = \! \sum\limits_{t = 1}^{M + 1} \!\log p(r_{t}| J, \!r_{0}, r_{1}, ... , r_{t-1}), $$
(17)

which is obtained by Eq. 15. When α = 1, the objective function \(\mathcal {O}(R,J)\) turns to the one in the NIC model [59].

The idea of consensus score comes from transfer-based methods, which indicate that the descriptions of similar images are very helpful in image captioning. Some existing transfer-based methods directly use the captions of the similar images as the description of the input image. For example, Devlin et al. [9] simply utilized the k-nearest neighbor model. First, retrieve k nearest neighbors of the input image and obtain the set of their captions C = {c1,c2,...,c5k} (five captions for each image). Second, calculate the n-gram overlap F score for every two captions in C. The consensus score of ci is defined as the mean of its top mF scores. Finally, select the caption with the highest consensus score as the description of the input image.

Similar to [9], we calculate the consensus score h(R,J) for image J and the generated sentence R (including incomplete ones that are being generated by the decoder) as follows:

$$ h(R,J) = \frac{1}{|C_{J}|}\sum\limits_{c \in C_{J}}{\text{sim}(R, c)}, $$
(18)

where CJ is the caption set of the k-nearest neighbor images of image J, and sim(⋅,⋅) is the function to calculate the similarity between two sentences (we use BLEU-4 [49] in experiments).

Since l(R,J) is much larger than h(R,J) in terms of absolute value, we normalize them before linear weighting:

$$ \begin{array}{ll} l'(R,J) &= \frac{l(R,J)-\min_{c \in H}{l(c, J)}}{\max_{c \in H}{l(c, J)}-\min_{c \in H}{l(c, J)}},\\ h'(R,J) &= \frac{h(R,J)-\min_{c \in H}{h(c, J)}}{\max_{c \in H}{h(c, I)}-\min_{c \in H}{h(c, J)}}, \end{array} $$
(19)

where H is the set of generated candidate descriptions. The final evaluation function is as follows:

$$ \mathcal{O}(R,J) = (1-\alpha)l'(R,J) + \alpha h'(R,J),\ 0\leq\alpha\leq1. $$
(20)

Different from training, in the generation phase, the labeled captions are no longer available, and the input word at time t is the output word rt− 1. Besides, as our dictionary size is large, which is up to about 10000 words after filtering out infrequent ones on the MS COCO dataset, the searching space is too large for an exhaustive enumeration. Therefore, we implement the beam search as an approximation. At each time step, we keep a set of K (called “beam size”) best sentences from K2 candidates according to Eq. 20. When a sentence is completed (the next word generated by the decoder is the stop word, or the sentence reaches the maximum length), it will be moved to the final pool, which also has the size of K and is maintained according to Eq. 20.

Experiments

To evaluate the effectiveness of the proposed method, we carry out extensive experiments on the Flickr30k dataset [66] and MS COCO dataset [5].

Experimental Settings

Datasets

Flickr30k dataset [66] contains 31,783 images, while the more challenging MS COCO dataset [5] consists of 123,287 images. Each image is labeled with at least five captions by different Amazon Mechanical Turk (AMT) workers. Since there is no standardized split on both datasets, we follow the publicly available splitFootnote 2 as in [19, 23, 61, 65] on Flickr30k dataset and in [23, 61, 65] on MS COCO for fair comparison. That is, 1000 images are used for validation, 1000 for testing and the rest for training in Flickr30; 5000 images are selected for validation, 5000 for testing, and the rest for training in MS COCO.

Evaluation Metrics

Following the evaluation API provided by the MS COCO server, we report the results on different metrics, including BLEU-1, 2, 3, 4, METEOR, ROUGE-L and CIDEr. BLEU is based on the n-gram precision. METEOR is based on the harmonic mean of uni-gram precision and recall, which weighs recall higher than precision. Different from BLEU, METEOR seeks correlation at the corpus level. ROUGE-L is used to measure the common subsequence with maximum length between the target and source sentences. CIDEr is designed to evaluate image descriptions using human consensus. Higher values represent better performances for all these metrics.

Implementation Details

The proposed R-LSTM model is implemented based on the NIC model [59]. The sentences are preprocessed following the publicly available codeFootnote 3. Unless specified otherwise, the beam size K used in the beam search is set to 10, similar to [21], while parameter α is set to 0.7 for Flickr30k and 0.4 for MS COCO. The LSTM cell size is 512 and the number of layers is 1. The image feature is extracted from the last 2048-dimensional fully connected layer of the ResNet-101 CNN model [17].

Results on Weighted Word Training

Some of the weighted words are shown in Fig. 6. Take the second image in the first row for example, considering the overall occurrences, the weights of the main subjects “girls” and “pizzas” are the largest, followed by the modifiers “little” and “pepperoni.” The part of speech strategy enlarges the weights of the nouns “girls” “pepperoni” and “pizzas” to emphasize the importance of these words. The synonyms-based method adjusts the weights based on the semantic meanings of corresponding synonyms. The last column combines the part-of-speech- and synonym-based methods together. We can conclude that after weighting, the main contents in the image are emphasized.

Fig. 6
figure 6

Results of weighted words. On the right of each image are the original caption words and assigned weights by the overall occurrences, part of speech, corresponding synonyms, and the combination of the latter two, respectively

The performances of the LSTM networks trained before and after weighting the words are shown in the first and second rows in Tables 1 and 2 on Flickr30k and MO COCO datasets, respectively. The best performances are emphasized in italic. We can see that compared with the original NIC model [59], the performance is improved by all kinds of weighting schemes. On the CIDEr metric, the enhanced weighting versions “part of speech,” “synonyms,” and “part of speech + synonyms” outperform the previous “occurrences” version [4]. The combined weighting method generally performs better than others on MS COCO dataset, while the “synonyms” method achieves the best performance on four metrics on Flickr30k dataset. Unless otherwise specified, we report the results of the combined weighting method in the following experiments.

Table 1 Performance comparison (%) of different weighting methods and reference-based generation on Flickr30k dataset after fine tuning the CNN encoder
Table 2 Performance comparison (%) of different weighting methods and reference-based generation on MS COCO dataset after finetuning the CNN encoder

Results on Reference-based Generation

To compare the performance contribution of weighted training and reference-based generation, we also conduct experiments on R-LSTM with no weighted training involved. The results on Flickr30k dataset and MS COCO dataset are shown in the last row of Tables 1 and 2, respectively. The best results of each column are highlighted in bold. It is clear that the reference-based generation achieves the best results on all the metrics with the significant performance gains. In view of this comparison, we can conclude that the reference-based generation contributes more than weighted training in the proposed R-LSTM model.

On Parameter α

The parameter α in Eq. 20 is crucial in our methods, which determines to what extent the generator depends on references. The black lines in Figs. 7a and 8a show how the quality of generated captions (on CIDEr) varies with respect to α on Flickr30k and MO COCO datasets, respectively. We can see that for both datasets with the increase of α, the performance firstly becomes better and then drops, which demonstrates that referring neighboring images can improve the performance but relying too much on references will lead to poor performance. The best α is 0.7 and 0.4 for Flickr30k and MO COCO datasets, respectively. We can conclude that the best α depends on the dataset.

Fig. 7
figure 7

The influence of α in the proposed generator on Flickr30k dataset. a The black and red lines are the influence of parameter α (i.e., α2 = α1 = α) and α1 (when optimizing α2), respectively. b The blue line is the performance of the generator with different α2 when α1 = 0.7

Fig. 8
figure 8

The influence of α in the proposed generator on MS COCO dataset. a The black and red lines are the influence of parameter α (i.e., α2 = α1 = α) and α1 (when optimizing α2), respectively. b The blue line is the performance of the generator with different α2 when α1 = 0.4

In the generation phase, the sentence length is increasing. Since a sentence certainly becomes more informative when it has more words, it may not be a good idea to keep the same weight of the references. We try to change α in different generation stages. For similarity, in the early stage, we set α = α1, and in the final pool stage, we set α = α2. To adjust α2, we conduct experiment with α1 = 0.7 and α1 = 0.4 fixed for Flickr30k and MO COCO datasets, respectively. As shown in the blue lines of Figs. 7b and 8b, we can obtain better a performance by varying α2 from 0.0 to 0.4, and the performance tends to decrease when α2 > 0.4. We can conclude that there exists the best α2 for the specified α1. We repeat this process for α1 = 0,0.1,0.2,...,1, and they all perform better by adjusting α2 (the red lines in Figs. 7a and 8a). We believe that further changing α in more details (i.e., in each generation step) may achieve better performance, which is our future work. In the following experiments, we report the results when α1 = 0.7, α2 = 0.4 and α1 = 0.4, α2 = 0.4 for Flickr30k and MO COCO datasets respectively, unless otherwise specified.

We take the ninth image in Fig. 9, for example, to understand the beam search process of Eq. 20, i.e., the significance of α1≠ 0. We can see that the subject “sheep” is misrecognized as “cattle” without using the consensus score, whose beam search process is illustrated in Fig. 10a. At the beginning, the model is waving between “sheep,” “cattle,” and ‘animal.” As α1 = 0, the model cannot utilize the neighbor images to correct the mistake. When t = 12, there is no “sheep” in the candidate sentences. Regardless of the value of α2, this mistake cannot be corrected. However, when α1≠ 0, this situation is avoidable with the help of references, as shown in Fig. 10b, from which we can see that when t = 8, all the candidate sentences contain the correct subject “sheep.”

Fig. 9
figure 9

Examples of generated captions by Google NIC [59] (in black) and the proposed R-LSTM model (in red)

Fig. 10
figure 10

The beam search process of the given image ranked by Eq. 20 when aα1 = 0, bα1 = 0.4 after weighted training. The red lines are the generating captions correctly recognizing the subject “sheep”

On Beam Search Size K

In order to analyze the effect of the beam search size K in the testing stage, we illustrate the performances on CIDEr of the best α1 and α2 as in “On Parameter α” with the beam size in the range of {1, 2, 3, 5, 10, 20} in Fig. 11. We can see that the performances are like the “∧” shapes on both datasets when beam size K varies from 1 to 20. The best K is 10 for both datasets, which is the adopted beam size in experiment when comparing with other methods. We can conclude that a larger would not necessarily mean better performances.

Fig. 11
figure 11

The influence of beam search size K in the R-LSTM model

Comparison with the State-of-the-arts

Tables 3 and 4 show the comparison of the completed model and several state-of-the-art methods on Flickr30k and MO COCO datasets respectively, where “-” represents unknown scores. From Table 3, we can see that the proposed method outperforms the state-of-the-art methods on all the metrics except for BLEU-4. It is clear from Table 4 that our approach performs the best on all the metrics, respectively achieving 5.52%, 8.45%, 9.83%, 10.26%, 5.60%, 4.50%, and 10.37% improvements compared with previous best results. These comparisons demonstrate the effectiveness of the proposed R-LSTM model for image captioning.

Table 3 Performance (%) of the proposed model compared with several state-of-the-art methods on Flickr30k dataset
Table 4 Performance (%) of the proposed model compared with several state-of-the-art methods on MS COCO dataset

We also test our approach on the online MS COCO server (a sort of competition). The results compared with the latest methods are reported in Table 5. Despite keen competition, we are still one of the top ten methods in terms of the overall performance. It is noted that those methods, which outperform our method, utilize either more complicated REINFORCE to maximize the likelihood [40, 53] or time-consuming attribute learning [64] and adaptive attention [42]. In principle, our idea of using weighted training and reference can also be applied to the frameworks, such as reinforcement learning and attribute learning, which will be one of our future works. We want to emphasize that our method performs the best when compared with those published papers [59, 63, 65] adopting the structure of CNN-RNN. In addition, compared with our previous conference version (THU-MIG* (ours)) [4], the enhanced algorithm achieves superior performances on almost all metrics.

Table 5 Evaluation results (%) of the latest captioning methods on dataset c5 and c40 on the online MS COCO server (http://mscoco.org/dataset/#captions-leaderboard) by December 8, 2016

To verify that the proposed approach can significantly improve the image captioning model, we carry out the T test experiments on both MS COCO and FLickr30k datasets. We choose the NIC as the control group, because our approach focuses on optimizing the training objective function and generation process, sharing the same architecture as Google NIC [59]. The results are shown in Table 6. We can see that, the p values of all metrics on both datasets are all smaller than 5%, which demonstrates the significant improvement of the proposed approach.

Table 6 Results of T test (p value) compared with Google NIC [59] on MS COCO and Flickr30k datasets

Case Study

Some examples of the generated sentences are illustrated in Fig. 9. The captions in red show how the proposed R-LSTM improves the generation quality as compared to Google NIC [59]: misrecognizion is fixed in image 3 (bench->floor, laptop->suitcase), image 6 (kitchen->grill), image 8 (bench->skateboard), image 9 (cattle->sheep); more semantic details are given in image 1 (next to a forest), image 7 (rice and vegetables), image 10 (next to), and image 12 (blanket and a stuffed animal); better match the habit of human cognition in image 2 (sitting on a table vs. filled with lots of), image 4 (sitting in the back seat vs. looking out of the window), image 5 (with a toothbrush in his mouth vs. brushing his teeth with a toothbrush), and image 11 (when holding a nintendo wii game controller, the people are actually playing a video game).

Conclusion

In this paper, we have presented a reference-based LSTM model, where the central idea is to use the training images as references to improve the quality of generated captions. In the training phase, the words are weighted in terms of their relevance to the image, including the overall occurrences, part of speech and corresponding synonyms, which drives the model to focus on the key information of the captions. In the generation phase, we proposed a novel evaluation function by combining the likelihood with the consensus score, which could fix misrecognition and make the generated sentences more natural sounding. Extensive experiments conducted on the MS COCO and Flickr30k datasets corroborated the superiority of the proposed R-LSTM over the state-of-the-art approaches for image captioning. In further studies, we plan to incorporate the attention mechanisms [21, 34, 61, 65] into the reference model and try other weighting strategies. How to generate stylized image captions with emotion [67, 68] and sentiment [45] and extend it to personalized settings [69] is also worth studying. In addition, combining the reference information with reinforcement learning [53] may further improve the image captioning performance.