Keywords

1 Introduction

With the rapid development of the artificial intelligence in recent years, image caption has been a hot spot in computer vision and natural language processing. Image caption is a multidisciplinary subject involving signal processing, pattern recognition, computer vision and cognitive sciences, which aims to describe the content of the input image by learning dense mappings between images and words. Image caption can be applied to the image retrieval, children’s education and the life support for visually impaired persons, which has a positive role in promoting social life.

Due to the advances of deep neural network [1], some state-of-the-art models have been presented for solving the challenges of generating image captions. For example, Mao et al. [2] propose a multimodal recurrent neural network (MRNN) for sentence generation. Xu et al. [3] explore attention mechanisms that capture salient feature of the raw image to generate caption. However, they only adopt unitary image feature instead of feeding global and local visual information simultaneously. Furthermore, they are easy to generate caption irrelevant to image content.

In order to overcome the above limitations, we propose a new global-local feature attention network with reranking strategy (GLAN-RS) for addressing image-captioning task. As illustrated in Fig. 1, GLAN-RS consists of global-local feature attention network to exploit visual information and a reranking strategy to select the consensus caption in candidate captions.

Fig. 1.
figure 1

The framework of GLAN-RS, which consists of global-local feature attention network (red dashed) and reranking strategy network (blue dashed). Emb denotes the dense word representation with two integrated layers. SGRU denotes the Stacked Gated Recurrent Unit. (Color figure online)

Our contributions are as follows:

Firstly, GLAN-RS combines global image feature and local convolutional attention maps for capture holistic and salient visual information simultaneously.

Secondly, we explore nearest neighbor approach to calculate the image similarity and obtain their reference captions. Thus we can determine the best candidate caption by finding the one with highest score respect to the reference captions.

Moreover, we validate the effectiveness of GLAN-RS comparing with the state-of-the-art approaches [2, 4] consistently across seven evaluation metrics, which show that GLAN-RS get an improvement of 20% in terms of BLEU4 score and 13 points in terms of CIDER score.

2 Proposed Method

2.1 Encoder-Decoder Framework for Image Caption Generation

We adopt the popular encoder-decoder framework for image caption generation, where convolution neural network (CNN) [5, 6] encodes the image into a fixed-dimension feature vector and then a stacked Gated Recurrent Unit (SGRU) [7] decodes the visual information into sematic information. Given an input image and its corresponding caption, the encoder-decoder model directly maximize the log-likelihood function of the following objective:

$$\begin{aligned} \theta ^{*} = \mathop {\arg \min }_{\theta }\sum _{i=0}^L \log p(s_i |f_I,(\theta )) \end{aligned}$$
(1)
Fig. 2.
figure 2

The global image attention features and regional convolutional attention feature extraction process.

2.2 Global Image Attention Feature

Where \(f_I\) denotes the global image representation of the raw image, \(s_i\) denotes a sequence of words in a sentence of length L and is the parameters of the model. Images information should be encoded as fix-length vectors to feed into SGRU. CNN is used to extract the image feature map \(f_I\) from a raw image I to give an overview of the image content to the model:

$$\begin{aligned} f_I = CNN(I) \end{aligned}$$
(2)

As is shown is Fig. 2, we capture the global image attention feature in the last fully-connected layer.

2.3 Local Convolutional Image Attention Feature

In the encoder-decoder framework with SGRU and local convolutional attention map, the conditional probability of the log-likelihood function is modeled as:

$$\begin{aligned} log p(s_t|s_{1:t-1}) = tanh(h_t,c_t) \end{aligned}$$
(3)

where \(c_t\) is the attention context vector obtained in the third layer of the fifth convolutional layer. In this paper, we utilize the SGRU with two hidden layers to modulate the information flow inside the unit instead of applying a separate memory cells. \(h_t\) is the activation of the hidden state in SGRU at time t, which is a linear combination between previous activation \(h_{t-1}\) and the candidate activation \(h_t^{'}\):

$$\begin{aligned} h_t=(1-z_t)h_{t-1}+{z_t}h_t^{'} \end{aligned}$$
(4)

where \(z_t\) denotes how the unit updates its content. The method for computing attention context vector \(c_t\) in spatial attention mechanism is defined as:

$$\begin{aligned} c_t=f_{att}(h_t,I) \end{aligned}$$
(5)

where \(f_{att}\) is the spatial attention function. \(I\in R^{d \times k}=[I_1,I_k]\), \(I_i\in R^d\) is the local convolutional image features, each is a d dimensional representation corresponding to a region of the image in conv5, 3 layer. As is shown is Fig. 2, we exploit the 14 * 14 intermediate feature map by attention mechanism.

Fig. 3.
figure 3

The visualization of the raw image.

Fig. 4.
figure 4

The visualization of the partial activation feature in conv5, 3 layer.

We show in Fig. 3 raw image and visualize its CNN features in Fig. 4, from which we can find the fifth convolutional layer is effective to capture accurate semantic content due to its low spatial resolution while the third layer is more effective to highlight details precisely. This means the activation feature in the conv5, 3 layer can detect some pivotal image areas and guide our model to excavate regional visual representation.

We feed the spatial image features \(I_i\) and hidden state \(h_t\) through a hyperbolic tangent neural function followed by a softmax function to generate the attention distribution over the k regions of the image:

$$\begin{aligned} \alpha _t=\frac{exp(e_t)}{\sum _{t=1}^k exp(e_t)} \end{aligned}$$
(6)
$$\begin{aligned} e_t=tanh(w_i I_i+w_h h_t) \end{aligned}$$
(7)

where \(w_i\),\(w_h\) are the project parameters to be learnt. \(\alpha _t\) is the spatial attention weight over features in \(I_i\). The attention context vector \(c_t\) is computed bases on the attention distribution:

$$\begin{aligned} c_t=\sum _{t=1}^k \alpha _t I_i \end{aligned}$$
(8)

We combine \(c_t\) and \(h_t\) to predict the next word as in Eq. (3). As is shown in Fig. 5, we design a bimodal layer to process the information from the local image attention feature and the activation of the current hidden layer of SGRU. The Cascaded semantic layer is explored to capture dense syntactic representation.

Fig. 5.
figure 5

The flowchart of GLAN model. Modules of different colors represent layers of different functions. The input layer feeds the words one by one. (Color figure online)

2.4 Reranking Strategy

For a test image, we first utilize global-local feature attention network to create the hypothetical captions adopting Beam Search algorithm. Then we use nearest neighbor approach to find the similar images and their corresponding reference captions. In this paper, Euclidean distance measure \(D_E (x_1,x_2 )\) is adopted to calculate the feature similarities, which are defined by:

$$\begin{aligned} D_E (x_1,x_2 )=\sqrt{\sum _{n=1}^N {(x_{1k}-x_{2k})}^2} \end{aligned}$$
(9)

We define consensus caption \(C^*\) as the one that has the highest accumulated similarity with the reference captions. The consensus caption selective function is defined by [8]:

$$\begin{aligned} c^*=\mathop {\arg \max }_{c_1 \in H} \sum _{c_2 \in R} Sim(C_1,C_2) \end{aligned}$$
(10)

where H is the set of hypothetical captions while R is the set of reference captions. \(Sim(c_1,c_2)\) is the accumulated lexical similarity function between two captions \(c_1\),\(c_2\)

$$\begin{aligned} Sim(C_1,C_2)=BP e^{{\frac{1}{N}} \sum _{n=1}^N logp_n} \end{aligned}$$
(11)

where \(p_n\) is the modified n-gram precision and BP is a brevity penalty for the short sentences, which is computed by:

$$\begin{aligned} BP = \min (1,e^{1-\frac{b}{c}}) \end{aligned}$$
(12)

where b is the length of the reference sentence while c is the length of the candidate sentence.

3 Experimental Results

We evaluate the performance of GLAN-RS using the MSCOCO dataset [9, 10], which contains 82,783 images for training, 40,504 for validation, with 5 reference captions provided by human annotators [11]. We randomly sample 1000 images for testing. We utilize two classical CNN models, i.e., IV3 network [5] and VggNet [6] to en-code the visual information. Seven standard evaluation metrics are adopted for evaluations, including BLEU scores (i.e. B-1, B-2, B-3 and B-4) [12] and MSCOCO validate toolkit (ROUGE-L, CIDEr and METEOR) [13,14,15].

3.1 Datasets

Microsoft COCO Caption is a datasets released by Microsoft Corporation and contains almost 300,000 images with five reference sentences; The MS COCO Caption datasets is created for image caption, which not only consists of rich images and captions, but also provides the computing servers and code of evaluation merits. The MS COCO Caption datasets has become the first choice of researchers in recent year. Figure 6 illustrates some pictures in MSCOCO datasets.

Fig. 6.
figure 6

The illustration of pictures in MSCOCO datasets.

3.2 Evaluation Metrics

In order to validate the quality of the caption generated by the model, we choose seven automatic evaluation merits that have a high correlation with human judgments. The higher the score of merits, the better effect of image caption.

Bleu [12] was designed for automated evaluation of statistical machine translation, which can be adopted to measure similarity of descriptions. Given the diversity of possible image descriptions, Bleu may penalize candidates which are arguably descriptive of image content. CIDEr [13] is a special automatic evaluation metrics for image caption, which measures the consistency of image caption by calculating TF-IDF weights of each n-tuple. ROUGE [14] is a set of evaluation metrics designed to evaluate text summarization algorithms. ROUGE-L uses a measure based on the Longest Common Subsequence. Meteor [15] is the harmonic mean of unigram precision and recall that allows for exact, synonym, and paraphrase matchings between candidates and references. The researchers can calculate precision, recall, and F-measure by Meteor.

3.3 Model Details

We trained all sets of weights using stochastic gradient descent with changed learning rate and no momentum. We set fix learning rate in initial iteration period and decline 15% every iteration period after initial iteration period

Dropout and ReLu function are adopted to avoid gradient vanishing or exploding. All weights are randomly initialized with the range of (−1.0, 1.0). we use 1028 dimensions for dense embedding and the size of the SGRU is set to 2048. To infer the sentence given an input image, we use Beam Search algorithm, which keeps the best m generated captions. For convenience, we introduce the architectural details in the following:

GLAN+VGG uses VGGnet to extract the 4096-dimensions holistic visual embedding features while GLAN+IV3 utilizes Inception V3 network to encode the 2048-dimensions holistic image representations. GLAN+RS applies the Euclidean distance function to re-rank the order of the caption. We denote GLAN+RS as our proposed model to compare with several state-of-the-art models.

3.4 Performance Comparison of GLAN-RS with the State-of-the-Art Models

In order to verify the superiority of our proposed model, we compare its performance with some state-of-the-art methods including: Google NIC [4], GLSTM [11], Attention [3] and MRNN [2].

Table 1. Comparison results on MSCOCO dataset by using VggNet to extract image features. (-) indicates an unknown metric.
Table 2. Comparison results on MSCOCO dataset by using IV3 network to extract image features. (-) indicates an unknown metric.

Tables 1 and 2 list the comparison results, where we respectively encode visual embedding features with VggNet and IV3 network. From the tables, we can obtain the follow conclusion:

  1. 1.

    GLAN performs best for almost all the listed evaluation criteria with using VggNet and IV3 network respectively to extract the visual embedding feature. This may due to the reason that our model employ global and local attention feature to learn deep image-to-word mapping, which is able to exploit semantic context to high quality image captions.

  2. 2.

    GLAN+IV3 outperforms GLAN+VGG, which means robust image representation benefits the performance.

  3. 3.

    Attention mechanism is able to capture the latent visual representation to get higher Bleu1 score and generate high-level caption.

  4. 4.

    Reranking strategy makes it possible to select the caption with the highest accumulated similarity with the reference captions.

  5. 5.

    GLAN+RS has a remarkable improvement on the all evaluation criteria. GLAN+RS significantly outperforms the state-of-the-art approaches such as M-RNN, Google NIC etc., which gets an improvement of 20% in terms of BLEU4 score and 13 points in terms of CIDER score.

Fig. 7.
figure 7

The illustration of the generated caption by GLAN-RS model.

3.5 Generation Results

In order to qualitatively validate the effective of GLAN-RS, we select several images to generate captions by GLAN-RS model, which is shown in the Fig. 7.

As is shown in Fig. 7, our proposed model can detect the activity phrases such as “catch a frisbee”, “ride a snowboard” and “pullig a carriage” in different instances, which indicates that GLAN-RS is able to capture the key action information in the image. our model notices the “red fire hydrant” and “donut with sprinkles” in instance (D) and (C), which validate that our model make it possible to exploit latent detail information. It’s amazing to find that the phrase “at night” is generated for instance (F) , which show that GLAN-RS benefits its distinctive encoder-decoder framework which combines the global and regional attention features to capture semantic-related visual concept representation.

4 Conclusion

In this paper, we propose a novel global-local feature attention network with re-ranking strategy (GLAN-RS) for image caption generation. We utilize global image feature and local convolutional attention maps for exploiting visual representation. The consensus caption is selected by nearest neighbor approach and reranking strategy. The experimental results on benchmark MSCOCO dataset demonstrate the superiority of GLAN-RS for generating high quality image captions.