Global-Local Feature Attention Network with Reranking Strategy for Image Caption Generation

Wu, Jie; Xie, Siya; Shi, Xinbao; Chen, Yaowen

doi:10.1007/978-981-10-7299-4_13

Jie Wu¹⁶,
Siya Xie¹⁶,
Xinbao Shi¹⁶ &
…
Yaowen Chen¹⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 771))

Included in the following conference series:

CCF Chinese Conference on Computer Vision

2779 Accesses
1 Citations

Abstract

In this paper, a novel framework, named global-local feature attention network with reranking strategy (GLAN-RS), is presented for image captioning task. Rather than only adopt unitary visual information in the classical models, GLAN-RS explore attention mechanism to capture local convolutional salient image maps. Furthermore, we adopt reranking strategy to adjust the priority of the candidate captions and select the best one. The proposed model is verified using the MSCOCO benchmark dataset across seven standard evaluation metrics. Experimental results show that GLAN-RS significantly outperforms the state-of-the-art approaches such as M-RNN, Google NIC etc., which gets an improvement of 20% in terms of BLEU4 score and 13 points in terms of CIDER score.

J. Wu—Student Paper.

You have full access to this open access chapter, Download conference paper PDF

Global-local feature attention network with reranking strategy for image caption generation

Article 17 November 2017

Neural Image Caption Generation with Global Feature Based Attention Scheme

Self-Enhanced Attention for Image Captioning

Article Open access 01 April 2024

Keywords

1 Introduction

With the rapid development of the artificial intelligence in recent years, image caption has been a hot spot in computer vision and natural language processing. Image caption is a multidisciplinary subject involving signal processing, pattern recognition, computer vision and cognitive sciences, which aims to describe the content of the input image by learning dense mappings between images and words. Image caption can be applied to the image retrieval, children’s education and the life support for visually impaired persons, which has a positive role in promoting social life.

Due to the advances of deep neural network [1], some state-of-the-art models have been presented for solving the challenges of generating image captions. For example, Mao et al. [2] propose a multimodal recurrent neural network (MRNN) for sentence generation. Xu et al. [3] explore attention mechanisms that capture salient feature of the raw image to generate caption. However, they only adopt unitary image feature instead of feeding global and local visual information simultaneously. Furthermore, they are easy to generate caption irrelevant to image content.

In order to overcome the above limitations, we propose a new global-local feature attention network with reranking strategy (GLAN-RS) for addressing image-captioning task. As illustrated in Fig. 1, GLAN-RS consists of global-local feature attention network to exploit visual information and a reranking strategy to select the consensus caption in candidate captions.

Our contributions are as follows:

Firstly, GLAN-RS combines global image feature and local convolutional attention maps for capture holistic and salient visual information simultaneously.

Secondly, we explore nearest neighbor approach to calculate the image similarity and obtain their reference captions. Thus we can determine the best candidate caption by finding the one with highest score respect to the reference captions.

Moreover, we validate the effectiveness of GLAN-RS comparing with the state-of-the-art approaches [2, 4] consistently across seven evaluation metrics, which show that GLAN-RS get an improvement of 20% in terms of BLEU4 score and 13 points in terms of CIDER score.

2 Proposed Method

2.1 Encoder-Decoder Framework for Image Caption Generation

We adopt the popular encoder-decoder framework for image caption generation, where convolution neural network (CNN) [5, 6] encodes the image into a fixed-dimension feature vector and then a stacked Gated Recurrent Unit (SGRU) [7] decodes the visual information into sematic information. Given an input image and its corresponding caption, the encoder-decoder model directly maximize the log-likelihood function of the following objective:

$$\begin{aligned} \theta ^{*} = \mathop {\arg \min }_{\theta }\sum _{i=0}^L \log p(s_i |f_I,(\theta )) \end{aligned}$$

(1)

2.2 Global Image Attention Feature

Where $f_I$ denotes the global image representation of the raw image, $s_i$ denotes a sequence of words in a sentence of length L and is the parameters of the model. Images information should be encoded as fix-length vectors to feed into SGRU. CNN is used to extract the image feature map $f_I$ from a raw image I to give an overview of the image content to the model:

$$\begin{aligned} f_I = CNN(I) \end{aligned}$$

(2)

As is shown is Fig. 2, we capture the global image attention feature in the last fully-connected layer.

2.3 Local Convolutional Image Attention Feature

In the encoder-decoder framework with SGRU and local convolutional attention map, the conditional probability of the log-likelihood function is modeled as:

$$\begin{aligned} log p(s_t|s_{1:t-1}) = tanh(h_t,c_t) \end{aligned}$$

(3)

where $c_t$ is the attention context vector obtained in the third layer of the fifth convolutional layer. In this paper, we utilize the SGRU with two hidden layers to modulate the information flow inside the unit instead of applying a separate memory cells. $h_t$ is the activation of the hidden state in SGRU at time t, which is a linear combination between previous activation $h_{t-1}$ and the candidate activation $h_t^{'}$:

$$\begin{aligned} h_t=(1-z_t)h_{t-1}+{z_t}h_t^{'} \end{aligned}$$

(4)

where $z_t$ denotes how the unit updates its content. The method for computing attention context vector $c_t$ in spatial attention mechanism is defined as:

$$\begin{aligned} c_t=f_{att}(h_t,I) \end{aligned}$$

(5)

where $f_{att}$ is the spatial attention function. $I\in R^{d \times k}=[I_1,I_k]$, $I_i\in R^d$ is the local convolutional image features, each is a d dimensional representation corresponding to a region of the image in conv5, 3 layer. As is shown is Fig. 2, we exploit the 14 * 14 intermediate feature map by attention mechanism.

We show in Fig. 3 raw image and visualize its CNN features in Fig. 4, from which we can find the fifth convolutional layer is effective to capture accurate semantic content due to its low spatial resolution while the third layer is more effective to highlight details precisely. This means the activation feature in the conv5, 3 layer can detect some pivotal image areas and guide our model to excavate regional visual representation.

We feed the spatial image features $I_i$ and hidden state $h_t$ through a hyperbolic tangent neural function followed by a softmax function to generate the attention distribution over the k regions of the image:

$$\begin{aligned} \alpha _t=\frac{exp(e_t)}{\sum _{t=1}^k exp(e_t)} \end{aligned}$$

(6)

$$\begin{aligned} e_t=tanh(w_i I_i+w_h h_t) \end{aligned}$$

(7)

where $w_i$,$w_h$ are the project parameters to be learnt. $\alpha _t$ is the spatial attention weight over features in $I_i$. The attention context vector $c_t$ is computed bases on the attention distribution:

$$\begin{aligned} c_t=\sum _{t=1}^k \alpha _t I_i \end{aligned}$$

(8)

We combine $c_t$ and $h_t$ to predict the next word as in Eq. (3). As is shown in Fig. 5, we design a bimodal layer to process the information from the local image attention feature and the activation of the current hidden layer of SGRU. The Cascaded semantic layer is explored to capture dense syntactic representation.

2.4 Reranking Strategy

For a test image, we first utilize global-local feature attention network to create the hypothetical captions adopting Beam Search algorithm. Then we use nearest neighbor approach to find the similar images and their corresponding reference captions. In this paper, Euclidean distance measure $D_E (x_1,x_2 )$ is adopted to calculate the feature similarities, which are defined by:

$$\begin{aligned} D_E (x_1,x_2 )=\sqrt{\sum _{n=1}^N {(x_{1k}-x_{2k})}^2} \end{aligned}$$

(9)

We define consensus caption $C^*$ as the one that has the highest accumulated similarity with the reference captions. The consensus caption selective function is defined by [8]:

$$\begin{aligned} c^*=\mathop {\arg \max }_{c_1 \in H} \sum _{c_2 \in R} Sim(C_1,C_2) \end{aligned}$$

(10)

where H is the set of hypothetical captions while R is the set of reference captions. $Sim(c_1,c_2)$ is the accumulated lexical similarity function between two captions $c_1$,$c_2$

$$\begin{aligned} Sim(C_1,C_2)=BP e^{{\frac{1}{N}} \sum _{n=1}^N logp_n} \end{aligned}$$

(11)

where $p_n$ is the modified n-gram precision and BP is a brevity penalty for the short sentences, which is computed by:

$$\begin{aligned} BP = \min (1,e^{1-\frac{b}{c}}) \end{aligned}$$

(12)

where b is the length of the reference sentence while c is the length of the candidate sentence.

3 Experimental Results

We evaluate the performance of GLAN-RS using the MSCOCO dataset [9, 10], which contains 82,783 images for training, 40,504 for validation, with 5 reference captions provided by human annotators [11]. We randomly sample 1000 images for testing. We utilize two classical CNN models, i.e., IV3 network [5] and VggNet [6] to en-code the visual information. Seven standard evaluation metrics are adopted for evaluations, including BLEU scores (i.e. B-1, B-2, B-3 and B-4) [12] and MSCOCO validate toolkit (ROUGE-L, CIDEr and METEOR) [13,14,15].

3.1 Datasets

Microsoft COCO Caption is a datasets released by Microsoft Corporation and contains almost 300,000 images with five reference sentences; The MS COCO Caption datasets is created for image caption, which not only consists of rich images and captions, but also provides the computing servers and code of evaluation merits. The MS COCO Caption datasets has become the first choice of researchers in recent year. Figure 6 illustrates some pictures in MSCOCO datasets.

3.2 Evaluation Metrics

In order to validate the quality of the caption generated by the model, we choose seven automatic evaluation merits that have a high correlation with human judgments. The higher the score of merits, the better effect of image caption.

Bleu [12] was designed for automated evaluation of statistical machine translation, which can be adopted to measure similarity of descriptions. Given the diversity of possible image descriptions, Bleu may penalize candidates which are arguably descriptive of image content. CIDEr [13] is a special automatic evaluation metrics for image caption, which measures the consistency of image caption by calculating TF-IDF weights of each n-tuple. ROUGE [14] is a set of evaluation metrics designed to evaluate text summarization algorithms. ROUGE-L uses a measure based on the Longest Common Subsequence. Meteor [15] is the harmonic mean of unigram precision and recall that allows for exact, synonym, and paraphrase matchings between candidates and references. The researchers can calculate precision, recall, and F-measure by Meteor.

3.3 Model Details

We trained all sets of weights using stochastic gradient descent with changed learning rate and no momentum. We set fix learning rate in initial iteration period and decline 15% every iteration period after initial iteration period

Dropout and ReLu function are adopted to avoid gradient vanishing or exploding. All weights are randomly initialized with the range of (−1.0, 1.0). we use 1028 dimensions for dense embedding and the size of the SGRU is set to 2048. To infer the sentence given an input image, we use Beam Search algorithm, which keeps the best m generated captions. For convenience, we introduce the architectural details in the following:

GLAN+VGG uses VGGnet to extract the 4096-dimensions holistic visual embedding features while GLAN+IV3 utilizes Inception V3 network to encode the 2048-dimensions holistic image representations. GLAN+RS applies the Euclidean distance function to re-rank the order of the caption. We denote GLAN+RS as our proposed model to compare with several state-of-the-art models.

3.4 Performance Comparison of GLAN-RS with the State-of-the-Art Models

In order to verify the superiority of our proposed model, we compare its performance with some state-of-the-art methods including: Google NIC [4], GLSTM [11], Attention [3] and MRNN [2].

Table 1. Comparison results on MSCOCO dataset by using VggNet to extract image features. (-) indicates an unknown metric.

Full size table

Table 2. Comparison results on MSCOCO dataset by using IV3 network to extract image features. (-) indicates an unknown metric.

Full size table

Tables 1 and 2 list the comparison results, where we respectively encode visual embedding features with VggNet and IV3 network. From the tables, we can obtain the follow conclusion:

1.
GLAN performs best for almost all the listed evaluation criteria with using VggNet and IV3 network respectively to extract the visual embedding feature. This may due to the reason that our model employ global and local attention feature to learn deep image-to-word mapping, which is able to exploit semantic context to high quality image captions.
2.
GLAN+IV3 outperforms GLAN+VGG, which means robust image representation benefits the performance.
3.
Attention mechanism is able to capture the latent visual representation to get higher Bleu1 score and generate high-level caption.
4.
Reranking strategy makes it possible to select the caption with the highest accumulated similarity with the reference captions.
5.
GLAN+RS has a remarkable improvement on the all evaluation criteria. GLAN+RS significantly outperforms the state-of-the-art approaches such as M-RNN, Google NIC etc., which gets an improvement of 20% in terms of BLEU4 score and 13 points in terms of CIDER score.

3.5 Generation Results

In order to qualitatively validate the effective of GLAN-RS, we select several images to generate captions by GLAN-RS model, which is shown in the Fig. 7.

As is shown in Fig. 7, our proposed model can detect the activity phrases such as “catch a frisbee”, “ride a snowboard” and “pullig a carriage” in different instances, which indicates that GLAN-RS is able to capture the key action information in the image. our model notices the “red fire hydrant” and “donut with sprinkles” in instance (D) and (C), which validate that our model make it possible to exploit latent detail information. It’s amazing to find that the phrase “at night” is generated for instance (F) , which show that GLAN-RS benefits its distinctive encoder-decoder framework which combines the global and regional attention features to capture semantic-related visual concept representation.

4 Conclusion

In this paper, we propose a novel global-local feature attention network with re-ranking strategy (GLAN-RS) for image caption generation. We utilize global image feature and local convolutional attention maps for exploiting visual representation. The consensus caption is selected by nearest neighbor approach and reranking strategy. The experimental results on benchmark MSCOCO dataset demonstrate the superiority of GLAN-RS for generating high quality image captions.

References

Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)
Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.: Deep captioning with multimodal recurrent neural networks (mRNN). arXiv preprint arXiv:1412.6632 (2014)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. CoRR, abs/1502.03044 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. arXiv preprint arXiv:1409.4842 (2014)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 (2014)
Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555 (2014)
Google Scholar
Devlin, J., Gupta, S., Girshick, R., Mitchell, M., Zitnick, C.L.: Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467 (2015)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Google Scholar
Chen, X., Fang, H., Lin, T.Y., Vedantam, R., et al.: Microsoft coco captions: data collection and evaluation server. arXiv:1504.00325 (2015)
Rashtchian, C., Young, P., Hodosh, M., et al.: Collecting image captions using Amazon’s mechanical turk. In: NAACL Hlt 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139–147 (2010)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR (2015)
Google Scholar
Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, vol. 8 (2004)
Google Scholar
Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, vol. 29, pp. 65–72 (2005)
Google Scholar
Jia, X., Gavves, E., Fernando, B., et al.: Guiding the long-short term memory model for image caption generation. In: ICCV (2015)
Google Scholar

Download references

Acknowledgements

We are grateful for the financial support from the Innovative Application and Re-search Project of Guangdong Province (No. 2016KZDXM013) and the Science & Technology Project of Shantou City (A201400150).

Author information

Authors and Affiliations

College of Engineering, Shantou University, No. 243 Daxue Road, Shantou, China
Jie Wu, Siya Xie & Xinbao Shi
The Key Laboratory of Digital Signal and Image Processing of Guangdong, Shantou University, No. 243 Daxue Road, Shantou, China
Yaowen Chen

Authors

Jie Wu
View author publications
You can also search for this author in PubMed Google Scholar
Siya Xie
View author publications
You can also search for this author in PubMed Google Scholar
Xinbao Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yaowen Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yaowen Chen .

Editor information

Editors and Affiliations

Civil Aviation University of China, Tianjin, China
Jinfeng Yang
School of Computer Science and Technology, Tianjin University, Tianjin, China
Qinghua Hu
Nankai University, Tianjin, China
Ming-Ming Cheng
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Liang Wang
Information Science and Technology, Nanjing University, Beijing, China
Qingshan Liu
Huazhong University of Science and Technology, Wuhan, Hubei, China
Xiang Bai
Xi’an Jiaotong University, Xi’an, China
Deyu Meng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, J., Xie, S., Shi, X., Chen, Y. (2017). Global-Local Feature Attention Network with Reranking Strategy for Image Caption Generation. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 771. Springer, Singapore. https://doi.org/10.1007/978-981-10-7299-4_13

Download citation

DOI: https://doi.org/10.1007/978-981-10-7299-4_13
Published: 30 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7298-7
Online ISBN: 978-981-10-7299-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Global-Local Feature Attention Network with Reranking Strategy for Image Caption Generation