Distinctive-Attribute Extraction for Image Captioning

Kim, Boeun; Lee, Young Han; Jung, Hyedong; Cho, Choongsang

doi:10.1007/978-3-030-11018-5_12

Boeun Kim¹⁴,
Young Han Lee¹⁴,
Hyedong Jung¹⁴ &
…
Choongsang Cho¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11132))

Included in the following conference series:

European Conference on Computer Vision

1264 Accesses
3 Citations

Abstract

Image captioning has evolved with the progress of deep neural networks. However, generating qualitatively detailed and distinctive captions is still an open issue. In previous works, a caption involving semantic description can be generated by applying additional information into the RNNs. In this approach, we propose a distinctive-attribute extraction (DaE) method that extracts attributes which explicitly encourage RNNs to generate an accurate caption. We evaluate the proposed method with a challenge data and verify that this method improves the performance, describing images in more detail. The method can be plugged into various models to improve their performance.

You have full access to this open access chapter, Download conference paper PDF

GAF-Net: Global view guided attribute fusion network for remote sensing image captioning

Article 07 August 2023

Yuqing Peng, Yamin Jia, … Xinhao Ji

Combining Object-Based Attention and Attributes for Image Captioning

Diverse Image Captioning with Grounded Style

Keywords

1 Introduction

Image captioning is a potent and useful tool for automatically describing or explaining the overall situation of an image [5, 22, 24]. However, generate qualitatively detailed and distinctive captions is still an open issue. Although in most cases captions with unique expressions are more useful than those with only safe ones, the current evaluation metrics do not adequately reflect this aspect. After the numerical performance of the previous researches increased to some extent, some works are studying how to generate detailed and accurate captions [4].

In this paper, we propose a Distinctive-attribute Extraction (DaE) method that extracts attributes which explicitly encourages RNNs to generate a caption that describes a significant meaning of an image. The main contributions of this paper are as follows: (i) We propose a semantics extraction method by using the TF-IDF caption analysis. (ii) We propose a scheme to infer distinctive-attribute by the model trained with semantic information. (iii) We perform quantitative and qualitative evaluations, demonstrating that the proposed method improves the performance of a base caption generation model by a substantial margin while describing images more distinctively.

2 Related Work

Combinations of CNNs and RNNs are widely used for the image captioning networks [5, 6, 8, 22,23,24]. The CNN was used as an image encoder, and an output of its last hidden layer is fed into the RNN decoder that generates sentences. Recent approaches can be grouped into two paradigms. Top-down includes attention-based mechanisms, and many of the bottom-up methods used semantic concepts. For the latter, Fang et al. [6] used multiple instance learning (MIL) to train word detectors with words that commonly occur in captions. The word detector outputs guided a language model to generate descriptions to include the detected words. Wu et al. [23] predicted attributes by treating the problem as a multi-label classification. The CNN framework was used and outputs from different proposal sub-regions are aggregated. Gan et al. [8] proposed Semantic Concept Network (SCN) integrating semantic concepts to the LSTM network. SCN factorized each weight matrix of the attribute integrated the LSTM model to reduce the number of parameters.

More recently, Dai et al. [4] proposed Contrastive Learning(CL) method which encourages the distinctiveness of captions. In addition to true image-caption pairs, this method used mismatched pairs which include captions describing other images for learning.

3 Distinctive-Attribute Extraction

In this paper, we describe a semantic information processing and extraction method, which affects the quality of generated captions. We propose a method to generate captions that can represent the unique situation of the image. Different from CL [4] that improved target method by additional pairs on a training set, our method lies on the bottom-up approaches using semantic attributes. We assign more weights to the attributes that are more informative and distinctive to describe the image. As illustrated in Fig. 1, there are two main steps, one is semantic information extraction, and the other is the distinctive-attribute prediction. First, we extract meaningful information from reference captions. Next, we learn the distinctive-attribute prediction model with image-information (${D}_{g}$) pairs. After getting distinctive-attribute (${D}_{p}$) from images, we apply these attributes to a caption generation network to verify their effect. For the network, we used SCN-LSTM [8] which is a tag integrated network.

3.1 Semantic Information Extraction by TF-IDF

Most of the previous methods constituted semantic information that was a ground truth attribute, as a binary form [6, 8, 23, 25]. They first determined vocabulary using K most common words in the training captions. The vocabulary included nouns, verbs, and adjectives. If the word in the vocabulary existed in reference captions, the corresponding element of an attribute vector became 1. Different from previous methods, we weight semantic information according to their significance. Informative and distinctive words are weighted more, and the weight scores are estimated from reference captions by TF-IDF scheme which was widely used in text mining tasks.

Figure 2 represents samples of COCO datasets. In Fig. 2(a), there is a common word “surfboard” in 3 out of 5 captions, which is a key-word that characterizes the image. Intuitively, this kind of words should get high scores. To implement this concept, we apply average term frequency ${TF}_{av}(w,d)$, the number of times word w occurs in document d divided by the number of captions for an image. Another common word “man” appears a lot in other images. Therefore, that is a less meaningful word for distinguishing one image from another. To reflect this, we apply inverse document frequency term weighting $IDF(w)=\log \{({N}_{d}+1)/(DF(w)+1)\}+1$, where ${N}_{d}$ is the total number of documents, and DF(w) is the number of documents that contain the word w. “1” is added in denominator and numerator to prevent zero-divisions [19]. Then a semantic information vector is derived by multiplying two metrics as $TF-IDF(w,d)={TF}_{av}(w,d) \times IDF(w)$. We apply L2 normalization to TF-IDF vectors of each image for training performance. The normalized value is the ground truth distinctive-attribute vector ${D}_{g}$. We apply stemming using Porter Stemmer [20] before extracting TF-IDF.

The next step is to construct vocabulary with the words in the reference captions. The vocabulary should contain enough characteristic words to represent each image. At the same time, the semantic information should be trained well for prediction accuracy. We determine the words to be included in the vocabulary based on the IDF scores which indicates the uniqueness of the word. The vocabulary contains the words whose IDF is higher than the IDF threshold (${th}_{IDF}$) regardless of the part of speech. We observe the performance of the attribute prediction model and overall captioning model while changing the IDF value threshold in Sect. 4.3.

3.2 Distinctive-Attribute Prediction Model

For the Distinctive-attribute prediction model, convolutional layers are followed by four fully-connected layers (FCs). We use ResNet-152 [10] architecture for CNN layers and the output of the 2048-way pool5 layer is fed into a stack of fully connected layers. Training data for each image consist of input image I and ground truth distinctive-attribute $\mathbf{D }_{g,i} = [{D}_{g,i1}, {D}_{g,i2}, \dots , {D}_{g,i{N}_{w}}]$, where ${N}_{w}$ is the number of the words in vocabulary and i is the index of the image. Our goal is to predict attribute scores as similar as possible to ${D}_{g}$. The cost function to be minimized is defined as mean squared error:

$$\begin{aligned} C = \frac{1}{M} \frac{1}{{N}_{w}} \sum _{i} \sum _{w} [{D}_{g,iw} - {D}_{p,iw}] ^{2} \end{aligned}$$

(1)

where $\mathbf{D }_{p,i} = [{D}_{p,i1}, {D}_{p,i2}, \dots , {D}_{p,i{N}_{w}}]$ is predictive attribute score vector for ith image and M denotes the number of training images. The first three FCs have 2048 channels each, the fourth contains ${N}_{w}$ channels. We use ReLU [17] as nonlinear activation function for all FCs. We adopt batch normalization [11] right after each FC and before activation. The training is regularized by dropout with ratio 0.3 for the first three FCs. Each FC is initialized with a Xavier initialization [9]. We note that our network does not contain softmax as a final layer, different from other attribute predictors described in previous papers [8, 23]. Hence, we use the output of an activation function of the fourth FC layer as the final predictive score $\mathbf{D }_{p,i}$.

4 Results

4.1 Experiment Settings

Our results are evaluated on the popular MS COCO dataset [2, 15]. The dataset contains 82,783 images for training, 40,504 and 40,775 images for validation and testing. The model described in Sect. 3.2 is implemented in Keras [3] and we used scikit-learn toolkit [19] to implement TF-IDF scheme. We set IDF threshold value to 7 in this experiment. The mini-batch size is fixed at 128 and Adam’s optimization [13] with learning rate $3 \times {10}^{-3}$ is used and stopped after 100 epochs. For the prediction model, we train 5 identical models with different initializations, and then ensemble by averaging their outcomes. SCN-LSTM training procedure follows [8] and we use the public implementation [7] of this method opened by Gan who is the author of the published paper [8].

Table 1. COCO evaluation server results using 5 references and 40 references captions. DaE improves the performance by significant margins across all metrics

Full size table

Table 2. Results of published image captioning models tested on the COCO evaluation server

Full size table

Table 3. This table illustrates several images with extracted attributes and captions. The captions generated by using DaE+SCN-LSTM are explained more in detail with more distinctive and accurate attributes

Full size table

4.2 Evaluation

Firstly, we compared our method with SCN [7]. We evaluate both results on the online COCO testing server [2] and list them in Table 1. For SCN, we use the pre-trained weights provided by the author. The vocabulary size of the proposed scheme is 938, which is smaller than that of SCN [7] with 999. Results of both methods are derived from ensembling 5 models, respectively. The widely used metrics, BLEU-1,2,3,4 [18], METEOR [1], ROUGL-L [14], CIDEr [21] are selected to evaluate overall captioning performance. DaE improves the performance of SCN-LSTM by significant margins across all metrics. Specifically, DaE improves CIDEr from 0.967 to 0.981 in 5-refs and from 0.971 to 0.990 in 40-refs. The increase is greater at 40-refs which have relatively various expressions. The results for other published models tested on the COCO evaluation server are summarized in Table 2. In 40-refs, our method surpasses the performance of $Adaptive Attention + CL$ [4] which is the state-of-the-art in terms of four BLEU scores. The qualitative evaluation is shown in Table 6. We listed the top eight attributes. For DaE, words after stemming with Porter Stemmer [20] are displayed as they are. Scores in the right parentheses of the tags and distinctive-attributes have different meanings, the former is probabilities, and the latter is distinctiveness values of the words. The attributes extracted using DaE include important words to represent the situation in an image; as a result, the caption generated by using them are represented more in detail compared with those of SCN. The result of the proposed method in (a), “A woman cutting a piece of fruit with a knife” explains what the main character does exactly. In the SCN, the general word “food” get a high probability, on the other hand, DaE extracts more distinctive words such as “fruit” and “apple.” For verbs, “cut”, which is the most specific action that viewers would be interested in, gets high distinctiveness score. In the case of (b), “wine” and “drink” are chosen as the words with the first and the third highest distinctiveness through DaE. Therefore, the characteristic phrase “drinking wine” is added. More examples are in Appendix A.

4.3 Vocabulary Construction

To analyze DaE in more detail, we conduct experiments with differently constructed vocabularies. We set seven different IDF threshold values, ${th}_{IDF}$, from 5 to 11.

$$\begin{aligned} {Vocab}_{i} = \{ w \;| \; IDF(w) > i, i = {th}_{IDF}\}. \end{aligned}$$

(2)

The vocabulary contains only the words whose IDF is bigger than ${th}_{IDF}$. The number of vocabulary words is shown in the second row of Table 4(a) and (b). Semantic information of the images are extracted corresponding to this vocabulary, and we use them to learn the proposed prediction model. Widely used splits [12] of COCO datasets are applied for the evaluation. We evaluate the prediction considering it as a multi-label and multi-class classification problem. The distinctiveness score between 0 and 1 are divided into four classes; (0.0, 0.25], (0.25, 0.5], (0.5, 0.75], and (0.75, 1.0] and the macro-averaged F1 score is computed globally. The performance, of the prediction model is shown in the third row. Each extracted distinctive-attribute is fed into SCN-LSTM to generate a caption, and the evaluation result, CIDEr, is shown in the fourth row. The CIDErs increase from ${Vocab}_{5}$ to ${Vocab}_{7}$, and then monotonically decrease in the rest. In other words, the maximum performance is derived from ${Vocab}_{7}$ to 0.996. The vocabulary size and the prediction performance are in a trade-off in this experiment. With the high ${th}_{IDF}$ value, captions can be generated with various vocabularies, but the captioning performance is not maximized because the performance of distinctive-attribute prediction is relatively low. ${Vocab}_{6}$ and ${Vocab}_{9}$ have almost the same CIDEr. In this case, If the vocabulary contains more words, it is possible to represent the captions more diversely and accurately for some images. Table 5 shows examples corresponding to this case. For the case of (a), the ${Vocab}_{6}$ does not include the stemmed word “carriag”, but the ${Vocab}_{9}$ contains the word and is extracted as the word having the seventh highest value through DaE. The word led the phrase “pulling a carriage” to be included the caption, well describing the situation. “Tamac” in (b), and “microwav” in (c) plays a similar role.

Table 4. Results of experiments with differently constructed vocabularies

Full size table

Table 4(b) presents experimental results without stemming. The maximum value was 0.911, which is lower than the maximum value of the experiments applying stemming. When stemming is applied, the distinctiveness and significance of a word can be better expressed because it is mapped to the same word even if the tense and form are different. In addition, the size of vocabulary required to achieve the same performance is less when stemming is applied.

Table 5. Several cases that more diverse and accurate captions are generated using ${Vocab}_{9}$ than using ${Vocab}_{6}$, although their CIDErs are similar

Full size table

5 Conclusion

In this study, we propose a Distinctive-attribute Extraction (DaE) method for image captioning. In particular, the TF-IDF scheme is used to extract meaningful information from the reference captions. Then the attribute prediction model is trained by the extracted information and used to infer the semantic-attribute for generating a description. DaE improves the performance of SCN-LSTM scheme by significant margins across all metrics; moreover, detailed and unique captions are generated. The proposed method can be plugged into various models to improve their performance.

References

Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chollet, F., et al.: Keras (2015). https://github.com/keras-team/keras
Dai, B., Lin, D.: Contrastive learning for image captioning. In: Advances in Neural Information Processing Systems, pp. 898–907 (2017)
Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2625–2634 (2015)
Google Scholar
Fang, H., et al.: From captions to visual concepts and back (2015)
Google Scholar
Gan, Z.: Semantic compositional nets (2017). https://github.com/zhegan27/Semantic_Compositional_Nets
Gan, Z., et al.: Semantic compositional networks for visual captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2 (2017)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of 13th International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128–3137 (2015)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. Text Summarization Branches Out (2004)
Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 6 (2017)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of 27th International Conference on Machine Learning (ICML), pp. 807–814 (2010)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)
MathSciNet MATH Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164. IEEE (2015)
Google Scholar
Wu, Q., Shen, C., Liu, L., Dick, A., van den Hengel, A.: What value do explicit high level concepts have in vision to language problems? In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 203–212 (2016)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4651–4659 (2016)
Google Scholar

Download references

Acknowledgement

This work was supported by IITP/MSIT [2017-0-00255, Autonomous digital companion framework and application].

Author information

Authors and Affiliations

AI Research Center, Korea Electronics Technology Institute, Seongnam, Korea
Boeun Kim, Young Han Lee, Hyedong Jung & Choongsang Cho

Authors

Boeun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Young Han Lee
View author publications
You can also search for this author in PubMed Google Scholar
Hyedong Jung
View author publications
You can also search for this author in PubMed Google Scholar
Choongsang Cho
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Choongsang Cho .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

A Qualitative Evaluation of DaE

Table 6. This figure is an expansion in Table 3 which is the qualitative evaluation of the proposed method

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, B., Lee, Y.H., Jung, H., Cho, C. (2019). Distinctive-Attribute Extraction for Image Captioning. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11132. Springer, Cham. https://doi.org/10.1007/978-3-030-11018-5_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-11018-5_12
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11017-8
Online ISBN: 978-3-030-11018-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Distinctive-Attribute Extraction for Image Captioning

Abstract

Similar content being viewed by others

GAF-Net: Global view guided attribute fusion network for remote sensing image captioning

Combining Object-Based Attention and Attributes for Image Captioning

Diverse Image Captioning with Grounded Style

Keywords

1 Introduction

2 Related Work

3 Distinctive-Attribute Extraction

3.1 Semantic Information Extraction by TF-IDF

3.2 Distinctive-Attribute Prediction Model

4 Results

4.1 Experiment Settings

4.2 Evaluation

4.3 Vocabulary Construction

5 Conclusion

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Qualitative Evaluation of DaE

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Distinctive-Attribute Extraction for Image Captioning

Abstract

Similar content being viewed by others

GAF-Net: Global view guided attribute fusion network for remote sensing image captioning

Combining Object-Based Attention and Attributes for Image Captioning

Diverse Image Captioning with Grounded Style

Keywords

1 Introduction

2 Related Work

3 Distinctive-Attribute Extraction

3.1 Semantic Information Extraction by TF-IDF

3.2 Distinctive-Attribute Prediction Model

4 Results

4.1 Experiment Settings

4.2 Evaluation

4.3 Vocabulary Construction

5 Conclusion

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Qualitative Evaluation of DaE

A Qualitative Evaluation of DaE

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation