Keywords

1 Introduction

Image captioning is a potent and useful tool for automatically describing or explaining the overall situation of an image [5, 22, 24]. However, generate qualitatively detailed and distinctive captions is still an open issue. Although in most cases captions with unique expressions are more useful than those with only safe ones, the current evaluation metrics do not adequately reflect this aspect. After the numerical performance of the previous researches increased to some extent, some works are studying how to generate detailed and accurate captions [4].

In this paper, we propose a Distinctive-attribute Extraction (DaE) method that extracts attributes which explicitly encourages RNNs to generate a caption that describes a significant meaning of an image. The main contributions of this paper are as follows: (i) We propose a semantics extraction method by using the TF-IDF caption analysis. (ii) We propose a scheme to infer distinctive-attribute by the model trained with semantic information. (iii) We perform quantitative and qualitative evaluations, demonstrating that the proposed method improves the performance of a base caption generation model by a substantial margin while describing images more distinctively.

2 Related Work

Combinations of CNNs and RNNs are widely used for the image captioning networks [5, 6, 8, 22,23,24]. The CNN was used as an image encoder, and an output of its last hidden layer is fed into the RNN decoder that generates sentences. Recent approaches can be grouped into two paradigms. Top-down includes attention-based mechanisms, and many of the bottom-up methods used semantic concepts. For the latter, Fang et al. [6] used multiple instance learning (MIL) to train word detectors with words that commonly occur in captions. The word detector outputs guided a language model to generate descriptions to include the detected words. Wu et al. [23] predicted attributes by treating the problem as a multi-label classification. The CNN framework was used and outputs from different proposal sub-regions are aggregated. Gan et al. [8] proposed Semantic Concept Network (SCN) integrating semantic concepts to the LSTM network. SCN factorized each weight matrix of the attribute integrated the LSTM model to reduce the number of parameters.

Fig. 1.
figure 1

An overview of the proposed framework including a semantic information extraction procedure and a distinctive-attribute prediction model

More recently, Dai et al. [4] proposed Contrastive Learning(CL) method which encourages the distinctiveness of captions. In addition to true image-caption pairs, this method used mismatched pairs which include captions describing other images for learning.

3 Distinctive-Attribute Extraction

In this paper, we describe a semantic information processing and extraction method, which affects the quality of generated captions. We propose a method to generate captions that can represent the unique situation of the image. Different from CL [4] that improved target method by additional pairs on a training set, our method lies on the bottom-up approaches using semantic attributes. We assign more weights to the attributes that are more informative and distinctive to describe the image. As illustrated in Fig. 1, there are two main steps, one is semantic information extraction, and the other is the distinctive-attribute prediction. First, we extract meaningful information from reference captions. Next, we learn the distinctive-attribute prediction model with image-information (\({D}_{g}\)) pairs. After getting distinctive-attribute (\({D}_{p}\)) from images, we apply these attributes to a caption generation network to verify their effect. For the network, we used SCN-LSTM [8] which is a tag integrated network.

3.1 Semantic Information Extraction by TF-IDF

Most of the previous methods constituted semantic information that was a ground truth attribute, as a binary form [6, 8, 23, 25]. They first determined vocabulary using K most common words in the training captions. The vocabulary included nouns, verbs, and adjectives. If the word in the vocabulary existed in reference captions, the corresponding element of an attribute vector became 1. Different from previous methods, we weight semantic information according to their significance. Informative and distinctive words are weighted more, and the weight scores are estimated from reference captions by TF-IDF scheme which was widely used in text mining tasks.

Fig. 2.
figure 2

Examples of images and their reference captions brought from MS COCO datasets [2, 15]

Figure 2 represents samples of COCO datasets. In Fig. 2(a), there is a common word “surfboard” in 3 out of 5 captions, which is a key-word that characterizes the image. Intuitively, this kind of words should get high scores. To implement this concept, we apply average term frequency \({TF}_{av}(w,d)\), the number of times word w occurs in document d divided by the number of captions for an image. Another common word “man” appears a lot in other images. Therefore, that is a less meaningful word for distinguishing one image from another. To reflect this, we apply inverse document frequency term weighting \(IDF(w)=\log \{({N}_{d}+1)/(DF(w)+1)\}+1\), where \({N}_{d}\) is the total number of documents, and DF(w) is the number of documents that contain the word w. “1” is added in denominator and numerator to prevent zero-divisions [19]. Then a semantic information vector is derived by multiplying two metrics as \(TF-IDF(w,d)={TF}_{av}(w,d) \times IDF(w)\). We apply L2 normalization to TF-IDF vectors of each image for training performance. The normalized value is the ground truth distinctive-attribute vector \({D}_{g}\). We apply stemming using Porter Stemmer [20] before extracting TF-IDF.

The next step is to construct vocabulary with the words in the reference captions. The vocabulary should contain enough characteristic words to represent each image. At the same time, the semantic information should be trained well for prediction accuracy. We determine the words to be included in the vocabulary based on the IDF scores which indicates the uniqueness of the word. The vocabulary contains the words whose IDF is higher than the IDF threshold (\({th}_{IDF}\)) regardless of the part of speech. We observe the performance of the attribute prediction model and overall captioning model while changing the IDF value threshold in Sect. 4.3.

3.2 Distinctive-Attribute Prediction Model

For the Distinctive-attribute prediction model, convolutional layers are followed by four fully-connected layers (FCs). We use ResNet-152 [10] architecture for CNN layers and the output of the 2048-way pool5 layer is fed into a stack of fully connected layers. Training data for each image consist of input image I and ground truth distinctive-attribute \(\mathbf{D }_{g,i} = [{D}_{g,i1}, {D}_{g,i2}, \dots , {D}_{g,i{N}_{w}}]\), where \({N}_{w}\) is the number of the words in vocabulary and i is the index of the image. Our goal is to predict attribute scores as similar as possible to \({D}_{g}\). The cost function to be minimized is defined as mean squared error:

$$\begin{aligned} C = \frac{1}{M} \frac{1}{{N}_{w}} \sum _{i} \sum _{w} [{D}_{g,iw} - {D}_{p,iw}] ^{2} \end{aligned}$$
(1)

where \(\mathbf{D }_{p,i} = [{D}_{p,i1}, {D}_{p,i2}, \dots , {D}_{p,i{N}_{w}}]\) is predictive attribute score vector for ith image and M denotes the number of training images. The first three FCs have 2048 channels each, the fourth contains \({N}_{w}\) channels. We use ReLU [17] as nonlinear activation function for all FCs. We adopt batch normalization  [11] right after each FC and before activation. The training is regularized by dropout with ratio 0.3 for the first three FCs. Each FC is initialized with a Xavier initialization [9]. We note that our network does not contain softmax as a final layer, different from other attribute predictors described in previous papers [8, 23]. Hence, we use the output of an activation function of the fourth FC layer as the final predictive score \(\mathbf{D }_{p,i}\).

4 Results

4.1 Experiment Settings

Our results are evaluated on the popular MS COCO dataset [2, 15]. The dataset contains 82,783 images for training, 40,504 and 40,775 images for validation and testing. The model described in Sect. 3.2 is implemented in Keras [3] and we used scikit-learn toolkit [19] to implement TF-IDF scheme. We set IDF threshold value to 7 in this experiment. The mini-batch size is fixed at 128 and Adam’s optimization [13] with learning rate \(3 \times {10}^{-3}\) is used and stopped after 100 epochs. For the prediction model, we train 5 identical models with different initializations, and then ensemble by averaging their outcomes. SCN-LSTM training procedure follows [8] and we use the public implementation [7] of this method opened by Gan who is the author of the published paper [8].

Table 1. COCO evaluation server results using 5 references and 40 references captions. DaE improves the performance by significant margins across all metrics
Table 2. Results of published image captioning models tested on the COCO evaluation server
Table 3. This table illustrates several images with extracted attributes and captions. The captions generated by using DaE+SCN-LSTM are explained more in detail with more distinctive and accurate attributes

4.2 Evaluation

Firstly, we compared our method with SCN [7]. We evaluate both results on the online COCO testing server [2] and list them in Table 1. For SCN, we use the pre-trained weights provided by the author. The vocabulary size of the proposed scheme is 938, which is smaller than that of SCN [7] with 999. Results of both methods are derived from ensembling 5 models, respectively. The widely used metrics, BLEU-1,2,3,4 [18], METEOR [1], ROUGL-L [14], CIDEr [21] are selected to evaluate overall captioning performance. DaE improves the performance of SCN-LSTM by significant margins across all metrics. Specifically, DaE improves CIDEr from 0.967 to 0.981 in 5-refs and from 0.971 to 0.990 in 40-refs. The increase is greater at 40-refs which have relatively various expressions. The results for other published models tested on the COCO evaluation server are summarized in Table 2. In 40-refs, our method surpasses the performance of \(Adaptive Attention + CL\) [4] which is the state-of-the-art in terms of four BLEU scores. The qualitative evaluation is shown in Table 6. We listed the top eight attributes. For DaE, words after stemming with Porter Stemmer [20] are displayed as they are. Scores in the right parentheses of the tags and distinctive-attributes have different meanings, the former is probabilities, and the latter is distinctiveness values of the words. The attributes extracted using DaE include important words to represent the situation in an image; as a result, the caption generated by using them are represented more in detail compared with those of SCN. The result of the proposed method in (a), “A woman cutting a piece of fruit with a knife” explains what the main character does exactly. In the SCN, the general word “food” get a high probability, on the other hand, DaE extracts more distinctive words such as “fruit” and “apple.” For verbs, “cut”, which is the most specific action that viewers would be interested in, gets high distinctiveness score. In the case of (b), “wine” and “drink” are chosen as the words with the first and the third highest distinctiveness through DaE. Therefore, the characteristic phrase “drinking wine” is added. More examples are in Appendix A.

4.3 Vocabulary Construction

To analyze DaE in more detail, we conduct experiments with differently constructed vocabularies. We set seven different IDF threshold values, \({th}_{IDF}\), from 5 to 11.

$$\begin{aligned} {Vocab}_{i} = \{ w \;| \; IDF(w) > i, i = {th}_{IDF}\}. \end{aligned}$$
(2)

The vocabulary contains only the words whose IDF is bigger than \({th}_{IDF}\). The number of vocabulary words is shown in the second row of Table 4(a) and (b). Semantic information of the images are extracted corresponding to this vocabulary, and we use them to learn the proposed prediction model. Widely used splits [12] of COCO datasets are applied for the evaluation. We evaluate the prediction considering it as a multi-label and multi-class classification problem. The distinctiveness score between 0 and 1 are divided into four classes; (0.0, 0.25], (0.25, 0.5], (0.5, 0.75], and (0.75, 1.0] and the macro-averaged F1 score is computed globally. The performance, of the prediction model is shown in the third row. Each extracted distinctive-attribute is fed into SCN-LSTM to generate a caption, and the evaluation result, CIDEr, is shown in the fourth row. The CIDErs increase from \({Vocab}_{5}\) to \({Vocab}_{7}\), and then monotonically decrease in the rest. In other words, the maximum performance is derived from \({Vocab}_{7}\) to 0.996. The vocabulary size and the prediction performance are in a trade-off in this experiment. With the high \({th}_{IDF}\) value, captions can be generated with various vocabularies, but the captioning performance is not maximized because the performance of distinctive-attribute prediction is relatively low. \({Vocab}_{6}\) and \({Vocab}_{9}\) have almost the same CIDEr. In this case, If the vocabulary contains more words, it is possible to represent the captions more diversely and accurately for some images. Table 5 shows examples corresponding to this case. For the case of (a), the \({Vocab}_{6}\) does not include the stemmed word “carriag”, but the \({Vocab}_{9}\) contains the word and is extracted as the word having the seventh highest value through DaE. The word led the phrase “pulling a carriage” to be included the caption, well describing the situation. “Tamac” in (b), and “microwav” in (c) plays a similar role.

Table 4. Results of experiments with differently constructed vocabularies

Table 4(b) presents experimental results without stemming. The maximum value was 0.911, which is lower than the maximum value of the experiments applying stemming. When stemming is applied, the distinctiveness and significance of a word can be better expressed because it is mapped to the same word even if the tense and form are different. In addition, the size of vocabulary required to achieve the same performance is less when stemming is applied.

Table 5. Several cases that more diverse and accurate captions are generated using \({Vocab}_{9}\) than using \({Vocab}_{6}\), although their CIDErs are similar

5 Conclusion

In this study, we propose a Distinctive-attribute Extraction (DaE) method for image captioning. In particular, the TF-IDF scheme is used to extract meaningful information from the reference captions. Then the attribute prediction model is trained by the extracted information and used to infer the semantic-attribute for generating a description. DaE improves the performance of SCN-LSTM scheme by significant margins across all metrics; moreover, detailed and unique captions are generated. The proposed method can be plugged into various models to improve their performance.