Keywords

1 Introduction

In the past few years, deep neural network has made significant progress in image processing area, like image classification [1,2,3], object detection [4,5,6]. However, tasks like image classification and object detection are far from the end of image understanding. One ultimate goal of image understanding is to understand the whole image scenario but not individual objects. Image captioning follows the same path by extracting the complete detail of individual object and their associated relationship from image. Finally, the system can automatically generate a sentence to describe the image. This problem is extremely important, as well as difficult because it connects two major artificial intelligence fields: computer vision and natural language processing.

Previous studies are mostly inspired by the works of machine translation, where the task is to translate a source sentence written in one language (like French) into a target sentence written in a different language (like English), keeping logic and syntax precise. Associating image captioning and machine translation together makes sense because they can be placed in the same framework, called Encoder-Decoder framework. In the Encoding step, we encode the source information, which is an image in image captioning task and source sentence in translation task, into a target vector. Followed by the decoding step, in which sentences are generated by decoding the target vector. The core part of this framework is how to encode the source information (image or sentence) and how to decode the target vectors.

Previous works are align in the decoding step, they usually use recurrent neural networks (RNN) based on long short-term memory (LSTM) [7] units as a decoder. As for the encoding step is concerned, the work is divided into two major classes: CNN-RNN models [8,9,10] and attention based models [11, 12]. CNN-RNN models represent an image as a single feature vector from the top layer of a pre-trained convolutional network, whereas attention based models use the vector made up by the representations of image’s subregions as the source vector. The biggest drawback of CNN-RNN models is that they hardly align different visual parts of the input image to words in captions. Attention based model like [11] allowed their model to attend any visual parts of the input image but most of the subregions it attends are meaningless.

Our model also follows the Encoder-Decoder framework, however, it is totally different from the previous models as shown in Fig. 1. Our motivation in describing an image is to find out its contents, rather than focusing on some meaningless regions associated with it. The locations of different objects in an image is also a positive information for describing an image, since they reflect the spatial position relationships of objects in an image. Due to the aforementioned reason, we combine object detection with image captioning to focus on the real meaningful information domain in the images to generate the resulting sentences much better and efficient. The encoding part of our model consists of two steps. Initially, we use an object detection model to detect objects in the image followed by a deep Convolutional Neural Network to extract their spatial relationships. All these information will be represented as a set of feature vectors, which is then fed into the decoding part where the description sentence will be generated.

Fig. 1.
figure 1

An overview of the proposed framework. The encoding part first extract the information of objects (left) and their spatial relationships (right) in the image, then the decoding part generate words based on these features.

In order to measure the performance, we evaluate our model on COCO dataset using seven standard metrics: BLEU-1, 2, 3, 4, METEOR, CIDEr and ROUGH-L. Experimental results shows that our proposed model perform better than the baseline soft attention model [11] and is similar to the benchmark ATT model [12] in performance evaluation.

2 Our Model

This section describes the detail of the proposed model, which consists of two main parts: encoding and decoding. The input to our model is a single image I, while the output is a descriptive sentence S consists of K encoded words: \( {\text{S}} = \left\{ { w_{1} , w_{2} , \ldots ,w_{K} } \right\} \).

In the encoding part, firstly, we present a model that recognizes objects in the input image followed by a deep CNN to extract their locations, which reflect the spatial relationship associated. All the information will be represented as a set of feature vectors referred as annotation vectors. The encoding part produces L annotation vectors, each of which is a D-dimensional representation corresponding to an object and also its spatial location in the input image: \( \varvec{A} = \left\{ { \varvec{A}_{1} , \varvec{A}_{2} , \ldots ,\varvec{A}_{L} } \right\}, \varvec{A}_{i} \in R^{D} . \)

In the decoding part, all these annotation vectors are fed into a deep Recurrent Neural Network model to generate a description sentence.

2.1 Encoding Part

Our core insight is that when human beings try to describe an image using sentence (combination of words), it’s natural to first find out objects and their relation-ships in the desired image. To imitate human beings, our encoding part has two steps, first we use an object detection model to detect objects in the image followed by a deep Convolutional Neural Network to get their spatial locations.

Object Detection.

In the past few years, significant progress have been done in object detection. These advances are driven by the success of region proposal methods (e.g. [13]) and region-based convolutional neural networks (R-CNN) [5]. In our model, we choose Faster R-CNN [6] as object detection model due to its efficiency and effectiveness in object detection task. Faster R-CNN is composed of two modules. The first module is a deep fully convolutional network that propose regions, the second module is the Fast R-CNN detector [5] which uses the proposed regions. To generate region proposals, the authors of [6] slides a small network over the convolutional feature map output by the last shared convolutional layer. For each sliding window of the input convolutional feature map, this small network maps it to a lower-dimensional feature. Then this feature fed into two sibling fully-connected layers: a box-regression layer (reg) and a box-classification layer (cls). After training, the Faster R-CNN takes an image (of any size) as an input and produce a set of rectangular object proposals, each with an objectness score. Then we sort these boxes according to their scores in descending order and choose the top-n boxes as the regions of objects in the input image. Each rectangular object region is mapped to a feature vector by fully connected layers (in our implementation it’s ‘fc7’ layer) of the Faster R-CNN model. More explicitly, for every input image we detect n objects in an image and each object is represented as a d-dimension vector: \( \left\{ { \varvec{obj}_{1} , \varvec{obj}_{2} , \ldots ,\varvec{obj}_{n} } \right\}, \varvec{obj}_{i} \in d^{d} . \) Images in Fig. 2 are some results of object detection part, each object detected has an objectness score on it, captions below each image are generated by the proposed model and the words in red align to the objects detected in each image.

Fig. 2.
figure 2

Some results of object detection part. The captions are generated by the proposed model.

Object Localization.

This part is designed to extract the information of objects spatial locations which in turn reflect their spatial relationships. Junqi et al. [14] also considered the locations of different localized regions. However, they just added the boxes central’s with x location, y location, width, height and area ratio with respect to the entire image’s geometry to the end of the vector of each localized regions. In this paper, the implementation of extracting information of each object location is completely different from [14]. In object detection part, we know that for each input image, the output is n rectangular object regions, each with an objectness score. For each object in this image, we keep the region of its bounding box unchanged and set remaining regions to mean value of the training set. So we get a new image, which has the same size as the original image but just consists the bounding box region of one object as shown in Fig. 1. As we detected n objects for each image therefore, we get n new images for individual image. Each new image will then be fed into the VGG net [2] and the feature vector of its ‘fc7’ layer will be extracted, which yields to the vectorized representation of object location. Furthermore, we get another n vectors of t-dimension in which each vector represents the information of spatial location of each object: \( \left\{ { \varvec{loc}_{1} , \varvec{loc}_{2} , \ldots ,\varvec{loc}_{n} } \right\}, \varvec{loc}_{i} \in d^{t} \).

Each annotation vector \( \varvec{A}_{i} \) consists of two parts: First, vector \( \varvec{obj}_{i} \) represents the feature of object which particularly describes the contents of image. Second, vector \( \varvec{loc}_{i} \) represents the feature of object location which tell us about the location of individual object.

$$ {\mathbf{A}}_{\varvec{i}} = \left[ {{\mathbf{obj}}_{\varvec{i}} \varvec{ };\varvec{ }{\mathbf{loc}}_{\varvec{i}} } \right], {\mathbf{A}}_{\varvec{i}} \varvec{ } \in d^{D} , D = d\, + \,t $$
(1)

2.2 Decoding Part

In this paper, we describe a decoding part based on an LSTM network with attention mechanism. Attention mechanism was first used in neural machine translation area by [15]. Following the same mechanism, the authors of [11, 16, 17] introduced it into image processing domain whereas, [11] was the first to apply it in image captioning task. The key idea of attention mechanism is that when a sentence is used to describe an image, not every word in the sentence is ‘‘translated” from the whole image but actually it just has relation to a few subregions of an image. It can be viewed as a form of alignment from words of the sentence to subregions of the image. The feature vectors of these subregions are referred to as annotation vectors. Here in our implementation, subregions are referred to as the bounding box of objects and annotation vectors are referred to as \( \left\{ {\varvec{A}_{i} } \right\} \), which is already discussed in the encoding part.

In the decoding part we follow [11] to use a long short-term memory (LSTM) network [7] as a decoder. LSTM network products one word at every step j conditioned on a context vector \( \varvec{z}_{j} \), the previous hidden state \( \varvec{h}_{j - 1} \) and the previously generated words \( \varvec{w}_{j - 1} \) using the following formulations:

$$ \varvec{In}_{\varvec{j}} = \sigma \left( {W_{i} E\varvec{w}_{{\varvec{j} - \mathbf{1}}} \, + \,U_{i} \varvec{h}_{{\varvec{j} - \mathbf{1}}} \, + \, Z_{i} \varvec{z}_{\varvec{j}} \, + \, \varvec{b}_{\varvec{i}} } \right) $$
(2)
$$ \varvec{f}_{\varvec{j}} = \sigma \left( {W_{f} E\varvec{w}_{{\varvec{j} - \mathbf{1}}} \, + \,U_{f} \varvec{h}_{{\varvec{j} - \mathbf{1}}} \, + \,Z_{f} \varvec{z}_{\varvec{j}} \, + \,\varvec{b}_{\varvec{f}} } \right) $$
(3)
$$ \varvec{c}_{\varvec{j}} = \varvec{f}_{\varvec{j}} \varvec{c}_{{\varvec{j} - \mathbf{1}}} + tanh\left( {W_{c} E\varvec{w}_{{\varvec{j} - \mathbf{1}}} \, + \,U_{c} \varvec{h}_{{\varvec{j} - 1}} \, + \,Z_{c} \varvec{z}_{\varvec{j}} \, + \,\varvec{b}_{\varvec{c}} } \right) $$
(4)
$$ \varvec{o}_{\varvec{j}} = \sigma \left( {W_{o} E\varvec{w}_{{\varvec{j} - \mathbf{1}}} \, + \,U_{o} \varvec{h}_{{\varvec{j} - \mathbf{1}}} \, + \,Z_{o} \varvec{z}_{\varvec{j}} \, + \,\varvec{b}_{\varvec{o}} } \right) $$
(5)
$$ \varvec{h}_{\varvec{j}} = \varvec{o}_{\varvec{j}} tanh\left( {\varvec{c}_{\varvec{j}} } \right) $$
(6)

Here \( \varvec{In}_{j} , \varvec{f}_{j} , \varvec{c}_{j} , \varvec{o}_{j} , \varvec{h}_{j} \) represent the state of input gate, forget gate, cell, output gate and hidden layer respectively. \( W_{.} , U_{.} , Z_{.} \) and \( b_{.} \) are learned weight matrices and biases. E is an embedding matrix and σ is the logistic sigmoid activation. The context vector \( \varvec{z}_{j} \) is generated from the annotation vectors \( \varvec{A}_{j} \), \( i = 1, \ldots , n \), corresponding to the feature vectors of different objects.

There are two different versions in [11] to compute the context vector \( \varvec{z}_{j} \) and we use the ‘‘soft” version, that is

$$ {\mathbf{z}}_{\varvec{j}} = \sum\nolimits_{i = 1}^{n} {\alpha_{ji} \,\varvec{A}_{\varvec{i}} } $$
(7)

where \( \alpha_{ji} \) is a scalar weighting of annotation vector \( \varvec{A}_{\varvec{i}} \) at time step j, defined as follows:

$$ {\mathbf{e}}_{{\varvec{ji}}} = f_{att} \left( {\varvec{A}_{\varvec{i}} , h_{j - 1} } \right) $$
(8)
$$ \alpha_{ji} = \frac{{{ \exp }\left( {{\mathbf{e}}_{{\varvec{ji}}} } \right)}}{{\mathop \sum \nolimits_{k = 1}^{n} { \exp }\left( {{\mathbf{e}}_{{\varvec{jk}}} } \right)}} $$
(9)
$$ \sum\nolimits_{i = 1}^{n} {\alpha_{ji} \, = \,1} $$
(10)

where \( f_{att} \) is a multilayer perceptron conditioned on the previous hidden state \( h_{j - 1} \).

We predict the next word \( W_{j} \) with a softmax layer, the input of it are the context vector, the previously generated word and the decoder state \( h_{j} \):

$$ {\text{p}}\left( {w_{j} } \right) \propto { \exp }\left( {L_{o} \left( {Ew_{j - 1} \, + \,L_{h} h_{j} \, + \,L_{z} z_{j} } \right)} \right) $$
(11)

where \( L_{o} \), \( E \), \( L_{h} \), \( L_{z} \) are learned parameters. The positive weight \( \alpha_{ji} \) can be viewed as the probability that the word generated at time step j ‘‘translated” from annotation vector \( \varvec{A}_{\varvec{i}} \). Since \( \varvec{A}_{\varvec{i}} \) contains both the content information and the spatial location information of object i, \( \alpha_{ji} \) will learns both the content relationships and spatial relationships of the objects in the input image when predict the next word \( \varvec{w}_{\varvec{j}} \). Examples in Fig. 3 shows how the spatial relationships of objects can be reflected in the generated sentences.

Fig. 3.
figure 3

Illustration of how the proposed model use the information of objects and their spatial relationships in the image to generate sentences. The baseline results are generated by Neural-Talk2 (version2.0 of Deep VS [10]). ‘‘Ours” are generated by the proposed model. Words in red align to the objects which are not recognized by the baseline model and the blue words show objects spatial relationships.

2.3 Training

This section describe the training of proposed model. The training data for each image consists of input image features {\( \varvec{A}_{\varvec{i}} \)} and output caption words sequence {\( w_{k} \)}. Parameters of the proposed encoding part is fixed, so we only need to learn the parameters of the proposed decoding part, which are all the attention model parameters \( \Theta _{Att} = \left\{ {W_{. } , U_{.} , Z_{. } , b_{. } } \right\} \) jointly with RNN parameters \( \Theta _{RNN} \).

We train our model using maximum likelihood with a regularization term on the attention weights by minimizing a loss function over training set. The loss function is a negative log probability of the ground truth words \( {\text{w}} = \left\{ { w_{1} , w_{2} , \ldots ,w_{K} } \right\} \):

$$ LOSS = - \sum\nolimits_{t} {\log \left( {p\left( {w_{j} } \right)} \right)} \, + \,\lambda \sum\nolimits_{i} {\left( {1\, - \,\sum\nolimits_{t} {\alpha_{i,j} } } \right)}^{2} $$
(12)

where \( w_{j} \) is the ground truth word and \( \lambda > 0 \) is a balancing factor between the cross entropy loss and a penalty on the attention weights. We use stochastic gradient descent with momentum 0.9 to train the parameters of our network.

3 Experiments

In this section, we first describe the dataset used in our experiments as well as the experimental methodology followed by the detail results discussion. We report all the results using Microsoft COCO caption evaluation tool [18], including BLEU-1, 2, 3, 4, METEOR, CIDEr and ROUGH-L.

3.1 Dataset

We use COCO dataset [21] for experimental purpose to show the efficiency of proposed method, which includes 123287 color images. For each image, there are at least five captions given by different AMT workers. To make the results comparable to other methodologies, we use the commonly adopted [18] splits of the dataset, which assigns 5000 for validation, 5000 for testing and keep the rest for training. For each sentence, we limit the length size to 50 words and truncate the rest. Each sentence ends with a character “<end>”. Words that appear less than five times are marked as a character “<unk>”. When dictionary is build, resulting in the final vocabulary with 10020 unique words in COCO dataset.

3.2 Experimental Setting

In the encoding part of the proposed model, we use faster R-CNN [5] pre-trained on COCO dataset for object detection and VGG net [2] pre-trained on ImageNet dataset for feature extraction. For each input image, Faster R-CNN output a set of rectangular object proposals with an objectness score. We sort these scores and select the top-5 areas as the objects detected in this image. Each object then is represented as a vector of 4096 dimensions by the ‘fc7’ layer of faster R-CNN. Also, the spatial location of each object is represented as a vector of 4096 dimensions by the ‘fc7’ layer of VGG net. We concatenate these two vectors to form annotation vectors \( \left\{ {\varvec{A}_{\varvec{i}} } \right\} \), which yields the dimension of annotation vectors as 8192. We also use the ‘fc7’ layer of VGG Net to extract features of the whole image and repeat this vector twice. So for each individual input image, the output of the proposed encoding part is a matrix of 6 * 8192 dimensions.

The proposed decoding part is the LSTM network [7] with attention mechanism. We map each word to a vector of 1000 dimensions and set the hidden layers of LSTM to be 1000 dimensions. We use tanh as nonlinear activation function σ. For training, we use stochastic gradient descent with momentum = 0.9 to do model updating with a mini-batch size of 100, we set initial learning rate to 0.01 and decrease it by half after every 20,000 iterations. During testing, we use beam search method which selects the top-k best sentences and keep them to be expanded with new words in future until the end of sentence symbol is reached. During testing, we use beam search method which selects the top-k best sentences and keep them to be expanded with new words in future until the end of sentence symbol is reached. By comparing the results of different value of k, we find we can get the best results when k is set to 4.

3.3 Evaluation Results

Following previous works, we evaluate the proposed model using standard metrics: BLEU-1,2,3,4, METEOR, CIDEr and ROUGH-L (the higher the better), all the metrics are computed by using the codes released by COCO Evaluation Server [18]. The results are shown in Table 1, where ‘‘Ours” indicates the performance of the proposed model. Comparing the results of [8,9,10, 19], which are CNN-RNN models, with results of [11, 14, 20], which are attention based models, we can find that to encode each image into a set of annotation vectors which represent different subregions of the input image is better and more optimized then to encode the whole image into a single feature vector. Since the proposed model is directly developed from the soft-attention approach [11], therefore, its comparison is highly demanding. The major difference between the proposed model and the soft-attention approach [11] is that we use the information of objects and their spatial information in images for sentence generation. From Table 1, it’s clear that the proposed model has achieve better performance than the soft-attention approach [11], which proves that the additional information really help machines to achieve better image understanding. Comparing with the best model which is ATT model [12] up to date, our results are still comparable even though they used a stronger CNN construction which is GoogLeNet instead of VGG16 in our case. Also, we claim that the proposed model takes less computation and more optimized than ATT model since GoogLeNet is more deeper model than VGG16.

Table 1. Results on the MSCOCO dataset.

4 Conclusion

In this paper, we present a multi-model Neural Network that automatically learns to describe the content of images. Our model first extracts the information of objects and their spatial locations in an image, and then a deep recurrent neural network (RNN) based on LSTM units with attention mechanism generates a description sentence. Each word of the description is automatically aligned to different objects in the input image when it is generated. The proposed model is more optimized compared to other benchmark algorithms on the ground that its implementation is totally made on human visual system. We hope that this paper will serve as a reference guide for researchers to facilitate the design and implementation image captioning.