Keywords

1 Introduction

Image Captioning is very useful in the field of big data and a great advance for computers to quickly extract information from images. In addition, Image captioning actually generates a comprehensive and smooth descriptive sentence automatically by the computer based on the content of the Image. For example, the user searches for the desired items through a paragraph, or find a paper or article source through a picture, multi-object recognition in images or videos, automatic semantic annotation of medical images, object recognition in automatic driving and so on.

The original image captioning technology is mainly derived from machine learning algorithms. For example, after extracting image operators and using classifiers to obtain targets, the target and attributes are used to generate captions. In recent years, it has many kinds of methods in the model [1]: one of them is statistical method to have features with NN model based on encode decode. HAF model is the baseline based on RL [2]. In a generating caption, REN for CIDEr by assigning different weights to each of importance and its weight is word-level. It is proposed to use the language model as a large label space to complete image caption [3], and it also includes using the Attention area to generate words. But there is a problem of attention drift.

This paper proposes a knowledge distillation architecture to increase our performance of an autoregressive teacher model with good generalization performance. The purpose is to provide more data for training as a reference, and introduce more unlabeled data to achieve soft target and true value as much as possible correspond. Comparing this method with two Encoder-Decoder architectures, the results implied that the model has certain improvements in calculation accuracy.

The rest of this paper includes: The second part is an overview of Image Caption; the third part is an introduction to the Encoder-Decoder architecture; the fourth part is the proposed knowledge distillation structure; the fifth part is the experiments and results, and finally is the summary.

2 Overview of Image Caption

Image caption is the automatic generation of image descriptions by human’s language, which has attracted more and more attention in the AI industry. Image captioning can be said to be a huge challenge for the core problem of CV, because image understanding is much more difficult than image classification. It requires not only CV technology, but also natural language processing technology to generate meaningful language for images [4].

Fig. 1.
figure 1

ASG2Caption model [7].

A novel ASG2Caption model [5] was proposed and shown in Fig. 1, which is able to recognize the graph structure. They let encoder to encode basic information with embedding and then propose a role-aware graph encoder, which contains a role-aware node embedding to distinguish node intentions by MR-GCN. The attention model with CNN over images and LSTM sentences was proposed with three stimulus-driven: Color/Dimension/Location. The CNN-LSTM model combining with the attention principle was considered in paper [6]. The image caption generation with an LSTM was proposed by Verma [7]. The paper [8] propose a lightweight Bifurcate-CNN.

3 Encoder-Decoder Architecture

According to the output and input sequence, in order to serve different application fields, different numbers of RNNs are designed into a variety of different structures. Encoder-Decoder is one of the most important structures in the current AI industry. Since the input and output of the sequence conversion model are variable in length, in order to deal with this type of input and output, the researcher designed a structure consisting of two main parts: the first is the encoder, which is the other to the content. A representation, which is used to output a feature vector network, using a variable-length sequence as input and converting it into a coded state with a fixed shape. The second is a decoder with the same network structure as the encoder but in the opposite direction, which maps a fixed-shape encoding state to a variable-length sequence. An encoder-decoder architecture was employed for captions generation [9]. Seq2Seq can overcome the shortcomings of RNN. For example, applications such as machine translation and chatbot need to achieve direct conversion from one sequence to another. The problem with RNN is that the size of the input and output is mandatory, and the Seq2Seq model does not need to have these restrictions, so the length of the input and output is variable for any occasions.

The encoder-decoder based on fusion methods can be adopted to finish subtitle text task [10]. In the post extraction part, use the VGG16 + Faster R-CNN framework and use the fusion method to train BLSTM. Gated Recurrent Unit is used for effective sentence generation [11]. When the time interval is too large or too small, the gradient of the RNN is more likely to decay or explode. Although deleting gradients can cope with gradient explosions, it cannot solve the difficulty of gradient attenuation. The root cause of RNN's difficulty in practical applications is that RNNs always have gradient attenuation for problems with large processing time distances. LSTM allows RNN to selectively forget some past information through gating, with the purpose of establishing a more global model for long-term conditions and relationships, and retaining useful past memories. GRU believes that it is necessary to further reduce the disappearance of gradients while retaining the advantages of long-term sequences.

4 Knowledge Distillation Structure

Conceptual Captions is a data set proposed in the paper [12]. Compared with the classic COCO data set, Conceptual Captions contains more images, image styles and image annotation content. The method of obtaining Conceptual Captions is to extract and filter the target information content on the internet web page, such as image data, images Image captions and other related information are used as search and filtering tools.

Fig. 2.
figure 2

The different knowledge distillation [13].

Fig. 3.
figure 3

The proposed knowledge distillation structure.

Like all other artificial intelligence methods, image caption mainly relies on multiple layers of deep neural networks, which introduces high computational costs. How to reduce this high computational cost, consider migrating the large-scale model used to describe large-scale knowledge to the small-scale model. The former is regarded as a teacher and the latter as a student [13, 14], as shown in Fig. 2. The problem that needs to be solved is to determine to integrate certain knowledge into the teacher model and transfer it, and also to solve the problem of the transfer process. This method is called knowledge distillation. The main principle is to map the core knowledge as the learning goal. What needs to be retained in the latter small-size model is the output layer of the previous larger-size model.

For further improving the semantic transmitted after distillation in the feature representation, this paper proposed one knowledge distillation architecture to increase the results of the autoregressive teacher model with good generalization performance, as shown in Fig. 3. The teacher model and student model use a network structure similar to U-net, which is conducive to training the model with higher efficiency on less data, and can achieve features to obtain higher results. Meanwhile, in the loss section, the label normalization method is used to deal with the unmatched image-sentence pairs. To achieve the purpose of more efficient distillation process. In addition, you can also provide more data for training as a reference, and introduce more unlabeled data to achieve the soft target and the ground truth as much as possible.

5 Experimental Results

To analysis and compare some results of the structure proposed in our paper, we selected a part of the data based on Microsoft COCO Caption and Flickr8K. Each image includes five corresponding sentence. All Backbone and Detector adopt VGG16. The multiple descriptions of the image are independent of each other and use different grammars. These descriptions describe different aspects of the same image, or simply use different grammars [15].

Fig. 4.
figure 4

The comparison of BLEU scores for five models.

In our evaluation process, BLEU is generally used as the evaluation index. BLEU calculates the number of matches between each candidate and ground truth by comparing the n-gram matches between the two. The more matches, the better the candidate gets. Figure 4 shows the test results of BLEU1–BLEU4 with five different ways. These results show that the feature extraction of different semantic levels of images has a good impact on increasing the results of image subtitles. When information is entered into the frame in different ways, such as LSTM and Attention LSTM, it may also affect the results. Comparing the method proposed in this paper with the Attention LSTM and Encoder-Decoder algorithms, the experimental results show that this knowledge distillation architecture can strengthen the semantic information transmitted after distillation in the feature representation, and achieve higher efficiency training models on less data to obtain a higher accuracy rate.

6 Conclusions

Image captioning technology is the comprehensive technology of image generation and description in real life. The recent image captioning is primary belong to the DNN Encode Decode architecture. The teacher-student knowledge distillation framework proposed in this paper can train the model with higher efficiency on less data, and can achieve features of different levels in different fields to increase the indicator of a teacher model with good generalization performance. The next step will be to study how to improve the mapping capabilities of multimodal spaces.