A High-Efficiency Knowledge Distillation Image Caption Technology

Li, Mingxiao

doi:10.1007/978-981-19-2456-9_92

Mingxiao Li⁴⁰

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE))

Included in the following conference series:

INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND APPLICATIONS

8397 Accesses

Abstract

Image caption is wildly considered in the application of machine learning. Its purpose is describing one given picture into text accurately. Currently, it uses the Encoder-Decoder architecture from deep learning. To further increase the semantic transmitted after distillation by feature representation, this paper proposes a knowledge distillation framework to increase the results of the teacher section, extracting features by different semantic levels from different fields of view, and the loss function adopts the method of label normalization. Handle unmatched image-sentence pairs. In order to achieve the purpose of a more efficient process. Experimental results prove that this knowledge distillation architecture can strengthen the semantic information transmitted after distillation in the feature representation, achieve a more efficient training model on less data, and obtain a higher accuracy rate.

You have full access to this open access chapter, Download conference paper PDF

Revisiting Knowledge Distillation for Image Captioning

Lightweight Image Captioning Model Based on Knowledge Distillation

Boost image captioning with knowledge reasoning

Article 27 October 2020

Keywords

1 Introduction

Image Captioning is very useful in the field of big data and a great advance for computers to quickly extract information from images. In addition, Image captioning actually generates a comprehensive and smooth descriptive sentence automatically by the computer based on the content of the Image. For example, the user searches for the desired items through a paragraph, or find a paper or article source through a picture, multi-object recognition in images or videos, automatic semantic annotation of medical images, object recognition in automatic driving and so on.

The original image captioning technology is mainly derived from machine learning algorithms. For example, after extracting image operators and using classifiers to obtain targets, the target and attributes are used to generate captions. In recent years, it has many kinds of methods in the model [1]: one of them is statistical method to have features with NN model based on encode decode. HAF model is the baseline based on RL [2]. In a generating caption, REN for CIDEr by assigning different weights to each of importance and its weight is word-level. It is proposed to use the language model as a large label space to complete image caption [3], and it also includes using the Attention area to generate words. But there is a problem of attention drift.

This paper proposes a knowledge distillation architecture to increase our performance of an autoregressive teacher model with good generalization performance. The purpose is to provide more data for training as a reference, and introduce more unlabeled data to achieve soft target and true value as much as possible correspond. Comparing this method with two Encoder-Decoder architectures, the results implied that the model has certain improvements in calculation accuracy.

The rest of this paper includes: The second part is an overview of Image Caption; the third part is an introduction to the Encoder-Decoder architecture; the fourth part is the proposed knowledge distillation structure; the fifth part is the experiments and results, and finally is the summary.

2 Overview of Image Caption

Image caption is the automatic generation of image descriptions by human’s language, which has attracted more and more attention in the AI industry. Image captioning can be said to be a huge challenge for the core problem of CV, because image understanding is much more difficult than image classification. It requires not only CV technology, but also natural language processing technology to generate meaningful language for images [4].

A novel ASG2Caption model [5] was proposed and shown in Fig. 1, which is able to recognize the graph structure. They let encoder to encode basic information with embedding and then propose a role-aware graph encoder, which contains a role-aware node embedding to distinguish node intentions by MR-GCN. The attention model with CNN over images and LSTM sentences was proposed with three stimulus-driven: Color/Dimension/Location. The CNN-LSTM model combining with the attention principle was considered in paper [6]. The image caption generation with an LSTM was proposed by Verma [7]. The paper [8] propose a lightweight Bifurcate-CNN.

3 Encoder-Decoder Architecture

According to the output and input sequence, in order to serve different application fields, different numbers of RNNs are designed into a variety of different structures. Encoder-Decoder is one of the most important structures in the current AI industry. Since the input and output of the sequence conversion model are variable in length, in order to deal with this type of input and output, the researcher designed a structure consisting of two main parts: the first is the encoder, which is the other to the content. A representation, which is used to output a feature vector network, using a variable-length sequence as input and converting it into a coded state with a fixed shape. The second is a decoder with the same network structure as the encoder but in the opposite direction, which maps a fixed-shape encoding state to a variable-length sequence. An encoder-decoder architecture was employed for captions generation [9]. Seq2Seq can overcome the shortcomings of RNN. For example, applications such as machine translation and chatbot need to achieve direct conversion from one sequence to another. The problem with RNN is that the size of the input and output is mandatory, and the Seq2Seq model does not need to have these restrictions, so the length of the input and output is variable for any occasions.

The encoder-decoder based on fusion methods can be adopted to finish subtitle text task [10]. In the post extraction part, use the VGG16 + Faster R-CNN framework and use the fusion method to train BLSTM. Gated Recurrent Unit is used for effective sentence generation [11]. When the time interval is too large or too small, the gradient of the RNN is more likely to decay or explode. Although deleting gradients can cope with gradient explosions, it cannot solve the difficulty of gradient attenuation. The root cause of RNN's difficulty in practical applications is that RNNs always have gradient attenuation for problems with large processing time distances. LSTM allows RNN to selectively forget some past information through gating, with the purpose of establishing a more global model for long-term conditions and relationships, and retaining useful past memories. GRU believes that it is necessary to further reduce the disappearance of gradients while retaining the advantages of long-term sequences.

4 Knowledge Distillation Structure

Conceptual Captions is a data set proposed in the paper [12]. Compared with the classic COCO data set, Conceptual Captions contains more images, image styles and image annotation content. The method of obtaining Conceptual Captions is to extract and filter the target information content on the internet web page, such as image data, images Image captions and other related information are used as search and filtering tools.

Like all other artificial intelligence methods, image caption mainly relies on multiple layers of deep neural networks, which introduces high computational costs. How to reduce this high computational cost, consider migrating the large-scale model used to describe large-scale knowledge to the small-scale model. The former is regarded as a teacher and the latter as a student [13, 14], as shown in Fig. 2. The problem that needs to be solved is to determine to integrate certain knowledge into the teacher model and transfer it, and also to solve the problem of the transfer process. This method is called knowledge distillation. The main principle is to map the core knowledge as the learning goal. What needs to be retained in the latter small-size model is the output layer of the previous larger-size model.

For further improving the semantic transmitted after distillation in the feature representation, this paper proposed one knowledge distillation architecture to increase the results of the autoregressive teacher model with good generalization performance, as shown in Fig. 3. The teacher model and student model use a network structure similar to U-net, which is conducive to training the model with higher efficiency on less data, and can achieve features to obtain higher results. Meanwhile, in the loss section, the label normalization method is used to deal with the unmatched image-sentence pairs. To achieve the purpose of more efficient distillation process. In addition, you can also provide more data for training as a reference, and introduce more unlabeled data to achieve the soft target and the ground truth as much as possible.

5 Experimental Results

To analysis and compare some results of the structure proposed in our paper, we selected a part of the data based on Microsoft COCO Caption and Flickr8K. Each image includes five corresponding sentence. All Backbone and Detector adopt VGG16. The multiple descriptions of the image are independent of each other and use different grammars. These descriptions describe different aspects of the same image, or simply use different grammars [15].

In our evaluation process, BLEU is generally used as the evaluation index. BLEU calculates the number of matches between each candidate and ground truth by comparing the n-gram matches between the two. The more matches, the better the candidate gets. Figure 4 shows the test results of BLEU1–BLEU4 with five different ways. These results show that the feature extraction of different semantic levels of images has a good impact on increasing the results of image subtitles. When information is entered into the frame in different ways, such as LSTM and Attention LSTM, it may also affect the results. Comparing the method proposed in this paper with the Attention LSTM and Encoder-Decoder algorithms, the experimental results show that this knowledge distillation architecture can strengthen the semantic information transmitted after distillation in the feature representation, and achieve higher efficiency training models on less data to obtain a higher accuracy rate.

6 Conclusions

Image captioning technology is the comprehensive technology of image generation and description in real life. The recent image captioning is primary belong to the DNN Encode Decode architecture. The teacher-student knowledge distillation framework proposed in this paper can train the model with higher efficiency on less data, and can achieve features of different levels in different fields to increase the indicator of a teacher model with good generalization performance. The next step will be to study how to improve the mapping capabilities of multimodal spaces.

References

Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Article Google Scholar
Wang, H., Zhang, Y., Yu, X.: An overview of image caption generation methods. Computational intelligence and neuroscience (2020)
Google Scholar
Wu, C., Yuan, S., Cao, H., et al.: Hierarchical attention-based fusion for image caption with multi-grained rewards. IEEE Access 8, 57943–57951 (2020)
Article Google Scholar
Xu, K., Ba, J., Kiros, R., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning. PMLR, pp. 2048–2057 (2015)
Google Scholar
You, Q., Jin, H., Wang, Z., et al.: Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p. 4651–4659 (2016)
Google Scholar
Chen, S., Jin, Q., Wang, P., et al.: Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9962–9971 (2020)
Google Scholar
Ding, S., Qu, S., Xi, Y., et al.: Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398, 520–530 (2020)
Article Google Scholar
Verma, A., Saxena, H., Jaiswal, M., et al.: Intelligence Embedded Image Caption Generator using LSTM based RNN Model. In: 2021 6th International Conference on Communication and Electronics Systems (ICCES), pp. 963–967. IEEE (2021)
Google Scholar
Zhao, D., Yang, R., Guo, S.: A lightweight convolutional neural network for large-scale Chinese image caption. Optoelectron. Lett. 17(6), 361–366 (2021)
Article Google Scholar
Singh, A., Singh, T.D., Bandyopadhyay, S.: An encoder-decoder based framework for hindi image caption generation. Multimedia Tools and Applications, 1–20 (2021)
Google Scholar
Duan, M., Liu, J., Lv, S.: Encoder-decoder based multi-feature fusion model for image caption generation. J. Big Data 3(2), 77 (2021)
Article Google Scholar
Parikh, H., Sawant, H., Parmar, B., et al.: Encoder-decoder architecture for image caption generation. In: 2020 3rd International Conference on Communication System, Computing and IT Applications (CSCITA), pp. 174–179. IEEE (2020)
Google Scholar
Sharma, P., Ding, N., Goodman, S., et al.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565 (2018)
Google Scholar
Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 (2016)
Park, W., Kim, D., Lu, Y., et al.: Relational knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

International School, Beijing University of Posts and Telecommunications, Beijing, China
Mingxiao Li

Authors

Mingxiao Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mingxiao Li .

Editor information

Editors and Affiliations

College of Communication Engineering, Jilin University, Jilin, Jilin, China
Zhihong Qian
Department of AI & ML, Vardhaman College of Engineering, Hyderabad, Telangana, India
M.A. Jabbar
College of Technology, Indiana State University, Terre Haute, IN, USA
Xiaolong Li

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, M. (2022). A High-Efficiency Knowledge Distillation Image Caption Technology. In: Qian, Z., Jabbar, M., Li, X. (eds) Proceeding of 2021 International Conference on Wireless Communications, Networking and Applications. WCNA 2021. Lecture Notes in Electrical Engineering. Springer, Singapore. https://doi.org/10.1007/978-981-19-2456-9_92

Download citation

DOI: https://doi.org/10.1007/978-981-19-2456-9_92
Published: 13 July 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2455-2
Online ISBN: 978-981-19-2456-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

A High-Efficiency Knowledge Distillation Image Caption Technology

Abstract

Similar content being viewed by others

Revisiting Knowledge Distillation for Image Captioning

Lightweight Image Captioning Model Based on Knowledge Distillation

Boost image captioning with knowledge reasoning

Keywords

1 Introduction

2 Overview of Image Caption

3 Encoder-Decoder Architecture

4 Knowledge Distillation Structure

5 Experimental Results

6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A High-Efficiency Knowledge Distillation Image Caption Technology

Abstract

Similar content being viewed by others

Revisiting Knowledge Distillation for Image Captioning

Lightweight Image Captioning Model Based on Knowledge Distillation

Boost image captioning with knowledge reasoning

Keywords

1 Introduction

2 Overview of Image Caption

3 Encoder-Decoder Architecture

4 Knowledge Distillation Structure

5 Experimental Results

6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation