Collaborative Learning Method for Natural Image Captioning

Wang, Rongzhao; Liu, Libo

doi:10.1007/978-981-19-5194-7_19

Rongzhao Wang¹¹ &
Libo Liu¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1628))

Included in the following conference series:

International Conference of Pioneering Computer Scientists, Engineers and Educators

626 Accesses

Abstract

We propose a collaborative learning method to solve the natural image captioning problem. Numerous existing methods use pretrained image classification CNNs to obtain feature representations for image caption generation, which ignores the gap in image feature representations between different computer vision tasks. To address this problem, our method aims to utilize the similarity between image caption and pix-to-pix inverting tasks to ease the feature representation gap. Specifically, our framework consists of two modules: 1) The pix2pix module (P2PM), which has a share learning feature extractor to extract feature representations and a U-net architecture to encode the image to latent code and then decodes them to the original image. 2) The natural language generation module (NLGM) generates descriptions from feature representations extracted by P2PM. Consequently, the feature representations and generated image captions are improved during the collaborative learning process. The experimental results on the MSCOCO 2017 dataset prove the effectiveness of our approach compared to other comparison methods.

This work was supported by grant of no.61862050 from the National Nature Science Foundation of China and no.2020AAC03031 from Natural Science Foundation of Ningxia, China.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 1597–1607. PMLR, 13–18 July 2020. https://proceedings.mlr.press/v119/chen20j.html
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chowdhury, G.G.: Natural language processing. Ann. Rev. Inf. Sci. Technol. 37(1), 51–89 (2003)
Article Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 Workshop on Deep Learning, December 2014 (2014)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Farhadi, A., et al.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Chapter Google Scholar
Forsyth, D.A., Ponce, J.: Computer Vision: A Modern Approach. Pearson, London (2012)
Google Scholar
Guan, S., Tai, Y., Ni, B., Zhu, F., Huang, F., Yang, X.: Collaborative learning for faster stylegan embedding. arXiv preprint arXiv:2007.01758 (2020)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Google Scholar
Kulkarni, G., et al.: Baby talk: understanding and generating simple image descriptions. In: CVPR 2011, pp. 1601–1608 (2011). https://doi.org/10.1109/CVPR.2011.5995466
Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220–228 (2011)
Google Scholar
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Association for Computational Linguistics, Barcelona, July 2004. https://www.aclweb.org/anthology/W04-1013
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Sheng, L., Xu, D., Ouyang, W., Wang, X.: Unsupervised collaborative learning of keyframe detection and visual odometry towards monocular deep SLAM. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Google Scholar
Wang, L., Bai, Z., Zhang, Y., Lu, H.: Show, recall, and tell: image captioning with recall mechanism. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 12176–12183, April 2020. https://doi.org/10.1609/aaai.v34i07.6898, https://ojs.aaai.org/index.php/AAAI/article/view/6898
Wang, R., Liu, L.: Paint to better describe: learning image caption by using text-to-image synthesis. In: 2021 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, International Conference on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), pp. 958–964. IEEE (2021)
Google Scholar
Xia, P., He, J., Yin, J.: Boosting image caption generation with feature fusion module. Multimed. Tools Appl. 79(33), 24225–24239 (2020)
Article Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Zhang, W., Tople, S., Ohrimenko, O.: Dataset-level attribute leakage in collaborative learning. arXiv preprint arXiv:2006.07267 (2020)
Zhang, W., Ouyang, W., Li, W., Xu, D.: Collaborative and adversarial network for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Google Scholar
Zhang, W., Tang, S., Su, J., Xiao, J., Zhuang, Y.: Tell and guess: cooperative learning for natural image caption generation with hierarchical refined attention. Multimed. Tools Appl. 80(11), 16267–16282 (2020). https://doi.org/10.1007/s11042-020-08832-7
Article Google Scholar
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and vqa. In: AAAI, pp. 13041–13049 (2020)
Google Scholar
Zhou, Y., et al.: Collaborative learning of semi-supervised segmentation and classification for medical images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Google Scholar
Zhou, Y., Wang, M., Liu, D., Hu, Z., Zhang, H.: More grounded image captioning by distilling image-text matching model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Engineering, Ningxia University, Yinchuan, China
Rongzhao Wang & Libo Liu

Authors

Rongzhao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Libo Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Libo Liu .

Editor information

Editors and Affiliations

Southwest Petroleum University, Chengdu, China
Yang Wang
University of Electronic Science and Technology of China, Chengdu, China
Guobin Zhu
Harbin Engineering University, Harbin, China
Qilong Han
Harbin Institute of Technology, Harbin, China
Hongzhi Wang
Harbin University of Science and Technology, Harbin, China
Xianhua Song
National Academy of Guo Ding Institute of Data Sciences, Beijing, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, R., Liu, L. (2022). Collaborative Learning Method for Natural Image Captioning. In: Wang, Y., Zhu, G., Han, Q., Wang, H., Song, X., Lu, Z. (eds) Data Science. ICPCSEE 2022. Communications in Computer and Information Science, vol 1628. Springer, Singapore. https://doi.org/10.1007/978-981-19-5194-7_19

Download citation

DOI: https://doi.org/10.1007/978-981-19-5194-7_19
Published: 10 August 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-5193-0
Online ISBN: 978-981-19-5194-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Collaborative Learning Method for Natural Image Captioning