A reference-based model using deep learning for image captioning

do Carmo Nogueira, Tiago; Vinhal, Cássio Dener Noronha; da Cruz Júnior, Gélson; Ullmann, Matheus Rudolfo Diedrich; Marques, Thyago Carvalho

doi:10.1007/s00530-022-00937-3

A reference-based model using deep learning for image captioning

Special Issue Paper
Published: 09 May 2022

Volume 29, pages 1665–1681, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Tiago do Carmo Nogueira¹,
Cássio Dener Noronha Vinhal²,
Gélson da Cruz Júnior²,
Matheus Rudolfo Diedrich Ullmann³ &
…
Thyago Carvalho Marques²

443 Accesses
4 Citations
Explore all metrics

Abstract

Describing images in natural language is a challenging task for computer vision. Image captioning is the task of creating image descriptions. Deep learning architectures that use convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are beneficial in this task. However, traditional RNNs may cause problems, including exploding gradients, vanishing gradients, and non-descriptive sentences. To solve these problems, we propose a model based on the encoder–decoder structure, using CNNs to extract features from reference images and gated recurrent units (GRUs) to create the descriptions. Our model applies part-of-speech (PoS) analysis and the likelihood function to generate weights in GRU. This method also performs the knowledge transfer during a validation phase using the k-nearest neighbors (kNN) technique. Our experimental results using Flickr30k and MS-COCO datasets indicate that the proposed PoS-based model yields competitive scores compared to those of high-end models. The system predicts more descriptive captions and closely approximates the expected captions for both the predicted and kNN-selected captions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reference-based model using multimodal gated recurrent units for image captioning

Article 15 August 2020

Hierarchical Deep Neural Network for Image Captioning

Article 18 February 2019

Neural Image Caption Generation with Weighted Training and Reference

Article Open access 08 August 2018

References

Al-Muzaini, H.A., Al-Yahya, T.N., Benhidour, H.: Automatic arabic image captioning using rnn-lstm-based language model and cnn. database 9(6), (2018)
Ayesha, H., Iqbal, S., Tariq, M., Abrar, M., Sanaullah, M., Abbas, I., Rehman, A., Niazi, M.F.K., Hussain, S.: Automatic medical image interpretation: state of the art and future directions. Pattern Recognit. 114, 107856 (2021)
Article Google Scholar
Banerjee, S., Lavie, A.: Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72 (2005)
Barratt, S., Sharma, R.: A note on the inception score. arXiv preprint arXiv:1801.01973 (2018)
Chang, Y.S.: Fine-grained attention for image caption generation. Multimed. Tools Appl. 77(3), 2959–2971 (2018)
Article Google Scholar
Cohen, E., Beck, C.: Empirical analysis of beam search performance degradation in neural sequence models. In: International Conference on Machine Learning, pp. 1290–1299 (2019)
Deshpande, A., Aneja, J., Wang, L., Schwing, A.G., Forsyth, D.: Fast, diverse and accurate image captioning guided by part-of-speech. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10695–10704 (2019)
Devlin, J., Gupta, S., Girshick, R., Mitchell, M., Zitnick, C.L.: Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467 (2015)
Ding, G., Chen, M., Zhao, S., Chen, H., Han, J., Liu, Q.: Neural image caption generation with weighted training and reference. Cogn. Comput. 11, 763–777 (2018)
Article Google Scholar
Ding, S., Qu, S., Xi, Y., Sangaiah, A.K., Wan, S.: Image caption generation with high-level image features. Pattern Recogn. Lett. 123, 89–95 (2019)
Article Google Scholar
Gao, L., Wang, B., Wang, W.: Image captioning with scene-graph based semantic concepts. In: Proceedings of the 2018 10th International Conference on Machine Learning and Computing, pp. 225–229. ACM (2018)
Geetha, G., Kirthigadevi, T., Ponsam, G.G., Karthik, T., Safa, M.: Image captioning using deep convolutional neural networks (cnns). J. Phys. Conf. Ser. 1712, 012015 (2020)
Article Google Scholar
He, C., Hu, H.: Image captioning with text-based visual attention. Neural Process. Lett. 49(1), 177–185 (2019)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp. 1026–1034 (2015)
He, X., Shi, B., Bai, X., Xia, G.S., Zhang, Z., Dong, W.: Image caption generation with part of speech guidance. Pattern Recogn. Lett. 119, 229–237 (2017)
Article Google Scholar
He, X., Yang, Y., Shi, B., Bai, X.: Vd-san: visual-densely semantic attention network for image caption generation. Neurocomputing 328, 48–55 (2019)
Article Google Scholar
Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp. 2407–2415 (2015)
Kalra, S., Leekha, A.: Survey of convolutional neural networks for image captioning. J. Inf. Optimiz. Sci. 41(1), 239–260 (2020)
Google Scholar
Katpally, H., Bansal, A.: Ensemble learning on deep neural networks for image caption generation. In: 2020 IEEE 14th international conference on semantic computing (ICSC), pp. 61–68. IEEE (2020)
Kim, J., Scott, C.D.: Robust kernel density estimation. J. Mach. Learn. Res. 13(Sep), 2529–2565 (2012)
MathSciNet MATH Google Scholar
Li, S., Zhang, J., Guo, Q., Lei, J., Tu, D.: Generating image descriptions with multidirectional 2d long short-term memory. IET Comput. Vis. 11(1), 104–111 (2016)
Article Google Scholar
Li, X., Yuan, A., Lu, X.: Multi-modal gated recurrent units for image description. Multimed. Tools Appl. 77(22), 29847–29869 (2018)
Article Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81 (2004)
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 (2014)
Peng, Y., Liu, X., Wang, W., Zhao, X., Wei, M.: Image caption model of double lstm with scene factors. Image Vis. Comput. 86, 38–44 (2019)
Article Google Scholar
Qu, Z., Cao, B., Wang, X., Li, F., Xu, P., Zhang, L.: Feedback lstm network based on attention for image description generator. CMC-Comput. Mater. Continua 59(2), 575–589 (2019)
Article Google Scholar
Seshadri, M., Srikanth, M., Belov, M.: Image to language understanding: captioning approach. arXiv preprint arXiv:2002.09536 (2020)
Sharma, G., Kalena, P., Malde, N., Nair, A., Parkar, S.: Visual image caption generator using deep learning. Available at SSRN 3368837 (2019)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826 (2016)
Tan, Y.H., Chan, C.S.: Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333, 86–100 (2019)
Article Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575 (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164 (2015)
Wang, Y., Liu, J., Wang, X.: Image caption with synchronous cross-attention. In: Proceedings of the on thematic workshops of ACM multimedia 2017, pp. 433–441. ACM (2017)
Wei, W., Cheng, L., Mao, X., Zhou, G., Zhu, F.: Stack-vs: Stacked visual-semantic attention for image caption generation. arXiv preprint arXiv:1909.02489 (2019)
Yuan, A., Li, X., Lu, X.: 3g structure for image caption generation. Neurocomputing 330, 17–28 (2019)
Article Google Scholar
Zheng, J., Krishnamurthy, S., Chen, R., Chen, M.H., Ge, Z., Li, X.: Image captioning with integrated bottom-up and multi-level residual top-down attention for game scene understanding. arXiv preprint arXiv:1906.06632 (2019)
Zhou, L., Xu, C., Koch, P., Corso, J.J.: Watch what you just said: image captioning with text-conditional attention. In: Proceedings of the on thematic workshops of ACM multimedia 2017, pp. 305–313. ACM (2017)

Download references

Author information

Authors and Affiliations

Federal Institute Baiano (IFBaiano), Bom Jesus da Lapa, Brazil
Tiago do Carmo Nogueira
School of Electrical, Mechanical and Computer Engineering (EMC), Federal University of Goiás (UFG), Goiânia, Brazil
Cássio Dener Noronha Vinhal, Gélson da Cruz Júnior & Thyago Carvalho Marques
Federal Institute of Bahia (IFBA), Barreiras, Brazil
Matheus Rudolfo Diedrich Ullmann

Authors

Tiago do Carmo Nogueira
View author publications
You can also search for this author in PubMed Google Scholar
Cássio Dener Noronha Vinhal
View author publications
You can also search for this author in PubMed Google Scholar
Gélson da Cruz Júnior
View author publications
You can also search for this author in PubMed Google Scholar
Matheus Rudolfo Diedrich Ullmann
View author publications
You can also search for this author in PubMed Google Scholar
Thyago Carvalho Marques
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tiago do Carmo Nogueira.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

do Carmo Nogueira, T., Vinhal, C.D.N., da Cruz Júnior, G. et al. A reference-based model using deep learning for image captioning. Multimedia Systems 29, 1665–1681 (2023). https://doi.org/10.1007/s00530-022-00937-3

Download citation

Received: 22 August 2020
Accepted: 23 February 2022
Published: 09 May 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s00530-022-00937-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A reference-based model using deep learning for image captioning

Abstract

Access this article

Similar content being viewed by others

Reference-based model using multimodal gated recurrent units for image captioning

Hierarchical Deep Neural Network for Image Captioning

Neural Image Caption Generation with Weighted Training and Reference

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A reference-based model using deep learning for image captioning

Abstract

Access this article

Similar content being viewed by others

Reference-based model using multimodal gated recurrent units for image captioning

Hierarchical Deep Neural Network for Image Captioning

Neural Image Caption Generation with Weighted Training and Reference

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation