Skip to main content
Log in

A reference-based model using deep learning for image captioning

  • Special Issue Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Describing images in natural language is a challenging task for computer vision. Image captioning is the task of creating image descriptions. Deep learning architectures that use convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are beneficial in this task. However, traditional RNNs may cause problems, including exploding gradients, vanishing gradients, and non-descriptive sentences. To solve these problems, we propose a model based on the encoder–decoder structure, using CNNs to extract features from reference images and gated recurrent units (GRUs) to create the descriptions. Our model applies part-of-speech (PoS) analysis and the likelihood function to generate weights in GRU. This method also performs the knowledge transfer during a validation phase using the k-nearest neighbors (kNN) technique. Our experimental results using Flickr30k and MS-COCO datasets indicate that the proposed PoS-based model yields competitive scores compared to those of high-end models. The system predicts more descriptive captions and closely approximates the expected captions for both the predicted and kNN-selected captions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Al-Muzaini, H.A., Al-Yahya, T.N., Benhidour, H.: Automatic arabic image captioning using rnn-lstm-based language model and cnn. database 9(6), (2018)

  2. Ayesha, H., Iqbal, S., Tariq, M., Abrar, M., Sanaullah, M., Abbas, I., Rehman, A., Niazi, M.F.K., Hussain, S.: Automatic medical image interpretation: state of the art and future directions. Pattern Recognit. 114, 107856 (2021)

    Article  Google Scholar 

  3. Banerjee, S., Lavie, A.: Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72 (2005)

  4. Barratt, S., Sharma, R.: A note on the inception score. arXiv preprint arXiv:1801.01973 (2018)

  5. Chang, Y.S.: Fine-grained attention for image caption generation. Multimed. Tools Appl. 77(3), 2959–2971 (2018)

    Article  Google Scholar 

  6. Cohen, E., Beck, C.: Empirical analysis of beam search performance degradation in neural sequence models. In: International Conference on Machine Learning, pp. 1290–1299 (2019)

  7. Deshpande, A., Aneja, J., Wang, L., Schwing, A.G., Forsyth, D.: Fast, diverse and accurate image captioning guided by part-of-speech. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10695–10704 (2019)

  8. Devlin, J., Gupta, S., Girshick, R., Mitchell, M., Zitnick, C.L.: Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467 (2015)

  9. Ding, G., Chen, M., Zhao, S., Chen, H., Han, J., Liu, Q.: Neural image caption generation with weighted training and reference. Cogn. Comput. 11, 763–777 (2018)

    Article  Google Scholar 

  10. Ding, S., Qu, S., Xi, Y., Sangaiah, A.K., Wan, S.: Image caption generation with high-level image features. Pattern Recogn. Lett. 123, 89–95 (2019)

    Article  Google Scholar 

  11. Gao, L., Wang, B., Wang, W.: Image captioning with scene-graph based semantic concepts. In: Proceedings of the 2018 10th International Conference on Machine Learning and Computing, pp. 225–229. ACM (2018)

  12. Geetha, G., Kirthigadevi, T., Ponsam, G.G., Karthik, T., Safa, M.: Image captioning using deep convolutional neural networks (cnns). J. Phys. Conf. Ser. 1712, 012015 (2020)

    Article  Google Scholar 

  13. He, C., Hu, H.: Image captioning with text-based visual attention. Neural Process. Lett. 49(1), 177–185 (2019)

    Article  Google Scholar 

  14. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp. 1026–1034 (2015)

  15. He, X., Shi, B., Bai, X., Xia, G.S., Zhang, Z., Dong, W.: Image caption generation with part of speech guidance. Pattern Recogn. Lett. 119, 229–237 (2017)

    Article  Google Scholar 

  16. He, X., Yang, Y., Shi, B., Bai, X.: Vd-san: visual-densely semantic attention network for image caption generation. Neurocomputing 328, 48–55 (2019)

    Article  Google Scholar 

  17. Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp. 2407–2415 (2015)

  18. Kalra, S., Leekha, A.: Survey of convolutional neural networks for image captioning. J. Inf. Optimiz. Sci. 41(1), 239–260 (2020)

    Google Scholar 

  19. Katpally, H., Bansal, A.: Ensemble learning on deep neural networks for image caption generation. In: 2020 IEEE 14th international conference on semantic computing (ICSC), pp. 61–68. IEEE (2020)

  20. Kim, J., Scott, C.D.: Robust kernel density estimation. J. Mach. Learn. Res. 13(Sep), 2529–2565 (2012)

    MathSciNet  MATH  Google Scholar 

  21. Li, S., Zhang, J., Guo, Q., Lei, J., Tu, D.: Generating image descriptions with multidirectional 2d long short-term memory. IET Comput. Vis. 11(1), 104–111 (2016)

    Article  Google Scholar 

  22. Li, X., Yuan, A., Lu, X.: Multi-modal gated recurrent units for image description. Multimed. Tools Appl. 77(22), 29847–29869 (2018)

    Article  Google Scholar 

  23. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81 (2004)

  24. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 (2014)

  25. Peng, Y., Liu, X., Wang, W., Zhao, X., Wei, M.: Image caption model of double lstm with scene factors. Image Vis. Comput. 86, 38–44 (2019)

    Article  Google Scholar 

  26. Qu, Z., Cao, B., Wang, X., Li, F., Xu, P., Zhang, L.: Feedback lstm network based on attention for image description generator. CMC-Comput. Mater. Continua 59(2), 575–589 (2019)

    Article  Google Scholar 

  27. Seshadri, M., Srikanth, M., Belov, M.: Image to language understanding: captioning approach. arXiv preprint arXiv:2002.09536 (2020)

  28. Sharma, G., Kalena, P., Malde, N., Nair, A., Parkar, S.: Visual image caption generator using deep learning. Available at SSRN 3368837 (2019)

  29. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  30. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826 (2016)

  31. Tan, Y.H., Chan, C.S.: Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333, 86–100 (2019)

    Article  Google Scholar 

  32. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575 (2015)

  33. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164 (2015)

  34. Wang, Y., Liu, J., Wang, X.: Image caption with synchronous cross-attention. In: Proceedings of the on thematic workshops of ACM multimedia 2017, pp. 433–441. ACM (2017)

  35. Wei, W., Cheng, L., Mao, X., Zhou, G., Zhu, F.: Stack-vs: Stacked visual-semantic attention for image caption generation. arXiv preprint arXiv:1909.02489 (2019)

  36. Yuan, A., Li, X., Lu, X.: 3g structure for image caption generation. Neurocomputing 330, 17–28 (2019)

    Article  Google Scholar 

  37. Zheng, J., Krishnamurthy, S., Chen, R., Chen, M.H., Ge, Z., Li, X.: Image captioning with integrated bottom-up and multi-level residual top-down attention for game scene understanding. arXiv preprint arXiv:1906.06632 (2019)

  38. Zhou, L., Xu, C., Koch, P., Corso, J.J.: Watch what you just said: image captioning with text-conditional attention. In: Proceedings of the on thematic workshops of ACM multimedia 2017, pp. 305–313. ACM (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tiago do Carmo Nogueira.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

do Carmo Nogueira, T., Vinhal, C.D.N., da Cruz Júnior, G. et al. A reference-based model using deep learning for image captioning. Multimedia Systems 29, 1665–1681 (2023). https://doi.org/10.1007/s00530-022-00937-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-022-00937-3

Keywords

Navigation