Abstract
Describing images in natural language is a challenging task for computer vision. Image captioning is the task of creating image descriptions. Deep learning architectures that use convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are beneficial in this task. However, traditional RNNs may cause problems, including exploding gradients, vanishing gradients, and non-descriptive sentences. To solve these problems, we propose a model based on the encoder–decoder structure, using CNNs to extract features from reference images and gated recurrent units (GRUs) to create the descriptions. Our model applies part-of-speech (PoS) analysis and the likelihood function to generate weights in GRU. This method also performs the knowledge transfer during a validation phase using the k-nearest neighbors (kNN) technique. Our experimental results using Flickr30k and MS-COCO datasets indicate that the proposed PoS-based model yields competitive scores compared to those of high-end models. The system predicts more descriptive captions and closely approximates the expected captions for both the predicted and kNN-selected captions.
Similar content being viewed by others
References
Al-Muzaini, H.A., Al-Yahya, T.N., Benhidour, H.: Automatic arabic image captioning using rnn-lstm-based language model and cnn. database 9(6), (2018)
Ayesha, H., Iqbal, S., Tariq, M., Abrar, M., Sanaullah, M., Abbas, I., Rehman, A., Niazi, M.F.K., Hussain, S.: Automatic medical image interpretation: state of the art and future directions. Pattern Recognit. 114, 107856 (2021)
Banerjee, S., Lavie, A.: Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72 (2005)
Barratt, S., Sharma, R.: A note on the inception score. arXiv preprint arXiv:1801.01973 (2018)
Chang, Y.S.: Fine-grained attention for image caption generation. Multimed. Tools Appl. 77(3), 2959–2971 (2018)
Cohen, E., Beck, C.: Empirical analysis of beam search performance degradation in neural sequence models. In: International Conference on Machine Learning, pp. 1290–1299 (2019)
Deshpande, A., Aneja, J., Wang, L., Schwing, A.G., Forsyth, D.: Fast, diverse and accurate image captioning guided by part-of-speech. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10695–10704 (2019)
Devlin, J., Gupta, S., Girshick, R., Mitchell, M., Zitnick, C.L.: Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467 (2015)
Ding, G., Chen, M., Zhao, S., Chen, H., Han, J., Liu, Q.: Neural image caption generation with weighted training and reference. Cogn. Comput. 11, 763–777 (2018)
Ding, S., Qu, S., Xi, Y., Sangaiah, A.K., Wan, S.: Image caption generation with high-level image features. Pattern Recogn. Lett. 123, 89–95 (2019)
Gao, L., Wang, B., Wang, W.: Image captioning with scene-graph based semantic concepts. In: Proceedings of the 2018 10th International Conference on Machine Learning and Computing, pp. 225–229. ACM (2018)
Geetha, G., Kirthigadevi, T., Ponsam, G.G., Karthik, T., Safa, M.: Image captioning using deep convolutional neural networks (cnns). J. Phys. Conf. Ser. 1712, 012015 (2020)
He, C., Hu, H.: Image captioning with text-based visual attention. Neural Process. Lett. 49(1), 177–185 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp. 1026–1034 (2015)
He, X., Shi, B., Bai, X., Xia, G.S., Zhang, Z., Dong, W.: Image caption generation with part of speech guidance. Pattern Recogn. Lett. 119, 229–237 (2017)
He, X., Yang, Y., Shi, B., Bai, X.: Vd-san: visual-densely semantic attention network for image caption generation. Neurocomputing 328, 48–55 (2019)
Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp. 2407–2415 (2015)
Kalra, S., Leekha, A.: Survey of convolutional neural networks for image captioning. J. Inf. Optimiz. Sci. 41(1), 239–260 (2020)
Katpally, H., Bansal, A.: Ensemble learning on deep neural networks for image caption generation. In: 2020 IEEE 14th international conference on semantic computing (ICSC), pp. 61–68. IEEE (2020)
Kim, J., Scott, C.D.: Robust kernel density estimation. J. Mach. Learn. Res. 13(Sep), 2529–2565 (2012)
Li, S., Zhang, J., Guo, Q., Lei, J., Tu, D.: Generating image descriptions with multidirectional 2d long short-term memory. IET Comput. Vis. 11(1), 104–111 (2016)
Li, X., Yuan, A., Lu, X.: Multi-modal gated recurrent units for image description. Multimed. Tools Appl. 77(22), 29847–29869 (2018)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81 (2004)
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 (2014)
Peng, Y., Liu, X., Wang, W., Zhao, X., Wei, M.: Image caption model of double lstm with scene factors. Image Vis. Comput. 86, 38–44 (2019)
Qu, Z., Cao, B., Wang, X., Li, F., Xu, P., Zhang, L.: Feedback lstm network based on attention for image description generator. CMC-Comput. Mater. Continua 59(2), 575–589 (2019)
Seshadri, M., Srikanth, M., Belov, M.: Image to language understanding: captioning approach. arXiv preprint arXiv:2002.09536 (2020)
Sharma, G., Kalena, P., Malde, N., Nair, A., Parkar, S.: Visual image caption generator using deep learning. Available at SSRN 3368837 (2019)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826 (2016)
Tan, Y.H., Chan, C.S.: Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333, 86–100 (2019)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575 (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164 (2015)
Wang, Y., Liu, J., Wang, X.: Image caption with synchronous cross-attention. In: Proceedings of the on thematic workshops of ACM multimedia 2017, pp. 433–441. ACM (2017)
Wei, W., Cheng, L., Mao, X., Zhou, G., Zhu, F.: Stack-vs: Stacked visual-semantic attention for image caption generation. arXiv preprint arXiv:1909.02489 (2019)
Yuan, A., Li, X., Lu, X.: 3g structure for image caption generation. Neurocomputing 330, 17–28 (2019)
Zheng, J., Krishnamurthy, S., Chen, R., Chen, M.H., Ge, Z., Li, X.: Image captioning with integrated bottom-up and multi-level residual top-down attention for game scene understanding. arXiv preprint arXiv:1906.06632 (2019)
Zhou, L., Xu, C., Koch, P., Corso, J.J.: Watch what you just said: image captioning with text-conditional attention. In: Proceedings of the on thematic workshops of ACM multimedia 2017, pp. 305–313. ACM (2017)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
do Carmo Nogueira, T., Vinhal, C.D.N., da Cruz Júnior, G. et al. A reference-based model using deep learning for image captioning. Multimedia Systems 29, 1665–1681 (2023). https://doi.org/10.1007/s00530-022-00937-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-022-00937-3