Automatic image captioning system based on augmentation and ranking mechanism

Revathi, B. S.; Kowshalya, A. Meena

doi:10.1007/s11760-023-02725-6

Automatic image captioning system based on augmentation and ranking mechanism

Original Paper
Published: 04 September 2023

Volume 18, pages 265–274, (2024)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

B. S. Revathi¹ &
A. Meena Kowshalya¹

189 Accesses
1 Citation
Explore all metrics

Abstract

Research on automatically producing syntactically and semantically accurate captions is still an open challenge. This paper proposes an effective pretrained Augmentation–Ranking (A–R) Image Captioning model. The proposed model improves the properties of the images and produces appropriate captions. The employed novel augmentation strategy improves convolution neural network (CNN) operation, while Ranking and Feedback Propagation improve Long Short-Term Memory (LSTM). Our proposed model seeks to address the issues of complexity, vanishing gradients and context during training. The proposed A–R model improves the performance of LSTM and CNN. The image dataset for training is expanded using the augmented CNN. Through ranks, the Ranking LSTM aids in the identification of the semantic captions. This blending method enhances the working of image captioning system. Utilizing greedy and beam search, the proposed A–R model is examined under maximum and average pooling. The outcomes are compared with cutting-edge models such as the bidirectional recurrent neural network, Google NIC and Bi-LSTM combined with semantic attention mechanism. The proposed model is assessed using the Flickr 8 k and Flickr 30 k dataset and assessed using measures including BLEU, METEOR and CIDER. The proposed model with reduced complexity generated captions deemed accurate, syntactically correct and semantically correct by achieving an accuracy of 74.87% above all baseline models, according to experimental results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Recommendation system based on deep learning methods: a systematic review and new directions

Article 03 August 2019

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Availability of data and materials

The general dataset Flickr 8 and 30 k is available online.

References

Staniūtė, R., Šešok, D.: A systematic literature review on image captioning. Appl. Sci. 9(10), 2024 (2019)
Article Google Scholar
Bahdanau, D., Cho, K.; ,engio, Y.: Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2015, arXiv:1409.0473
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5–11, 2010, Proceedings, Part IV 11 (pp. 15–29). Springe, Berlin, Heidelberg (2010)
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, T.L.: Babytalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell., 35(12), 2891–2903 (2013)
Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning (pp. 220–228) (2011)
Yang, Y., Teo, C., Daumé III, H., & Aloimonos, Y. (2011, July). Corpus-guided sentence generation of natural images. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 444–454).
Kuznetsova, P., Ordonez, V., Berg, A., Berg, T., Choi, Y.: Collective generation of natural image descriptions. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 359–368) (2012)
Kuznetsova, P., Ordonez, V., Berg, T.L., Choi, Y.: Treetalk: Composition and compression of trees for image descriptions. Trans. Assoc. Comput. Linguistics 2, 351–362 (2014)
Article Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3128–3137) (2015)
Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090 (2014)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Jin, J., Fu, K., Cui, R., Sha, F., Zhang, C.: Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv preprint arXiv:1506.06272 (2015)
Kiros, R., Salakhutdinov, R., Zemel, R.: Multimodal neural language models. In: International Conference on Machine Learning (pp. 595–603). PMLR (2014)
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2625–2634) (2015)
Tomasi, C., Manduchi, R. Bilateral filtering for gray and color images. In: Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271) (pp. 839–846). IEEE (1998)
Felzenszwalb, P., McAllester, D., Ramanan, D. A discriminatively trained, multiscale, deformable part model. I:n 2008 IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8) (2008)
Divvala, S.K., Hoiem, D., Hays, J.H., Efros, A.A., Hebert, M.: An empirical study of context in object detection. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 1271–1278) (2009)
Oliva, A., Torralba, A.: Building the gist of a scene: the role of global image features in recognition. Prog. Brain Res. 155, 23–36 (2006)
Article Google Scholar
Curran, J.R., Clark, S., Bos, J.: Linguistically motivated large-scale NLP with C&C and Boxer. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions (pp. 33–36) (2007)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778) (2016)
Perez, L., Wang, J.: The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621 (2017)
Atliha, V., Šešok, D.: Text augmentation using BERT for image captioning. Appl. Sci. 10(17), 5978 (2020)
Article Google Scholar
Cui, Y., Yang, G., Veit, A., Huang, X., Belongie, S.: Learning to evaluate image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5804–5812) (2018)
He, X., Wei, D., Lam, K.M., Li, J., Wang, L., Jia, W., Wu, Q.: Canny edge detection using bilateral filter on real hexagonal structure. In: Advanced Concepts for Intelligent Vision Systems: 12th International Conference, ACIVS 2010, Sydney, Australia, December 13–16, 2010, Proceedings, Part I 12 (pp. 233-244). Springer, Berlin, Heidelberg (2010)
Cao, P., Yang, Z., Sun, L., Liang, Y., Yang, M.Q., Guan, R.: Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Process. Lett. 50(1), 103–119 (2019)
Article Google Scholar
Chowdhary, K.: Natural language processing. Fund. Artif. Intell., pp. 603–649 (2020)
]. Makav, B., Kılıç, V.: A new image captioning approach for visually impaired people. In: 2019 11th International Conference on Electrical and Electronics Engineering (ELECO) (pp. 945–949). IEEE (2019)
Ullah, W., Ullah, A., Hussain, T., Khan, Z.A., Baik, S.W.: An efficient anomaly recognition framework using an attention residual LSTM in surveillance videos. Sensors 21(8), 2811 (2021)
Article Google Scholar
Ullah, W., Ullah, A., Hussain, T., Muhammad, K., Heidari, A.A., Del Ser, J., De Albuquerque, V.H.C.: Artificial Intelligence of Things-assisted two-stream neural network for anomaly detection in surveillance Big Video Data. Future Generat. Comput. Syst.,129, 286–297 (2022)
Ullah, W., Ullah, A., Haq, I.U., Muhammad, K., Sajjad, M., Baik, S.W.: CNN features with bi-directional LSTM for real-time anomaly detection in surveillance networks. Multimedia tools and applications 80, 16979–16995 (2021)
Article Google Scholar
Ullah, W., Hussain, T., Khan, Z.A., Haroon, U., Baik, S.W.: Intelligent dual stream CNN and echo state network for anomaly detection. Knowl.Based Syst. 253, 109456 (2022)
Article Google Scholar
Liu, X., Xu, Q., Wang, N.: A survey on deep neural network-based image captioning. Vis. Comput. 35(3), 445–470 (2019)
Article Google Scholar
Yang, M., Liu, J., Shen, Y., Zhao, Z., Chen, X., Wu, Q., Li, C.: An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Trans. Image Process. 29, 9627–9640 (2020)
Article MathSciNet Google Scholar
Hossain, M.Z., Sohel, F., Shiratuddin, M.F., Laga, H.: A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. (CsUR) 51(6), 1–36 (2019)
Article Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318) (2002
Vedantam, R., Lawrence Zitnick, C., Parikh, D.:. Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4566–4575) (2015)
Banerjee, S., Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (pp. 65–72) (2005)
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif,. Intell. Res. 47(2013), 853–899 (2013)
Article MathSciNet Google Scholar
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: Describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst., pp. 1143–1151 (2011)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: ICML, 296–304 (1998)
Wu, J.: Introduction to convolutional neural networks. National Key Lab for Novel Software Technology. Nanjing University. China, 5(23), 495 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 1026–1034) (2015)
Duda, R.O., Hart, P.E.: Use of the Hough transformation to detect lines and curves in pictures. Commun. ACM 15(1), 11–15 (1972)
Article Google Scholar
Bieder, F., Sandkühler, R., Cattin, P.C. Comparison of methods generalizing max-and average-pooling. arXiv preprint arXiv:2103.01746 (2021)
Wilt, C.M., Thayer, J.T., Ruml, W.: A comparison of greedy search algorithms. In: Third Annual Symposium on Combinatorial Search (2010)
Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271) (pp. 839–846). IEEE (1998)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015. arXiv preprint arXiv:1502.03167 (2015)
Vijayaraju, N.: Image retrieval using image captioning (2019)
O’Gorman, L., Kasturi, R.: Document image analysis, vol. 39. IEEE Computer Society Press, Los Alamitos (1995)
Google Scholar
Guo, K., Wu, Z., Wang, W., Ren, S., Zhou, X., Gadekallu, T. R., Liu, C.: GRTR: gradient rebalanced traffic sign recognition for autonomous vehicles. IEEE Trans. Auto. Sci. Eng. (2023)
Teng, L., Qiao, Y., Shafiq, M., Srivastava, G., Javed, A.R., Gadekallu, T.R., Yin, S.: FLPK-BiSeNet: Federated learning based on priori knowledge and bilateral segmentation network for image edge extraction. IEEE Transa. Netw. Serv. Manag. (2023)
Aldabbas, H., Asad, M., Ryalat, M.H., Malik, K.R., Qureshi, M.Z.A.: Data augmentation to stabilize image caption generation models in deep learning. Int J Adv Comput Sci Appl 10(10), 571–579 (2019)
Google Scholar

Download references

Funding

Funded by WOS-A/ET-6/2021.

Author information

Authors and Affiliations

Government College of Technology, Coimbatore, India
B. S. Revathi & A. Meena Kowshalya

Authors

B. S. Revathi
View author publications
You can also search for this author in PubMed Google Scholar
A. Meena Kowshalya
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

This research work is done by the author B. S. Revathi and guided by A. Meena Kowshalya.

Corresponding author

Correspondence to B. S. Revathi.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Revathi, B.S., Kowshalya, A.M. Automatic image captioning system based on augmentation and ranking mechanism. SIViP 18, 265–274 (2024). https://doi.org/10.1007/s11760-023-02725-6

Download citation

Received: 09 January 2023
Revised: 21 July 2023
Accepted: 04 August 2023
Published: 04 September 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s11760-023-02725-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic image captioning system based on augmentation and ranking mechanism

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Recommendation system based on deep learning methods: a systematic review and new directions

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic image captioning system based on augmentation and ranking mechanism

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Recommendation system based on deep learning methods: a systematic review and new directions

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation