Advanced Generative Deep Learning Techniques for Accurate Captioning of Images

Chandar, J. Navin; Kavitha, G.

doi:10.1007/s11277-024-11037-y

Advanced Generative Deep Learning Techniques for Accurate Captioning of Images

Research
Published: 29 April 2024

(2024)
Cite this article

Wireless Personal Communications Aims and scope Submit manuscript

J. Navin Chandar¹ &
G. Kavitha¹

20 Accesses
Explore all metrics

Abstract

Image captioning is a challenging task involving generating descriptive sentences to describe images. The application of semantic concepts to automatically annotate images has made significant progress. However, the now available frameworks have apparent limitations, particularly in concept detection. Incomplete labelling due to biased annotations, using synonyms in training captions, and the enormous gap between positive and negative thought samples contribute to the problem. Incomplete labelling is a result of biased annotations. The captioning frameworks that are now in use are inadequate and create a barrier to accurate image captioning. Unequal sample occurrences and missing training captions negatively affect the model's potential to develop rich and varied descriptions of images. Inadequate sample occurrences and missing training captions also contribute to insufficient idea generation. To circumvent these limitations, a novel approach has been designed to automatically generate images using Weighted Stacked Generative Adversarial Network (WSGAN). With the help of this boost, the uneven distribution of concepts is intended to be rectified, thereby expanding the breadth of the horizons covered by the training set. The proposed approach utilizes a WSGAN in conjunction with a Gated Recurrent Units (GRU)–based Deep Learning (DL) model and a Visual Attention Mechanism (VAM)–based DL model. The purpose of the GRU-VAM model is to enable the generation of text captions for images. To train the model, combining the MS COCO dataset with a wide variety of original and machine-generated image datasets in numerous permutations is necessary. The WSGAN-generated images correct the imbalance and incompleteness in the training dataset, which boosts the model's ability to capture a wider variety of thoughts. During testing and evaluation, the proposed WSGAN- GRU-VAM demonstrates significant enhancements in image captioning metrics compared to existing models. WSGAN-GRU-VAM is superior to other well-known image captioning algorithms such as EnsCaption, Fast RF-UIC, RAGAN, and SAT-GPT-3 in terms of its performance across various essential parameters. Increase in BLEU (8%), METEOR (7%), CIDEr (9%), and ROUGE-L (6%), on average, reflect the model's capacity to provide image captions with enhanced linguistic accuracy, relevance, and coherence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards Generating Stylized Image Captions via Adversarial Training

Assamese news image caption generation using attention mechanism

Article 14 February 2022

Injecting Prior Knowledge into Image Caption Generation

Data Availability

According to acceptable restrictions, the competent authors may supply the models utilized in the present research.

References

Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., & Cucchiara, R. (2022). From show to tell: A survey on deep learning-based image captioning. IEEE transactions on pattern analysis and machine intelligence, 45(1), 539–559.
Article Google Scholar
Ghandi, T., Pourreza, H., & Mahyar, H. (2023). Deep learning approaches on image captioning: A review. ACM Computing Surveys, 56(3), 1–39.
Article Google Scholar
Chun, P. J., Yamane, T., & Maemura, Y. (2022). A deep learning-based image captioning method to automatically generate comprehensive explanations of bridge damage. Computer-Aided Civil and Infrastructure Engineering, 37(11), 1387–1401.
Article Google Scholar
Castro, R., Pineda, I., Lim, W., & Morocho-Cayamcela, M. E. (2022). Deep learning approaches based on transformer architectures for image captioning tasks. IEEE Access, 10, 33679–33694.
Article Google Scholar
Sharma, H., Agrahari, M., Singh, S. K., Firoj, M., & Mishra, R. K. (2020, February). Image captioning: a comprehensive survey. In 2020 international conference on power electronics & IoT applications in renewable energy and its control (PARC) (pp. 325–328). IEEE.
Oluwasammi, A., Aftab, M. U., Qin, Z., Ngo, S. T., Doan, T. V., Nguyen, S. B., & Nguyen, G. H. (2021). Features to text: a comprehensive survey of deep learning on semantic segmentation and image captioning. Complexity, 2021, 1–19.
Article Google Scholar
Alzubi, J. A., Jain, R., Nagrath, P., Satapathy, S., Taneja, S., & Gupta, P. (2021). Deep image captioning using an ensemble of CNN and LSTM based deep neural networks. Journal of Intelligent & Fuzzy Systems, 40(4), 5761–5769.
Article Google Scholar
Wang, Y., Xiao, B., Bouferguene, A., Al-Hussein, M., & Li, H. (2022). Vision-based method for semantic information extraction in construction by integrating deep learning object detection and image captioning. Advanced Engineering Informatics, 53, 101699.
Article Google Scholar
Ming, Y., Hu, N., Fan, C., Feng, F., Zhou, J., & Yu, H. (2022). Visuals to text: A comprehensive review on automatic image captioning. IEEE/CAA Journal of Automatica Sinica, 9(8), 1339–1365.
Article Google Scholar
Humaira, M., Shimul, P., Jim, M. A. R. K., Ami, A. S., & Shah, F. M. (2021). A hybridized deep learning method for Bengali image captioning. International Journal of Advanced Computer Science and Applications. https://doi.org/10.14569/IJACSA.2021.0120287
Article Google Scholar
Makav, B., & Kılıç, V. (2019, November). A new image captioning approach for visually impaired people. In 2019 11th international conference on Electrical and Electronics Engineering (ELECO) (pp. 945–949). IEEE.
Hoxha, G., Melgani, F., & Demir, B. (2020). Toward remote sensing image retrieval under a deep image captioning perspective. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13, 4462–4475.
Article Google Scholar
Yu, J., Li, J., Yu, Z., & Huang, Q. (2019). Multimodal transformer with multi-view visual representation for image captioning. IEEE Transactions on Circuits and Systems for Video Technology, 30(12), 4467–4480.
Article Google Scholar
Sumbul, G., Nayak, S., & Demir, B. (2020). SD-RSIC: Summarization-driven deep remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing, 59(8), 6922–6934.
Article Google Scholar
Puscasiu, A., Fanca, A., Gota, D. I., & Valean, H. (2020, May). Automated image captioning. In 2020 IEEE international conference on automation, quality and testing, robotics (AQTR) (pp. 1–6). IEEE.
Xiong, Y., Du, B., & Yan, P. (2019). Reinforced transformer for medical image captioning. In Machine Learning in Medical Imaging: 10th International workshop, MLMI 2019, held in conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings 10 (pp. 673–680). Springer International Publishing.
Xu, N., Zhang, H., Liu, A. A., Nie, W., Su, Y., Nie, J., & Zhang, Y. (2019). Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia, 22(5), 1372–1383.
Article Google Scholar
Omri, M., Abdel-Khalek, S., Khalil, E. M., Bouslimi, J., & Joshi, G. P. (2022). Modeling of hyperparameter tuned deep learning model for automated image captioning. Mathematics, 10(3), 288.
Article Google Scholar
Amirian, S., Rasheed, K., Taha, T. R., & Arabnia, H. R. (2019, December). Image captioning with generative adversarial network. In 2019 international conference on computational science and computational intelligence (CSCI) (pp. 272–275). IEEE.
Liu, X., Xu, Q., & Wang, N. (2019). A survey on deep neural network-based image captioning. The Visual Computer, 35(3), 445–470.
Article Google Scholar
Sharma, H., & Jalal, A. S. (2020). Incorporating external knowledge for image captioning using CNN and LSTM. Modern Physics Letters B, 34(28), 2050315.
Article MathSciNet Google Scholar
He, S., Liao, W., Tavakoli, H. R., Yang, M., Rosenhahn, B., & Pugeault, N. (2020). Image captioning through image transformer. In Proceedings of the Asian conference on computer vision.
Ueda, A., Yang, W., & Sugiura, K. (2023). Switching text-based image encoders for captioning images with text. IEEE Access. https://doi.org/10.1109/ACCESS.2023.3282444
Article Google Scholar
Yang, M., Liu, J., Shen, Y., Zhao, Z., Chen, X., Wu, Q., & Li, C. (2020). An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Transactions on Image Processing, 29, 9627–9640.
Article MathSciNet Google Scholar
Zhang, M., Yang, Y., Zhang, H., Ji, Y., Shen, H. T., & Chua, T. S. (2018). More is better: Precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Transactions on Image Processing, 28(1), 32–44.
Article MathSciNet Google Scholar
Yang, R., Cui, X., Qin, Q., Deng, Z., Lan, R., & Luo, X. (2023). Fast RF-UIC: A fast unsupervised image captioning model. Displays, 79, 102490.
Article Google Scholar
Lee, D. I., Lee, J. H., Jang, S. H., Oh, S. J., & Doo, I. C. (2023). Crop disease diagnosis with deep learning-based image captioning and object detection. Applied Sciences, 13(5), 3148.
Article Google Scholar
Deepak, G., Gali, S., Sonker, A., Jos, B. C., Daya Sagar, K. V., & Singh, C. (2023). Automatic image captioning system using a deep learning approach. Soft Computing. https://doi.org/10.1007/s00500-023-08544-8
Article Google Scholar
Selivanov, A., Rogov, O. Y., Chesakov, D., Shelmanov, A., Fedulova, I., & Dylov, D. V. (2023). Medical image captioning via generative pretrained transformers. Scientific Reports, 13(1), 4171.
Article Google Scholar
MS COCO Captions Dataset | Papers With Code, https://paperswithcode.com/dataset/coco-captions

Download references

Funding

The authors state that they did not receive any funding for this study.

Author information

Authors and Affiliations

Department of Information Technology, B. S. Abdur Rahman Crescent Institute of Science & Technology, Chennai, 600048, India
J. Navin Chandar & G. Kavitha

Authors

J. Navin Chandar
View author publications
You can also search for this author in PubMed Google Scholar
G. Kavitha
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JNC: Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Resources, Data Curation, Writing – Original Draft, Writing – Review & Editing, Visualization, Supervision GK: Conceptualization, Validation, Investigation, Resources, Writing – Review & Editing, Supervision.

Corresponding author

Correspondence to J. Navin Chandar.

Ethics declarations

Conflict of interest

The authors reported that they had no conflicts of interest.

Consent for Publication

Not applicable.

Ethical Approval

Not applicable.

Informed Consent

All individual participants provided informed consent.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chandar, J.N., Kavitha, G. Advanced Generative Deep Learning Techniques for Accurate Captioning of Images. Wireless Pers Commun (2024). https://doi.org/10.1007/s11277-024-11037-y

Download citation

Accepted: 03 April 2024
Published: 29 April 2024
DOI: https://doi.org/10.1007/s11277-024-11037-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Advanced Generative Deep Learning Techniques for Accurate Captioning of Images

Abstract

Access this article

Similar content being viewed by others

Towards Generating Stylized Image Captions via Adversarial Training

Assamese news image caption generation using attention mechanism

Injecting Prior Knowledge into Image Caption Generation

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent for Publication

Ethical Approval

Informed Consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Advanced Generative Deep Learning Techniques for Accurate Captioning of Images

Abstract

Access this article

Similar content being viewed by others

Towards Generating Stylized Image Captions via Adversarial Training

Assamese news image caption generation using attention mechanism

Injecting Prior Knowledge into Image Caption Generation

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent for Publication

Ethical Approval

Informed Consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation