Abstract
Image captioning is a challenging task involving generating descriptive sentences to describe images. The application of semantic concepts to automatically annotate images has made significant progress. However, the now available frameworks have apparent limitations, particularly in concept detection. Incomplete labelling due to biased annotations, using synonyms in training captions, and the enormous gap between positive and negative thought samples contribute to the problem. Incomplete labelling is a result of biased annotations. The captioning frameworks that are now in use are inadequate and create a barrier to accurate image captioning. Unequal sample occurrences and missing training captions negatively affect the model's potential to develop rich and varied descriptions of images. Inadequate sample occurrences and missing training captions also contribute to insufficient idea generation. To circumvent these limitations, a novel approach has been designed to automatically generate images using Weighted Stacked Generative Adversarial Network (WSGAN). With the help of this boost, the uneven distribution of concepts is intended to be rectified, thereby expanding the breadth of the horizons covered by the training set. The proposed approach utilizes a WSGAN in conjunction with a Gated Recurrent Units (GRU)–based Deep Learning (DL) model and a Visual Attention Mechanism (VAM)–based DL model. The purpose of the GRU-VAM model is to enable the generation of text captions for images. To train the model, combining the MS COCO dataset with a wide variety of original and machine-generated image datasets in numerous permutations is necessary. The WSGAN-generated images correct the imbalance and incompleteness in the training dataset, which boosts the model's ability to capture a wider variety of thoughts. During testing and evaluation, the proposed WSGAN- GRU-VAM demonstrates significant enhancements in image captioning metrics compared to existing models. WSGAN-GRU-VAM is superior to other well-known image captioning algorithms such as EnsCaption, Fast RF-UIC, RAGAN, and SAT-GPT-3 in terms of its performance across various essential parameters. Increase in BLEU (8%), METEOR (7%), CIDEr (9%), and ROUGE-L (6%), on average, reflect the model's capacity to provide image captions with enhanced linguistic accuracy, relevance, and coherence.
Similar content being viewed by others
Data Availability
According to acceptable restrictions, the competent authors may supply the models utilized in the present research.
References
Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., & Cucchiara, R. (2022). From show to tell: A survey on deep learning-based image captioning. IEEE transactions on pattern analysis and machine intelligence, 45(1), 539–559.
Ghandi, T., Pourreza, H., & Mahyar, H. (2023). Deep learning approaches on image captioning: A review. ACM Computing Surveys, 56(3), 1–39.
Chun, P. J., Yamane, T., & Maemura, Y. (2022). A deep learning-based image captioning method to automatically generate comprehensive explanations of bridge damage. Computer-Aided Civil and Infrastructure Engineering, 37(11), 1387–1401.
Castro, R., Pineda, I., Lim, W., & Morocho-Cayamcela, M. E. (2022). Deep learning approaches based on transformer architectures for image captioning tasks. IEEE Access, 10, 33679–33694.
Sharma, H., Agrahari, M., Singh, S. K., Firoj, M., & Mishra, R. K. (2020, February). Image captioning: a comprehensive survey. In 2020 international conference on power electronics & IoT applications in renewable energy and its control (PARC) (pp. 325–328). IEEE.
Oluwasammi, A., Aftab, M. U., Qin, Z., Ngo, S. T., Doan, T. V., Nguyen, S. B., & Nguyen, G. H. (2021). Features to text: a comprehensive survey of deep learning on semantic segmentation and image captioning. Complexity, 2021, 1–19.
Alzubi, J. A., Jain, R., Nagrath, P., Satapathy, S., Taneja, S., & Gupta, P. (2021). Deep image captioning using an ensemble of CNN and LSTM based deep neural networks. Journal of Intelligent & Fuzzy Systems, 40(4), 5761–5769.
Wang, Y., Xiao, B., Bouferguene, A., Al-Hussein, M., & Li, H. (2022). Vision-based method for semantic information extraction in construction by integrating deep learning object detection and image captioning. Advanced Engineering Informatics, 53, 101699.
Ming, Y., Hu, N., Fan, C., Feng, F., Zhou, J., & Yu, H. (2022). Visuals to text: A comprehensive review on automatic image captioning. IEEE/CAA Journal of Automatica Sinica, 9(8), 1339–1365.
Humaira, M., Shimul, P., Jim, M. A. R. K., Ami, A. S., & Shah, F. M. (2021). A hybridized deep learning method for Bengali image captioning. International Journal of Advanced Computer Science and Applications. https://doi.org/10.14569/IJACSA.2021.0120287
Makav, B., & Kılıç, V. (2019, November). A new image captioning approach for visually impaired people. In 2019 11th international conference on Electrical and Electronics Engineering (ELECO) (pp. 945–949). IEEE.
Hoxha, G., Melgani, F., & Demir, B. (2020). Toward remote sensing image retrieval under a deep image captioning perspective. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13, 4462–4475.
Yu, J., Li, J., Yu, Z., & Huang, Q. (2019). Multimodal transformer with multi-view visual representation for image captioning. IEEE Transactions on Circuits and Systems for Video Technology, 30(12), 4467–4480.
Sumbul, G., Nayak, S., & Demir, B. (2020). SD-RSIC: Summarization-driven deep remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing, 59(8), 6922–6934.
Puscasiu, A., Fanca, A., Gota, D. I., & Valean, H. (2020, May). Automated image captioning. In 2020 IEEE international conference on automation, quality and testing, robotics (AQTR) (pp. 1–6). IEEE.
Xiong, Y., Du, B., & Yan, P. (2019). Reinforced transformer for medical image captioning. In Machine Learning in Medical Imaging: 10th International workshop, MLMI 2019, held in conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings 10 (pp. 673–680). Springer International Publishing.
Xu, N., Zhang, H., Liu, A. A., Nie, W., Su, Y., Nie, J., & Zhang, Y. (2019). Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia, 22(5), 1372–1383.
Omri, M., Abdel-Khalek, S., Khalil, E. M., Bouslimi, J., & Joshi, G. P. (2022). Modeling of hyperparameter tuned deep learning model for automated image captioning. Mathematics, 10(3), 288.
Amirian, S., Rasheed, K., Taha, T. R., & Arabnia, H. R. (2019, December). Image captioning with generative adversarial network. In 2019 international conference on computational science and computational intelligence (CSCI) (pp. 272–275). IEEE.
Liu, X., Xu, Q., & Wang, N. (2019). A survey on deep neural network-based image captioning. The Visual Computer, 35(3), 445–470.
Sharma, H., & Jalal, A. S. (2020). Incorporating external knowledge for image captioning using CNN and LSTM. Modern Physics Letters B, 34(28), 2050315.
He, S., Liao, W., Tavakoli, H. R., Yang, M., Rosenhahn, B., & Pugeault, N. (2020). Image captioning through image transformer. In Proceedings of the Asian conference on computer vision.
Ueda, A., Yang, W., & Sugiura, K. (2023). Switching text-based image encoders for captioning images with text. IEEE Access. https://doi.org/10.1109/ACCESS.2023.3282444
Yang, M., Liu, J., Shen, Y., Zhao, Z., Chen, X., Wu, Q., & Li, C. (2020). An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Transactions on Image Processing, 29, 9627–9640.
Zhang, M., Yang, Y., Zhang, H., Ji, Y., Shen, H. T., & Chua, T. S. (2018). More is better: Precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Transactions on Image Processing, 28(1), 32–44.
Yang, R., Cui, X., Qin, Q., Deng, Z., Lan, R., & Luo, X. (2023). Fast RF-UIC: A fast unsupervised image captioning model. Displays, 79, 102490.
Lee, D. I., Lee, J. H., Jang, S. H., Oh, S. J., & Doo, I. C. (2023). Crop disease diagnosis with deep learning-based image captioning and object detection. Applied Sciences, 13(5), 3148.
Deepak, G., Gali, S., Sonker, A., Jos, B. C., Daya Sagar, K. V., & Singh, C. (2023). Automatic image captioning system using a deep learning approach. Soft Computing. https://doi.org/10.1007/s00500-023-08544-8
Selivanov, A., Rogov, O. Y., Chesakov, D., Shelmanov, A., Fedulova, I., & Dylov, D. V. (2023). Medical image captioning via generative pretrained transformers. Scientific Reports, 13(1), 4171.
MS COCO Captions Dataset | Papers With Code, https://paperswithcode.com/dataset/coco-captions
Funding
The authors state that they did not receive any funding for this study.
Author information
Authors and Affiliations
Contributions
JNC: Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Resources, Data Curation, Writing – Original Draft, Writing – Review & Editing, Visualization, Supervision GK: Conceptualization, Validation, Investigation, Resources, Writing – Review & Editing, Supervision.
Corresponding author
Ethics declarations
Conflict of interest
The authors reported that they had no conflicts of interest.
Consent for Publication
Not applicable.
Ethical Approval
Not applicable.
Informed Consent
All individual participants provided informed consent.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chandar, J.N., Kavitha, G. Advanced Generative Deep Learning Techniques for Accurate Captioning of Images. Wireless Pers Commun (2024). https://doi.org/10.1007/s11277-024-11037-y
Accepted:
Published:
DOI: https://doi.org/10.1007/s11277-024-11037-y