Skip to main content
Log in

Collaborative strategy network for spatial attention image captioning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Automatic image captioning is an interesting task that lies at the intersection of computer vision and natural language processing. Although image captioning based on reinforcement learning has made significant progress in the past few years, the problem of inconsistent evaluation indicators for training and testing remains. Reinforcement learning optimizes a single metric, and the caption generated by the model is monotonous and non-characteristics. The model cannot reflect the diversity among images. In response to the above problems, we design a novel image captioning model based on lightweight spatial attention and a generative adversarial network. The lightweight spatial attention module discards the coarse-grained approach of maximum pooling after convolution and transforms the spatial information to preserve key information in the feature map. Then, the game mechanism between the generator and the discriminator is used to optimize the evaluation metric of the model. Finally, we design a discriminator network that cooperates with reinforcement learning to update the model parameters and objectively optimize the language metric inconsistencies between the evaluation and test indicators. We verified the effectiveness of the proposed model on the MS-COCO and Flickr 30K datasets. The experimental results show that the model proposed in this paper achieves state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Availability of Data and Material

We all make sure that all data and materials support our published claims and comply with field standards.

Code Availability

We are preparing to upload the model code to GitHub.

References

  1. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086

  2. Babu KK, Dubey SR (2021) Csgan: Cyclic-synthesized generative adversarial networks for image-to-image transformation. Expert Syst Appl 169(114):431

    Google Scholar 

  3. Bodapati JD (2021) Sae-pd-seq: sequence autoencoder-based pre-training of decoder for sequence learning tasks. SIViP, pp 1–7

  4. Cao T, Han K, Wang X, Ma L, Fu Y, Jiang YG, Xue X (2020) Feature deformation meta-networks in image captioning of novel objects. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 10,494–10,501

  5. do Carmo Nogueira T, Vinhal CDN, da Cruz Júnior G, Ullmann MRD (2020) Reference-based model using multimodal gated recurrent units for image captioning. Multimedia Tools and Applications 79 (41):30,615–30,635

    Article  Google Scholar 

  6. Chen J, Jin Q (2020) Better captioning with sequence-level exploration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10,890–10,899

  7. Chen S, Jin Q, Wang P, Wu Q (2020) Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9962–9971

  8. Han HY, Chen YC, Hsiao PY, Fu LC (2020) Using channel-wise attention for deep cnn based real-time semantic segmentation with class-aware edge information IEEE Transactions on Intelligent Transportation Systems

  9. He J, Zhao Y, Sun B, Yu L (2020) Feedback evaluations to promote image captioning. IET Image Process 14(13):3021–3027

    Article  Google Scholar 

  10. He S, Lu Y, Chen S (2021) Image captioning algorithm based on multi-branch cnn and bi-lstm. IEICE Trans Inf Syst 104(7):941–947

    Article  Google Scholar 

  11. Hu T, Long C, Xiao C (2021) A novel visual representation on text using diverse conditional gan for visual recognition. IEEE Trans Image Process 30:3499–3512

    Article  Google Scholar 

  12. Huang F, Li X, Yuan C, Zhang S, Zhang J, Qiao S (2021) Attention-emotion-enhanced convolutional lstm for sentiment analysis IEEE Transactions on Neural Networks and Learning Systems

  13. Huang Y, Chen J, Ouyang W, Wan W, Xue Y (2020) Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Trans Image Process 29:4013– 4026

    Article  Google Scholar 

  14. Ji J, Du Z, Zhang X (2021) Divergent-convergent attention for image captioning. Pattern Recogn 115(107):928

    Google Scholar 

  15. Li W, Wang Q, Wu J, Yu Z (2021) Piecewise convolutional neural networks with position attention and similar bag attention for distant supervision relation extraction. Appl Intell, pp 1–11

  16. Liu H, Nie H, Zhang Z, Li YF (2021) Anisotropic angle distribution learning for head pose estimation and attention understanding in human-computer interaction. Neurocomputing 433:310–322

    Article  Google Scholar 

  17. Liu H, Zhang S, Lin K, Wen J, Li J, Hu X (2021) Vocabulary-wide credit assignment for training image captioning models. IEEE Trans Image Process 30:2450–2460

    Article  MathSciNet  Google Scholar 

  18. Liu M, Li L, Hu H, Guan W, Tian J (2020) Image caption generation with dual attention mechanism. Information Processing & Management 57(2):102,178

    Article  Google Scholar 

  19. Lu H, Yang R, Deng Z, Zhang Y, Gao G, Lan R (2021) Chinese image captioning via fuzzy attention-based densenet-bilstm. ACM Transactions on Multimedia Computing. Communications, and Applications (TOMM) 17(1s):1–18

    Google Scholar 

  20. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024

  21. Sharma R, Kumar A, Meena D, Pushp S (2020) Employing differentiable neural computers for image captioning and neural machine translation. Procedia Computer Science 173:234– 244

    Article  Google Scholar 

  22. Shi T, Keneshloo Y, Ramakrishnan N, Reddy CK (2021) Neural abstractive text summarization with sequence-to-sequence models. ACM Transactions on Data Science 2(1):1–37

    Article  Google Scholar 

  23. Sun B, Wu Y, Zhao K, He J, Yu L, Yan H, Luo A (2021) Student class behavior dataset: a video dataset for recognizing, detecting, and captioning students’ behaviors in classroom scenes. Neural Comput & Applic, pp 1–20

  24. Sun C, Ai Y, Wang S, Zhang W (2021) Mask-guided ssd for small-object detection. Appl Intell 51(6):3311–3322

    Article  Google Scholar 

  25. Wei Y, Wang L, Cao H, Shao M, Wu C (2020) Multi-attention generative adversarial network for image captioning. Neurocomputing 387:91–99

    Article  Google Scholar 

  26. Yan S, Wu F, Smith JS, Lu W, Zhang B (2018) Image captioning using adversarial networks and reinforcement learning. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 248–253. IEEE

  27. Wang S, Lan L, Zhang X, Dong G, Luo Z (2020) Object-aware semantics of attention for image captioning. Multimedia Tools and Applications 79(3):2013–2030

    Article  Google Scholar 

  28. Xu M, Fu P, Liu B, Yin H, Li J (2021) A novel dynamic graph evolution network for salient object detection. Appl Intell, pp 1–18

  29. Xu X, Wang T, Yang Y, Zuo L, Shen F, Shen HT (2020) Cross-modal attention with semantic consistence for image–text matching. IEEE transactions on neural networks and learning systems 31 (12):5412–5425

    Article  Google Scholar 

  30. Yang S, Niu J, Wu J, Wang Y, Liu X, Li Q (2021) Automatic ultrasound image report generation with adaptive multimodal attention mechanism. Neurocomputing 427:40–49

    Article  Google Scholar 

  31. Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10,685–10,694

  32. Yang X, Zhang H, Cai J (2019) Learning to collocate neural modules for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4250–4260

  33. Yang X, Zhang H, Cai J (2020) Auto-encoding and distilling scene graphs for image captioning IEEE Transactions on Pattern Analysis and Machine Intelligence

  34. Yuan J, Zhang L, Guo S, Xiao Y, Li Z (2020) Image captioning with a joint attention mechanism by visual concept samples. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16(3):1–22

    Article  Google Scholar 

  35. Zhang H, Le Z, Shao Z, Xu H, Ma J (2021) Mff-gan: an unsupervised generative adversarial network with adaptive and gradient joint constraints for multi-focus image fusion. Information Fusion 66:40–53

    Article  Google Scholar 

  36. Zhang Y, Shi X, Mi S, Yang X (2021) Image captioning with transformer and knowledge graph. Pattern Recogn Lett 143:43–49

    Article  Google Scholar 

  37. Zhang Z, Wu Q, Wang Y, Chen F (2021) Exploring region relationships implicitly: Image captioning with visual relationship attention. Image and Vision Computing p 104146

  38. Zhong X, Nie G, Huang W, Liu W, Ma B, Lin CW (2021) Attention-guided image captioning with adaptive global and local feature fusion. Journal of Visual Communication and Image Representation p 103138

  39. Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4777–4786

  40. Zhu H, Wang R, Zhang X (2021) Image captioning with dense fusion connection and improved stacked attention module. Neural Process Lett, pp 1–18

Download references

Funding

This work is supported by the National Natural Science Foundation of China (Nos. 61866014, 61663014, 61966005, 61962017, 617512134), the Guangxi Natural Science Foundation (Nos. 2018GXNSFDA281019, 2017GXNSFAA198315, 2016GXNSFAA380156, 2018GXNSFDA294011).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing Yang.

Ethics declarations

Ethics approval

This paper strictly abides by the moral standards of this journal.

Consent to Participate

All the authors of this paper have reviewed and agreed to contribute to your journal by consensus.

Consent for Publication

Once this paper is hired, we agree to publish it in your journal.

Conflict of Interests

No conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, D., Yang, J. & Bao, R. Collaborative strategy network for spatial attention image captioning. Appl Intell 52, 9017–9032 (2022). https://doi.org/10.1007/s10489-021-02943-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02943-w

Keywords

Navigation