Skip to main content
Log in

Tell and guess: cooperative learning for natural image caption generation with hierarchical refined attention

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Automatically generating a natural language description of an image is one of the most fundamental and challenging problems in Multimedia Intelligence because it translates information between two different modalities, while such translation requires the ability to understand both modalities. The existing image captioning models have already achieved remarkable performance. However, they heavily rely on the Encoder-Decoder framework is a directional translation which is hard to be further improved. In this paper, we designed the “Tell and Guess” Cooperative Learning model with a Hierarchical Refined Attention mechanism (CL-HRA) that bidirectionally improves the performance to generate more informative captions. The Cooperative Learning (CL) method combines an image caption module (ICM) with an image retrieval module (IRM) - the ICM is responsible for the “Tell” function, which generates informative and natural language descriptions for a given image. While the IRM will “Guess” and try to select that image from a lineup of images based on the given description. Such cooperation mutually improves the learning of two modules. On the other hand, the Hierarchical Refined Attention (HRA) learns to selectively attend the high-level attributes and the low-level visual features, then incorporate them into CL to fulfill the objective gaps from image to caption. The HRA can pay different attention at the different semantic levels to refine the visual representation, while the CL with the human-like mindset is more interpretable to generate a more related caption for the corresponding image. The experimental results on Microsoft COCO dataset show the effectiveness of CL-HRA in terms of several popular image caption generation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://www.captionbot.ai/

References

  1. Abdulnabi AH, Wang G, Lu J, Jia K (2015) Multi-task cnn model for attribute prediction. IEEE Transactions on Multimedia 17(11):1949–1959

    Article  Google Scholar 

  2. Alahmadi R, Park CH, Hahn J (2019) Sequence-to-sequence image caption generator Eleventh international conference on machine vision (ICMV 2018), vol. 11041, p. 110410c. International society for optics and photonics

  3. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086

  4. Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72

  5. Chang YS (2018) Fine-grained attention for image caption generation. Multimedia Tools and Applications 77(3):2959–2971

    Article  Google Scholar 

  6. Chao YW, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster r-cnn architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139

  7. Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft coco captions: Data collection and evaluation server. arXiv preprint. arXiv:1504.00325

  8. Chen H, Ding G, Lin Z, Zhao S, Han J (2018) Show, observe and tell: Attribute-driven attention model for image captioning. In: IJCAI, pp. 606–612

  9. Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint. arXiv:1412.3555

  10. Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, et al. (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1473–1482

  11. Gao J, Ge R, Chen K, Nevatia R (2018) Motion-appearance co-memory networks for video question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6576–6585

  12. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778

  13. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation 9(8):1735–1780

    Article  Google Scholar 

  14. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint . arXiv:1502.03167

  15. Jacobsen JH, Oyallon E, Mallat S, Smeulders AW (2017) Hierarchical attribute cnns. In: ICML Workshop on principled approaches to deep learning

  16. Kinghorn P, Zhang L, Shao L (2018) A region-based image caption generator with refined descriptions. Neurocomputing 272:416–424

    Article  Google Scholar 

  17. LeCun Y, Bottou L, Bengio Y, Haffner P, et al. (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  18. Li L, Tang S, Deng L, Zhang Y, Tian Q (2017) Image caption with global-local attention. In: Thirty-first AAAI conference on artificial intelligence

  19. Li Q, Fu J, Yu D, Mei T, Luo J (2018) Tell-and-answer: Towards explainable visual question answering using attributes and captions. arXiv preprint. arXiv:1801.09041

  20. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440

  21. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics

  22. Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543

  23. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp. 91–99

  24. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint . arXiv:1409.1556

  25. Smith JR, Saint-Amand H, Plamada M, Koehn P, Callison-Burch C, Lopez A (2013) Dirt cheap web-scale parallel text from the common crawl. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1374–1383

  26. Sutskever I, Hinton GE, Krizhevsky A (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, pp. 1097–1105

  27. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9

  28. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575

  29. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164

  30. Wang L, Chu X, Zhang W, Wei Y, Sun W, Wu C (2018) Social image captioning: Exploring visual attention and user attention. Sensors 18(2):646

    Article  Google Scholar 

  31. Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40(10):2413–2427

    Article  Google Scholar 

  32. Wang W, Ding Y, Tian C (2018) A novel semantic attribute-based feature for image caption generation. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 3081–3085. IEEE

  33. Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE transactions on pattern analysis and machine intelligence 40(6):1367–1381

    Article  Google Scholar 

  34. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp. 2048–2057

  35. Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21–29

  36. Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4894–4902

  37. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4651–4659

  38. Yu H, Cheng S, Ni B, Wang M, Zhang J, Yang X (2018) Fine-grained video captioning for sports narrative. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6006–6015

  39. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision, pp. 818–833. Springer

  40. Zhang ML, Zhou ZH (2005) A k-nearest neighbor based algorithm for multi-label classification. GrC 5:718–721

    Google Scholar 

Download references

Acknowledgements

This work has been supported in part by NSFC (No. 61751209, U1611461), Hikvision-Zhejiang University Joint Research Center, Chinese Knowledge Center of Engineering Science and Technology (CKCEST), Engineering Research Center of Digital Library, Ministry of Education.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Siliang Tang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, W., Tang, S., Su, J. et al. Tell and guess: cooperative learning for natural image caption generation with hierarchical refined attention. Multimed Tools Appl 80, 16267–16282 (2021). https://doi.org/10.1007/s11042-020-08832-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08832-7

Keywords

Navigation