Deep Captioning with Attention-Based Visual Concept Transfer Mechanism for Enriching Description


In this paper, we propose a novel deep captioning framework called Attention-based multimodal recurrent neural network with Visual Concept Transfer Mechanism (A-VCTM). There are three advantages of the proposed A-VCTM. (1) A multimodal layer is used to integrate the visual representation and context representation together, building a bridge that connects context information with visual information directly. (2) An attention mechanism is introduced to lead the model to focus on the regions corresponding to the next word to be generated (3) We propose a visual concept transfer mechanism to generate novel visual concepts and enrich the description sentences. Qualitative and quantitative results on two standard benchmarks, MSCOCO and Flickr30K show the effectiveness and practicability of the proposed A-VCTM framework.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. 1.

  2. 2.


  1. 1.

    Socher R, Karpathy A, Le QV et al (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2(1):207–218

    Article  Google Scholar 

  2. 2.

    Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899

    MathSciNet  Article  Google Scholar 

  3. 3.

    Mao J, Xu W, Yang Y, et al. (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632

  4. 4.

    Donahue J, Hendricks LA, Rohrbach M et al (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691

    Article  Google Scholar 

  5. 5.

    Kiros, Ryan, Salakhutdinov Ruslan, Richard S Zemel (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539

  6. 6.

    Chung J, Cho K, Bengio Y (2016) A character-level decoder without explicit segmentation for neural machine translation. In: 54th annual meeting of the association for computational linguistics, ACL 2016. Association for computational linguistics (ACL)

  7. 7.

    Rensink Ronald A (2000) The dynamic representation of scenes. Vis Cognit 7(1–3):17C42

    Google Scholar 

  8. 8.

    Xu K, Ba J, Kiros R, et al. (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

  9. 9.

    Young P, Lai A, Hodosh M et al (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78

    Article  Google Scholar 

  10. 10.

    Mao J, Wei X, Yang Y, et al. (2015) Learning like a child: fast novel visual concept learning from sentence descriptions of images. In: Proceedings of the IEEE international conference on computer vision, pp 2533–2541

  11. 11.

    Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. European conference on computer vision. Springer, Cham, pp 740–755

    Google Scholar 

  12. 12.

    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  13. 13.

    Elman JL (1990) Finding structure in time. Cognit Sci 14(2):179–211

    Article  Google Scholar 

  14. 14.

    Guthrie D, Allison B, Liu W, et al. (2006) A closer look at skip-gram modelling. In: Proceedings of the 5th international conference on language resources and evaluation (LREC-2006), pp 1–4

  15. 15.

    Mikolov T, Sutskever I, Chen K, et al. (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  16. 16.

    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  17. 17.

    Vinyals O, Toshev A, Bengio S, et al. (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  18. 18.

    Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  19. 19.

    You Q, Jin H, Wang Z, et al. (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659

  20. 20.

    Lu J, Xiong C, Parikh D, et al. (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3242–3250

  21. 21.

    Papineni K, Roukos S, Ward T, et al. (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for computational linguistics, pp 311–318

  22. 22.

    Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  23. 23.

    Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  24. 24.

    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  25. 25.

    Szegedy C, Vanhoucke V, Ioffe S, et al. (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

Download references


This work was supported in part by the National Natural Science Foundation of China under Grant 61673402, Grant 61273270, and Grant 60802069, in part by the Natural Science Foundation of Guangdong under Grant 2017A030311029, Grant 2016B010109002, in part by the Science and Technology Program of Guangzhou under Grant 201704020180 and Grant 201604020024, and in part by the Fundamental Research Funds for the Central Universities of China.

Author information



Corresponding author

Correspondence to Haifeng Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Hu, H. Deep Captioning with Attention-Based Visual Concept Transfer Mechanism for Enriching Description. Neural Process Lett 50, 1891–1905 (2019).

Download citation


  • Image captioning
  • Visual concepts transfer mechanism
  • Attention mechanism
  • Multimodal fusion