Image captioning: from structural tetrad to translated sentences

  • Rui Guo
  • Shubo Ma
  • Yahong HanEmail author


Generating semantic descriptions for images becomes more and more prevalent in recent years. Sentence which contains objects with their attributes and activity or scene involved is more informative and able to express more details of image semantic. In this paper, we focus on the generation of descriptions for images from the structural words we have generated, i.e., a semantically-layered structural tetrad of <object, attribute, activity, scene>. We propose to use deep machine translation method to generate semantically meaningful descriptions. In particular, the generated sentences describe objects with attributes, such as color, size, and corresponding activities or scenes involved. We propose to use a multi-task learning method to recognize structural words. Taking the words sequence as source language, we train a LSTM encoder-decoder machine translation model to output the target caption. In order to demonstrate the effectiveness of using multi-task learning method to generate structural words, we do experiments on benchmark datasets, i.e., aPascal and aYahoo. We also use UIUC Pascal, Flickr8k, Flickr30k, and MSCOCO datasets to justify that translating structural words to sentences achieves promising performance compared to the state-of-the-art methods of image captioning in terms of language generation metrics.


Image description Structural words Multi-task LSTM Machine translation 



This work is supported by the NSFC (under Grant U1509206,61472276, 61876130) and Tianjin Natural Science Foundation (no. 15JCYBJC15400).


  1. 1.
    Aditya S, Yang Y, Baral C, Fermuller C, Aloimonos Y (2015) From images to sentences through scene description graphs using commonsense reasoning and knowledge. arXiv:151103292
  2. 2.
    Aneja J, Deshpande A (2018) Convolutional image captioning. In: CVPRGoogle Scholar
  3. 3.
    Baldridge J (2005) The opennlp project. http://opennlpapacheorg/indexhtml. Accessed 2 Feb 2012
  4. 4.
    Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarizationGoogle Scholar
  5. 5.
    Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: CVPRGoogle Scholar
  6. 6.
    Cheng G, Zhou P, Han J (2017) Duplex metric learning for image set classification. IEEE Trans Image Process PP(99):1–1Google Scholar
  7. 7.
    Cheng G, Yang C, Yao X, Guo L, Han J (2018) When deep learning meets metric learning: remote sensing image scene classification via learning discriminative cnns. IEEE Trans Geoscience Remote Sens 56(5):2811–2821CrossRefGoogle Scholar
  8. 8.
    Cho K, van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder–decoder approaches. Syntax, Semantics and Structure in Statistical TranslationGoogle Scholar
  9. 9.
    Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational LinguisticsGoogle Scholar
  10. 10.
    Cho K, Courville A, Bengio Y (2015) Describing multimedia content using attention-based encoder-decoder networks. IEEE Trans Multimedia 17(11):1875–1886CrossRefGoogle Scholar
  11. 11.
    Cui H, Zhu L, Cui C, Nie X, Zhang H (2018) Efficient weakly-supervised discrete hashing for large-scale social image retrieval. Pattern Recognition Letters.
  12. 12.
    Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: the quirks and what works. arXiv:150501809
  13. 13.
    Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPRGoogle Scholar
  14. 14.
    Fang H, Gupta S, Iandola F, Srivastava R K, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt J C et al (2015) From captions to visual concepts and back. In: CVPRGoogle Scholar
  15. 15.
    Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: CVPRGoogle Scholar
  16. 16.
    Farhadi A, Hejrati M, Sadeghi M, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Computer Vision–ECCVGoogle Scholar
  17. 17.
    Fawcett T (2006) An introduction to roc analysis. Pattern Recognit Lett 27(8):861–874MathSciNetCrossRefGoogle Scholar
  18. 18.
    Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Computer Vision–ECCV 2014, SpringerGoogle Scholar
  19. 19.
    Guo J M, Prasetyo H, Chen JH (2015) Content-based image retrieval using error diffusion block truncation coding features. IEEE Trans Circuits Syst Video Technol 25 (3):466–481CrossRefGoogle Scholar
  20. 20.
    Han J, Zhang D, Cheng G, Liu N, Xu D (2018) Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Sign Process Mag 35(1):84–100CrossRefGoogle Scholar
  21. 21.
    Han Y, Li G (2015) Describing images with hierarchical concepts and object class localization. In: Proceedings of the 5th ACM on international conference on multimedia retrieval. ACMGoogle Scholar
  22. 22.
    Hendricks LA, Venugopalan S, Rohrbach M, Mooney R, Saenko K, Darrell T (2015) Deep compositional captioning: Describing novel object categories without paired training data. In: CVPR. IEEEGoogle Scholar
  23. 23.
    Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899MathSciNetCrossRefGoogle Scholar
  24. 24.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM international conference on multimedia. ACMGoogle Scholar
  25. 25.
    Jing P, Su Y, Nie L, Gu H, Liu J, Wang M (2018) A framework of joint low-rank and sparse regression for image memorability prediction. IEEE Transactions on Circuits and Systems for Video Technology.
  26. 26.
    Johnson J, Krishna R, Stark M, Li LJ, Shamma DA, Bernstein MS, Fei-Fei L (2015) Image retrieval using scene graphs. In: CVPR. IEEEGoogle Scholar
  27. 27.
    Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPRGoogle Scholar
  28. 28.
    Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: Proceedings of the 31st international conference on machine learning (ICML-14)Google Scholar
  29. 29.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systemsGoogle Scholar
  30. 30.
    Kulkarni G, Premraj V, Dhar S, Li S, Choi Y, Berg A C, Berg T L (2011) Baby talk: Understanding and generating simple image descriptions. In: CVPRGoogle Scholar
  31. 31.
    Lampert CH, Nickisch H, Harmeling S (2009) Learning to detect unseen object classes by between-class attribute transfer. In: CVPRGoogle Scholar
  32. 32.
    Li J, Wu Y, Zhao J, Lu K (2016) Multi-manifold sparse graph embedding for multi-modal image classification. Neurocomputing 173(Part 3):501–510CrossRefGoogle Scholar
  33. 33.
    Li J, Yue W, Zhao J, Ke L (2016) Low-rank discriminant embedding for multiview learning. IEEE Trans Cybern 47(11):3516–3529CrossRefGoogle Scholar
  34. 34.
    Li J, Zhao J, Lu K (2016) Joint feature selection and structure preservation for domain adaptation. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence. AAAI PressGoogle Scholar
  35. 35.
    Lin CY, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the 42nd annual meeting on association for computational linguistics, association for computational linguisticsGoogle Scholar
  36. 36.
    Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision. SpringerGoogle Scholar
  37. 37.
    Liu X, Xu Q, Chau T, Mu Y, Zhu L, Yan S (2018) Revisiting jump-diffusion process for visual tracking: a reinforcement learning approach. IEEE Transactions on Circuits and Systems for Video Technology.
  38. 38.
    Liu X, Xu Y, Zhu L, Mu Y (2018) A stochastic attribute grammar for robust cross-view human tracking. IEEE Trans Circuits Syst Video Technol 28(10):2884–2895CrossRefGoogle Scholar
  39. 39.
    Lu X, Zhu L, Cheng Z, Song X, Zhang H (2019) Efficient discrete latent semantic hashing for scalable cross-modal retrieval. Signal Process 154:217–231. CrossRefGoogle Scholar
  40. 40.
    Ma Z, Nie F, Yang Y, Uijlings JR, Sebe N (2012) Web image annotation via subspace-sparsity collaborated feature selection. IEEE Trans Multimedia 14(4):1021–1030CrossRefGoogle Scholar
  41. 41.
    Ma Z, Yang Y, Cai Y, Sebe N, Hauptmann AG (2012) Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In: Proceedings of the 20th ACM international conference on multimedia. ACMGoogle Scholar
  42. 42.
    Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014) Explain images with multimodal recurrent neural networks. NIPS Deep Learning WorkshopGoogle Scholar
  43. 43.
    Mekhalfi ML, Melgani F, Bazi Y, Alajlan N (2015) A compressive sensing approach to describe indoor scenes for blind people. IEEE Trans Circuits Syst Video Technol 25(7):1246–1257CrossRefGoogle Scholar
  44. 44.
    Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé H III (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, association for computational linguisticsGoogle Scholar
  45. 45.
    Pan JS, Feng Q, Yan L, Yang JF (2015) Neighborhood feature line segment for image classification. IEEE Trans Circuits Syst Video Technol 25(3):387–398CrossRefGoogle Scholar
  46. 46.
    Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, association for computational linguisticsGoogle Scholar
  47. 47.
    Parikh D, Grauman K (2011) Relative attributes. In: 2011 international conference on computer vision. IEEEGoogle Scholar
  48. 48.
    Ren Z, Gao S, Chia LT, Tsang IWH (2014) Region-based saliency detection and its application in object recognition. IEEE Trans Circuits Syst Video Technol 24 (5):769–779CrossRefGoogle Scholar
  49. 49.
    Rohrbach A, Rohrbach M, Schiele B (2015) The long-short story of movie description. In: Pattern recognition. SpringerGoogle Scholar
  50. 50.
    Schuster S, Krishna R, Chang A, Fei-Fei L, Manning CD (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In: Proceedings of the fourth workshop on vision and languageGoogle Scholar
  51. 51.
    Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2(1):207–218Google Scholar
  52. 52.
    Song X, Shi Y, Chen X, Han Y (2018) Explore multi-step reasoning in video question answering. In: Proceedings of the ACM international conference on multimedia (ACM MM)Google Scholar
  53. 53.
    Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systemsGoogle Scholar
  54. 54.
    Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of the 25th international conference on computational Linguistics (COLING) AugustGoogle Scholar
  55. 55.
    Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: CVPRGoogle Scholar
  56. 56.
    Wang C, Yan S, Zhang L, Zhang HJ (2009) Multi-label sparse coding for automatic image annotation. In: CVPR. IEEEGoogle Scholar
  57. 57.
    Wang H, Xiao B, Wang L, Zhu F, Jiang YG, Wu J (2015) Chcf: a cloud-based heterogeneous computing framework for large-scale image retrieval. IEEE Trans Circuits Syst Video Technol 25(12):1900–1913CrossRefGoogle Scholar
  58. 58.
    Wang B, Xu Y, Han Y, Hong R (2018) Movie question answering: remembering the textual cues for layered visual contents. In: AAAIGoogle Scholar
  59. 59.
    Wang H, Xu Y, Han Y (2018) Spotting and aggregating salient regions for video captioning. In: Proceedings of the ACM international conference on multimedia (ACM MM)Google Scholar
  60. 60.
    Werbos PJ (1990) Backpropagation through time: what it does and how to do it. In: Proceedings of the IEEEGoogle Scholar
  61. 61.
    Wu A, Han Y (2018) Multi-modal circulant fusion for video-to-language and backward. In: IJCAI, pp 1029–1035Google Scholar
  62. 62.
    Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  63. 63.
    Xu Y, Han Y, Hong R, Tian Q (2018) Sequential video vlad: training the aggregation locally and temporally. IEEE Trans Image Process 27(10):4933–4944MathSciNetCrossRefGoogle Scholar
  64. 64.
    Yang Y, Shen HT, Ma Z, Huang Z, Zhou X (2011) l2, 1-norm regularized discriminative feature selection for unsupervised learning. In: IJCAI proceedings-international joint conference on artificial intelligence. CiteseerGoogle Scholar
  65. 65.
    Yang Y, Teo CL, Daumé H III, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the conference on empirical methods in natural language processing, association for computational linguisticsGoogle Scholar
  66. 66.
    Yang Z, Han Y, Wang Z (2017) Catching the temporal regions-of-interest for video captioning. In: Proceedings of the ACM international conference on multimedia (ACM MM)Google Scholar
  67. 67.
    Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78Google Scholar
  68. 68.
    Zhang L, Wang L, Lin W, Yan S (2014) Geometric optimum experimental design for collaborative image retrieval. IEEE Trans Circuits Syst Video Technol 24(2):346–359CrossRefGoogle Scholar
  69. 69.
    Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2018) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circuits Syst Video Technol 28(8):1839–1849CrossRefGoogle Scholar
  70. 70.
    Zhou D, Huang J, Schölkopf B (2006) Learning with hypergraphs: clustering, classification and embedding. In: Advances in neural information processing systemsGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.National Engineering Research Center of Turbo Generator Vibration, School of Energy and EnvironmentSoutheast UniversityNanjingChina
  2. 2.College of Intelligence and ComputingTianjin UniversityTianjinChina

Personalised recommendations