Advertisement

Multi-modal gated recurrent units for image description

  • Xuelong Li
  • Aihong Yuan
  • Xiaoqiang Lu
Article

Abstract

Using a natural language sentence to describe the content of an image is a challenging but very important task. It is challenging because a description must not only capture objects contained in the image and the relationships among them, but also be relevant and grammatically correct. In this paper a multi-modal embedding model based on gated recurrent units (GRU) which can generate variable-length description for a given image. In the training step, we apply the convolutional neural network (CNN) to extract the image feature. Then the feature is imported into the multi-modal GRU as well as the corresponding sentence representations. The multi-modal GRU learns the inter-modal relations between image and sentence. And in the testing step, when an image is imported to our multi-modal GRU model, a sentence which describes the image content is generated. The experimental results demonstrate that our multi-modal GRU model obtains the state-of-the-art performance on Flickr8K, Flickr30K and MS COCO datasets.

Keywords

Image description Gated recurrent unit Convolutional neural network Multi-modal embedding 

Notes

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61761130079, in part by the Key Research Program of Frontier Sciences, CAS under Grant QYZDY-SSW-JSC044, in part by the National Natural Science Foundation of China under Grant 61472413, in part by the National Natural Science Foundation of China under Grant 61772510, and in part by the Young Top-notch Talent Program of Chinese Academy of Sciences under Grant QYZDB-SSW-JSC015.

References

  1. 1.
    Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and VQA. CoRR arXiv:1707.07998
  2. 2.
    Andrew G, Arora R, Bilmes JA, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of the international conference on machine learning, pp 1247–1255Google Scholar
  3. 3.
    Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, vol 29, pp 65–72Google Scholar
  4. 4.
    Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155MATHGoogle Scholar
  5. 5.
    Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATHGoogle Scholar
  6. 6.
    Chen X, Zitnick CL (2014) Learning a recurrent visual representation for image caption generation. arXiv:1411.5654
  7. 7.
    Chen X, Lawrence Zitnick C (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2422–2431Google Scholar
  8. 8.
    Chen M, Ding G, Zhao S, Chen H, Liu Q, Han J (2017) Reference based LSTM for image captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 3981–3987Google Scholar
  9. 9.
    Cheng G, Han J, Lu X (2017) Remote sensing image scene classification: benchmark and state of the art. Proc IEEE 105(10):1865–1883CrossRefGoogle Scholar
  10. 10.
    Cheng G, Li Z, Yao X, Guo L, Wei Z (2017) Remote sensing image scene classification using bag of convolutional features. IEEE Geosci Remote Sens LettGoogle Scholar
  11. 11.
    Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078
  12. 12.
    Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555
  13. 13.
    Chung J, Gülċehre C, Cho K, Bengio Y (2015) Gated feedback recurrent neural networks. Computing Research Repository (CoRR) arXiv:1502.02367
  14. 14.
    Ding G, Guo Y, Zhou J, Gao Y (2016) Large-scale cross-modality search via collective matrix factorization hashing. IEEE Trans Image Process 25(11):5427–5440MathSciNetCrossRefGoogle Scholar
  15. 15.
    Ding G, Zhou J, Guo Y, Lin Z, Zhao S, Han J (2017) Large-scale image retrieval with sparse embedded hashing. Neurocomputing 257:24–36CrossRefGoogle Scholar
  16. 16.
    Dolan B, Quirk C, Brockett C (2004) Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the international conference on computational linguistics. Association for Computational Linguistics, p 350Google Scholar
  17. 17.
    Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634Google Scholar
  18. 18.
    Everingham M, Van Gool L, Williams CKI, Winn J (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338CrossRefGoogle Scholar
  19. 19.
    Fairbank M, Alonso E, Prokhorov D (2013) An equivalence between adaptive dynamic programming with a critic and backpropagation through time. IEEE Trans Neural Netw Learn Syst 24(12):2088–2100CrossRefGoogle Scholar
  20. 20.
    Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Proceedings of the European conference on computer vision. Springer, pp 15–29Google Scholar
  21. 21.
    Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Proceedings of the advances in neural information processing systems, pp 2121–2129Google Scholar
  22. 22.
    Gatignon H (2014) Canonical correlation analysis. In: Statistical analysis of management data. Springer, pp 217–230Google Scholar
  23. 23.
    Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2015) Lstm: a search space odyssey. arXiv:1503.04069
  24. 24.
    Guo Y, Ding G, Han J, Gao Y (2017) Zero-shot learning with transferred samples. IEEE Trans Image Process 26(7):3277–3290MathSciNetCrossRefGoogle Scholar
  25. 25.
    Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664CrossRefMATHGoogle Scholar
  26. 26.
    Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetCrossRefMATHGoogle Scholar
  27. 27.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  28. 28.
    Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899MathSciNetMATHGoogle Scholar
  29. 29.
    Jia Y, Salzmann M, Darrell T (2011) Learning cross-modality similarity for multinomial data. In: Proceedings of the international conference on computer vision. IEEE, pp 2407–2414Google Scholar
  30. 30.
    Jin J (2015) A c++ library for multimodal deep learning. Comput Res Repos (CoRR)Google Scholar
  31. 31.
    Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137Google Scholar
  32. 32.
    Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of the advances in neural information processing systems, pp 1889–1897Google Scholar
  33. 33.
    Karpathy A, Johnson J, Li FF (2016) Visualizing and understanding recurrent networks. In: Proceedings of the international conference on learning representationsGoogle Scholar
  34. 34.
    Kiros R, Salakhutdinov R, Zemel RS (2015) Unifying visual-semantic embeddings with multimodal neural language models. Trans Assoc Comput LinguisGoogle Scholar
  35. 35.
    Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Proceedings of the advances in neural information processing systems, pp 3294–3302Google Scholar
  36. 36.
    Kombrink S, Mikolov T, Karafiát M, Burget L (2011) Recurrent neural network based language modeling in meeting recognition. In: Interspeech, vol 11, pp 2877–2880Google Scholar
  37. 37.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the advances in neural information processing systems, pp 1097–1105Google Scholar
  38. 38.
    Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903CrossRefGoogle Scholar
  39. 39.
    Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) Treetalk: composition and compression of trees for image descriptions. Trans Assoc Comput Linguis 2(10):351–362Google Scholar
  40. 40.
    LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRefGoogle Scholar
  41. 41.
    Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proceedings of the European conference on computer vision. Springer, pp 740–755Google Scholar
  42. 42.
    Liu C, Sun F, Wang C, Wang F, Yuille AL (2017) MAT: a multimodal attentive translator for image captioning. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, pp 4033–4039Google Scholar
  43. 43.
    Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE international conference on computer vision, pp 2623–2631Google Scholar
  44. 44.
    Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: Proceedings of the international conference on learning representationsGoogle Scholar
  45. 45.
    Mei S, Zhu X (2015) The security of latent Dirichlet allocation. In: Proceedings of the international conference on artificial intelligence and statisticsGoogle Scholar
  46. 46.
    Mikolov T, Deoras A, Povey D, Burget L, Ċernockỳ J (2011) Strategies for training large scale neural network language models. In: Proceedings of the automatic speech recognition and understanding, pp 196–201Google Scholar
  47. 47.
    Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
  48. 48.
    Mnih A, Hinton G (2007) Three new graphical models for statistical language modelling. In: Proceedings of the international conference on machine learning. ACM, pp 641–648Google Scholar
  49. 49.
    Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the association for computational linguistics. Association for Computational Linguistics, pp 311–318Google Scholar
  50. 50.
    Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with amazon’s mechanical Turk. Association for Computational Linguistics, pp 139–147Google Scholar
  51. 51.
    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg CA, Li F (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252MathSciNetCrossRefGoogle Scholar
  52. 52.
    Salakhutdinov R, Hinton GE (2009) Deep boltzmann machines. In: Proceedings of the international conference on artificial intelligence and statistics, pp 448–455Google Scholar
  53. 53.
    Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the international conference on learning representationsGoogle Scholar
  54. 54.
    Socher R, Huang EH, Pennin J, Manning CD, Ng AY (2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Proceedings of the advances in neural information processing systems, pp 801–809Google Scholar
  55. 55.
    Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguis 2:207–218Google Scholar
  56. 56.
    Srivastava N, Salakhutdinov RR (2012) Multimodal learning with deep Boltzmann machines. In: Proceedings of the advances in neural information processing systems, pp 2222–2230Google Scholar
  57. 57.
    Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of the advances in neural information processing systems, pp 3104–3112Google Scholar
  58. 58.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9Google Scholar
  59. 59.
    Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575Google Scholar
  60. 60.
    Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164Google Scholar
  61. 61.
    Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560CrossRefGoogle Scholar
  62. 62.
    Wu Q, Shen C, Liu L, Dick AR, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212Google Scholar
  63. 63.
    Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3441–3450Google Scholar
  64. 64.
    Yao X, Han J, Cheng G, Qian X, Guo L (2016) Semantic annotation of high-resolution satellite images via weakly supervised learning. IEEE Trans Geosci Remote Sens 54(6):3660–3671CrossRefGoogle Scholar
  65. 65.
    Yao X, Han J, Zhang D, Nie F (2017) Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering. IEEE Trans Image Process 26(7):3196–3209MathSciNetCrossRefGoogle Scholar
  66. 66.
    Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguis 2:67–78Google Scholar
  67. 67.
    Zhang D, Han J, Han J, Shao L (2016) Cosaliency detection based on intrasaliency prior transfer and deep intersaliency mining. IEEE Trans Neural Netw Learn Syst 27(6):1163–1176MathSciNetCrossRefGoogle Scholar
  68. 68.
    Zhang D, Han J, Li C, Wang J, Li X (2016) Detection of co-salient objects by looking deep and wide. Int J Comput Vis 120(2):215–232MathSciNetCrossRefGoogle Scholar
  69. 69.
    Zhang L, Sung F, Liu F, Xiang T, Gong S, Yang Y, Hospedales TM (2017) Actor-critic sequence training for image captioning. CoRR arXiv:1706.09601
  70. 70.
    Zhao S, Yao H, Zhang Y, Wang Y, Liu S (2015) View-based 3d object retrieval via multi-modal graph learning. Signal Process 112:110–118CrossRefGoogle Scholar
  71. 71.
    Zhao S, Yao H, Gao Y, Ji R, Ding G (2017) Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Trans Multimed 19(3):632–645CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Center for OPTical IMagery Analysis and Learning (OPTIMAL), Xi’an Institute of Optics and Precision MechanicsChinese Academy of SciencesXi’anPeople’s Republic of China
  2. 2.University of Chinese Academy of SciencesBeijingPeople’s Republic of China

Personalised recommendations