Advertisement

The Visual Computer

, Volume 35, Issue 3, pp 445–470 | Cite as

A survey on deep neural network-based image captioning

  • Xiaoxiao Liu
  • Qingyang XuEmail author
  • Ning WangEmail author
Survey
  • 509 Downloads

Abstract

Image captioning is a hot topic of image understanding, and it is composed of two natural parts (“look” and “language expression”) which correspond to the two most important fields of artificial intelligence (“machine vision” and “natural language processing”). With the development of deep neural networks and better labeling database, the image captioning techniques have developed quickly. In this survey, the image captioning approaches and improvements based on deep neural network are introduced, including the characteristics of the specific techniques. The early image captioning approach based on deep neural network is the retrieval-based method. The retrieval method makes use of a searching technique to find an appropriate image description. The template-based method separates the image captioning process into object detection and sentence generation. Recently, end-to-end learning-based image captioning method has been verified effective at image captioning. The end-to-end learning techniques can generate more flexible and fluent sentence. In this survey, the image captioning methods are reviewed in detail. Furthermore, some remaining challenges are discussed.

Keywords

Image captioning Image understanding Object detection Language model Attention mechanism Dense captioning 

Notes

Acknowledgements

The authors would like to thank the two anonymous reviewers and the editor-in-chief for their comment to improve the paper. This work is supported by National Nature Science Foundation of China (under Grants 61603214, 61573213, 51009017 and 51379002), Shandong Provincial Key Research and Development Plan (2018GGX101039), Shandong Provincial Natural Science Foundation (ZR2015PF009, 2016ZRE2703), the Fund for Dalian Distinguished Young Scholars (under Grant 2016RJ10), the Innovation Support Plan for Dalian High-level Talents (under Grant 2015R065), and the Fundamental Research Funds for the Central Universities (under Grant 3132016314 and 3132018126).

References

  1. 1.
    Yan, R., Hauptmann, A.G.: A review of text and image retrieval approaches for broadcast news video. Inf. Retr. 10, 445–484 (2007)CrossRefGoogle Scholar
  2. 2.
    Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., Plank, B.: Automatic description generation from images: a survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55, 409–442 (2016)CrossRefGoogle Scholar
  3. 3.
    Aloimonos, Y., Aloimonos, Y., Aloimonos, Y.: Computer vision and natural language processing: recent approaches in multimedia and robotics. ACM Comput. Surv. 49, 71 (2016)zbMATHGoogle Scholar
  4. 4.
    Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: Meeting of the Association for Computational Linguistics: Long Papers, Korea, Jeju Island, pp. 359–368 (2012)Google Scholar
  5. 5.
    Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: a deep visual-semantic embedding model. In: International Conference on Neural Information Processing Systems, Neural Information Processing Systems Foundation, Lake Tahoe, pp. 2121–2129 (2013)Google Scholar
  6. 6.
    Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)CrossRefGoogle Scholar
  7. 7.
    Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep Boltzmann machines. J. Mach. Learn. Res. 15, 2949–2980 (2014)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., Corrado, G.S., Dean, J.: Zero-shot learning by convex combination of semantic embeddings. In: International Conference on Learning Representations ICLR2014, Banff, Canada (2014)Google Scholar
  9. 9.
    Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Ordonez, V., Kulkarni, G., Berg, T.L.: Im2Text: describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst. 25, 1143–1151 (2012)Google Scholar
  11. 11.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2014)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: Generating sentences from images. In: 11th European Conference on Computer Vision (ECCV 2010), Crete, Greece, 2010, pp. 15–29 (2010)Google Scholar
  13. 13.
    Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: understanding and generating simple image descriptions. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, IEEE, Colorado Springs, CO, pp. 1601–1608 (2011)Google Scholar
  14. 14.
    Li, S., Kulkarni, G., Berg, L.B., Berg, C.A., Choi, Y.: Composing simple image descriptions using web-scale N-grams. In: 15th Conference on Computational Natural Language Learning, Portland, USA, 2011, pp. 220–228 (2011)Google Scholar
  15. 15.
    Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, pp. 1473–1482 (2015)Google Scholar
  16. 16.
    Maron, O., Lozano-Perez, T.: A framework for multiple-instance learning. In: 11th Annual Conference on Neural Information Processing Systems, NIPS 1997, Neural information processing systems foundation, Denver, CO, pp. 570–576 (1998)Google Scholar
  17. 17.
    Viola, P., Platt, J.C., Zhang, C.: Multiple instance boosting for object detection. In: International Conference on Neural Information Processing Systems, MIT, Vancouver, British Columbia, Canada, pp. 1417–1424 (2005)Google Scholar
  18. 18.
    Song, H.O., Girshick, R., Jegelka, S., Mairal, J., Harchaoui, Z., Darrell, T.: On learning to localize objects with minimal supervision. In: 2016 International Conference on Machine Learning, Beijing, China, pp. 1611–1619 (2014)Google Scholar
  19. 19.
    Zitnick, C.L., Dollár, P.: Edge Boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) 13th European Conference on Computer Vision, pp. 391–405. Springer, Zurich (2014)Google Scholar
  20. 20.
    Uijlings, J.R., Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104, 154–171 (2013)CrossRefGoogle Scholar
  21. 21.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, Curran Associates Inc, Lake Tahoe, pp. 1097–1105 (2012)Google Scholar
  22. 22.
    D.J., D.W., S.R., J.L. L., L. Kai, F. Li, ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Miami, Florida, USA, pp. 248–255 (2009)Google Scholar
  23. 23.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014), OH, USA, 2014, pp. 580–587 (2014)Google Scholar
  24. 24.
    van de Sande K.E.A., Uijlings, J.R.R., Gevers, T., Smeulders, A.W.M.: Segmentation as selective search for object recognition. In: 2011 International Conference on Computer Vision, IEEE, Barcelona, Spain, pp. 1879–1886 (2011)Google Scholar
  25. 25.
    He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. 37, 1904–1916 (2015)CrossRefGoogle Scholar
  26. 26.
    Girshick, R.: Fast R-CNN. In: IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1440–1448 (2015)Google Scholar
  27. 27.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137 (2017)CrossRefGoogle Scholar
  28. 28.
    Zhou, L., Hovy, E.: Template-filtered headline summarization. In: The Proceedings of The ACL Workshop Text Summarization Branches Out, pp. 56–60 (2004)Google Scholar
  29. 29.
    Channarukul, S., Mcroy, S.W., Ali, S.S.: DOGHED: a template-based generator for multimodal dialog systems targeting heterogeneous devices. In: Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (2003)Google Scholar
  30. 30.
    Chisholm, M., Tadepalli, P.: Learning decision rules by randomized iterative local search. In: Nineteenth International Conference on Machine Learning, Morgan Kaufmann, pp. 75–82 (2002)Google Scholar
  31. 31.
    White, M., Cardie, C.: Selecting sentences for multidocument summaries using randomized local search. Proc. ACL Summ. Workshop 4, 9–18 (2002)Google Scholar
  32. 32.
    Klein, D., Manning, C.D.: Accurate Unlexicalized Parsing. In: Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 423–430 (2003)Google Scholar
  33. 33.
    Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22, 39–71 (2002)Google Scholar
  34. 34.
    Yang, Y., Teo, C.L., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Edinburgh, United Kingdom, pp. 444–454 (2011)Google Scholar
  35. 35.
    Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI, Atlanta, Georgia, pp. 1306–1313 (2010)Google Scholar
  36. 36.
    Graff, D., Kong, J., Chen, K., Maeda, K.: English Gigaword, 3rd edn. LDC2007T07. Web Download. Linguistic Data consortium, Philadelphia (2007)Google Scholar
  37. 37.
    Mikolov, T., Karafiát, M., Burget, L., Jan, C., Khudanpur, S.: Recurrent neural network based language model. In: 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, ISCA, Makuhari, Chiba, Japan, pp. 1045–1048 (2010)Google Scholar
  38. 38.
    Boden, M.: A Guide to Recurrent Neural Networks and Backpropagation, Dallas Project Sics Technical Report T Sics (2002)Google Scholar
  39. 39.
    Sutskever, I., Martens, J., Hinton, G.: Generating text with recurrent neural networks. In: 28th International Conference on Machine Learning, ICML 2011, DBLP, Bellevue, Washington, USA, pp. 1017–1024 (2011)Google Scholar
  40. 40.
    Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 2673–2681 (1997)CrossRefGoogle Scholar
  41. 41.
    Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, IEEE, Vancouver, BC, Canada, pp. 6645–6649 (2013)Google Scholar
  42. 42.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRefGoogle Scholar
  43. 43.
    Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Association for Computational Linguistics (ACL), Doha, Qatar, pp. 1724–1734 (2014)Google Scholar
  44. 44.
    Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos, K., Hal Daumé, I.: Midge: generating image descriptions from computer vision detections. In: Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Avignon, France, pp. 747–756 (2012)Google Scholar
  45. 45.
    Verma, Y., Gupta, A., Mannem, P., Jawahar, C.V.: Generating image descriptions using semantic similarities in the output space. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, IEEE, Portland, USA, pp. 288–293 (2013)Google Scholar
  46. 46.
    Kuznetsova, P., Ordonez, V., Berg, T., Choi, Y.: TREETALK: composition and compression of trees for image descriptions. TACL 2, 351–362 (2014)Google Scholar
  47. 47.
    Elliott, D., Keller, F.: Image description using visual dependency representations. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Washington, USA, 2013, pp. 1292–1302 (2013)Google Scholar
  48. 48.
    Aker, A., Gaizauskas, R.: Generating image descriptions using dependency relational patterns. In: Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Uppsala, Sweden, pp. 1250–1258 (2013)Google Scholar
  49. 49.
    Cho, K., Merrienboer, B.V., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Computer Science, pp. 1724–1734. arXiv:1406.1078 (2014)
  50. 50.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: International Conference on Neural Information Processing Systems, MIT, Montreal, pp. 3104–3112 (2014)Google Scholar
  51. 51.
    Johnson, R., Zhang, T.: Effective Use of Word Order for Text Categorization with Convolutional Neural Networks, pp. 103–112. Eprint Arxiv arXiv:1412.1058 (2014)
  52. 52.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473 (2014)
  53. 53.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: 2015 Computer Vision and Pattern Recognition, IEEE, Boston, pp. 3156–3164 (2015)Google Scholar
  54. 54.
    Oriol, V., Alexander, T., Samy, B., Dumitru, E.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal. 39(2017), 652–663 (2015)Google Scholar
  55. 55.
    Ioffe, S., Szegedy, C., Bach, F., Blei, D.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: 32nd International Conference on Machine Learning, ICML 2015, International Machine Learning Society (IMLS), Lille, France, pp. 448–456 (2015)Google Scholar
  56. 56.
    Lint, R., Liu, S., Yang, M., Li, M., Zhou, M., Li, S.: Hierarchical recurrent neural network for document modeling. In: Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Association for Computational Linguistics (ACL), Lisbon, Portugal, pp. 899–907 (2015)Google Scholar
  57. 57.
    Kiros, R., Salakhutdinov, R., Zemel, R.: Multimodal neural language models. In: 31st International Conference on Machine Learning, ICML 2014, International Machine Learning Society (IMLS), Bejing, China, pp. 595–603 (2014)Google Scholar
  58. 58.
    Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Explain Images with Multimodal Recurrent Neural Networks. arXiv preprint arXiv:1410.1090 (2014)
  59. 59.
    Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Darrell, T., Saenko, K.: Long-term recurrent convolutional networks for visual recognition and description In: Computer Vision and Pattern Recognition, IEEE, Boston, MA, USA, p. 677 (2015)Google Scholar
  60. 60.
    Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, arXiv preprint arXiv:1411.2539 (2014)
  61. 61.
    Tanti, M., Gatt, A., Camilleri, K.P.: Where to put the Image in an Image Caption Generator, arXiv preprint arXiv:1703.09137 (2017)
  62. 62.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space, arXiv preprint arXiv:1301.3781 (2013)
  63. 63.
    Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent Neural Network Regularization, arXiv preprint arXiv:1409.2329 (2014)
  64. 64.
    Er, M.J., Zhang, Y., Wang, N., Pratama, M.: Attention pooling-based convolutional neural network for sentence modelling. Inf. Sci. 373, 388–403 (2016)CrossRefGoogle Scholar
  65. 65.
    Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2204–2212. Curran Associates Inc, Red Hook (2014)Google Scholar
  66. 66.
    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Computer Science, pp. 2048–2057 (2015)Google Scholar
  67. 67.
    Pascanu, R., Gulcehre, C., Cho, K., Bengio, Y.: How to Construct Deep Recurrent Neural Networks, arXiv preprint arXiv:1312.6026 (2014)
  68. 68.
    You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, NV, USA, pp. 4651–4659 (2016)Google Scholar
  69. 69.
    Karpathy, A., Li, F.F.: Deep visual-semantic alignments for generating image descriptions. In: Computer Vision and Pattern Recognition, IEEE, Boston, MA, USA, pp. 3128–3137 (2015)Google Scholar
  70. 70.
    Chen, X., Zitnick, C.L.: Mind’s eye: a recurrent visual representation for image caption generation. In: Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, pp. 2422–2431 (2015)Google Scholar
  71. 71.
    Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN), arXiv preprint arXiv:1412.6632 (2014)
  72. 72.
    Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., Yuille, A.L.: Learning like a child: fast novel visual concept learning from sentence descriptions of images In: IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 2533–2541 (2016)Google Scholar
  73. 73.
    Lebret, R., Pinheiro, P.O., Collobert, R.: Simple Image Description Generator via a Linear Phrase-Based Approach, arXiv preprint arXiv:1412.8419 (2014)
  74. 74.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations ofwords and phrases and their compositionality. In: 27th Annual Conference on Neural Information Processing Systems, NIPS 2013, Neural Information Processing Systems Foundation, Lake Tahoe, NV (2013)Google Scholar
  75. 75.
    Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning, arXiv preprint arXiv:1612.01887 (2016)
  76. 76.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, USA 2016, pp. 770–778 (2016)Google Scholar
  77. 77.
    Merity, S., Xiong, C., Bradbury, J., Socher, R.: Pointer Sentinel Mixture Models, arXiv preprint arXiv:1609.07843 (2016)
  78. 78.
    Johnson, J., Karpathy, A., Li, F.F.: DenseCap: fully convolutional localization networks for dense captioning. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2015, pp. 4565–4574 (2015)Google Scholar
  79. 79.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Machine Intell. 39(6), 1137–1149 (2017)CrossRefGoogle Scholar
  80. 80.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition, Computer Science arXiv preprint arXiv:1409.1556 (2014)
  81. 81.
    Gregor, K., Danihelka, I., Graves, A., Rezende, D.J., Wierstra, D.: DRAW: a recurrent neural network for image generation. In: Computer Science, pp. 1462–1471 (2015)Google Scholar
  82. 82.
    Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. arXiv preprint arXiv:1506.02025 (2015)
  83. 83.
    Yang, L., Tang, K., Yang, J., Li, L.J.: Dense Captioning with Joint Inference and Visual. Context 2017, 1978–1987 (2017)Google Scholar
  84. 84.
    Krause, J., Johnson, J., Krishna, R., Li, F.F.: A Hierarchical Approach for Generating Descriptive Image Paragraphs, arXiv preprint arXiv:1611.06607 (2016)
  85. 85.
    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. arXiv preprint arXiv:1405.0312 (2014)
  86. 86.
    Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–73 (2016)MathSciNetCrossRefGoogle Scholar
  87. 87.
    Li, J., Luong, M.T., Dan, J.: A Hierarchical Neural Autoencoder for Paragraphs and Documents, pp. 1057–1506 (2015)Google Scholar
  88. 88.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: IBM Research Report Bleu: a method for automatic evaluation of machine translation. In: ACL Proceedings of Annual Meeting of the Association for Computational Linguistics, vol. 30, pp. 311–318 (2002)Google Scholar
  89. 89.
    Denkowski, M., Lavie, A.: Meteor Universal: Language Specific Translation Evaluation for Any Target Language, pp. 376–380. Baltimore, Maryland (2014)Google Scholar
  90. 90.
    Flick, C.: ROUGE: A Package for Automatic Evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, 2004, p. 10 (2004)Google Scholar
  91. 91.
    Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, pp. 4566–4575 (2014)Google Scholar
  92. 92.
    Wu, Q., Shen, C., Liu, L., Dick, A., Hengel, A.V.D.: What value do explicit high level concepts have in vision to language problems? In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), Las Vegas, NV, United States, 2016, pp. 203–212 (2016)Google Scholar
  93. 93.
    Chen, W., Lucchi, A., Hofmann, T.: Bootstrap, Review, Decode: Using Out-of-Domain Textual Data to Improve Image Captioning, arXiv:1611.05321v1 (2016)
  94. 94.
    Aditya, S., Yang, Y., Baral, C., Aloimonos, Y., Fermüller, C.: Image understanding using vision and reasoning through scene description graph. In: Computer Vision and Image Understanding (2017)Google Scholar
  95. 95.
    Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 5188–5196 (2015)Google Scholar
  96. 96.
    Dosovitskiy, A., Brox, T.: Inverting visual representations with convolutional networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), Las Vegas, NV, United States, 2016, pp. 4829–4837 (2016)Google Scholar
  97. 97.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Object Detectors Emerge in Deep Scene CNNs, arXiv e-print arXiv:1412.6856 (2014)
  98. 98.
    Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, arXiv e-print arXiv:1412.3555 (2014)
  99. 99.
    Karpathy, A., Johnson, J., Fei-Fei, L.: Visualizing and Understanding Recurrent Networks, arXiv e-print arXiv:1506.02078 (2015)
  100. 100.
    Dong, Y., Su, H., Zhu, J., Zhang, B.: Improving interpretability of deep neural networks with semantic information. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), Honolulu, Hawaii, USA, 2017, pp. 975–983 (2017)Google Scholar
  101. 101.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks, arXiv e-print arXiv:1412.4729 (2014)

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Mechanical, Electrical and Information EngineeringShandong UniversityWeihaiPeople’s Republic of China
  2. 2.Marine Engineering CollegeDalian Maritime UniversityDalianPeople’s Republic of China

Personalised recommendations