Advertisement

Multimedia Tools and Applications

, Volume 78, Issue 22, pp 32187–32237 | Cite as

Survey of deep learning and architectures for visual captioning—transitioning between media and natural languages

  • Chiranjib SurEmail author
Article
  • 101 Downloads

Abstract

Deep Learning Architectures has been researched the most in this decade because of its capability to scale up and solve problems that couldn’t be solved before. Mean while many NLP applications cropped up and there is a requirement to understand how the concepts gradually evolved till date after perceptron was introduced in 1959. This document will provide a detailed description of the computational neuroscience starting from artificial neural network and how researchers retrospected the drawbacks faced by the previous architectures and paved the way for modern deep learning. Modern deep learning is more than what it had been perceived decades ago and had been extended to architectures, with exceptional intelligence, scalability and precision, beyond imagination. This document will provide an overview of the continuation of work and will also specifically deal with applications of various domains related to natural language processing and visual and media contents.

Keywords

Neural network Deep learning Natural language processing Visual features Representation learning Sequential memory network 

Notes

References

  1. 1.
    Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, vol 3, p 6Google Scholar
  2. 2.
    Baldi P (2012) Autoencoders, unsupervised learning, and deep architectures. In: ICML Unsupervised and Transfer Learning, vol 27, p 1Google Scholar
  3. 3.
    Bayer J, Wierstra D, Togelius J, Schmidhuber J (2009) Evolving memory cell structures for sequence learning. In: International conference on artificial neural networks. Springer, Berlin, pp 755–764Google Scholar
  4. 4.
    Bebis G, Georgiopoulos M (1994) Feed-forward neural networks. IEEE Potentials 13(4):27–31CrossRefGoogle Scholar
  5. 5.
    Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155zbMATHGoogle Scholar
  6. 6.
    Bengio Y, Lamblin P, Popovici P, Larochelle H (2007) Greedy layer-wise training of deep networks. In: Advances in neural information processing systems 19. MIT Press, CambridgeGoogle Scholar
  7. 7.
    Bengio Y, Boulanger-Lewandowski N, Pascanu R (2013) Advances in optimizing recurrent networks. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 8624–8628Google Scholar
  8. 8.
    Bordes A, Weston J Learning end-to-end goal-oriented dialog. arXiv:1605.07683
  9. 9.
    Bordes A, Usunier N, Chopra S, Weston J Large-scale simple question answering with memory networks. arXiv:1506:02075
  10. 10.
    Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  11. 11.
    Chen S, Cowan CF, Grant PM (1991) Orthogonal least squares learning algorithm for radial basis function networks. IEEE Trans Neural Netw 2(2):302–309CrossRefGoogle Scholar
  12. 12.
    Chen W, Wilson JT, Tyree S, Weinberger KQ, Chen Y (2015) Compressing neural networks with the hashing trick. arXiv:1504.04788
  13. 13.
    Chen M, Ding G, Zhao S, Chen H, Liu Q, Han J (2017) Reference based LSTM for image captioning. In: AAAI, pp 3981–3987Google Scholar
  14. 14.
    Chen F, Ji R, Su J, Wu Y, Wu Y (2017) Structcap: structured semantic embedding for image captioning. In: Proceedings of the 2017 ACM on multimedia conference. ACM, pp 46–54Google Scholar
  15. 15.
    Chen H, Zhang H, Chen P Y, Yi J, Hsieh CJ (2017) Show-and-fool: crafting adversarial examples for neural image captioning. arXiv:1712.02051
  16. 16.
    Chen H, Ding G, Lin Z, Zhao S, Han J (2018) Show, observe and tell: attribute-driven attention model for image captioning. In: IJCAI, pp 606–612Google Scholar
  17. 17.
    Chen F, Ji R, Sun X, Wu Y, Su J (2018) GroupCap: group-based image captioning with structured relevance and diversity constraints. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1345–1353Google Scholar
  18. 18.
    Chen T, Zhang Z, You Q, Fang C, Wang Z, Jin H, Luo J (2018) “Factual” or “Emotional”: stylized image captioning with adaptive learning and attention. arXiv:1807.03871
  19. 19.
    Cho Y, Saul LK (2009) Kernel methods for deep learning. In: Advances in neural information processing systems, pp 342–350Google Scholar
  20. 20.
    Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches. arXiv:1409.1259
  21. 21.
    Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078
  22. 22.
    Cohn-Gordon R, Goodman N, Potts C (2018) Pragmatically informative image captioning with character-level reference. arXiv:1804.05417
  23. 23.
    Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on Machine learning. ACM, pp 160–167Google Scholar
  24. 24.
    Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Paying more attention to saliency: image captioning with saliency and context attention. ACM Trans Multimed Comput Commun Appl (TOMM) 14(2):48Google Scholar
  25. 25.
    Courville AC, Bergstra J, Bengio Y (2011) A spike and slab restricted Boltzmann machine. In: AISTATS, vol 1, p 5Google Scholar
  26. 26.
    Devlin J et al (2015) Language models for image captioning: the quirks and what works. arXiv:1505.01809
  27. 27.
    Dhillon PS, Foster DP, Ungar LH (2015) Eigenwords: spectral word embeddings. J Mach Learn Res 16(1):3035–3078MathSciNetzbMATHGoogle Scholar
  28. 28.
    Doersch C (2016) Tutorial on variational autoencoders. arXiv:1606.05908
  29. 29.
    Donahue J et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  30. 30.
    Du B, Xiong W, Wu J, Zhang L, Zhang L, Tao D (2016) Stacked convolutional denoising auto-encoders for feature representation. IEEE trans Cybern 47(4):1017–1027CrossRefGoogle Scholar
  31. 31.
    Fang H et al (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  32. 32.
    Farhadi A et al (2010) Every picture tells a story: Generating sentences from images. In: European conference on computer vision. Springer, BerlinGoogle Scholar
  33. 33.
    Fu J et al (2016) Deep Q-networks for accelerating the training of deep neural networks. arXiv:1606.01467
  34. 34.
    Fu K, Jin J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell 39(12):2321–2334CrossRefGoogle Scholar
  35. 35.
    Fu K, Li J, Jin J, Zhang C (2018) Image-text surgery: efficient concept learning in image captioning by generating pseudopairs. IEEE Trans Neural Netw Learn Syst 29.12(2018):5910–5921Google Scholar
  36. 36.
    Funahashi KI, Nakamura Y (1993) Approximation of dynamical systems by continuous time recurrent neural networks. Neural Netw 6(6):801–806CrossRefGoogle Scholar
  37. 37.
    Gan Z et al (2016) Semantic compositional networks for visual captioning. arXiv:1611.08002
  38. 38.
    Gan C et al (2017) Stylenet: generating attractive visual captions with styles. In: CVPRGoogle Scholar
  39. 39.
    Girshick R et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  40. 40.
    Goldberg Y, Levy O (2014) word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722
  41. 41.
    Graupe D (1997) Large scale memory storage and retrieval (LAMSTAR) network. In: Principles of artificial neural networks, pp 191–222Google Scholar
  42. 42.
    Graves A, Wayne G, Danihelka I (2014) Neural turing machines. arXiv:1410.5401
  43. 43.
    Han S, Mao H, Dally WJ (2015) Deep compression: compressing deep neural network with pruning, trained quantization and Huffman coding. arXiv:1510.00149
  44. 44.
    Harzig P, Brehm S, Lienhart R, Kaiser C, Schallner R (2018) Multimodal image captioning for marketing analysis. arXiv:1802.01958
  45. 45.
    He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385
  46. 46.
    Heermann PD, Khazenie N (1992) Classification of multispectral remote sensing data using a back-propagation neural network. IEEE Trans Geosci Remote Sens 30 (1):81–88CrossRefGoogle Scholar
  47. 47.
    Hendricks LA et al (2016) Deep compositional captioning: describing novel object categories without paired training data. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  48. 48.
    Herculano-Houzel S (2009) The human brain in numbers: a linearly scaled-up primate brain. Front Hum Neurosci 3:31CrossRefGoogle Scholar
  49. 49.
    Hinton GE (1986) Learning distributed representations of concepts. In: Proceedings of the eighth annual conference of the cognitive science society, vol 1, p 12Google Scholar
  50. 50.
    Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554MathSciNetCrossRefGoogle Scholar
  51. 51.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  52. 52.
    Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899MathSciNetCrossRefGoogle Scholar
  53. 53.
    Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci 79(8):2554–2558MathSciNetCrossRefGoogle Scholar
  54. 54.
    Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers, vol 1. Association for Computational Linguistics, pp 873–882Google Scholar
  55. 55.
    Hubel DH, Wiesel TN (1959) Receptive fields of single neurones in the cat’s striate cortex. J Physiol 148(3):574–591CrossRefGoogle Scholar
  56. 56.
    Hutchinson B, Deng L, Yu D (2013) Tensor deep stacking networks. IEEE Trans Pattern Anal Mach Intell 35(8):1944–1957CrossRefGoogle Scholar
  57. 57.
    Irsoy O, Cardie C (2014) Deep recursive neural networks for compositionality in language. In: Advances in neural information processing systems, pp 2096–2104Google Scholar
  58. 58.
    Iyyer M, Manjunatha V, Boyd-Graber J, Daumé H III (2015) Deep unordered composition rivals syntactic methods for text classification. In: Proceedings of the association for computational linguisticsGoogle Scholar
  59. 59.
    Izhikevich EM (2004) Which model to use for cortical spiking neurons? IEEE Trans Neural Netw 15(5):1063–1070CrossRefGoogle Scholar
  60. 60.
    Jia X et al (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer visionGoogle Scholar
  61. 61.
    Jiang W, Ma L, Chen X, Zhang H, Liu W (2018) Learning to guide decoding for image captioning. arXiv:1804.00887
  62. 62.
    Jin J et al (2015) Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv:1506.06272
  63. 63.
    Karpathy A, Li F-F (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  64. 64.
    Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systemsGoogle Scholar
  65. 65.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732Google Scholar
  66. 66.
    Kilickaya M, Akkus BK, Cakici R, Erdem A, Erdem E, Ikizler-Cinbis N (2017) Data-driven image captioning via salient region discovery. IET Comput Vis 11(6):398–406CrossRefGoogle Scholar
  67. 67.
    Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539
  68. 68.
    Kiros R, Zemel R, Salakhutdinov R (2014) A multiplicative model for learning distributed text-based attribute representations. In: Advances in neural information processing systemsGoogle Scholar
  69. 69.
    Kohonen T (1995) Learning vector quantization. In: Self-organizing maps. Springer, Berlin, pp 175–189Google Scholar
  70. 70.
    Kohonen T, Somervuo P (1998) Self-organizing maps of symbol strings. Neurocomputing 21(1):19–30CrossRefGoogle Scholar
  71. 71.
    Krishna R et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73MathSciNetCrossRefGoogle Scholar
  72. 72.
    Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
  73. 73.
    Kulkarni G et al (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903CrossRefGoogle Scholar
  74. 74.
    Kumar A, Irsoy O, Su J, Bradbury J, English R, Pierce B, ..., Socher R (2015) Ask me anything: dynamic memory networks for natural language processing. arXiv:1506.07285
  75. 75.
    Kuznetsova P et al (2014) TREETALK: composition and compression of trees for image descriptions. TACL 2(10):351–362Google Scholar
  76. 76.
    Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2–3):259–284CrossRefGoogle Scholar
  77. 77.
    Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 609–616Google Scholar
  78. 78.
    Levy O, Goldberg Y (2014) Dependency-based word embeddings. In: ACL (2), pp 302–308Google Scholar
  79. 79.
    Li S et al (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the fifteenth conference on computational natural language learning. Association for Computational LinguisticsGoogle Scholar
  80. 80.
    Li X, Wang X, Xu C, Lan W, Wei Q, Yang G, Xu J (2018) COCO-CN for cross-lingual image tagging, captioning and retrieval. arXiv:1805.08661
  81. 81.
    Lin Y, Tong Z, Zhu S, Yu K (2010) Deep coding network. In: Advances in neural information processing systems, pp 1405–1413Google Scholar
  82. 82.
    Liu C, Mao J, Sha F, Yuille A L (2017) Attention correctness in neural image captioning. In: AAAI, pp 4176–4182Google Scholar
  83. 83.
    Liu C, Sun F, Wang C, Wang F, Yuille A (2017) MAT: a multimodal attentive translator for image captioning. arXiv:1702.05658
  84. 84.
    Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings IEEE international conference on computer vision, vol 3, p 3Google Scholar
  85. 85.
    Liu X, Li H, Shao J, Chen D, Wang X (2018) Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. arXiv:1803.08314
  86. 86.
    Lo SCB, Chan HP, Lin JS, Li H, Freedman M T, Mun S K (1995) Artificial convolution neural network for medical image pattern recognition. Neural Netw 8(7):1201–1214CrossRefGoogle Scholar
  87. 87.
    Lotter W, Kreiman G, Cox D (2016) Deep predictive coding networks for video prediction and unsupervised learning. arXiv:1301.1880
  88. 88.
    Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), vol 6, p 2Google Scholar
  89. 89.
    Lu D, Whitehead S, Huang L, Ji H, Chang S F (2018) Entity-aware image caption generation. arXiv:1804.07889
  90. 90.
    Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228Google Scholar
  91. 91.
    Luong T, Socher R, Manning CD (2013) Better word representations with recursive neural networks for morphology. In: CoNLL, pp 104–113Google Scholar
  92. 92.
    Mao J et al (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632
  93. 93.
    Mao J et al (2015) Learning like a child: fast novel visual concept learning from sentence descriptions of images. In: Proceedings of the IEEE international conference on computer visionGoogle Scholar
  94. 94.
    Mathews AP, Xie L, He X (2016) SentiCap: generating image descriptions with sentiments. In: AAAIGoogle Scholar
  95. 95.
    Melnyk I, Sercu T, Dognin PL, Ross J, Mroueh Y (2018) Improved image captioning with adversarial semantic alignment. arXiv:1805.00063
  96. 96.
    Memisevic R, Hinton G (2007) Unsupervised learning of image transformations. In: IEEE conference on computer vision and pattern recognition, 2007. CVPR’07. IEEEGoogle Scholar
  97. 97.
    Mikolov T, Karafiát M, Burget L, Cernocký J, Khudanpur S (2010) Recurrent neural network based language model. In: Interspeech, vol 2, p 3Google Scholar
  98. 98.
    Mitchell M et al (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the european chapter of the association for computational linguistics. Association for Computational LinguisticsGoogle Scholar
  99. 99.
    Mnih A, Hinton G E (2009) A scalable hierarchical distributed language model. In: Advances in neural information processing systems, pp 1081–1088Google Scholar
  100. 100.
    Mnih A, Kavukcuoglu K (2013) Learning word embeddings efficiently with noise-contrastive estimation. In: Advances in neural information processing systems, pp 2265–2273Google Scholar
  101. 101.
    Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. arXiv:1312.5602
  102. 102.
    Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, ..., Petersen S (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533CrossRefGoogle Scholar
  103. 103.
    Nakkiran P, Alvarez R, Prabhavalkar R, Parada C (2015) Compressing deep neural networks using a rank-constrained topologyGoogle Scholar
  104. 104.
    Nie L, Wang M, Zhang L, Yan S, Zhang B, Chua T S (2015) Disease inference from health-related questions via sparse deep learning. IEEE Trans Knowl Data Eng 27(8):2107–2119CrossRefGoogle Scholar
  105. 105.
    Ordonez V, Kulkarni G, Berg TL (2011) Im2text: describing images using 1 million captioned photographs. In: Advances in neural information processing systemsGoogle Scholar
  106. 106.
    Park CC, Kim B, Kim G (2017) Attend to you: personalized image captioning with context sequence memory networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 895–903Google Scholar
  107. 107.
    Park CC, Kim B, Kim G (2018) Towards personalized image captioning via multimodal memory networks. IEEE Trans Pattern Anal Mach Intell 41.4(2018):999–1012Google Scholar
  108. 108.
    Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: EMNLP, vol 14, pp 1532–43Google Scholar
  109. 109.
    Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 701–710Google Scholar
  110. 110.
    Pfister T, Simonyan K, Charles J, Zisserman A (2014) Deep convolutional neural networks for efficient pose estimation in gesture videos. In: Asian conference on computer vision. Springer International Publishing, pp 538–552Google Scholar
  111. 111.
    Pu Y et al (2016) Variational autoencoder for deep learning of images, labels and captions. In: Advances in neural information processing systemsGoogle Scholar
  112. 112.
    Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99Google Scholar
  113. 113.
    Ren Z, Wang X, Zhang N, Lv X, Li LJ (2017) Deep reinforcement learning-based image captioning with embedding reward. arXiv:1704.03899
  114. 114.
    Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: CVPR, vol 1, p 3Google Scholar
  115. 115.
    Salakhutdinov R, Hinton G (2009) Semantic hashing. Int J Approx Reason 50(7):969–978CrossRefGoogle Scholar
  116. 116.
    Salakhutdinov R, Hinton GE (2009) Deep Boltzmann Machines. In: AISTATS, vol 1, p 3Google Scholar
  117. 117.
    Salakhutdinov R, Tenenbaum JB, Torralba A (2013) Learning with hierarchical-deep models. IEEE Trans Pattern Anal Mach Intell 35(8):1958–1971CrossRefGoogle Scholar
  118. 118.
    Schmidhuber J (1992) Learning complex, extended sequences using the principle of history compression. Neural Comput 4(2):234–242CrossRefGoogle Scholar
  119. 119.
    Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), vol 1, pp 2556–2565Google Scholar
  120. 120.
    Shaw AM, Doyle FJ, Schwaber JS (1997) A dynamic neural network approach to nonlinear process modeling. Comput Chem Eng 21(4):371–385CrossRefGoogle Scholar
  121. 121.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576Google Scholar
  122. 122.
    Simonyan K, Vedaldi A, Zisserman A (2013) Deep fisher networks for large-scale image classification. In: Advances in neural information processing systems, pp 163–171Google Scholar
  123. 123.
    Socher R, Lin CC, Manning C, Ng AY (2011) Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 129–136Google Scholar
  124. 124.
    Socher R et al (2014) Grounded compositional semantics for finding and describing images with sentences, vol 2, pp 207–218Google Scholar
  125. 125.
    Srivastava N, Salakhutdinov R R (2012) Multimodal learning with deep Boltzmann machines. In: Advances in neural information processing systems, pp 2222–2230Google Scholar
  126. 126.
    Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetzbMATHGoogle Scholar
  127. 127.
    Strobl EV, Visweswaran S (2013) Deep multiple kernel learning. In: 2013 12th international conference on machine learning and applications (ICMLA), vol 1. IEEE, pp 414–417Google Scholar
  128. 128.
    Sukhbaatar S, Szlam A, Weston J, Fergus R End-To-End memory networks. NIPS 2015 (and arXiv:1503.08895)
  129. 129.
    Sur C (2018) DeepSeq: learning browsing log data based personalized security vulnerabilities and counter intelligent measures. J Ambient Intell Humaniz Comput (2018):1–30Google Scholar
  130. 130.
    Sur C (2018) Ensemble one-vs-all learning technique with emphatic & rehearsal training for phishing email classification using psychology. J Exp Theor Artif Intell 30.6(2018):733–762Google Scholar
  131. 131.
    Sutskever I, Hinton GE, Taylor GW (2009) The recurrent temporal restricted boltzmann machine. In: Advances in neural information processing systems, pp 1601–1608Google Scholar
  132. 132.
    Sutskever I, Martens J, Hinton G (2011) Generating text with recurrent neural networks. In: Proceedings of the 28th international conference on machine learning (ICML-11)Google Scholar
  133. 133.
    Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systemsGoogle Scholar
  134. 134.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, ..., Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9Google Scholar
  135. 135.
    Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015) Line: large-scale information network embedding. In: Proceedings of the 24th international conference on World Wide Web. ACM, pp 1067–1077Google Scholar
  136. 136.
    Tavakoliy HR, Shetty R, Borji A, Laaksonen J (2017) Paying attention to descriptions generated by image captioning models. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 2506–2515Google Scholar
  137. 137.
    Torralba A, Tenenbaum JB, Salakhutdinov RR (2011) Learning to learn with compound hd models. In: Advances in neural information processing systems, pp 2061–2069Google Scholar
  138. 138.
    Tran D et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer visionGoogle Scholar
  139. 139.
    Tran K et al (2016) Rich image captioning in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshopsGoogle Scholar
  140. 140.
    Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 384–394Google Scholar
  141. 141.
    Tymoshenko K, Bonadiman D, Moschitti A (2016) Convolutional neural networks vs. convolution kernels: feature engineering for answer sentence reranking. In: Proceedings of NAACL-HLT, pp 1268–1278Google Scholar
  142. 142.
    Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408MathSciNetzbMATHGoogle Scholar
  143. 143.
    Vinyals O et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  144. 144.
    Vinyals O, Fortunato M, Jaitly N (2015) Pointer networks. In: Advances in neural information processing systems, pp 2692–2700Google Scholar
  145. 145.
    Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663CrossRefGoogle Scholar
  146. 146.
    Wang Z, de Freitas N, Lanctot M (2015) Dueling network architectures for deep reinforcement learning. arXiv:1511.06581
  147. 147.
    Wang Y, Lin Z, Shen X, Cohen S, Cottrell G W (2017) Skeleton key: image captioning by skeleton-attribute decomposition. arXiv:1704.06972
  148. 148.
    Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans Multimed Comput Commun Appl (TOMM) 14(2s):40Google Scholar
  149. 149.
    Weston J, Chopra S, Bordes A Memory networks. ICLR 2015 (and arXiv:1410.3916)
  150. 150.
    Weston J, Bordes A, Chopra S, Rush AM, van Merriënboer B, Joulin A, Mikolov T (2015) Towards ai-complete question answering: a set of prerequisite toy tasks. arXiv:1502.05698
  151. 151.
    Wu Q et al (2016) What value do explicit high level concepts have in vision to language problems?. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  152. 152.
    Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40.6(2017):1367–1381Google Scholar
  153. 153.
    Wu J, Hu Z, Mooney RJ (2018) Joint image captioning and question answering. arXiv:1805.08389
  154. 154.
    Wu C, Wei Y, Chu X, Su F, Wang L (2018) Modeling visual and word-conditional semantic attention for image captioning. Signal Process Image Commun 67(2018):100–107Google Scholar
  155. 155.
    Xu K et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learningGoogle Scholar
  156. 156.
    Yang Y et al (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational LinguisticsGoogle Scholar
  157. 157.
    Yang Z et al (2016) Review networks for caption generation. In: Advances in neural information processing systemsGoogle Scholar
  158. 158.
    Yao T, Pan Y, Li Y, Mei T (2017) Incorporating copying mechanism in image captioning for learning novel objects. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 5263–5271Google Scholar
  159. 159.
    Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: IEEE international conference on computer vision, ICCV, pp 22–29Google Scholar
  160. 160.
    Ye S, Liu N, Han J (2018) Attentive linear transformation for image captioning. IEEE Trans Image Process 27.11(2018):5514–5524Google Scholar
  161. 161.
    You Q et al (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  162. 162.
    You Q, Jin H, Luo J (2018) Image captioning at will: a versatile scheme for effectively injecting sentiments into image descriptions. arXiv:1801.10121
  163. 163.
    Young P et al (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2:67–78CrossRefGoogle Scholar
  164. 164.
    Yu D, Deng L, Seide F (2013) The deep tensor neural network with applications to large vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 21 (2):388–396CrossRefGoogle Scholar
  165. 165.
    Zhang X, LeCun Y (2015) Text understanding from scratch. arXiv:1502.01710
  166. 166.
    Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Advances in neural information processing systems, pp 649–657Google Scholar
  167. 167.
    Zhang L, Sung F, Liu F, Xiang T, Gong S, Yang Y, Hospedales T M (2017) Actor-critic sequence training for image captioning. arXiv:1706.09601
  168. 168.
    Zhang Y-D et al (2017) Image based fruit category classification by 13-layer deep convolutional neural network and data augmentation. Multimed Tools Appl 78.3(2019):3613–3632Google Scholar
  169. 169.
    Zhang Y-D, Muhammad K, Tang C (2018) Twelve-layer deep convolutional neural network with stochastic pooling for tea category classification on GPU platform. Multimed Tools Appl 77(17):22821–22839CrossRefGoogle Scholar
  170. 170.
    Zhang M, Yang Y, Zhang H, Ji Y, Shen H T, Chua T S (2018) More is better: precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Trans Image Process 28.1(2018):32–44Google Scholar
  171. 171.
    Zhao W, Wang B, Ye J, Yang M, Zhao Z, Luo R, Qiao Y (2018) A multi-task learning approach for image captioning. In: IJCAI, pp 1205–1211Google Scholar
  172. 172.
    Zhuang J, Tsang IW, Hoi SC (2011) Two-layer multiple kernel learning. In: AISTATS, pp 909–917Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Computer & Information Science & Engineering DepartmentUniversity of FloridaGainesvilleUSA

Personalised recommendations