Encyclopedia of Machine Learning and Data Mining

2017 Edition
| Editors: Claude Sammut, Geoffrey I. Webb

Deep Learning

  • Jürgen Schmidhuber
Reference work entry
DOI: https://doi.org/10.1007/978-1-4899-7687-1_909

Abstract

Deep learning artificial neural networks have won numerous contests in pattern recognition and machine learning. They are now widely used by the worlds most valuable public companies. I review the most popular algorithms for feedforward and recurrent networks and their history.

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. Aizenberg I, Aizenberg NN, Vandewalle JPL (2000) Multi-valued and universal binary neurons: theory, learning and applications. Springer, Boston. First work to introduce the term “Deep Learning” to Neural NetworksGoogle Scholar
  2. AMAmemory (2015) Answer at reddit AMA (Ask Me Anything) on “memory networks” etc (with references) http://www.reddit.com/r/MachineLearning/comments/2xcyrl/i_am_j%C3%BCrgen_schmidhuber_ama/cp0q12t
  3. Amari S-I (1998) Natural gradient works efficiently in learning. Neural Comput 10(2):251–276CrossRefGoogle Scholar
  4. Baird H (1990) Document image defect models. In: Proceedings of IAPR workshop on syntactic and structural pattern recognition, Murray HillGoogle Scholar
  5. Baldi P, Pollastri G (2003) The principled design of large-scale recursive neural network architectures – DAG-RNNs and the protein structure prediction problem. J Mach Learn Res 4:575–602zbMATHGoogle Scholar
  6. Ballard DH (1987) Modular learning in neural networks. In: Proceedings of AAAI, Seattle, pp 279–284Google Scholar
  7. Barlow HB, Kaushal TP, Mitchison GJ (1989) Finding minimum entropy codes. Neural Comput 1(3):412–423CrossRefGoogle Scholar
  8. Bayer J, Wierstra D, Togelius J, Schmidhuber J (2009) Evolving memory cell structures for sequence learning. In: Proceedings of ICANN, vol 2. Springer, Berlin/New York, pp 755–764Google Scholar
  9. Behnke S (1999) Hebbian learning and competition in the neural abstraction pyramid. In: Proceedings of IJCNN, vol 2. Washington, pp 1356–1361Google Scholar
  10. Behnke S (2003) Hierarchical neural networks for image interpretation. Lecture notes in computer science, vol LNCS 2766. Springer, Berlin/New YorkzbMATHCrossRefGoogle Scholar
  11. Bengio Y, Lamblin P, Popovici D, Larochelle H (2007) Greedy layer-wise training of deep networks. In: Cowan JD, Tesauro G, Alspector J (eds) Proceedings of NIPS 19, MIT Press, Cambridge, pp 153–160Google Scholar
  12. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140zbMATHGoogle Scholar
  13. Bryson AE (1961) A gradient method for optimizing multi-stage allocation processes. In: Proceedings of Harvard university symposium on digital computers and their applications, Harvard University Press, CambridgeGoogle Scholar
  14. Bryson A, Ho Y (1969) Applied optimal control: optimization, estimation, and control. Blaisdell Publishing Company, WashingtonGoogle Scholar
  15. Cho K, Ilin A, Raiko T (2012) Tikhonov-type regularization for restricted Boltzmann machines. In: Proceedings of ICANN 2012, Springer, Berlin/New York, pp 81–88Google Scholar
  16. Ciresan DC, Meier U, Gambardella LM, Schmidhuber J (2010) Deep big simple neural nets for handwritten digit recogntion. Neural Comput 22(12):3207–3220CrossRefGoogle Scholar
  17. Ciresan DC, Meier U, Masci J, Gambardella LM, Schmidhuber J (2011) Flexible, high performance convolutional neural networks for image classification. In: Proceedings of IJCAI, pp 1237–1242Google Scholar
  18. Ciresan DC, Giusti A, Gambardella LM, Schmidhuber J (2012a) Deep neural networks segment neuronal membranes in electron microscopy images. In: Proceedings of NIPS, Quebec City, pp 2852–2860Google Scholar
  19. Ciresan DC, Meier U, Masci J, Schmidhuber J (2012b) Multi-column deep neural network for traffic sign classification. Neural Netw 32:333–338CrossRefGoogle Scholar
  20. Ciresan DC, Meier U, Schmidhuber J (2012c) Multi-column deep neural networks for image classification. In: Proceedings of CVPR 2012, Long preprint. arXiv:1202.2745v1 [cs.CV]Google Scholar
  21. Ciresan DC, Giusti A, Gambardella LM, Schmidhuber J (2013) Mitosis detection in breast cancer histology images with deep neural networks. In: Proceedings of MICCAI, vol 2. Nagoya, pp 411–418Google Scholar
  22. Coates A, Huval B, Wang T, Wu DJ, Ng AY, Catanzaro, B (2013) Deep learning with COTS HPC systems. In: Proceedings of ICML’13Google Scholar
  23. Dechter R (1986) Learning while searching in constraint-satisfaction problems. University of California, Computer Science Department, Cognitive Systems Laboratory. First paper to introduce the term “Deep Learning” to Machine Learning; compare a popular G+ post on this. https://plus.google.com/100849856540000067209/posts/7N6z251w2Wd?pid=6127540521703625346&oid=100849856540000067209Google Scholar
  24. Dreyfus SE (1962) The numerical solution of variational problems. J Math Anal Appl 5(1):30–45MathSciNetzbMATHCrossRefGoogle Scholar
  25. Dreyfus SE (1973) The computational solution of optimal control problems with time lag. IEEE Trans Autom Control 18(4):383–385MathSciNetCrossRefGoogle Scholar
  26. Fan B, Wang L, Soong FK, Xie L (2015) Photo-real talking head with deep bidirectional LSTM. In: Proceedings of ICASSP 2015, BrisbaneGoogle Scholar
  27. Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35(8):1915–1929CrossRefGoogle Scholar
  28. Fernandez S, Graves A, Schmidhuber J (2007a) An application of recurrent neural networks to discriminative keyword spotting. In: Proceedings of ICANN, vol 2. pp 220–229Google Scholar
  29. Fernandez S, Graves A, Schmidhuber J (2007b) Sequence labelling in structured domains with hierarchical recurrent neural networks. In: Proceedings of IJCAIGoogle Scholar
  30. Fu KS (1977) Syntactic pattern recognition and applications. Springer, BerlinzbMATHCrossRefGoogle Scholar
  31. Fukushima K (1979) Neural network model for a mechanism of pattern recognition unaffected by shift in position – neocognitron. Trans. IECE J62-A(10):658–665Google Scholar
  32. Gers FA, Schmidhuber J (2001) LSTM recurrent networks learn simple context free and context sensitive languages. IEEE Trans Neural Netw 12(6):1333–1340CrossRefGoogle Scholar
  33. Gerstner W, Kistler WK (2002) Spiking neuron models. Cambridge University Press, CambridgezbMATHCrossRefGoogle Scholar
  34. Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier networks. In: Proceedings of AISTATS, vol 15. Fort Lauderdale, pp 315–323Google Scholar
  35. Goodfellow IJ, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. In: Proceedings of ICML, AtlantaGoogle Scholar
  36. Goodfellow IJ, Bulatov Y, Ibarz J, Arnoud S, Shet V (2014b) Multi-digit number recognition from street view imagery using deep convolutional neural networks. arXiv preprint arXiv:1312.6082 v4Google Scholar
  37. Goller C, Küchler A (1996) Learning task-dependent distributed representations by backpropagation through structure. In: IEEE international conference on neural networks 1996, vol 1, pp 347–352Google Scholar
  38. Graves A, Fernandez S, Gomez FJ, Schmidhuber J(2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural nets. In: Proceedings of ICML’06, Pittsburgh, pp 369–376Google Scholar
  39. Graves A, Liwicki M, Fernandez S, Bertolami R, Bunke H, Schmidhuber J (2009) A novel connectionist system for improved unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31(5):855–868CrossRefGoogle Scholar
  40. Graves A, Mohamed A-R, Hinton GE (2013) Speech recognition with deep recurrent neural networks. In: Proceedings of ICASSP, Vancouver, pp 6645–6649Google Scholar
  41. Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, Ng AY (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint http://arxiv.org/abs/1412.5567
  42. Hanson SJ, Pratt LY (1989) Comparing biases for minimal network construction with back-propagation. In: Touretzky DS (ed) Proceedings of NIPS, vol 1. Morgan Kaufmann, San Mateo, pp 177–185Google Scholar
  43. Hanson SJ (1990) A stochastic version of the delta rule. Phys D: Nonlinear Phenom 42(1):265–272MathSciNetCrossRefGoogle Scholar
  44. Hastie TJ, Tibshirani RJ (1990) Generalized additive models, vol 43. CRC PresszbMATHGoogle Scholar
  45. Hebb DO (1949) The organization of behavior. Wiley, New YorkGoogle Scholar
  46. Herrero J, Valencia A, Dopazo J (2001) A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17(2):126–136CrossRefGoogle Scholar
  47. Hinton G, Salakhutdinov R (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetzbMATHCrossRefGoogle Scholar
  48. Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800zbMATHCrossRefGoogle Scholar
  49. Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554MathSciNetzbMATHCrossRefGoogle Scholar
  50. Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012b) Improving neural networks by preventing co-adaptation of feature detectors. Technical report. arXiv:1207.0580Google Scholar
  51. Hochreiter S (1991) Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut fuer Informatik, Lehrstuhl Prof. Brauer, Tech. Univ. Munich. Advisor: J. SchmidhuberGoogle Scholar
  52. Hochreiter S, Schmidhuber J (1997a) Flat minima. Neural Comput 9(1):1–42zbMATHCrossRefGoogle Scholar
  53. Hochreiter S, Schmidhuber J (1997b) Long short-term memory. Neural Comput 9(8):1735–1780. Based on TR FKI-207-95, TUM (1995)Google Scholar
  54. Hochreiter S, Schmidhuber J (1999) Feature extraction through LOCOCODE. Neural Comput 11(3):679–714CrossRefGoogle Scholar
  55. Hodgkin AL, Huxley AF (1952) A quantitative description of membrane current and its application to conduction and excitation in nerve. J Physiol 117(4):500CrossRefGoogle Scholar
  56. Hutter M (2005) Universal artificial intelligence: sequential decisions based on algorithmic probability. Springer, BerlinzbMATHCrossRefGoogle Scholar
  57. Ivakhnenko AG, Lapa VG (1965) Cybernetic Predicting Devices. CCM Information Corporation, New YorkGoogle Scholar
  58. Ivakhnenko AG (1971) Polynomial theory of complex systems. IEEE Trans Syst Man Cybern (4):364–378MathSciNetCrossRefGoogle Scholar
  59. Jaeger H (2004) Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science 304:78–80CrossRefGoogle Scholar
  60. Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRefGoogle Scholar
  61. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J AI Res 4:237–285Google Scholar
  62. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of CVPR, ColumbusCrossRefGoogle Scholar
  63. Kelley HJ (1960) Gradient theory of optimal flight paths. ARS J 30(10):947–954zbMATHCrossRefGoogle Scholar
  64. Khan SH, Bennamoun M, Sohel F, Togneri R (2014) Automatic feature learning for robust shadow detection. In: Proceedings of CVPR, ColumbusCrossRefGoogle Scholar
  65. Koikkalainen P and Oja E (1990) Self-organizing hierarchical feature maps. In: Proceedings of IJCNN, pp 279–284Google Scholar
  66. Koutnik J, Greff K, Gomez F, Schmidhuber J (2014) A Clockwork RNN. In: Proceedings of ICML, vol 32. pp 1845–1853. arXiv:1402.3511 [cs.NE]Google Scholar
  67. Kramer M (1991) Nonlinear principal component analysis using autoassociative neural networks. AIChE J 37:233–243CrossRefGoogle Scholar
  68. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of NIPS, Nevada, p 4Google Scholar
  69. LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Back-propagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551CrossRefGoogle Scholar
  70. LeCun Y, Denker JS, Solla SA (1990b) Optimal brain damage. In: Touretzky DS (ed) Proceedings of NIPS 2, Morgan Kaufmann, San Mateo, pp 598–605Google Scholar
  71. LeCun Y, Bengio Y, Hinton G (2015) Deep Learning. Nature 521:436–444. Link. See critique by J. Schmidhuber (2015) http://people.idsia.ch/~juergen/deep-learning-conspiracy.html
  72. Lee S, Kil RM (1991) A Gaussian potential function network with hierarchically selforganizing learning. Neural Netw 4(2):207–224CrossRefGoogle Scholar
  73. Li X, Wu X (2015) Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. In: Proceedings of ICASSP 2015. http://arxiv.org/abs/1410.4281
  74. Linnainmaa S (1970) The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis, University of HelsinkiGoogle Scholar
  75. Linnainmaa S (1976) Taylor expansion of the accumulated rounding error. BIT Numer Math 16(2):146–160MathSciNetzbMATHCrossRefGoogle Scholar
  76. Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, AtlantaGoogle Scholar
  77. Maass W (2000) On the computational power of winner-take-all. Neural Comput 12:2519–2535CrossRefGoogle Scholar
  78. MacKay, DJC (1992) A practical Bayesian framework for backprop networks. Neural Comput 4:448–472CrossRefGoogle Scholar
  79. Maclin R, Shavlik JW (1995) Combining the predictions of multiple classifiers: using competitive learning to initialize neural networks. In: Proceedings of IJCAI, pp 524–531Google Scholar
  80. Martens J, Sutskever I (2011) Learning recurrent neural networks with Hessian-free optimization. In: Proceedings of ICML, pp 1033–1040Google Scholar
  81. Masci J, Giusti A, Ciresan DC, Fricout G, Schmidhuber J (2013) A fast learning algorithm for image segmentation with max-pooling convolutional networks. In: Proceedings of ICIP13, pp 2713–2717Google Scholar
  82. McCulloch W, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 7:115–133MathSciNetzbMATHCrossRefGoogle Scholar
  83. Mohamed A, Hinton GE (2010) Phone recognition using restricted Boltzmann machines. In: Proceedings of ICASSP, Dallas, pp 4354–4357Google Scholar
  84. Moller MF (1993) Exact calculation of the product of the Hessian matrix of feed-forward network error functions and a vector in O(N) time. Technical report PB-432, Computer Science Department, Aarhus UniversityGoogle Scholar
  85. Montavon G, Orr G, Mueller K (2012) Neural networks: tricks of the trade. Lecture notes in computer science, vol LNCS 7700. Springer, Berlin/HeidelbergCrossRefGoogle Scholar
  86. Moody JE (1992) The effective number of parameters: an analysis of generalization and regularization in nonlinear learning systems. In: Proceedings of NIPS’4, Morgan Kaufmann, San Mateo, pp 847–854Google Scholar
  87. Mozer MC, Smolensky P (1989) Skeletonization: a technique for trimming the fat from a network via relevance assessment. In: Proceedings of NIPS 1, Morgan Kaufmann, San Mateo, pp 107–115Google Scholar
  88. Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of ICML, DallasGoogle Scholar
  89. Oh K-S, Jung K (2004) GPU implementation of neural networks. Pattern Recognit 37(6):1311–1314zbMATHCrossRefGoogle Scholar
  90. Pascanu R, Mikolov T, Bengio Y (2013b) On the difficulty of training recurrent neural networks. In: ICML’13: JMLR: W&CP, vol 28Google Scholar
  91. Pearlmutter BA (1994) Fast exact multiplication by the Hessian. Neural Comput 6(1):147–160CrossRefGoogle Scholar
  92. Raina R, Madhavan A, Ng A (2009) Large-scale deep unsupervised learning using graphics processors. In: Proceedings of ICML, Montreal, pp 873–880Google Scholar
  93. Ranzato MA, Huang F, Boureau Y, LeCun Y (2007) Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: Proceedings of CVPR, Minneapolis, pp 1–8Google Scholar
  94. Robinson AJ, Fallside F (1987) The utility driven dynamic error propagation network. Technical report CUED/F-INFENG/TR.1, Cambridge University Engineering DepartmentGoogle Scholar
  95. Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386CrossRefGoogle Scholar
  96. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClelland JL (eds) Parallel distributed processing, vol 1, MIT Press, Cambridge, pp 318–362Google Scholar
  97. Sak H, Senior AW, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. INTERSPEECHGoogle Scholar
  98. Sak H, Senior A, Rao K, Beaufays F, Schalkwyk J (2015) Google research blog. http://googleresearch.blogspot.ch/2015/09/google-voice-search-faster-and-more.html
  99. Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227Google Scholar
  100. Scherer D, Mueller A, Behnke S (2010) Evaluation of pooling operations in convolutional architectures for object recognition. In: Proceedings of ICANN, Thessaloniki, pp 92–101Google Scholar
  101. Schmidhuber J (1989b) A local learning algorithm for dynamic feedforward and recurrent networks. Connect Sci 1(4):403–412CrossRefGoogle Scholar
  102. Schmidhuber J (1992b) Learning complex, extended sequences using the principle of history compression. Neural Comput 4(2):234–242. Based on TR FKI-148-91, TUM, 1991Google Scholar
  103. Schmidhuber J (1992c) Learning factorial codes by predictability minimization. Neural Comput 4(6):863–879CrossRefGoogle Scholar
  104. Schmidhuber J (1997) Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Netw 10(5):857–873CrossRefGoogle Scholar
  105. Schmidhuber J, Wierstra D, Gagliolo M, Gomez FJ (2007) Training recurrent networks by Evolino. Neural Comput 19(3):757–779zbMATHCrossRefGoogle Scholar
  106. Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117. arXiv preprint 1404.7828Google Scholar
  107. Schmidhuber J (2015) Deep learning. Scholarpedia 10(11):32832CrossRefGoogle Scholar
  108. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681CrossRefGoogle Scholar
  109. Sima J (1994) Loading deep networks is hard. Neural Comput 6(5):842–850CrossRefGoogle Scholar
  110. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. arXiv preprint http://arxiv.org/abs/1409.1556
  111. Smolensky P (1986) Parallel distributed processing: explorations in the microstructure of cognition, chapter information processing in dynamical systems: foundations of Harmony theory, vol 1. MIT Press, Cambridge, pp 194–281Google Scholar
  112. Speelpenning B (1980) Compiling fast partial derivatives of functions given by algorithms. Ph.D. thesis, Department of Computer Science, University of Illinois, Urbana-ChampaignGoogle Scholar
  113. Srivastava RK, Masci J, Kazerounian S, Gomez F, Schmidhuber J (2013) Compete to compute. In: Proceedings of NIPS, Nevada, pp 2310–2318Google Scholar
  114. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of NIPS’2014. arXiv preprint arXiv:1409.3215 [cs.CL]Google Scholar
  115. Sutton R, Barto A (1998) Reinforcement learning: an introduction. MIT Press, CambridgeGoogle Scholar
  116. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. arXiv preprint arXiv:1409.4842 [cs.CV]Google Scholar
  117. Tikhonov AN, Arsenin VI, John F (1977) Solutions of ill-posed problems. Winston, New YorkzbMATHGoogle Scholar
  118. Vaillant R, Monrocq C, LeCun Y (1994) Original approach for the localisation of objects in images. IEE Proc Vision Image Signal Process 141(4):245–250CrossRefGoogle Scholar
  119. Vieira A, Barradas N (2003) A training algorithm for classification of high-dimensional data. Neurocomputing 50:461–472zbMATHCrossRefGoogle Scholar
  120. Vinyals O, Toshev A, Bengio S, Erhan D (2014a) Show and tell: a neural image caption generator. arXiv Preprint http://arxiv.org/pdf/1411.4555v1.pdf
  121. Vinyals O, Kaiser L, Koo T, Petrov S, Sutskever I, Hinton G (2014b) Grammar as a foreign language. Preprint http://arxiv.org/abs/1412.7449
  122. Wan EA (1994) Time series prediction by using a connectionist network with internal delay lines. In: Weigend AS, Gershenfeld NA (eds) Time series prediction: forecasting the future and understanding the past. Addison-Wesley, Reading, pp 265–295Google Scholar
  123. Weng JJ, Ahuja N, Huang TS (1993) Learning recognition and segmentation of 3-d objects from 2-d images. Proceedings of the fourth international conference on computer vision. IEEECrossRefGoogle Scholar
  124. Williams RJ (1989) Complexity of exact gradient computation algorithms for recurrent neural networks. Technical Report NU-CCS-89-27, Northeastern University, College of Computer Science, BostonGoogle Scholar
  125. Wiering M, van Otterlo M (2012) Reinforcement learning. Springer, Berlin/HeidelbergCrossRefGoogle Scholar
  126. Werbos PJ (1974) Beyond regression: new tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard UniversityGoogle Scholar
  127. Werbos PJ (1982) Applications of advances in nonlinear sensitivity analysis. In: Proceedings of the 10th IFIP conference, 31.8–4.9, NYC, pp 762–770Google Scholar
  128. Werbos PJ (1988) Generalization of backpropagation with application to a recurrent gas market model. Neural Netw 1(4):339–356CrossRefGoogle Scholar
  129. Yamins D, Hong H, Cadieu C, DiCarlo JJ (2013) Hierarchical modular optimization of convolutional networks achieves representations similar to macaque IT and human ventral stream. In: Proceedings of NIPS, Nevada, pp 1–9Google Scholar
  130. Zeiler MD, Fergus R (2013) Visualizing and understanding convolutional networks. Technical report arXiv:1311.2901 [cs.CV], NYUGoogle Scholar
  131. Zen H, Sak H (2015) Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: Proceedings of ICASSP, Brisbane, pp 4470–4474Google Scholar
  132. Zimmermann H-G, Tietz C, Grothmann R (2012) Forecasting with recurrent neural networks: 12 tricks. In: Montavon G, Orr GB, Mueller K-R (eds) Neural networks: tricks of the trade, 2nd edn. Lecture Notes in Computer Science, vol 7700. Springer, Berlin/New York, pp 687–707CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.The Swiss AI LabIDSIA, USI & SUPSIManno & LuganoSwitzerland