Practical Recommendations for Gradient-Based Training of Deep Architectures

  • Yoshua Bengio
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7700)


Learning algorithms related to artificial neural networks and in particular for Deep Learning may seem to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations for some of the most commonly used hyperparameters, in particular in the context of learning algorithms based on back-propagated gradient and gradient-based optimization. It also discusses how to deal with the fact that more interesting results can be obtained when allowing one to adjust many hyper-parameters. Overall, it describes elements of the practice used to successfully and efficiently train and debug large-scale and often deep multi-layer neural networks. It closes with open questions about the training difficulties observed with deeper architectures.


Deep Learn Grid Search Hide Unit Sparse Code Generalization Error 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amari, S.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998)CrossRefGoogle Scholar
  2. 2.
    Bach, F., Moulines, E.: Non-asymptotic analysis of stochastic approximation algorithms. In: NIPS 2011 (2011)Google Scholar
  3. 3.
    Bagnell, J.A., Bradley, D.M.: Differentiable sparse coding. In: NIPS 2009, pp. 113–120 (2009)Google Scholar
  4. 4.
    Baxter, J.: Learning internal representations. In: COLT 1995, pp. 311–320 (1995)Google Scholar
  5. 5.
    Baxter, J.: A Bayesian/information theoretic model of learning via multiple task sampling. Machine Learning 28, 7–40 (1997)CrossRefzbMATHGoogle Scholar
  6. 6.
    Bengio, Y.: Neural net language models. Scholarpedia 3(1), 3881 (2008)CrossRefGoogle Scholar
  7. 7.
    Bengio, Y.: Learning deep architectures for AI. Now Publishers (2009)Google Scholar
  8. 8.
    Bengio, Y.: Deep learning of representations for unsupervised and transfer learning. In: JMLR W&CP: Proc. Unsupervised and Transfer Learning (2011)Google Scholar
  9. 9.
    Bengio, Y., Delalleau, O.: On the Expressive Power of Deep Architectures. In: Kivinen, J., Szepesvári, C., Ukkonen, E., Zeugmann, T. (eds.) ALT 2011. LNCS, vol. 6925, pp. 18–36. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  10. 10.
    Bengio, Y., LeCun, Y.: Scaling learning algorithms towards AI. In: Large Scale Kernel Machines (2007)Google Scholar
  11. 11.
    Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. JMLR 3, 1137–1155 (2003)zbMATHGoogle Scholar
  12. 12.
    Bengio, Y., Le Roux, N., Vincent, P., Delalleau, O., Marcotte, P.: Convex neural networks. In: NIPS 2005, pp. 123–130 (2006a)Google Scholar
  13. 13.
    Bengio, Y., Delalleau, O., Le Roux, N.: The curse of highly variable functions for local kernel machines. In: NIPS 2005, pp. 107–114 (2006b)Google Scholar
  14. 14.
    Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: NIPS 2006 (2007)Google Scholar
  15. 15.
    Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICML 2009 (2009)Google Scholar
  16. 16.
    Bengio, Y., Alain, G., Rifai, S.: Implicit density estimation by local moment matching to sample from auto-encoders. Technical report, arXiv:1207.0057 (2012)Google Scholar
  17. 17.
    Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Machine Learning Res. 13, 281–305 (2012)MathSciNetzbMATHGoogle Scholar
  18. 18.
    Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proc. Python for Scientific Comp. Conf. (SciPy) (2010)Google Scholar
  19. 19.
    Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: NIPS 2011 (2011)Google Scholar
  20. 20.
    Berkes, P., Wiskott, L.: Applying Slow Feature Analysis to Image Sequences Yields a Rich Repertoire of Complex Cell Properties. In: Dorronsoro, J.R. (ed.) ICANN 2002. LNCS, vol. 2415, pp. 81–86. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  21. 21.
    Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Technical Report 2848, LIDS (2010)Google Scholar
  22. 22.
    Bordes, A., Bottou, L., Gallinari, P.: Sgd-qn: Careful quasi-newton stochastic gradient descent. Journal of Machine Learning Research 10, 1737–1754 (2009)MathSciNetzbMATHGoogle Scholar
  23. 23.
    Bordes, A., Weston, J., Collobert, R., Bengio, Y. (2011). Learning structured embeddings of knowledge bases. In: AAAI (2011)Google Scholar
  24. 24.
    Bordes, A., Glorot, X., Weston, J., Bengio, Y.: Joint learning of words and meaning representations for open-text semantic parsing. In: AISTATS 2012 (2012)Google Scholar
  25. 25.
    Bottou, L.: From machine learning to machine reasoning. Technical report, arXiv.1102 (2011)Google Scholar
  26. 26.
    Bottou, L.: Stochastic Gradient Descent Tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 421–436. Springer, Heidelberg (2012)Google Scholar
  27. 27.
    Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: NIPS 2008 (2008)Google Scholar
  28. 28.
    Bottou, L., LeCun, Y.: Large-scale on-line learning. In: NIPS 2003 (2004)Google Scholar
  29. 29.
    Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1994)zbMATHGoogle Scholar
  30. 30.
    Breuleux, O., Bengio, Y., Vincent, P.: Quickly generating representative samples from an rbm-derived process. Neural Computation 23(8), 2053–2073 (2011)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Caruana, R.: Multitask connectionist learning. In: Proceedings of the 1993 Connectionist Models Summer School, pp. 372–379 (1993)Google Scholar
  32. 32.
    Cho, K., Raiko, T., Ilin, A.: Enhanced gradient and adaptive learning rate for training restricted boltzmann machines. In: ICML 2011, pp. 105–112 (2011)Google Scholar
  33. 33.
    Coates, A., Ng, A.Y.: The importance of encoding versus training with sparse coding and vector quantization. In: ICML 2011 (2011)Google Scholar
  34. 34.
    Collobert, R., Bengio, S.: Links between perceptrons, MLPs and SVMs. In: ICML 2004 (2004a)Google Scholar
  35. 35.
    Collobert, R., Bengio, S.: Links between perceptrons, MLPs and SVMs. In: International Conference on Machine Learning, ICML (2004b)Google Scholar
  36. 36.
    Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537 (2011a)zbMATHGoogle Scholar
  37. 37.
    Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011b)Google Scholar
  38. 38.
    Courville, A., Bergstra, J., Bengio, Y.: Unsupervised models of images by spike-and-slab RBMs. In: ICML 2011 (2011)Google Scholar
  39. 39.
    Dauphin, Y., Glorot, X., Bengio, Y.: Sampled reconstruction for large-scale learning of embeddings. In: Proc. ICML 2011 (2011)Google Scholar
  40. 40.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  41. 41.
    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research (2011)Google Scholar
  42. 42.
    Elman, J.L.: Learning and development in neural networks: The importance of starting small. Cognition 48, 781–799 (1993)CrossRefGoogle Scholar
  43. 43.
    Erhan, D., Courville, A., Bengio, Y.: Understanding representations learned in deep architectures. Technical Report 1355, Université de Montréal/DIRO (2010a)Google Scholar
  44. 44.
    Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., Bengio, S.: Why does unsupervised pre-training help deep learning? J. Machine Learning Res. 11, 625–660 (2010b)MathSciNetzbMATHGoogle Scholar
  45. 45.
    Frasconi, P., Gori, M., Sperduti, A.: A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks 9(5), 768–786 (1998)CrossRefGoogle Scholar
  46. 46.
    Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural Computation 4(1), 1–58 (1992)CrossRefGoogle Scholar
  47. 47.
    Getoor, L., Taskar, B.: Introduction to Statistical Relational Learning. MIT Press (2006)Google Scholar
  48. 48.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS 2010, pp. 249–256 (2010)Google Scholar
  49. 49.
    Glorot, X., Bordes, A., Bengio, Y. (2011a). Deep sparse rectifier neural networks. In: AISTATS 2011 (2011)Google Scholar
  50. 50.
    Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: A deep learning approach. In: ICML 2011 (2011b)Google Scholar
  51. 51.
    Goodfellow, I., Le, Q., Saxe, A., Ng, A.: Measuring invariances in deep networks. In: NIPS 2009, pp. 646–654 (2009)Google Scholar
  52. 52.
    Goodfellow, I., Courville, A., Bengio, Y.: Spike-and-slab sparse coding for unsupervised feature discovery. In: NIPS Workshop on Challenges in Learning Hierarchical Models (2011)Google Scholar
  53. 53.
    Graepel, T., Candela, J.Q., Borchert, T., Herbrich, R.: Web-scale Bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In: ICML (2010)Google Scholar
  54. 54.
    Håstad, J.: Almost optimal lower bounds for small depth circuits. In: STOC 1986, pp. 6–20 (1986)Google Scholar
  55. 55.
    Håstad, J., Goldmann, M.: On the power of small-depth threshold circuits. Computational Complexity 1, 113–129 (1991)MathSciNetCrossRefzbMATHGoogle Scholar
  56. 56.
    Hinton, G.E.: Relaxation and its role in vision. Ph.D. thesis, University of Edinburgh (1978)Google Scholar
  57. 57.
    Hinton, G.E.: Learning distributed representations of concepts. In: Proc. 8th Annual Conf. Cog. Sc. Society, pp. 1–12 (1986)Google Scholar
  58. 58.
    Hinton, G.E.: Connectionist learning procedures. Artificial Intelligence 40, 185–234 (1989)CrossRefGoogle Scholar
  59. 59.
    Hinton, G.E.: A practical guide to training restricted Boltzmann machines. Technical Report UTML TR 2010-003, Department of Computer Science, University of Toronto (2010)Google Scholar
  60. 60.
    Hinton, G.E.: A Practical Guide to Training Restricted Boltzmann Machines. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 599–619. Springer, Heidelberg (2012)Google Scholar
  61. 61.
    Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Computation 18, 1527–1554 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  62. 62.
    Hutter, F.: Automated Configuration of Algorithms for Solving Hard Computational Problems. Ph.D. thesis, University of British Columbia (2009)Google Scholar
  63. 63.
    Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: Coello Coello, C.A. (ed.) LION 5. LNCS, vol. 6683, pp. 507–523. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  64. 64.
    Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: ICCV (2009)Google Scholar
  65. 65.
    Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., LeCun, Y.: Learning invariant features through topographic filter maps. In: CVPR 2009 (2009)Google Scholar
  66. 66.
    Krueger, K.A., Dayan, P.: Flexible shaping: how learning in small steps helps. Cognition 110, 380–394 (2009)CrossRefGoogle Scholar
  67. 67.
    Lamblin, P., Bengio, Y.: Important gains from supervised fine-tuning of deep architectures on large labeled sets. In: NIPS 2010 Deep Learning and Unsupervised Feature Learning Workshop (2010)Google Scholar
  68. 68.
    Lang, K.J., Hinton, G.E.: The development of the time-delay neural network architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University (1988)Google Scholar
  69. 69.
    Larochelle, H., Bengio, Y.: Classification using discriminative restricted Boltzmann machines. In: ICML 2008 (2008)Google Scholar
  70. 70.
    Larochelle, H., Bengio, Y., Louradour, J., Lamblin, P.: Exploring strategies for training deep neural networks. J. Machine Learning Res. 10, 1–40 (2009)zbMATHGoogle Scholar
  71. 71.
    Le, Q., Ngiam, J., Chen, Z., Hao Chia, D.J., Koh, P.W., Ng, A.: Tiled convolutional neural networks. In: NIPS 2010 (2010)Google Scholar
  72. 72.
    Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Ng, A.: On optimization methods for deep learning. In: ICML 2011 (2011)Google Scholar
  73. 73.
    Le Roux, N., Manzagol, P.-A., Bengio, Y.: Topmoumoute online natural gradient algorithm. In: NIPS 2007 (2008)Google Scholar
  74. 74.
    Le Roux, N., Bengio, Y., Fitzgibbon, A.: Improving first and second-order methods by modeling uncertainty. In: Optimization for Machine Learning. MIT Press (2011)Google Scholar
  75. 75.
    Le Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. Technical report, arXiv:1202.6258 (2012)Google Scholar
  76. 76.
    LeCun, Y.: Modèles connexionistes de l’apprentissage. Ph.D. thesis, Université de Paris VI (1987)Google Scholar
  77. 77.
    LeCun, Y.: Generalization and network design strategies. Technical Report CRG-TR-89-4, University of Toronto (1989)Google Scholar
  78. 78.
    LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Computation 1(4), 541–551 (1989)CrossRefGoogle Scholar
  79. 79.
    LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. In: Orr, G.B., Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 9–50. Springer, Heidelberg (1998a)CrossRefGoogle Scholar
  80. 80.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient based learning applied to document recognition. IEEE 86(11), 2278–2324 (1998b)CrossRefGoogle Scholar
  81. 81.
    Lee, H., Ekanadham, C., Ng, A. (2008). Sparse deep belief net model for visual area V2. In: NIPS 2007 (2007)Google Scholar
  82. 82.
    Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: ICML 2009 (2009)Google Scholar
  83. 83.
    Martens, J.: Deep learning via Hessian-free optimization. In: ICML 2010, pp. 735–742 (2010)Google Scholar
  84. 84.
    Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., Bergstra, J.: Unsupervised and transfer learning challenge: a deep learning approach. In: Proc. Unsupervised and Transfer Learning, JMLR W&CP, vol. 7 (2011)Google Scholar
  85. 85.
    Montavon, G., Braun, M.L., Müller, K.-R.: Deep Boltzmann machines as feed-forward hierarchies. In: AISTATS 2012 (2012)Google Scholar
  86. 86.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: ICML 2010 (2010)Google Scholar
  87. 87.
    Nemirovski, A., Yudin, D.: Problem complexity and method efficiency in optimization. Wiley (1983)Google Scholar
  88. 88.
    Nesterov, Y.: Primal-dual subgradient methods for convex problems. Mathematical Programming 120(1), 221–259 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  89. 89.
    Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research 37, 3311–3325 (1997)CrossRefGoogle Scholar
  90. 90.
    Pearlmutter, B.: Fast exact multiplication by the Hessian. Neural Computation 6(1), 147–160 (1994)CrossRefGoogle Scholar
  91. 91.
    Pinto, N., Doukhan, D., DiCarlo, J.J., Cox, D.D.: A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS Comput. Biol. 5(11), e1000579 (2009)Google Scholar
  92. 92.
    Pollack, J.B.: Recursive distributed representations. Artificial Intelligence 46(1), 77–105 (1990)CrossRefGoogle Scholar
  93. 93.
    Polyak, B., Juditsky, A.: Acceleration of stochastic approximation by averaging. SIAM J. Control and Optimization 30(4), 838–855 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
  94. 94.
    Raiko, T., Valpola, H., LeCun, Y. (2012). Deep learning made easier by linear transformations in perceptrons. In: AISTATS 2012 (2012)Google Scholar
  95. 95.
    Ranzato, M., Poultney, C., Chopra, S., LeCun, Y.: Efficient learning of sparse representations with an energy-based model. In: NIPS 2006 (2007)Google Scholar
  96. 96.
    Ranzato, M., Boureau, Y.-L., LeCun, Y.: Sparse feature learning for deep belief networks. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems (NIPS 2007), vol. 20, pp. 1185–1192. MIT Press, Cambridge (2008a)Google Scholar
  97. 97.
    Ranzato, M., Boureau, Y., LeCun, Y.: Sparse feature learning for deep belief networks. In: NIPS 2007 (2008b)Google Scholar
  98. 98.
    Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62, 107–136 (2006)CrossRefGoogle Scholar
  99. 99.
    Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contracting auto-encoders: Explicit invariance during feature extraction. In: ICML 2011 (2011a)Google Scholar
  100. 100.
    Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., Muller, X.: The manifold tangent classifier. In: NIPS 2011 (2011b)Google Scholar
  101. 101.
    Rifai, S., Bengio, Y., Dauphin, Y., Vincent, P.: A generative process for sampling contractive auto-encoders. In: ICML 2012 (2012)Google Scholar
  102. 102.
    Robbins, H., Monro, S.: A stochastic approximation method. Annals of Mathematical Statistics 22, 400–407 (1951)MathSciNetCrossRefzbMATHGoogle Scholar
  103. 103.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)CrossRefGoogle Scholar
  104. 104.
    Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: AISTATS 2009 (2009)Google Scholar
  105. 105.
    Saxe, A.M., Koh, P.W., Chen, Z., Bhand, M., Suresh, B., Ng, A.: On random weights and unsupervised feature learning. In: ICML 2011 (2011)Google Scholar
  106. 106.
    Schaul, T., Zhang, S., LeCun, Y.: No More Pesky Learning Rates. Technical report (2012)Google Scholar
  107. 107.
    Schraudolph, N.N.: Centering Neural Network Gradient Factors. In: Orr, G.B., Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 207–548. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  108. 108.
    Socher, R., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive neural networks. In: ICML 2011 (2011)Google Scholar
  109. 109.
    Srinivasan, A., Ramakrishnan, G.: Parameter screening and optimisation for ILP using designed experiments. Journal of Machine Learning Research 12, 627–662 (2011)zbMATHGoogle Scholar
  110. 110.
    Swersky, K., Chen, B., Marlin, B., de Freitas, N.: A tutorial on stochastic approximation algorithms for training restricted boltzmann machines and deep belief nets. In: Information Theory and Applications Workshop (2010)Google Scholar
  111. 111.
    Tenenbaum, J., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)CrossRefGoogle Scholar
  112. 112.
    Tieleman, T., Hinton, G.: Using fast weights to improve persistent contrastive divergence. In: ICML 2009 (2009)Google Scholar
  113. 113.
    van der Maaten, L., Hinton, G.E.: Visualizing data using t-sne. J. Machine Learning Res. 9 (2008)Google Scholar
  114. 114.
    Vincent, P.: A connection between score matching and denoising autoencoders. Neural Computation 23(7) (2011)Google Scholar
  115. 115.
    Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: ICML 2008 (2008)Google Scholar
  116. 116.
    Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Machine Learning Res. 11 (2010)Google Scholar
  117. 117.
    Weston, J., Ratle, F., Collobert, R.: Deep learning via semi-supervised embedding. In: ICML 2008 (2008)Google Scholar
  118. 118.
    Weston, J., Bengio, S., Usunier, N.: Wsabie: Scaling up to large vocabulary image annotation. In: Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI (2011)Google Scholar
  119. 119.
    Wiskott, L., Sejnowski, T.J.: Slow feature analysis: Unsupervised learning of invariances. Neural Computation 14(4), 715–770 (2002)CrossRefzbMATHGoogle Scholar
  120. 120.
    Zou, W.Y., Ng, A.Y., Yu, K.: Unsupervised learning of visual invariance with temporal coherence. In: NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Yoshua Bengio
    • 1
  1. 1.Université de MontréalCanada

Personalised recommendations