On the Expressive Power of Deep Architectures

  • Yoshua Bengio
  • Olivier Delalleau
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6925)

Abstract

Deep architectures are families of functions corresponding to deep circuits. Deep Learning algorithms are based on parametrizing such circuits and tuning their parameters so as to approximately optimize some training objective. Whereas it was thought too difficult to train deep architectures, several successful algorithms have been proposed in recent years. We review some of the theoretical motivations for deep architectures, as well as some of their practical successes, and propose directions of investigations to address some of the remaining challenges.

Keywords

Hide Unit Sparse Code Neural Information Processing System Deep Neural Network Restrict Boltzmann Machine 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Attwell, D., Laughlin, S.B.: An energy budget for signaling in the grey matter of the brain. Journal of Cerebral Blood Flow And Metabolism 21, 1133–1145 (2001)CrossRefGoogle Scholar
  2. Barron, A.E.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. on Information Theory 39, 930–945 (1993)MathSciNetCrossRefMATHGoogle Scholar
  3. Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine Learning 2(1), 1–127 (2009); Also published as a book. Now PublishersCrossRefMATHGoogle Scholar
  4. Bengio, Y.: Deep learning of representations for unsupervised and transfer learning. In: Workshop on Unsupervised and Transfer Learning, ICML 2011 (2011)Google Scholar
  5. Bengio, Y., Delalleau, O.: Justifying and generalizing contrastive divergence. Neural Computation 21(6), 1601–1621 (2009)MathSciNetCrossRefMATHGoogle Scholar
  6. Bengio, Y., Delalleau, O.: Shallow versus deep sum-product networks. In: The Learning Workshop, Fort Lauderdale, Florida (2011)Google Scholar
  7. Bengio, Y., LeCun, Y.: Scaling learning algorithms towards AI. In: Bottou, L., Chapelle, O., DeCoste, D., Weston, J. (eds.) Large Scale Kernel Machines. MIT Press, Cambridge (2007)Google Scholar
  8. Bengio, Y., Monperrus, M.: Non-local manifold tangent learning. In: Saul, L., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, NIPS 2004, vol. 17, pp. 129–136. MIT Press, Cambridge (2005)Google Scholar
  9. Bengio, Y., Delalleau, O., Le Roux, N.: The curse of highly variable functions for local kernel machines. In: Weiss, Y., Schölkopf, B., Platt, J. (eds.) Advances in Neural Information Processing Systems (NIPS 2005), vol. 18, pp. 107–114. MIT Press, Cambridge (2006)Google Scholar
  10. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems (NIPS 2006), vol. 19, pp. 153–160. MIT Press, Cambridge (2007)Google Scholar
  11. Bengio, Y., Delalleau, O., Simard, C.: Decision trees do not generalize to new variations. Computational Intelligence 26(4), 449–467 (2010)MathSciNetCrossRefMATHGoogle Scholar
  12. Bengio, Y., Bastien, F., Bergeron, A., Boulanger-Lewandowski, N., Breuel, T., Chherawala, Y., Cisse, M., Côté, M., Erhan, D., Eustache, J., Glorot, X., Muller, X., Pannetier Lebeuf, S., Pascanu, R., Rifai, S., Savard, F., Sicard, G.: Deep learners benefit more from out-of-distribution examples. In: JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011 (2011)Google Scholar
  13. Bourlard, H., Kamp, Y.: Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics 59, 291–294 (1988)MathSciNetCrossRefMATHGoogle Scholar
  14. Braverman, M.: Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM 54(4), 108–115 (2011)CrossRefGoogle Scholar
  15. Breuleux, O., Bengio, Y., Vincent, P.: Quickly generating representative samples from an rbm-derived process. Neural Computation 23(8), 2058–2073 (2011)MathSciNetCrossRefGoogle Scholar
  16. Bromley, J., Benz, J., Bottou, L., Guyon, I., Jackel, L., LeCun, Y., Moore, C., Sackinger, E., Shah, R.: Signature verification using a siamese time delay neural network. In: Advances in Pattern Recognition Systems using Neural Network Technologies, pp. 669–687. World Scientific, Singapore (1993)Google Scholar
  17. Caruana, R.: Learning many related tasks at the same time with backpropagation. In: Tesauro, G., Touretzky, D., Leen, T. (eds.) Advances in Neural Information Processing Systems (NIPS 1994), vol. 7, pp. 657–664. MIT Press, Cambridge (1995)Google Scholar
  18. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR 2005). IEEE Press, Los Alamitos (2005)Google Scholar
  19. Collobert, R., Weston, J.: A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Cohen, W.W., McCallum, A., Roweis, S.T. (eds.) Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML 2008), pp. 160–167. ACM, New York (2008)Google Scholar
  20. Desjardins, G., Courville, A., Bengio, Y., Vincent, P., Delalleau, O.: Tempered Markov chain monte carlo for training of restricted Boltzmann machine. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), pp. 145–152 (2010)Google Scholar
  21. Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., Bengio, S.: Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research 11, 625–660 (2010)MathSciNetMATHGoogle Scholar
  22. Freund, Y., Haussler, D.: Unsupervised learning of distributions on binary vectors using two layer networks. Technical Report UCSC-CRL-94-25, University of California, Santa Cruz (1994)Google Scholar
  23. Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: A deep learning approach. In: Proceedings of the Twenty-eight International Conference on Machine Learning, ICML 2011 (2011)Google Scholar
  24. Goodfellow, I., Le, Q., Saxe, A., Ng, A.: Measuring invariances in deep networks. In: Bengio, Y., Schuurmans, D., Williams, C., Lafferty, J., Culotta, A. (eds.) Advances in Neural Information Processing Systems (NIPS 2009), vol. 22, pp. 646–654 (2009)Google Scholar
  25. Gutmann, M., Hyvarinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010 (2010)Google Scholar
  26. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR 2006), pp. 1735–1742. IEEE Press, Los Alamitos (2006)CrossRefGoogle Scholar
  27. Hadsell, R., Erkan, A., Sermanet, P., Scoffier, M., Muller, U., LeCun, Y.: Deep belief net learning in a long-range vision system for autonomous off-road driving. In: Proc. Intelligent Robots and Systems (IROS 2008), pp. 628–633 (2008)Google Scholar
  28. Håstad, J.: Almost optimal lower bounds for small depth circuits. In: Proceedings of the 18th Annual ACM Symposium on Theory of Computing, pp. 6–20. ACM Press, Berkeley (1986)Google Scholar
  29. Håstad, J., Goldmann, M.: On the power of small-depth threshold circuits. Computational Complexity 1, 113–129 (1991)MathSciNetCrossRefMATHGoogle Scholar
  30. Hinton, G.E.: Learning distributed representations of concepts. In: Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Amherst, pp. 1–12. Lawrence Erlbaum, Hillsdale (1986)Google Scholar
  31. Hinton, G.E.: Connectionist learning procedures. Artificial Intelligence 40, 185–234 (1989)CrossRefGoogle Scholar
  32. Hinton, G.E.: Products of experts. In: Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN), vol. 1, pp. 1–6. IEE, Edinburgh (1999)Google Scholar
  33. Hinton, G.E., Salakhutdinov, R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)MathSciNetCrossRefMATHGoogle Scholar
  34. Hinton, G.E., Zemel, R.S.: Autoencoders, minimum description length, and helmholtz free energy. In: Cowan, D., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing Systems (NIPS 1993), vol. 6, pp. 3–10. Morgan Kaufmann Publishers, Inc., San Francisco (1994)Google Scholar
  35. Hinton, G.E., Sejnowski, T.J., Ackley, D.H.: Boltzmann machines: Constraint satisfaction networks that learn. Technical Report TR-CMU-CS-84-119, Carnegie-Mellon University, Dept. of Computer Science (1984)Google Scholar
  36. Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Computation 18, 1527–1554 (2006)MathSciNetCrossRefMATHGoogle Scholar
  37. Hyvärinen, A.: Estimation of non-normalized statistical models using score matching. Journal of Machine Learning Research 6, 695–709 (2005)MathSciNetMATHGoogle Scholar
  38. Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: Proc. International Conference on Computer Vision (ICCV 2009), pp. 2146–2153. IEEE, Los Alamitos (2009)CrossRefGoogle Scholar
  39. Kavukcuoglu, K., Ranzato, M., LeCun, Y.: Fast inference in sparse coding algorithms with applications to object recognition. Technical report, Computational and Biological Learning Lab, Courant Institute, NYU. Tech Report CBLL-TR-2008-12-01 (2008)Google Scholar
  40. Kavukcuoglu, K., Ranzato, M., Fergus, R., LeCun, Y.: Learning invariant features through topographic filter maps. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR 2009), pp. 1605–1612. IEEE, Los Alamitos (2009)CrossRefGoogle Scholar
  41. Kingma, D., LeCun, Y.: Regularized estimation of image statistics by score matching. In: Lafferty, J., Williams, C.K.I., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 23, pp. 1126–1134 (2010)Google Scholar
  42. Larochelle, H., Erhan, D., Courville, A., Bergstra, J., Bengio, Y.: An empirical evaluation of deep architectures on problems with many factors of variation. In: Ghahramani, Z. (ed.) Proceedings of the Twenty-fourth International Conference on Machine Learning (ICML 2007), pp. 473–480. ACM, New York (2007)Google Scholar
  43. Larochelle, H., Erhan, D., Vincent, P.: Deep learning using robust interdependent codes. In: Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS 2009), pp. 312–319 (2009)Google Scholar
  44. Le Roux, N., Bengio, Y.: Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation 20(6), 1631–1649 (2008)MathSciNetCrossRefMATHGoogle Scholar
  45. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Bottou, L., Littman, M. (eds.) Proceedings of the Twenty-sixth International Conference on Machine Learning (ICML 2009). ACM, Montreal (2009)Google Scholar
  46. Lennie, P.: The cost of cortical computation. Current Biology 13(6), 493–497 (2003)CrossRefGoogle Scholar
  47. Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., Bergstra, J.: Unsupervised and transfer learning challenge: a deep learning approach. In: Workshop on Unsupervised and Transfer Learning, ICML 2011 (2011)Google Scholar
  48. Mobahi, H., Collobert, R., Weston, J.: Deep learning from temporal coherence in video. In: Bottou, L., Littman, M. (eds.) Proceedings of the 26th International Conference on Machine Learning, pp. 737–744. Omnipress, Montreal (2009)Google Scholar
  49. Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research 37, 3311–3325 (1997)CrossRefGoogle Scholar
  50. Osindero, S., Hinton, G.E.: Modeling image patches with a directed hierarchy of markov random field. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems (NIPS 2007), vol. 20, pp. 1121–1128. MIT Press, Cambridge (2008)Google Scholar
  51. Poon, H., Domingos, P.: Sum-product networks: A new deep architecture. In: NIPS, Workshop on Deep Learning and Unsupervised Feature Learning, Whistler, Canada (2010)Google Scholar
  52. Poon, H., Domingos, P.: Sum-product networks for deep learning. In: Learning Workshop. FL, Fort Lauderdale (2011)Google Scholar
  53. Ranzato, M., Poultney, C., Chopra, S., LeCun, Y.: Efficient learning of sparse representations with an energy-based model. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems (NIPS 2006), vol. 19, pp. 1137–1144. MIT Press, Cambridge (2007)Google Scholar
  54. Ranzato, M., Boureau, Y.-L., LeCun, Y.: Sparse feature learning for deep belief networks. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems (NIPS 2007), vol. 20, pp. 1185–1192. MIT Press, Cambridge (2008)Google Scholar
  55. Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: Explicit invariance during feature extraction. In: Proceedings of the Twenty-eight International Conference on Machine Learning, ICML 2011 (2011)Google Scholar
  56. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)CrossRefMATHGoogle Scholar
  57. Salakhutdinov, R., Hinton, G.E.: Semantic hashing. In: Proceedings of the 2007 Workshop on Information Retrieval and Applications of Graphical Models (SIGIR 2007). Elsevier, Amsterdam (2007)Google Scholar
  58. Salakhutdinov, R., Hinton, G.E.: Deep Boltzmann machines. In: Proceedings of The Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS 2009), vol. 5, pp. 448–455 (2009)Google Scholar
  59. Salakhutdinov, R., Hinton, G.E.: An efficient learning procedure for deep Boltzmann machines. Technical Report MIT-CSAIL-TR-2010-037, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology (2010)Google Scholar
  60. Salakhutdinov, R., Mnih, A., Hinton, G.E.: Restricted Boltzmann machines for collaborative filtering. In: Ghahramani, Z. (ed.) Proceedings of the Twenty-fourth International Conference on Machine Learning (ICML 2007), pp. 791–798. ACM, New York (2007)Google Scholar
  61. Smolensky, P.: Information processing in dynamical systems: Foundations of harmony theory. In: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed Processing, vol. 1, ch. 6, pp. 194–281. MIT Press, Cambridge (1986)Google Scholar
  62. Tieleman, T.: Training restricted boltzmann machines using approximations to the likelihood gradient. In: Cohen, W.W., McCallum, A., Roweis, S.T. (eds.) Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML 2008), pp. 1064–1071. ACM, New York (2008)Google Scholar
  63. Tieleman, T., Hinton, G.: Using fast weights to improve persistent contrastive divergence. In: Bottou, L., Littman, M. (eds.) Proceedings of the Twenty-sixth International Conference on Machine Learning (ICML 2009), pp. 1033–1040. ACM, New York (2009)Google Scholar
  64. Vincent, P.: A connection between score matching and denoising autoencoders. Neural Computation 23(7), 1661–1674 (2011)MathSciNetCrossRefMATHGoogle Scholar
  65. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: Cohen, W.W., McCallum, A., Roweis, S.T. (eds.) Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML 2008), pp. 1096–1103. ACM, New York (2008)Google Scholar
  66. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 11, 3371–3408 (2010)MathSciNetMATHGoogle Scholar
  67. Welling, M.: Herding dynamic weights for partially observed random field models. In: Proceedings of the 25th Conference in Uncertainty in Artificial Intelligence (UAI 2009). Morgan Kaufmann, San Francisco (2009)Google Scholar
  68. Weston, J., Ratle, F., Collobert, R.: Deep learning via semi-supervised embedding. In: Cohen, W.W., McCallum, A., Roweis, S.T. (eds.) Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML 2008), pp. 1168–1175. ACM, New York (2008)Google Scholar
  69. Wolpert, D.H.: The lack of a priori distinction between learning algorithms. Neural Computation 8(7), 1341–1390 (1996)CrossRefGoogle Scholar
  70. Yao, A.: Separating the polynomial-time hierarchy by oracles. In: Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pp. 1–10 (1985)Google Scholar
  71. Younes, L.: On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics and Stochastic Reports 65(3), 177–228 (1999)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Yoshua Bengio
    • 1
  • Olivier Delalleau
    • 1
  1. 1.Dept. IROUniversité de MontréalMontréalCanada

Personalised recommendations