Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review

  • Tomaso Poggio
  • Hrushikesh Mhaskar
  • Lorenzo Rosasco
  • Brando Miranda
  • Qianli Liao
Open Access
Review Special Issue on Human Inspired Computing

Abstract

The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

Keywords

Machine learning neural networks deep and shallow networks convolutional neural networks function approximation deep learning 

References

  1. [1]
    F. Anselmi, L. Rosasco, C. Tan, T. Poggio. Deep Convolutional Networks are Hierarchical Kernel Machines, Center for Brains, Minds and Machines (CBMM) Memo No. 035, The Center for Brains, Minds and Machines, USA, 2015.Google Scholar
  2. [2]
    T. Poggio, L. Rosasco, A. Shashua, N. Cohen, F. Anselmi. Notes on Hierarchical Splines, DCLNs and i-theory, Center for Brains, Minds and Machines (CBMM) Memo No. 037, The Center for Brains, Minds and Machines, USA, 2015.Google Scholar
  3. [3]
    T. Poggio, F. Anselmi, L. Rosasco. I-theory on Depth vs Width: Hierarchical Function Composition, Center for Brains, Minds and Machines (CBMM) Memo No. 041, The Center for Brains, Minds and Machines, USA, 2015.Google Scholar
  4. [4]
    H. Mhaskar, Q. L. Liao, T. Poggio. Learning Real and Boolean Functions: When is Deep Better than Shallow, Center for Brains, Minds and Machines (CBMM) Memo No. 045, The Center for Brains, Minds and Machines, USA, 2016.Google Scholar
  5. [5]
    H. N. Mhaskar, T. Poggio. Deep Vs. Shallow Networks: An Approximation Theory Perspective, Center for Brains, Minds and Machines (CBMM) Memo No. 054, The Center for Brains, Minds and Machines, USA, 2016.Google Scholar
  6. [6]
    D. L. Donoho. High-dimensional data analysis: The curses and blessings of dimensionality. Lecture–Math Challenges of Century, vol. 13, pp. 178–183, 2000.Google Scholar
  7. [7]
    Y. LeCun, Y. Bengio, G. Hinton. Deep learning. Nature, vol. 521, no. 7553, pp. 436–444, 2015.CrossRefGoogle Scholar
  8. [8]
    K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, vol. 36, no. 4, pp. 193–202, 1980.CrossRefMATHGoogle Scholar
  9. [9]
    M. Riesenhuber, T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, vol. 2, no. 11, pp. 1019–1025, 1999.CrossRefGoogle Scholar
  10. [10]
    H. N. Mhaskar. Approximation properties of a multilayered feedforward artificial neural network. Advances in Computational Mathematics, vol. 1, no. 1, pp. 61–80, 1993.MathSciNetCrossRefMATHGoogle Scholar
  11. [11]
    C. K. Chui, X. Li, H. Mhaskar. Neural networks for localized approximation. Mathematics of Computation, vol. 63, no. 208, pp. 607–623, 1994.MathSciNetCrossRefMATHGoogle Scholar
  12. [12]
    C. K. Chui, X. Li, H. N. Mhaskar. Limitations of the approximation capabilities of neural networks with one hidden layer. Advances in Computational Mathematics, vol. 5, no. 1, pp. 233–243, 1996.MathSciNetCrossRefMATHGoogle Scholar
  13. [13]
    A. Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica, vol. 8, pp. 143–195, 1999.MathSciNetCrossRefMATHGoogle Scholar
  14. [14]
    T. Poggio, S. Smale. The mathematics of learning: Dealing with data. Notices of the American Mathematical Society, vol. 50, no. 5, pp. 537–544, 2003.MathSciNetMATHGoogle Scholar
  15. [15]
    B. Moore, T. Poggio. Representation properties of multilayer feedforward networks. Neural Networks, vol. 1, no. S1, pp. 203, 1998.Google Scholar
  16. [16]
    R. Livni, S. Shalev-Shwartz, O. Shamir. A provably efficient algorithm for training deep networks. CoRR, abs/1304.7045, 2013.Google Scholar
  17. [17]
    O. Delalleau, Y. Bengio. Shallow vs. deep sum-product networks. In Proceedings of Advances in Neural Information Processing Systems 24, NIPS, Granada, Spain, pp. 666–674, 2011.Google Scholar
  18. [18]
    G. F. Montufar, R. Pascanu, K. Cho, Y. Bengio. On the number of linear regions of deep neural networks. In Proceedings of Advances in Neural Information Processing Systems 27, NIPS, Denver, USA, pp. 2924–2932, 2014.Google Scholar
  19. [19]
    H. N. Mhaskar. Neural networks for localized approximation of real functions. In Proceedings of IEEE-SPWorkshop on Neural Networks for Processing III, pp. 190–196, IEEE, Linthicum Heights, USA, 1993.Google Scholar
  20. [20]
    N. Cohen, O. Sharir, A. Shashua. On the expressive power of deep learning: A tensor analysis. arXiv:1509.0500v1, 2015.Google Scholar
  21. [21]
    F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, T. Poggio. Unsupervised Learning of Invariant Representations With Low Sample Complexity: The Magic of Sensory Cortex or A New Framework for Machine Learning? Center for Brains, Minds and Machines (CBMM) Memo No. 001, The Center for Brains, Minds and Machines, USA, 2014.Google Scholar
  22. [22]
    F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, T. Poggio. Unsupervised learning of invariant representations. Theoretical Computer Science, vol. 633, pp. 112–121, 2016.MathSciNetCrossRefMATHGoogle Scholar
  23. [23]
    T. Poggio, L. Rosaco, A. Shashua, N. Cohen, F. Anselmi. Notes on Hierarchical Splines, DCLNs and i-theory, Center for Brains, Minds and Machines (CBMM) Memo No. 037. The Center for Brains, Minds and Machines, 2015.Google Scholar
  24. [24]
    Q. L. Liao, T. Poggio. Bridging the Gaps between Residual Learning, Recurrent Neural Networks and Visual Cortex, Center for Brains, Minds and Machines (CBMM) Memo No. 047, The Center for Brains, Minds and Machines, 2016.Google Scholar
  25. [25]
    M. Telgarsky. Representation benefits of deep feedforward networks. arXiv:1509.08101v2, 2015.Google Scholar
  26. [26]
    I. Safran, O. Shamir. Depth separation in ReLU networks for approximating smooth non-linear functions. arXiv:1610.09887v1, 2016.Google Scholar
  27. [27]
    H. N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural Computation, vol. 8, no. 1, pp. 164–177, 1996.CrossRefGoogle Scholar
  28. [28]
    E. Corominas, F. S. Balaguer. Conditions for an infinitely differentiable function to be a polynomial. Revista Matemática Hispanoamericana vol. 14, no. 1–2, pp. 26–43, 1954. (in Spanish)MathSciNetGoogle Scholar
  29. [29]
    T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, Q. L. Liao. Why and when can deep–but not shallow–networks avoid the curse of dimensionality: A review. arXiv:1611.00740v3, 2016.Google Scholar
  30. [30]
    R. A. DeVore, R. Howard C. A. Micchelli. Optimal nonlinear approximation. Manuscripta Mathematica, vol. 63, no. 4, pp. 469–478, 1989.MathSciNetCrossRefMATHGoogle Scholar
  31. [31]
    H. N. Mhaskar. On the tractability of multivariate integration and approximation by neural networks. Journal of Complexity, vol. 20, no. 4, pp. 561–590, 2004.MathSciNetCrossRefMATHGoogle Scholar
  32. [32]
    F. Bach. Breaking the curse of dimensionality with convex neural networks. arXiv:1412.8690, 2014.Google Scholar
  33. [33]
    D. Kingma, J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.Google Scholar
  34. [34]
    J. Bergstra, Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, vol. 13, no. 1, pp. 281–305, 2012.MathSciNetMATHGoogle Scholar
  35. [35]
    M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. F. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Q. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Q. Zheng. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016.Google Scholar
  36. [36]
    R. Eldan, O. Shamir. The power of depth for feedforward neural networks. arXiv:1512.03965v4, 2016.Google Scholar
  37. [37]
    H. W. Lin, M. Tegmark. Why does deep and cheap learning work so well? arXiv:1608.08225, 2016.Google Scholar
  38. [38]
    J. T. Håstad. Computational Limitations for Small Depth Circuits, Cambridge, MA, USA: MIT Press, 1987.Google Scholar
  39. [39]
    N. Linial, Y. Mansour, N. Nisan. Constant depth circuits, Fourier transform, and learnability. Journal of the ACM, vol. 40, no. 3, pp. 607–620, 1993.MathSciNetCrossRefMATHGoogle Scholar
  40. [40]
    Y. Bengio, Y. LeCun. Scaling learning algorithms towards AI. Large-Scale Kernel Machines, L. Bottou, O. Chapelle, D. DeCoste, J. Weston, Eds., Cambridge, MA, USA: MIT Press, 2007.Google Scholar
  41. [41]
    Y. Mansour. Learning Boolean functions via the Fourier transform. Theoretical Advances in Neural Computation and Learning, V. Roychowdhury, K. Y. Siu, A. Orlitsky, Eds., pp. 391–424, US: Springer, 1994.CrossRefGoogle Scholar
  42. [42]
    M. Anthony, P. Bartlett. Neural Network Learning: Theoretical Foundations, Cambridge, UK: Cambridge University Press, 2002.MATHGoogle Scholar
  43. [43]
    F. Anselmi, L. Rosasco, C. Tan, T. Poggio. Deep Convolutional Networks are Hierarchical Kernel Machines, Center for Brains, Minds and Machines (CBMM) Memo No. 035, The Center for Brains, Minds and Machines, USA, 2015.Google Scholar
  44. [44]
    B. M. Lake, R. Salakhutdinov, J. B. Tenenabum. Humanlevel concept learning through probabilistic program induction. Science, vol. 350, no. 6266, pp. 1332–1338, 2015.MathSciNetCrossRefMATHGoogle Scholar
  45. [45]
    A. Maurer. Bounds for linear multi-task learning. Journal of Machine Learning Research, vol. 7, no. 1, pp. 117–139, 2016.MathSciNetMATHGoogle Scholar
  46. [46]
    S. Soatto. Steps towards a theory of visual information: Active perception, signal-to-symbol conversion and the interplay between sensing and control. arXiv:1110.2053, 2011.Google Scholar
  47. [47]
    T. A. Poggio, F. Anselmi. Visual Cortex and Deep Networks: Learning Invariant Representations, Cambridge, MA, UK: MIT Press, 2016.Google Scholar
  48. [48]
    L. Grasedyck. Hierarchical singular value decomposition of tensors. SIAM Journal on Matrix Analysis and Applications, no. 31, no. 4, pp. 2029–2054, 2010.MathSciNetCrossRefMATHGoogle Scholar
  49. [49]
    S. Shalev-Shwartz, S. Ben-David. Understanding Machine Learning: From Theory to Algorithms, Cambridge, UK: Cambridge University Press, 2014.CrossRefMATHGoogle Scholar
  50. [50]
    T. Poggio, W. Reichardt. On the representation of multiinput systems: Computational properties of polynomial algorithms. Biological Cybernetics, vol. 37, no. 3, 167–186, 1980.MathSciNetCrossRefMATHGoogle Scholar
  51. [51]
    M. L. Minsky, S. A. Papert. Perceptrons: An Introduction to Computational Geometry, Cambridge MA, UK: The MIT Press, 1972.MATHGoogle Scholar

Copyright information

© The Author(s) 2017

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors and Affiliations

  • Tomaso Poggio
    • 1
  • Hrushikesh Mhaskar
    • 2
    • 3
  • Lorenzo Rosasco
    • 1
  • Brando Miranda
    • 1
  • Qianli Liao
    • 1
  1. 1.Center for Brains, Minds, and Machines, McGovern Institute for Brain ResearchMassachusetts Institute of TechnologyCambridgeUSA
  2. 2.Department of MathematicsCalifornia Institute of TechnologyPasadenaUSA
  3. 3.Institute of Mathematical SciencesClaremont Graduate UniversityClaremontUSA

Personalised recommendations