Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review

Abstract

The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

References

  1. [1]

    F. Anselmi, L. Rosasco, C. Tan, T. Poggio. Deep Convolutional Networks are Hierarchical Kernel Machines, Center for Brains, Minds and Machines (CBMM) Memo No. 035, The Center for Brains, Minds and Machines, USA, 2015.

    Google Scholar 

  2. [2]

    T. Poggio, L. Rosasco, A. Shashua, N. Cohen, F. Anselmi. Notes on Hierarchical Splines, DCLNs and i-theory, Center for Brains, Minds and Machines (CBMM) Memo No. 037, The Center for Brains, Minds and Machines, USA, 2015.

    Google Scholar 

  3. [3]

    T. Poggio, F. Anselmi, L. Rosasco. I-theory on Depth vs Width: Hierarchical Function Composition, Center for Brains, Minds and Machines (CBMM) Memo No. 041, The Center for Brains, Minds and Machines, USA, 2015.

    Google Scholar 

  4. [4]

    H. Mhaskar, Q. L. Liao, T. Poggio. Learning Real and Boolean Functions: When is Deep Better than Shallow, Center for Brains, Minds and Machines (CBMM) Memo No. 045, The Center for Brains, Minds and Machines, USA, 2016.

    Google Scholar 

  5. [5]

    H. N. Mhaskar, T. Poggio. Deep Vs. Shallow Networks: An Approximation Theory Perspective, Center for Brains, Minds and Machines (CBMM) Memo No. 054, The Center for Brains, Minds and Machines, USA, 2016.

    Google Scholar 

  6. [6]

    D. L. Donoho. High-dimensional data analysis: The curses and blessings of dimensionality. Lecture–Math Challenges of Century, vol. 13, pp. 178–183, 2000.

    Google Scholar 

  7. [7]

    Y. LeCun, Y. Bengio, G. Hinton. Deep learning. Nature, vol. 521, no. 7553, pp. 436–444, 2015.

    Article  Google Scholar 

  8. [8]

    K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, vol. 36, no. 4, pp. 193–202, 1980.

    Article  MATH  Google Scholar 

  9. [9]

    M. Riesenhuber, T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, vol. 2, no. 11, pp. 1019–1025, 1999.

    Article  Google Scholar 

  10. [10]

    H. N. Mhaskar. Approximation properties of a multilayered feedforward artificial neural network. Advances in Computational Mathematics, vol. 1, no. 1, pp. 61–80, 1993.

    MathSciNet  Article  MATH  Google Scholar 

  11. [11]

    C. K. Chui, X. Li, H. Mhaskar. Neural networks for localized approximation. Mathematics of Computation, vol. 63, no. 208, pp. 607–623, 1994.

    MathSciNet  Article  MATH  Google Scholar 

  12. [12]

    C. K. Chui, X. Li, H. N. Mhaskar. Limitations of the approximation capabilities of neural networks with one hidden layer. Advances in Computational Mathematics, vol.5, no. 1, pp. 233–243, 1996.

    MathSciNet  Article  MATH  Google Scholar 

  13. [13]

    A. Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica, vol. 8, pp. 143–195, 1999.

    MathSciNet  Article  MATH  Google Scholar 

  14. [14]

    T. Poggio, S. Smale. The mathematics of learning: Dealing with data. Notices of the American Mathematical Society, vol. 50, no. 5, pp. 537–544, 2003.

    MathSciNet  MATH  Google Scholar 

  15. [15]

    B. Moore, T. Poggio. Representation properties of multilayer feedforward networks. Neural Networks, vol.1, no.S1, pp. 203, 1998.

    Google Scholar 

  16. [16]

    R. Livni, S. Shalev-Shwartz, O. Shamir. A provably efficient algorithm for training deep networks. CoRR, abs/1304.7045, 2013.

    Google Scholar 

  17. [17]

    O. Delalleau, Y. Bengio. Shallow vs. deep sum-product networks. In Proceedings of Advances in Neural Information Processing Systems 24, NIPS, Granada, Spain, pp. 666–674, 2011.

    Google Scholar 

  18. [18]

    G. F. Montufar, R. Pascanu, K. Cho, Y. Bengio. On the number of linear regions of deep neural networks. In Proceedings of Advances in Neural Information Processing Systems 27, NIPS, Denver, USA, pp. 2924–2932, 2014.

    Google Scholar 

  19. [19]

    H. N. Mhaskar. Neural networks for localized approximation of real functions. In Proceedings of IEEE-SPWorkshop on Neural Networks for Processing III, pp. 190–196, IEEE, Linthicum Heights, USA, 1993.

    Google Scholar 

  20. [20]

    N. Cohen, O. Sharir, A. Shashua. On the expressive power of deep learning: A tensor analysis. arXiv:1509.0500v1, 2015.

    Google Scholar 

  21. [21]

    F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, T. Poggio. Unsupervised Learning of Invariant Representations With Low Sample Complexity: The Magic of Sensory Cortex or A New Framework for Machine Learning? Center for Brains, Minds and Machines (CBMM) Memo No. 001, The Center for Brains, Minds and Machines, USA, 2014.

    Google Scholar 

  22. [22]

    F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, T. Poggio. Unsupervised learning of invariant representations. Theoretical Computer Science, vol. 633, pp. 112–121, 2016.

    MathSciNet  Article  MATH  Google Scholar 

  23. [23]

    T. Poggio, L. Rosaco, A. Shashua, N. Cohen, F. Anselmi. Notes on Hierarchical Splines, DCLNs and i-theory, Center for Brains, Minds and Machines (CBMM) Memo No. 037. The Center for Brains, Minds and Machines, 2015.

    Google Scholar 

  24. [24]

    Q. L. Liao, T. Poggio. Bridging the Gaps between Residual Learning, Recurrent Neural Networks and Visual Cortex, Center for Brains, Minds and Machines (CBMM) Memo No. 047, The Center for Brains, Minds and Machines, 2016.

    Google Scholar 

  25. [25]

    M. Telgarsky. Representation benefits of deep feedforward networks. arXiv:1509.08101v2, 2015.

    Google Scholar 

  26. [26]

    I. Safran, O. Shamir. Depth separation in ReLU networks for approximating smooth non-linear functions. arXiv:1610.09887v1, 2016.

    Google Scholar 

  27. [27]

    H. N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural Computation, vol. 8, no. 1, pp. 164–177, 1996.

    Article  Google Scholar 

  28. [28]

    E. Corominas, F. S. Balaguer. Conditions for an infinitely differentiable function to be a polynomial. Revista Matemática Hispanoamericana vol. 14, no. 1–2, pp. 26–43, 1954. (in Spanish)

    MathSciNet  Google Scholar 

  29. [29]

    T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, Q. L. Liao. Why and when can deep–but not shallow–networks avoid the curse of dimensionality: A review. arXiv:1611.00740v3, 2016.

    Google Scholar 

  30. [30]

    R. A. DeVore, R. Howard C. A. Micchelli. Optimal nonlinear approximation. Manuscripta Mathematica, vol. 63, no. 4, pp. 469–478, 1989.

    MathSciNet  Article  MATH  Google Scholar 

  31. [31]

    H. N. Mhaskar. On the tractability of multivariate integration and approximation by neural networks. Journal of Complexity, vol. 20, no. 4, pp. 561–590, 2004.

    MathSciNet  Article  MATH  Google Scholar 

  32. [32]

    F. Bach. Breaking the curse of dimensionality with convex neural networks. arXiv:1412.8690, 2014.

    Google Scholar 

  33. [33]

    D. Kingma, J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.

    Google Scholar 

  34. [34]

    J. Bergstra, Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, vol. 13, no. 1, pp. 281–305, 2012.

    MathSciNet  MATH  Google Scholar 

  35. [35]

    M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. F. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Q. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Q. Zheng. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016.

    Google Scholar 

  36. [36]

    R. Eldan, O. Shamir. The power of depth for feedforward neural networks. arXiv:1512.03965v4, 2016.

    Google Scholar 

  37. [37]

    H. W. Lin, M. Tegmark. Why does deep and cheap learning work so well? arXiv:1608.08225, 2016.

    Google Scholar 

  38. [38]

    J. T. Håstad. Computational Limitations for Small Depth Circuits, Cambridge, MA, USA: MIT Press, 1987.

    Google Scholar 

  39. [39]

    N. Linial, Y. Mansour, N. Nisan. Constant depth circuits, Fourier transform, and learnability. Journal of the ACM, vol. 40, no. 3, pp. 607–620, 1993.

    MathSciNet  Article  MATH  Google Scholar 

  40. [40]

    Y. Bengio, Y. LeCun. Scaling learning algorithms towards AI. Large-Scale Kernel Machines, L. Bottou, O. Chapelle, D. DeCoste, J. Weston, Eds., Cambridge, MA, USA: MIT Press, 2007.

    Google Scholar 

  41. [41]

    Y. Mansour. Learning Boolean functions via the Fourier transform. Theoretical Advances in Neural Computation and Learning, V. Roychowdhury, K. Y. Siu, A. Orlitsky, Eds., pp. 391–424, US: Springer, 1994.

    Google Scholar 

  42. [42]

    M. Anthony, P. Bartlett. Neural Network Learning: Theoretical Foundations, Cambridge, UK: Cambridge University Press, 2002.

    Google Scholar 

  43. [43]

    F. Anselmi, L. Rosasco, C. Tan, T. Poggio. Deep Convolutional Networks are Hierarchical Kernel Machines, Center for Brains, Minds and Machines (CBMM) Memo No. 035, The Center for Brains, Minds and Machines, USA, 2015.

    Google Scholar 

  44. [44]

    B. M. Lake, R. Salakhutdinov, J. B. Tenenabum. Humanlevel concept learning through probabilistic program induction. Science, vol. 350, no. 6266, pp. 1332–1338, 2015.

    MathSciNet  Article  MATH  Google Scholar 

  45. [45]

    A. Maurer. Bounds for linear multi-task learning. Journal of Machine Learning Research, vol. 7, no. 1, pp. 117–139, 2016.

    MathSciNet  MATH  Google Scholar 

  46. [46]

    S. Soatto. Steps towards a theory of visual information: Active perception, signal-to-symbol conversion and the interplay between sensing and control. arXiv:1110.2053, 2011.

    Google Scholar 

  47. [47]

    T. A. Poggio, F. Anselmi. Visual Cortex and Deep Networks: Learning Invariant Representations, Cambridge, MA, UK: MIT Press, 2016.

    Google Scholar 

  48. [48]

    L. Grasedyck. Hierarchical singular value decomposition of tensors. SIAM Journal on Matrix Analysis and Applications, no. 31, no. 4, pp. 2029–2054, 2010.

    MathSciNet  Article  MATH  Google Scholar 

  49. [49]

    S. Shalev-Shwartz, S. Ben-David. Understanding Machine Learning: From Theory to Algorithms, Cambridge, UK: Cambridge University Press, 2014.

    Google Scholar 

  50. [50]

    T. Poggio, W. Reichardt. On the representation of multiinput systems: Computational properties of polynomial algorithms. Biological Cybernetics, vol. 37, no. 3, 167–186, 1980.

    MathSciNet  Article  MATH  Google Scholar 

  51. [51]

    M. L. Minsky, S. A. Papert. Perceptrons: An Introduction to Computational Geometry, Cambridge MA, UK: The MIT Press, 1972.

    Google Scholar 

Download references

Acknowledgement

The authors thank O. Shamir for useful emails that prompted us to clarify our results in the context of lower bounds.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Tomaso Poggio.

Additional information

This work was supported by the Center for Brains, Minds and Machines (CBMM), NSF STC award CCF (No. 1231216), and ARO (No. W911NF-15-1-0385).

Recommended by Associate Editor Hong Qiao

Tomaso Poggio received the Ph.D. degree in theoretical physics from University of Genoa, Italy in 1971. He is the Eugene McDermott Professor at Department of Brain and Cognitive Sciences, the director of Center for Brains, Minds and Machines, the member of the Computer Science and Artificial Intelligence Laboratory at Massachusetts Institute of Technology (MIT), USA. Since 2000, he is a member of the faculty of the McGovern Institute for Brain Research. He was a Wissenschaftlicher assistant in Max Planck Institut für Biologische Kybernetik, Tüebingen, Germany from 1972 until 1981 when he became an associate professor at MIT. He is an honorary member of the Neuroscience Research Program, a member of the American Academy of Arts and Sciences and a Founding Fellow of AAAI. He received several awards such as the Otto-Hahn-Medaille Award of the Max-Planck-Society, the Max Planck Research Award (with M. Fahle), from the Alexander von Humboldt Foundation, the MIT 50K Entrepreneurship Competition Award, the Laurea Honoris Causa from the University of Pavia in 2000 (Volta Bicentennial), the 2003 Gabor Award, the 2009 Okawa prize, the American Association for the Advancement of Science (AAAS) Fellowship (2009) and the Swartz Prize for Theoretical and Computational Neuroscience in 2014. He is one of the most cited computational neuroscientists (with a h-index greater than 100-based on GoogleScholar).

Hrushikesh Mhaskar did his under-graduate studies in Institute of Science, Nagpur, and received the first M. Sc. degree in mathematics from the Indian Institute of Technology, India in 1976. He received the Ph.D. degree in mathematics and M. Sc. degree in computer science from the Ohio State University, USA in 1980. He then joined Cal. State L.A., and was promoted to full professor in 1990. After retirement in 2012, he is now a visiting associate at California Institute of Technology, Research Professor at Claremont Graduate University, and occasionally served as a consultant for Qualcomm. He has published more than 135 refereed articles in the area of approximation theory, potential theory, neural networks, wavelet analysis, and data processing. His book Weighted Polynomial Approximation was published in 1997 by World Scientific, and the book with Dr. D. V. Pai, Fundamentals of Approximation Theory was published by Narosa Publishers, CRC, and Alpha Science in 2000. He serves on the editorial boards of Journal of Approximation Theory, Applied and Computational Harmonic Analysis, and Jaen Journal of Approximation. In addition, he was a co-editor of a special issue of “Advances in Computational Mathematics on Mathematical Aspects of Neural Networks”, two volumes of Journal of Approximation Theory, dedicated to the memory of G. G. Lorentz, as well as two edited collections of research articles: Wavelet Analysis and Applications, Narosa Publishers, 2001, and Frontiers in Interpolation and Approximation, Chapman and Hall/CRC, 2006. He has held visiting positions, as well as given several invited lectures throughout North America, Europe, and Asia. He was awarded the Humboldt Fellowship for research in Germany four times. He was John von Neumann distinguished professor at Technical University of Munich in 2011. He is listed in Outstanding Young Men of America (1985) and Who’s Who in America’s Teachers (1994). His research was supported by the National Science Foundation and the U. S. Army Research Office, the Air Force Office of Scientific Research, the National Security Agency, and the Research and Development Laboratories.

Lorenzo Rosasco received the Ph.D. degree from the University of Genova, Italy in 2006, where he worked under the supervision of Alessandro Verri and Ernesto De Vito in the Statistical Learning and Image Processing Research Unit (SLIPGURU). He is an assistant professor at the University of Genova, Italy. He is also affiliated with the Massachusetts Institute of Technology (MIT), USA, where is a visiting professor, and with the Istituto Italiano di Tecnologia (IIT), Italy where he is an external collaborator. He is leading the efforts to establish the Laboratory for Computational and Statistical Learning (LCSL), born from a collaborative agreement between IIT and MIT. During his Ph.D. degree period, he has been visiting student at the Toyota Technological Institute at Chicago, USA (working with Steve Smale) and at the Center for Biological and Computational Learning (CBCL) at MIT–working with Tomaso Poggio. Between 2006 and 2009, he was a postdoctoral fellow at CBCL working with Tomaso Poggio.

His research interests include theory and algorithms for machine learning. He has developed and analyzed methods to learn from small as well as large samples of high dimensional data, using analytical and probabilistic tools, within a multidisciplinary approach drawing concepts and techniques primarily from computer science but also from statistics, engineering and applied mathematics.

Brando Miranda received the B. Sc. degree in electrical engineering and computer science (EECS) and the M. Eng. degree (supervised by Professor Tomaso Poggio) in machine learning from Massachusetts Institute of Technology (MIT), USA in 2014 and 2016, respectively.

His research interests include machine learning, statistics, neural networks, theories in deep learning and applied mathematics.

Qianli Liao is a Ph. D. degree candidate in electrical engineering and computer science (EECS) at Massachusetts Institute of Technology (MIT), USA, supervised by Professor Tomaso Poggio.

His research interests include machine learning, optimization, neural networks, theoretical deep learning, computer vision, visual object/face recognition, biologically-plausible and brain-inspired learning.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Poggio, T., Mhaskar, H., Rosasco, L. et al. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. Int. J. Autom. Comput. 14, 503–519 (2017). https://doi.org/10.1007/s11633-017-1054-2

Download citation

Keywords

  • Machine learning
  • neural networks
  • deep and shallow networks
  • convolutional neural networks
  • function approximation
  • deep learning