Advertisement

Journal of Statistical Physics

, Volume 168, Issue 6, pp 1223–1247 | Cite as

Why Does Deep and Cheap Learning Work So Well?

  • Henry W. Lin
  • Max Tegmark
  • David Rolnick
Article

Abstract

We show how the success of deep learning could depend not only on mathematics but also on physics: although well-known mathematical theorems guarantee that neural networks can approximate arbitrary functions well, the class of functions of practical interest can frequently be approximated through “cheap learning” with exponentially fewer parameters than generic ones. We explore how properties frequently encountered in physics such as symmetry, locality, compositionality, and polynomial log-probability translate into exceptionally simple neural networks. We further argue that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine learning, a deep neural network can be more efficient than a shallow one. We formalize these claims using information theory and discuss the relation to the renormalization group. We prove various “no-flattening theorems” showing when efficient linear deep networks cannot be accurately approximated by shallow ones without efficiency loss; for example, we show that n variables cannot be multiplied using fewer than \(2^n\) neurons in a single hidden layer.

Keywords

Artificial neural networks Deep learning Statistical physics 

Notes

Acknowledgements

This work was supported by the Foundational Questions Institute http://fqxi.org/, the Rothberg Family Fund for Cognitive Science and NSF Grant 1122374. We thank Scott Aaronson, Frank Ban, Yoshua Bengio, Rico Jonschkowski, Tomaso Poggio, Bart Selman, Viktoriya Krakovna, Krishanu Sankar and Boya Song for helpful discussions and suggestions, Frank Ban, Fernando Perez, Jared Jolton, and the anonymous referee for helpful corrections and the Center for Brains, Minds, and Machines (CBMM) for hospitality.

References

  1. 1.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)ADSCrossRefGoogle Scholar
  2. 2.
    Bengio, Y.: Learning deep architectures for AI, foundations and trends\({\textregistered }\). Mach. Learn. 2, 1–127 (2009)CrossRefzbMATHGoogle Scholar
  3. 3.
    Russell, S., Dewey, D., Tegmark, M.: Research priorities for robust and beneficial artificial intelligence. AI Mag. 36, 105–114 (2015)CrossRefGoogle Scholar
  4. 4.
    Herbrich, R., Williamson, R.C.: Algorithmic luckiness. J. Mach. Learn. Res. 3, 175–212 (2002)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Shawe-Taylor, J., Bartlett, P.L., Williamson, R.C., Anthony, M.: Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory 44, 1926–1940 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Poggio, T., Anselmi, F., Rosasco, L.: I-theory on depth vs width: hierarchical function composition. Center Brains Minds Mach. (2015). Technical ReportsGoogle Scholar
  7. 7.
    Mehta, P., Schwab, D.J.: An exact mapping between the variational renormalization group and deep learning. arXiv:1410.3831 (2014)
  8. 8.
    Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989)CrossRefGoogle Scholar
  9. 9.
    Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2, 303–314 (1989)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numer. 8, 143–195 (1999)ADSMathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Gnedenko, B., Kolmogorov, A., Gnedenko, B., Kolmogorov, A.: Limit distributions for sums of independent. Am. J. Math. 105, 28–35 (1954)zbMATHGoogle Scholar
  12. 12.
    Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106, 620 (1957)ADSMathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Tegmark, M., Aguirre, A., Rees, M.J., Wilczek, F.: Dimensionless constants, cosmology, and other dark matters. Phys. Rev. D 73, 023505 (2006)ADSCrossRefGoogle Scholar
  14. 14.
    Delalleau, O., Bengio, Y.: Shallow vs. deep sum-product networks. In: Advances in Neural Information Processing Systems, pp. 666–674 (2011)Google Scholar
  15. 15.
    Mhaskar, H., Liao, Q., Poggio, T.: Learning functions: when is deep better than shallow. arXiv:1603.00988 (2016)
  16. 16.
    Mhaskar, H., Poggio, T.: Deep vs. shallow networks: an approximation theory perspective. arXiv:1608.03287 (2016)
  17. 17.
    Adam, R., Ade, P., Aghanim, N., Akrami, Y., Alves, M., Arnaud, M., Arroja, F., Aumont, J., Baccigalupi, C., Ballardini, M., et al.: arXiv:1502.01582 (2015)
  18. 18.
    Seljak, U., Zaldarriaga, M.: A line of sight approach to cosmic microwave background anisotropies. arXiv:astro-ph/9603033 (1996)
  19. 19.
    Tegmark, M.: How to measure CMB power spectra without losing information. Phys. Rev. D 55, 5895 (1997)ADSCrossRefGoogle Scholar
  20. 20.
    Bond, J., Jaffe, A.H., Knox, L.: Estimating the power spectrum of the cosmic microwave background. Phys. Rev. D 57, 2117 (1998)ADSCrossRefGoogle Scholar
  21. 21.
    Tegmark, M., de Oliveira-Costa, A., Hamilton, A.J.: High resolution foreground cleaned CMB map from WMAP. Phys. Rev. D 68, 123523 (2003)ADSCrossRefGoogle Scholar
  22. 22.
    Ade, P., Aghanim, N., Armitage-Caplan, C., Arnaud, M., Ashdown, M., Atrio-Barandela, F., Aumont, J., Baccigalupi, C., Banday, A.J., Barreiro, R., et al.: Planck 2013 results. XII. Diffuse component separation. Astron. Astrophys. 571, A12 (2014)CrossRefGoogle Scholar
  23. 23.
    Tegmark, M.: How to make maps from cosmic microwave background data without losing information. Astrophys. J. Lett. 480, L87 (1997)ADSCrossRefGoogle Scholar
  24. 24.
    Hinshaw, G., Barnes, C., Bennett, C., Greason, M., Halpern, M., Hill, R., Jarosik, N., Kogut, A., Limon, M., Meyer, S., et al.: First-year Wilkinson microwave anisotropy probe (WMAP) WMAP is the result of a partnership between Princeton University and the NASA Goddard Space Flight Center. Scientific guidance is provided by the WMAP Science Team. Observations: data processing methods and systematic error limits. Astrophys. J. Suppl. Ser. 148, 63 (2003)ADSCrossRefGoogle Scholar
  25. 25.
    Hinton, G.: A practical guide to training restricted Boltzmann machines. Momentum 9, 926 (2010)Google Scholar
  26. 26.
    Émile Borel, M.: Les probabilités dénombrables et leurs applications arithmétiques. Rendiconti del Circolo Matematico di Palermo (1884–1940) 27, 247–271 (1909)Google Scholar
  27. 27.
    Fisher, R.A.: On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. Lond. Ser. A Contain. Pap. Math. Phys. Charact. 222, 309–368 (1922)ADSCrossRefzbMATHGoogle Scholar
  28. 28.
    Riesenhuber, M., Poggio, T.: Models of object recognition. Nat. Neurosci. 3, 1199–1204 (2000)CrossRefGoogle Scholar
  29. 29.
    Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951). doi: 10.1214/aoms/1177729694 MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (2012)zbMATHGoogle Scholar
  31. 31.
    Kardar, M.: Statistical Physics of Fields. Cambridge University Press, Cambridge (2007)CrossRefzbMATHGoogle Scholar
  32. 32.
    Cardy, J.: Scaling and Renormalization in Statistical Physics, vol. 5. Cambridge University Press, Cambridge (1996)CrossRefzbMATHGoogle Scholar
  33. 33.
    Johnson, J.K., Malioutov, D.M., Willsky, A.S.: Lagrangian relaxation for MAP estimation in graphical models. arXiv:0710.0013 (2007)
  34. 34.
    Bény, C.: Deep learning and the renormalization group. arXiv:1301.3124 (2013)
  35. 35.
    Saremi, S., Sejnowski, T.J.: Hierarchical model of natural images and the origin of scale invariance. Proc. Natl. Acad. Sci. 110, 3071–3076 (2013). http://www.pnas.org/content/110/8/3071.full.pdf, http://www.pnas.org/content/110/8/3071.abstract
  36. 36.
    Miles Stoudenmire, E., Schwab, D.J.: Supervised learning with quantum-inspired tensor networks. arXiv:1605.05775 (2016)
  37. 37.
    Vidal, G.: Class of quantum many-body states that can be efficiently simulated. Phys. Rev. Lett. 101, 110501 (2008). arXiv:quant-ph/0610099
  38. 38.
    Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems, pp. 2654–2662 (2014)Google Scholar
  39. 39.
    Hastad, J.: Almost optimal lower bounds for small depth circuits. In: Proceedings of the Eighteenth Annual ACM Symposium on Theory of Computing, pp. 6–20. Organization ACM (1986)Google Scholar
  40. 40.
    Telgarsky, M.: Representation benefits of deep feedforward networks. arXiv:1509.08101 (2015)
  41. 41.
    Montufar, G.F., Pascanu, R., Cho, K., Bengio, Y.: On the number of linear regions of deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2924–2932 (2014)Google Scholar
  42. 42.
    Eldan, R., Shamir, O.: The power of depth for feedforward neural networks. arXiv:1512.03965 (2015)
  43. 43.
    Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., Ganguli, S.: Exponential expressivity in deep neural networks through transient chaos. arXiv:1606.05340 (2016)
  44. 44.
    Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., Sohl-Dickstein, J.: On the expressive power of deep neural networks. arXiv:1606.05336 (2016)
  45. 45.
    Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120 (2013)
  46. 46.
    Bengio, Y., LeCun, Y., et al.: Scaling learning algorithms towards AI. Large Scale Kernel Mach. 34, 1–41 (2007)Google Scholar
  47. 47.
    Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13, 354–356 (1969)MathSciNetCrossRefzbMATHGoogle Scholar
  48. 48.
    Le Gall, F.: In: Proceedings of the 39th international symposium on symbolic and algebraic computation. Organization ACM, pp. 296–303 (2014)Google Scholar
  49. 49.
    Carleo, G., Troyer, M.: Solving the quantum many-body problem with artificial neural networks. arXiv:1606.02318 (2016)
  50. 50.
    Vollmer, H.: Introduction to Circuit Complexity: A Uniform Approach. Springer, Berlin (2013)zbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Department of PhysicsHarvard UniversityCambridgeUSA
  2. 2.Department of PhysicsMassachusetts Institute of TechnologyCambridgeUSA
  3. 3.Department of MathematicsMassachusetts Institute of TechnologyCambridgeUSA

Personalised recommendations