Skip to main content
Log in

Machine learning from a continuous viewpoint, I

  • Articles
  • Published:
Science China Mathematics Aims and scope Submit manuscript

Abstract

We present a continuous formulation of machine learning, as a problem in the calculus of variations and differential-integral equations, in the spirit of classical numerical analysis. We demonstrate that conventional machine learning models and algorithms, such as the random feature model, the two-layer neural network model and the residual neural network model, can all be recovered (in a scaled form) as particular discretizations of different continuous formulations. We also present examples of new models, such as the flow-based random feature model, and new algorithms, such as the smoothed particle method and spectral method, that arise naturally from this continuous formulation. We discuss how the issues of generalization error and implicit regularization can be studied under this framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ambrosio L, Gigli N, Savaré G. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Berlin: Springer, 2008

    MATH  Google Scholar 

  2. Araújo D, Oliveira R I, Yukimura D. A mean-field limit for certain deep neural networks. arXiv:1906.00193, 2019

  3. Arbel M, Korba A, Salim A, et al. Maximum mean discrepancy gradient flow. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2019, 6481–6491

    Google Scholar 

  4. Avelin B, Nyström K. Neural ODEs as the deep limit of ResNets with constant weights. arXiv:1906.12183, 2019

  5. Bach F. Breaking the curse of dimensionality with convex neural networks. J Mach Learn Res, 2017, 18: 1–53

    MATH  Google Scholar 

  6. Barron A R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans Inform Theory, 1993, 39: 930–945

    MathSciNet  MATH  Google Scholar 

  7. Bartlett P L, Evans S N, Long P M. Representing smooth functions as compositions of near-identity functions with implications for deep network optimization. arXiv:1804.05012, 2018

  8. Bartlett P L, Mendelson S. Rademacher and gaussian complexities: Risk bounds and structural results. J Mach Learn Res, 2002, 3: 463–482

    MathSciNet  MATH  Google Scholar 

  9. Belkin M, Hsu D, Ma S Y, et al. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc Natl Acad Sci USA, 2019, 116: 15849–15854

    MathSciNet  MATH  Google Scholar 

  10. Boltyanskii V G, Gamkrelidze R V, Pontryagin L S. The theory of optimal processes. I. The maximum principle. In: Twenty Papers on Analytic Functions and Ordinary Differential Equations. Providence: Amer Math Soc, 1961, 341–382

    Google Scholar 

  11. Candès E J. Harmonic analysis of neural networks. Appl Comput Harmon Anal, 1999, 6: 197–218

    MathSciNet  MATH  Google Scholar 

  12. Candès E J, Donoho D L. Ridgelets: A key to higher-dimensional intermittency? Philos Trans R Soc Lond Ser A Math Phys Eng Sci, 1999, 357: 2495–2509

    MathSciNet  MATH  Google Scholar 

  13. Carleo G, Troyer M. Solving the quantum many-body problem with artificial neural networks. Science, 2017, 355: 602–606

    MathSciNet  MATH  Google Scholar 

  14. Carratino L, Rudi A, Rosasco L. Learning with SGD and random features. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 10213–10224

    Google Scholar 

  15. Chen R T Q, Rubanova Y, Bettencourt J, et al. Neural ordinary differential equations. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 6571–6583

    Google Scholar 

  16. Chizat L, Bach F. On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 3036–3046

    Google Scholar 

  17. Ciarlet P G. The Finite Element Method for Elliptic Problems. Classics in Applied Mathematics, vol. 40. Philadelphia: SIAM, 2002

    Google Scholar 

  18. Cybenko G. Approximation by superpositions of a sigmoidal function. Math Control Signals Systems, 1989, 2: 303–314

    MathSciNet  MATH  Google Scholar 

  19. Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018

  20. E W. A proposal on machine learning via dynamical systems. Commun Math Stat, 2017, 5: 1–11

    MathSciNet  MATH  Google Scholar 

  21. E W. Machine learning: Mathematical theory and scientific applications. Notices Amer Math Soc, 2019, 66: 1813–1820

    MathSciNet  MATH  Google Scholar 

  22. E W, Han J Q, Jentzen A. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Commun Math Stat, 2017, 5: 349–380

    MathSciNet  MATH  Google Scholar 

  23. E W, Han J Q, Li Q X. A mean-field optimal control formulation of deep learning. Res Math Sci, 2019, 6: 10

    MathSciNet  MATH  Google Scholar 

  24. E W, Ma C, Wu L. Barron spaces and the compositional function spaces for neural network models. arXiv:1906.08039, 2019

  25. E W, Ma C, Wu L. A priori estimates of the population risk for two-layer neural networks. Commun Math Sci, 2019, 17: 1407–1425

    MathSciNet  MATH  Google Scholar 

  26. E W, Ma C, Wu L. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Sci China Math, 2020, 63: 1235–1258

    MathSciNet  MATH  Google Scholar 

  27. E W, Yu B. The deep Ritz method: A deep learning-based numerical algorithm for solving variational problems. Commun Math Stat, 2018, 6: 1–12

    MathSciNet  MATH  Google Scholar 

  28. Forsythe G E, Wasow W R. Finite-Difference Methods for Partial Differential Equations. Applied Mathematics Series. New York-London: John Wiley & Sons, 1967

    Google Scholar 

  29. Gottlieb D, Orszag S A. Numerical Analysis of Spectral Methods: Theory and Applications. CBMS-NSF Regional Conference Series in Applied Mathematics, vol. 26. Philadelphia: SIAM, 1977

    Google Scholar 

  30. Gustafsson B, Kreiss H-O, Oliger J. Time-Dependent Problems and Difference Methods. New York: John Wiley & Sons, 1995

    MATH  Google Scholar 

  31. Haber E, Ruthotto L. Stable architectures for deep neural networks. Inverse Problems, 2017, 34: 014004

    MathSciNet  MATH  Google Scholar 

  32. Han J Q, E W. Deep learning approximation for stochastic control problems. Deep Reinforcement Learning Workshop, arXiv:1611.07422, 2016

  33. Han J Q, Jentzen A, E W. Solving high-dimensional partial differential equations using deep learning. Proc Natl Acad Sci USA, 2018, 115: 8505–8510

    MathSciNet  MATH  Google Scholar 

  34. Han J Q, Zhang L F, E W. Solving many-electron Schrödinger equation using deep neural networks. J Comput Phys, 2019, 399: 108929

    MathSciNet  Google Scholar 

  35. Hanin B. Which neural net architectures give rise to exploding and vanishing gradients? In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 582–591

    Google Scholar 

  36. He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016, 770–778

    Google Scholar 

  37. Hendrycks D, Gimpel K. Gaussian error linear units (GELUs). arXiv:1606.08415, 2016

  38. Hochreiter S, Bengio Y, Frasconi P, et al. Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In: A Field Guide to Dynamical Recurrent Neural Networks. Piscataway: Wiley-IEEE Press, 2001, 237–244

    Google Scholar 

  39. Hohenberg P C, Halperin B I. Theory of dynamic critical phenomena. Rev Modern Phys, 1977, 49: 435

    Google Scholar 

  40. Jabir J-F, Šiška D, Szpruch L. Mean-field neural ODEs via relaxed optimal control. arXiv:1912.05475, 2019

  41. Jordan R, Kinderlehrer D, Otto F. The variational formulation of the Fokker-Planck equation. SIAM J Math Anal, 1998, 29: 1–17

    MathSciNet  MATH  Google Scholar 

  42. Khoo Y H, Lu J F, Ying L X. Solving for high-dimensional committor functions using artificial neural networks. Res Math Sci, 2019, 6: 1

    MathSciNet  MATH  Google Scholar 

  43. Li Q X, Chen L, Tai C, et al. Maximum principle based algorithms for deep learning. J Mach Learn Res, 2017, 18: 5998–6026

    MathSciNet  Google Scholar 

  44. Lu Y P, Zhong A X, Li Q Z, et al. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In: International Conference on Machine Learning. Stockholm: ICML, 2018, 3282–3291

    Google Scholar 

  45. Mei S, Montanari A, Nguyen P-M. A mean field view of the landscape of two-layer neural networks. Proc Natl Acad Sci USA, 2018, 115: E7665–E7671

    MathSciNet  MATH  Google Scholar 

  46. Monaghan J J. Smoothed particle hydrodynamics. Rep Progr Phys, 2005, 68: 1703

    MathSciNet  MATH  Google Scholar 

  47. Mumford D, Shah J. Optimal approximations by piecewise smooth functions and associated variational problems. Comm Pure Appl Math, 1989, 42: 577–685

    MathSciNet  MATH  Google Scholar 

  48. Murata N. An integral representation of functions using three-layered networks and their approximation bounds. Neural Networks, 1996, 9: 947–956

    Google Scholar 

  49. Nguyen P-M. Mean field limit of the learning dynamics of multilayer neural networks. arXiv:1902.02880, 2019

  50. Pardoux E, Peng S G. Backward stochastic differential equations and quasilinear parabolic partial differential equations. Lecture Notes in Control and Inform Sci, 1992, 176: 200–217

    MathSciNet  MATH  Google Scholar 

  51. Pfau D, Spencer J S, Matthews A G, et al. Ab-initio solution of the many-electron schrodinger equation with deep neural networks. arXiv:1909.02487, 2019

  52. Richtmyer R D, Morton K W. Difference Methods for Initial-Value Problems. New York: Interscience, 1967

    MATH  Google Scholar 

  53. Rotskoff G, Jelassi S, Bruna J, et al. Neuron birth-death dynamics accelerates gradient descent and converges asymptotically. In: International Conference on Machine Learning. Long Beach: ICML, 2019, 5508–5517

    Google Scholar 

  54. Rotskoff G, Vanden-Eijnden E. Parameters as interacting particles: Long time convergence and asymptotic error scaling of neural networks. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 7146–7155

    Google Scholar 

  55. Roux N L, Bengio Y. Continuous neural networks. In: Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, vol. 2. Puerto Rico: PMLR, 2007, 404–411

    Google Scholar 

  56. Rudin L I, Osher S, Fatemi E. Nonlinear total variation based noise removal algorithms. Phys D, 1992, 60: 259–268

    MathSciNet  MATH  Google Scholar 

  57. Santambrogio F. {Euclidean, metric, and Wasserstein} gradient flows: An overview. Bull Math Sci, 2017, 7: 87–154

    MathSciNet  MATH  Google Scholar 

  58. Shalev-Shwartz S, Ben-David S. Understanding Machine Learning: From Theory to Algorithms. Cambridge: Cambridge university press, 2014

    MATH  Google Scholar 

  59. Sirignano J, Spiliopoulos K. DGM: A deep learning algorithm for solving partial differential equations. J Comput Phys, 2018, 375: 1339–1364

    MathSciNet  MATH  Google Scholar 

  60. Sirignano J, Spiliopoulos K. Mean field analysis of deep neural networks. arXiv:1903.04440, 2019

  61. Sirignano J, Spiliopoulos K. Mean field analysis of neural networks: A central limit theorem. Stochastic Process Appl, 2020, 130: 1820–1852

    MathSciNet  MATH  Google Scholar 

  62. Sonoda S, Ishikawa I, Ikeda M, et al. The global optimum of shallow neural network is attained by ridgelet transform. arXiv:1805.07517, 2018

  63. Sonoda S, Murata N. Neural network with unbounded activation functions is universal approximator. Appl Comput Harmon Anal, 2017, 43: 233–268

    MathSciNet  MATH  Google Scholar 

  64. Thorpe M, Gennip Y V. Deep limits of residual neural networks. arXiv:1810.11741, 2018

  65. Villani C. Optimal Transport: Old and New. Berlin-Heidelberg: Springer-Verlag, 2008

  66. Xu Z-Q, Zhang Y Y, Luo T, et al. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv:1901.06523, 2019

Download references

Acknowledgements

This work was supported by a gift to Princeton University from iFlytek and the Office of Naval Research (ONR) Grant (Grant No. N00014-13-1-0338). The authors are grateful to Jianfeng Lu, Stephan Wojtowytsch, Lexing Ying and Shuhai Zhao for helpful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weinan E.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

E, W., Ma, C. & Wu, L. Machine learning from a continuous viewpoint, I. Sci. China Math. 63, 2233–2266 (2020). https://doi.org/10.1007/s11425-020-1773-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11425-020-1773-8

Keywords

MSC(2010)

Navigation