Machine learning from a continuous viewpoint, I

E, Weinan; Ma, Chao; Wu, Lei

doi:10.1007/s11425-020-1773-8

Machine learning from a continuous viewpoint, I

Articles
Published: 11 September 2020

Volume 63, pages 2233–2266, (2020)
Cite this article

Science China Mathematics Aims and scope Submit manuscript

Weinan E^1,2,3,
Chao Ma² &
Lei Wu²

1598 Accesses
26 Citations
11 Altmetric
Explore all metrics

Abstract

We present a continuous formulation of machine learning, as a problem in the calculus of variations and differential-integral equations, in the spirit of classical numerical analysis. We demonstrate that conventional machine learning models and algorithms, such as the random feature model, the two-layer neural network model and the residual neural network model, can all be recovered (in a scaled form) as particular discretizations of different continuous formulations. We also present examples of new models, such as the flow-based random feature model, and new algorithms, such as the smoothed particle method and spectral method, that arise naturally from this continuous formulation. We discuss how the issues of generalization error and implicit regularization can be studied under this framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Regularization: From Inverse Problems to Large-Scale Machine Learning

Recent Advances in Stochastic Riemannian Optimization

Recent Advances in Numerical Methods, Machine Learning, and Computer Science

References

Ambrosio L, Gigli N, Savaré G. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Berlin: Springer, 2008
MATH Google Scholar
Araújo D, Oliveira R I, Yukimura D. A mean-field limit for certain deep neural networks. arXiv:1906.00193, 2019
Arbel M, Korba A, Salim A, et al. Maximum mean discrepancy gradient flow. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2019, 6481–6491
Google Scholar
Avelin B, Nyström K. Neural ODEs as the deep limit of ResNets with constant weights. arXiv:1906.12183, 2019
Bach F. Breaking the curse of dimensionality with convex neural networks. J Mach Learn Res, 2017, 18: 1–53
MATH Google Scholar
Barron A R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans Inform Theory, 1993, 39: 930–945
MathSciNet MATH Google Scholar
Bartlett P L, Evans S N, Long P M. Representing smooth functions as compositions of near-identity functions with implications for deep network optimization. arXiv:1804.05012, 2018
Bartlett P L, Mendelson S. Rademacher and gaussian complexities: Risk bounds and structural results. J Mach Learn Res, 2002, 3: 463–482
MathSciNet MATH Google Scholar
Belkin M, Hsu D, Ma S Y, et al. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc Natl Acad Sci USA, 2019, 116: 15849–15854
MathSciNet MATH Google Scholar
Boltyanskii V G, Gamkrelidze R V, Pontryagin L S. The theory of optimal processes. I. The maximum principle. In: Twenty Papers on Analytic Functions and Ordinary Differential Equations. Providence: Amer Math Soc, 1961, 341–382
Google Scholar
Candès E J. Harmonic analysis of neural networks. Appl Comput Harmon Anal, 1999, 6: 197–218
MathSciNet MATH Google Scholar
Candès E J, Donoho D L. Ridgelets: A key to higher-dimensional intermittency? Philos Trans R Soc Lond Ser A Math Phys Eng Sci, 1999, 357: 2495–2509
MathSciNet MATH Google Scholar
Carleo G, Troyer M. Solving the quantum many-body problem with artificial neural networks. Science, 2017, 355: 602–606
MathSciNet MATH Google Scholar
Carratino L, Rudi A, Rosasco L. Learning with SGD and random features. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 10213–10224
Google Scholar
Chen R T Q, Rubanova Y, Bettencourt J, et al. Neural ordinary differential equations. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 6571–6583
Google Scholar
Chizat L, Bach F. On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 3036–3046
Google Scholar
Ciarlet P G. The Finite Element Method for Elliptic Problems. Classics in Applied Mathematics, vol. 40. Philadelphia: SIAM, 2002
Google Scholar
Cybenko G. Approximation by superpositions of a sigmoidal function. Math Control Signals Systems, 1989, 2: 303–314
MathSciNet MATH Google Scholar
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018
E W. A proposal on machine learning via dynamical systems. Commun Math Stat, 2017, 5: 1–11
MathSciNet MATH Google Scholar
E W. Machine learning: Mathematical theory and scientific applications. Notices Amer Math Soc, 2019, 66: 1813–1820
MathSciNet MATH Google Scholar
E W, Han J Q, Jentzen A. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Commun Math Stat, 2017, 5: 349–380
MathSciNet MATH Google Scholar
E W, Han J Q, Li Q X. A mean-field optimal control formulation of deep learning. Res Math Sci, 2019, 6: 10
MathSciNet MATH Google Scholar
E W, Ma C, Wu L. Barron spaces and the compositional function spaces for neural network models. arXiv:1906.08039, 2019
E W, Ma C, Wu L. A priori estimates of the population risk for two-layer neural networks. Commun Math Sci, 2019, 17: 1407–1425
MathSciNet MATH Google Scholar
E W, Ma C, Wu L. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Sci China Math, 2020, 63: 1235–1258
MathSciNet MATH Google Scholar
E W, Yu B. The deep Ritz method: A deep learning-based numerical algorithm for solving variational problems. Commun Math Stat, 2018, 6: 1–12
MathSciNet MATH Google Scholar
Forsythe G E, Wasow W R. Finite-Difference Methods for Partial Differential Equations. Applied Mathematics Series. New York-London: John Wiley & Sons, 1967
Google Scholar
Gottlieb D, Orszag S A. Numerical Analysis of Spectral Methods: Theory and Applications. CBMS-NSF Regional Conference Series in Applied Mathematics, vol. 26. Philadelphia: SIAM, 1977
Google Scholar
Gustafsson B, Kreiss H-O, Oliger J. Time-Dependent Problems and Difference Methods. New York: John Wiley & Sons, 1995
MATH Google Scholar
Haber E, Ruthotto L. Stable architectures for deep neural networks. Inverse Problems, 2017, 34: 014004
MathSciNet MATH Google Scholar
Han J Q, E W. Deep learning approximation for stochastic control problems. Deep Reinforcement Learning Workshop, arXiv:1611.07422, 2016
Han J Q, Jentzen A, E W. Solving high-dimensional partial differential equations using deep learning. Proc Natl Acad Sci USA, 2018, 115: 8505–8510
MathSciNet MATH Google Scholar
Han J Q, Zhang L F, E W. Solving many-electron Schrödinger equation using deep neural networks. J Comput Phys, 2019, 399: 108929
MathSciNet Google Scholar
Hanin B. Which neural net architectures give rise to exploding and vanishing gradients? In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 582–591
Google Scholar
He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016, 770–778
Google Scholar
Hendrycks D, Gimpel K. Gaussian error linear units (GELUs). arXiv:1606.08415, 2016
Hochreiter S, Bengio Y, Frasconi P, et al. Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In: A Field Guide to Dynamical Recurrent Neural Networks. Piscataway: Wiley-IEEE Press, 2001, 237–244
Google Scholar
Hohenberg P C, Halperin B I. Theory of dynamic critical phenomena. Rev Modern Phys, 1977, 49: 435
Google Scholar
Jabir J-F, Šiška D, Szpruch L. Mean-field neural ODEs via relaxed optimal control. arXiv:1912.05475, 2019
Jordan R, Kinderlehrer D, Otto F. The variational formulation of the Fokker-Planck equation. SIAM J Math Anal, 1998, 29: 1–17
MathSciNet MATH Google Scholar
Khoo Y H, Lu J F, Ying L X. Solving for high-dimensional committor functions using artificial neural networks. Res Math Sci, 2019, 6: 1
MathSciNet MATH Google Scholar
Li Q X, Chen L, Tai C, et al. Maximum principle based algorithms for deep learning. J Mach Learn Res, 2017, 18: 5998–6026
MathSciNet Google Scholar
Lu Y P, Zhong A X, Li Q Z, et al. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In: International Conference on Machine Learning. Stockholm: ICML, 2018, 3282–3291
Google Scholar
Mei S, Montanari A, Nguyen P-M. A mean field view of the landscape of two-layer neural networks. Proc Natl Acad Sci USA, 2018, 115: E7665–E7671
MathSciNet MATH Google Scholar
Monaghan J J. Smoothed particle hydrodynamics. Rep Progr Phys, 2005, 68: 1703
MathSciNet MATH Google Scholar
Mumford D, Shah J. Optimal approximations by piecewise smooth functions and associated variational problems. Comm Pure Appl Math, 1989, 42: 577–685
MathSciNet MATH Google Scholar
Murata N. An integral representation of functions using three-layered networks and their approximation bounds. Neural Networks, 1996, 9: 947–956
Google Scholar
Nguyen P-M. Mean field limit of the learning dynamics of multilayer neural networks. arXiv:1902.02880, 2019
Pardoux E, Peng S G. Backward stochastic differential equations and quasilinear parabolic partial differential equations. Lecture Notes in Control and Inform Sci, 1992, 176: 200–217
MathSciNet MATH Google Scholar
Pfau D, Spencer J S, Matthews A G, et al. Ab-initio solution of the many-electron schrodinger equation with deep neural networks. arXiv:1909.02487, 2019
Richtmyer R D, Morton K W. Difference Methods for Initial-Value Problems. New York: Interscience, 1967
MATH Google Scholar
Rotskoff G, Jelassi S, Bruna J, et al. Neuron birth-death dynamics accelerates gradient descent and converges asymptotically. In: International Conference on Machine Learning. Long Beach: ICML, 2019, 5508–5517
Google Scholar
Rotskoff G, Vanden-Eijnden E. Parameters as interacting particles: Long time convergence and asymptotic error scaling of neural networks. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 7146–7155
Google Scholar
Roux N L, Bengio Y. Continuous neural networks. In: Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, vol. 2. Puerto Rico: PMLR, 2007, 404–411
Google Scholar
Rudin L I, Osher S, Fatemi E. Nonlinear total variation based noise removal algorithms. Phys D, 1992, 60: 259–268
MathSciNet MATH Google Scholar
Santambrogio F. {Euclidean, metric, and Wasserstein} gradient flows: An overview. Bull Math Sci, 2017, 7: 87–154
MathSciNet MATH Google Scholar
Shalev-Shwartz S, Ben-David S. Understanding Machine Learning: From Theory to Algorithms. Cambridge: Cambridge university press, 2014
MATH Google Scholar
Sirignano J, Spiliopoulos K. DGM: A deep learning algorithm for solving partial differential equations. J Comput Phys, 2018, 375: 1339–1364
MathSciNet MATH Google Scholar
Sirignano J, Spiliopoulos K. Mean field analysis of deep neural networks. arXiv:1903.04440, 2019
Sirignano J, Spiliopoulos K. Mean field analysis of neural networks: A central limit theorem. Stochastic Process Appl, 2020, 130: 1820–1852
MathSciNet MATH Google Scholar
Sonoda S, Ishikawa I, Ikeda M, et al. The global optimum of shallow neural network is attained by ridgelet transform. arXiv:1805.07517, 2018
Sonoda S, Murata N. Neural network with unbounded activation functions is universal approximator. Appl Comput Harmon Anal, 2017, 43: 233–268
MathSciNet MATH Google Scholar
Thorpe M, Gennip Y V. Deep limits of residual neural networks. arXiv:1810.11741, 2018
Villani C. Optimal Transport: Old and New. Berlin-Heidelberg: Springer-Verlag, 2008
Xu Z-Q, Zhang Y Y, Luo T, et al. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv:1901.06523, 2019

Download references

Acknowledgements

This work was supported by a gift to Princeton University from iFlytek and the Office of Naval Research (ONR) Grant (Grant No. N00014-13-1-0338). The authors are grateful to Jianfeng Lu, Stephan Wojtowytsch, Lexing Ying and Shuhai Zhao for helpful discussions.

Author information

Authors and Affiliations

Department of Mathematics, Princeton University, Princeton, NJ, 08544, USA
Weinan E
The Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ, 08544, USA
Weinan E, Chao Ma & Lei Wu
Beijing Institute of Big Data Research, Beijing, 100871, China
Weinan E

Authors

Weinan E
View author publications
You can also search for this author in PubMed Google Scholar
Chao Ma
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weinan E.

Rights and permissions

Reprints and permissions

About this article

Cite this article

E, W., Ma, C. & Wu, L. Machine learning from a continuous viewpoint, I. Sci. China Math. 63, 2233–2266 (2020). https://doi.org/10.1007/s11425-020-1773-8

Download citation

Received: 27 May 2020
Accepted: 28 August 2020
Published: 11 September 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11425-020-1773-8

Keywords

MSC(2010)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine learning from a continuous viewpoint, I

Abstract

Access this article

Similar content being viewed by others

Regularization: From Inverse Problems to Large-Scale Machine Learning

Recent Advances in Stochastic Riemannian Optimization

Recent Advances in Numerical Methods, Machine Learning, and Computer Science

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

MSC(2010)

Navigation

Machine learning from a continuous viewpoint, I

Abstract

Access this article

Similar content being viewed by others

Regularization: From Inverse Problems to Large-Scale Machine Learning

Recent Advances in Stochastic Riemannian Optimization

Recent Advances in Numerical Methods, Machine Learning, and Computer Science

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

MSC(2010)

Search

Navigation