Abstract
We present a continuous formulation of machine learning, as a problem in the calculus of variations and differential-integral equations, in the spirit of classical numerical analysis. We demonstrate that conventional machine learning models and algorithms, such as the random feature model, the two-layer neural network model and the residual neural network model, can all be recovered (in a scaled form) as particular discretizations of different continuous formulations. We also present examples of new models, such as the flow-based random feature model, and new algorithms, such as the smoothed particle method and spectral method, that arise naturally from this continuous formulation. We discuss how the issues of generalization error and implicit regularization can be studied under this framework.
Similar content being viewed by others
References
Ambrosio L, Gigli N, Savaré G. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Berlin: Springer, 2008
Araújo D, Oliveira R I, Yukimura D. A mean-field limit for certain deep neural networks. arXiv:1906.00193, 2019
Arbel M, Korba A, Salim A, et al. Maximum mean discrepancy gradient flow. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2019, 6481–6491
Avelin B, Nyström K. Neural ODEs as the deep limit of ResNets with constant weights. arXiv:1906.12183, 2019
Bach F. Breaking the curse of dimensionality with convex neural networks. J Mach Learn Res, 2017, 18: 1–53
Barron A R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans Inform Theory, 1993, 39: 930–945
Bartlett P L, Evans S N, Long P M. Representing smooth functions as compositions of near-identity functions with implications for deep network optimization. arXiv:1804.05012, 2018
Bartlett P L, Mendelson S. Rademacher and gaussian complexities: Risk bounds and structural results. J Mach Learn Res, 2002, 3: 463–482
Belkin M, Hsu D, Ma S Y, et al. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc Natl Acad Sci USA, 2019, 116: 15849–15854
Boltyanskii V G, Gamkrelidze R V, Pontryagin L S. The theory of optimal processes. I. The maximum principle. In: Twenty Papers on Analytic Functions and Ordinary Differential Equations. Providence: Amer Math Soc, 1961, 341–382
Candès E J. Harmonic analysis of neural networks. Appl Comput Harmon Anal, 1999, 6: 197–218
Candès E J, Donoho D L. Ridgelets: A key to higher-dimensional intermittency? Philos Trans R Soc Lond Ser A Math Phys Eng Sci, 1999, 357: 2495–2509
Carleo G, Troyer M. Solving the quantum many-body problem with artificial neural networks. Science, 2017, 355: 602–606
Carratino L, Rudi A, Rosasco L. Learning with SGD and random features. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 10213–10224
Chen R T Q, Rubanova Y, Bettencourt J, et al. Neural ordinary differential equations. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 6571–6583
Chizat L, Bach F. On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 3036–3046
Ciarlet P G. The Finite Element Method for Elliptic Problems. Classics in Applied Mathematics, vol. 40. Philadelphia: SIAM, 2002
Cybenko G. Approximation by superpositions of a sigmoidal function. Math Control Signals Systems, 1989, 2: 303–314
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018
E W. A proposal on machine learning via dynamical systems. Commun Math Stat, 2017, 5: 1–11
E W. Machine learning: Mathematical theory and scientific applications. Notices Amer Math Soc, 2019, 66: 1813–1820
E W, Han J Q, Jentzen A. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Commun Math Stat, 2017, 5: 349–380
E W, Han J Q, Li Q X. A mean-field optimal control formulation of deep learning. Res Math Sci, 2019, 6: 10
E W, Ma C, Wu L. Barron spaces and the compositional function spaces for neural network models. arXiv:1906.08039, 2019
E W, Ma C, Wu L. A priori estimates of the population risk for two-layer neural networks. Commun Math Sci, 2019, 17: 1407–1425
E W, Ma C, Wu L. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Sci China Math, 2020, 63: 1235–1258
E W, Yu B. The deep Ritz method: A deep learning-based numerical algorithm for solving variational problems. Commun Math Stat, 2018, 6: 1–12
Forsythe G E, Wasow W R. Finite-Difference Methods for Partial Differential Equations. Applied Mathematics Series. New York-London: John Wiley & Sons, 1967
Gottlieb D, Orszag S A. Numerical Analysis of Spectral Methods: Theory and Applications. CBMS-NSF Regional Conference Series in Applied Mathematics, vol. 26. Philadelphia: SIAM, 1977
Gustafsson B, Kreiss H-O, Oliger J. Time-Dependent Problems and Difference Methods. New York: John Wiley & Sons, 1995
Haber E, Ruthotto L. Stable architectures for deep neural networks. Inverse Problems, 2017, 34: 014004
Han J Q, E W. Deep learning approximation for stochastic control problems. Deep Reinforcement Learning Workshop, arXiv:1611.07422, 2016
Han J Q, Jentzen A, E W. Solving high-dimensional partial differential equations using deep learning. Proc Natl Acad Sci USA, 2018, 115: 8505–8510
Han J Q, Zhang L F, E W. Solving many-electron Schrödinger equation using deep neural networks. J Comput Phys, 2019, 399: 108929
Hanin B. Which neural net architectures give rise to exploding and vanishing gradients? In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 582–591
He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016, 770–778
Hendrycks D, Gimpel K. Gaussian error linear units (GELUs). arXiv:1606.08415, 2016
Hochreiter S, Bengio Y, Frasconi P, et al. Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In: A Field Guide to Dynamical Recurrent Neural Networks. Piscataway: Wiley-IEEE Press, 2001, 237–244
Hohenberg P C, Halperin B I. Theory of dynamic critical phenomena. Rev Modern Phys, 1977, 49: 435
Jabir J-F, Šiška D, Szpruch L. Mean-field neural ODEs via relaxed optimal control. arXiv:1912.05475, 2019
Jordan R, Kinderlehrer D, Otto F. The variational formulation of the Fokker-Planck equation. SIAM J Math Anal, 1998, 29: 1–17
Khoo Y H, Lu J F, Ying L X. Solving for high-dimensional committor functions using artificial neural networks. Res Math Sci, 2019, 6: 1
Li Q X, Chen L, Tai C, et al. Maximum principle based algorithms for deep learning. J Mach Learn Res, 2017, 18: 5998–6026
Lu Y P, Zhong A X, Li Q Z, et al. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In: International Conference on Machine Learning. Stockholm: ICML, 2018, 3282–3291
Mei S, Montanari A, Nguyen P-M. A mean field view of the landscape of two-layer neural networks. Proc Natl Acad Sci USA, 2018, 115: E7665–E7671
Monaghan J J. Smoothed particle hydrodynamics. Rep Progr Phys, 2005, 68: 1703
Mumford D, Shah J. Optimal approximations by piecewise smooth functions and associated variational problems. Comm Pure Appl Math, 1989, 42: 577–685
Murata N. An integral representation of functions using three-layered networks and their approximation bounds. Neural Networks, 1996, 9: 947–956
Nguyen P-M. Mean field limit of the learning dynamics of multilayer neural networks. arXiv:1902.02880, 2019
Pardoux E, Peng S G. Backward stochastic differential equations and quasilinear parabolic partial differential equations. Lecture Notes in Control and Inform Sci, 1992, 176: 200–217
Pfau D, Spencer J S, Matthews A G, et al. Ab-initio solution of the many-electron schrodinger equation with deep neural networks. arXiv:1909.02487, 2019
Richtmyer R D, Morton K W. Difference Methods for Initial-Value Problems. New York: Interscience, 1967
Rotskoff G, Jelassi S, Bruna J, et al. Neuron birth-death dynamics accelerates gradient descent and converges asymptotically. In: International Conference on Machine Learning. Long Beach: ICML, 2019, 5508–5517
Rotskoff G, Vanden-Eijnden E. Parameters as interacting particles: Long time convergence and asymptotic error scaling of neural networks. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 7146–7155
Roux N L, Bengio Y. Continuous neural networks. In: Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, vol. 2. Puerto Rico: PMLR, 2007, 404–411
Rudin L I, Osher S, Fatemi E. Nonlinear total variation based noise removal algorithms. Phys D, 1992, 60: 259–268
Santambrogio F. {Euclidean, metric, and Wasserstein} gradient flows: An overview. Bull Math Sci, 2017, 7: 87–154
Shalev-Shwartz S, Ben-David S. Understanding Machine Learning: From Theory to Algorithms. Cambridge: Cambridge university press, 2014
Sirignano J, Spiliopoulos K. DGM: A deep learning algorithm for solving partial differential equations. J Comput Phys, 2018, 375: 1339–1364
Sirignano J, Spiliopoulos K. Mean field analysis of deep neural networks. arXiv:1903.04440, 2019
Sirignano J, Spiliopoulos K. Mean field analysis of neural networks: A central limit theorem. Stochastic Process Appl, 2020, 130: 1820–1852
Sonoda S, Ishikawa I, Ikeda M, et al. The global optimum of shallow neural network is attained by ridgelet transform. arXiv:1805.07517, 2018
Sonoda S, Murata N. Neural network with unbounded activation functions is universal approximator. Appl Comput Harmon Anal, 2017, 43: 233–268
Thorpe M, Gennip Y V. Deep limits of residual neural networks. arXiv:1810.11741, 2018
Villani C. Optimal Transport: Old and New. Berlin-Heidelberg: Springer-Verlag, 2008
Xu Z-Q, Zhang Y Y, Luo T, et al. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv:1901.06523, 2019
Acknowledgements
This work was supported by a gift to Princeton University from iFlytek and the Office of Naval Research (ONR) Grant (Grant No. N00014-13-1-0338). The authors are grateful to Jianfeng Lu, Stephan Wojtowytsch, Lexing Ying and Shuhai Zhao for helpful discussions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
E, W., Ma, C. & Wu, L. Machine learning from a continuous viewpoint, I. Sci. China Math. 63, 2233–2266 (2020). https://doi.org/10.1007/s11425-020-1773-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11425-020-1773-8