A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics

E, Weinan; Ma, Chao; Wu, Lei

doi:10.1007/s11425-019-1628-5

A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics

Articles
Published: 08 January 2020

Volume 63, pages 1235–1258, (2020)
Cite this article

Science China Mathematics Aims and scope Submit manuscript

Weinan E^1,2,3,
Chao Ma² &
Lei Wu²

508 Accesses
25 Citations
2 Altmetric
Explore all metrics

Abstract

A fairly comprehensive analysis is presented for the gradient descent dynamics for training two-layer neural network models in the situation when the parameters in both layers are updated. General initialization schemes as well as general regimes for the network width and training data size are considered. In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels. In addition, it is proved that throughout the training process the functions represented by the neural network model are uniformly close to that of a kernel method. For general values of the network width and training data size, sharp estimates of the generalization error is established for target functions in the appropriate reproducing kernel Hilbert space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weighted neural tangent kernel: a generalized and improved network-induced kernel

Article 20 July 2023

Fast gradient descent algorithm for image classification with neural networks

Article 20 May 2020

How does a kernel based on gradients of infinite-width neural networks come to be widely used: a review of the neural tangent kernel

Article 01 February 2024

References

Allen-Zhu Z, Li Y, Liang Y. Learning and generalization in overparameterized neural networks, going beyond two layers. ArXiv:1811.04918, 2018
Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via over-parameterization. ArXiv:1811.03962, 2018
Aronszajn N. Theory of reproducing kernels. Trans Amer Math Soc, 1950, 68: 337–404
Article MathSciNet Google Scholar
Arora S, Du S S, Hu W, et al. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. ArXiv:1901.08584, 2019
Barron A R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans Inform Theory, 1993, 39: 930–945
Article MathSciNet Google Scholar
Braun M L. Accurate error bounds for the eigenvalues of the kernel matrix. J Mach Learn Res, 2006, 7: 2303–2328
MathSciNet MATH Google Scholar
Breiman L. Hinging hyperplanes for regression, classification, and function approximation. IEEE Trans Inform Theory, 1993, 39: 999–1013
Article MathSciNet Google Scholar
Cao Y, Gu Q. A generalization theory of gradient descent for learning over-parameterized deep ReLU networks. ArXiv:1902.01384, 2019
Chizat L, Bach F. A note on lazy training in supervised differentiable programming. ArXiv:1812.07956, 2018
Daniely A. SGD learns the conjugate kernel class of the network. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2017, 2422–2430
Google Scholar
Du S S, Lee D J, Li H, et al. Gradient descent finds global minima of deep neural networks. ArXiv:1811.03804, 2018
Du S S, Zhai X, Poczos B, et al. Gradient descent provably optimizes over-parameterized neural networks. In: 7th International Conference on Learning Representations. https://openreview.net/pdf?id=S1eK3i09YQ, 2019
E W, Ma C, Wu L. A priori estimates of the population risk for two-layer neural networks. In: Communications in Mathematical Sciences. Somerville: Int Press, 2019, 1407–1425
Google Scholar
E W, Ma C, Wu L. A comparative analysis of the optimization and generalization property of two-layer neural network and random feature models under gradient descent dynamics. ArXiv:1904.04326, 2019
Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 8580–8589
Google Scholar
Kawaguchi K. Deep learning without poor local minima. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2016, 586–594
Google Scholar
Keskar S N, Mudigere D, Nocedal J, et al. On large-batch training for deep learning: Generalization gap and sharp minima. In: 5th International Conference on Learning Representations. ArXiv:1609.04836, 2017
Klusowski J M, Barron A R. Risk bounds for high-dimensional ridge function combinations including neural networks. ArXiv:1607.01434, 2016
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2012, 1097–1105
Google Scholar
LeCun Y, Bengio Y, Hinton G E. Deep learning. Nature, 2015, 521: 436–444
Article Google Scholar
Li Y, Liang Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 1–10
Google Scholar
Mei S, Montanari A, Nguyen P. A mean field view of the landscape of two-layers neural networks. Proc Natl Acad Sci USA, 2018, 115: E7665–E7671
Article MathSciNet Google Scholar
Neyshabur B, Tomioka R, Srebro N. In search of the real inductive bias: On the role of implicit regularization in deep learning. ArXiv:1412.6614, 2014
Rahimi A, Recht B. Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2008, 1177–1184
Google Scholar
Rahimi A, Recht B. Uniform approximation of functions with random bases. In: 46th Annual Allerton Conference on Communication, Control, and Computing. New York: IEEE, 2008, 555–561
Google Scholar
Rahimi A, Recht B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2009, 1313–1320
Google Scholar
Rotskoff G, Vanden-Eijnden E. Parameters as interacting particles: Long time convergence and asymptotic error scaling of neural networks. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 7146–7155
Google Scholar
Shalev-Shwartz S, Ben-David S. Understanding Machine Learning: From Theory to Algorithms. Cambridge: Cambridge university press, 2014
Book Google Scholar
Sirignano J, Spiliopoulos K. Mean field analysis of neural networks: A central limit theorem. ArXiv:1808.09372, 2018
Xie B, Liang Y, Song L. Diverse neural network learns true target functions. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. Fort Lauderdale: PMLR, 2017, 1216–1224
Google Scholar
Zhang C, Bengio S, Hardt M, et al. Understanding deep learning requires rethinking generalization. In: 5th International Conference on Learning Representations. ArXiv:1611.03530, 2017
Zou D, Cao Y, Zhou D, et al. Stochastic gradient descent optimizes over-parameterized deep ReLU networks. ArX-iv:1811.08888, 2018

Download references

Acknowledgements

This work was supported by a gift to Princeton University from iFlytek and the Office of Naval Research (ONR) (Grant No. N00014-13-1-0338).

Author information

Authors and Affiliations

Department of Mathematics, Princeton University, Princeton, NJ, 08544, USA
Weinan E
The Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ, 08544, USA
Weinan E, Chao Ma & Lei Wu
Beijing Institute of Big Data Research, Beijing, 100871, China
Weinan E

Authors

Weinan E
View author publications
You can also search for this author in PubMed Google Scholar
Chao Ma
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weinan E.

Rights and permissions

Reprints and permissions

About this article

Cite this article

E, W., Ma, C. & Wu, L. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Sci. China Math. 63, 1235–1258 (2020). https://doi.org/10.1007/s11425-019-1628-5

Download citation

Received: 12 September 2019
Accepted: 18 December 2019
Published: 08 January 2020
Issue Date: July 2020
DOI: https://doi.org/10.1007/s11425-019-1628-5

Keywords

MSC(2010)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics

Abstract

Access this article

Similar content being viewed by others

Weighted neural tangent kernel: a generalized and improved network-induced kernel

Fast gradient descent algorithm for image classification with neural networks

How does a kernel based on gradients of infinite-width neural networks come to be widely used: a review of the neural tangent kernel

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

MSC(2010)

Navigation

A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics

Abstract

Access this article

Similar content being viewed by others

Weighted neural tangent kernel: a generalized and improved network-induced kernel

Fast gradient descent algorithm for image classification with neural networks

How does a kernel based on gradients of infinite-width neural networks come to be widely used: a review of the neural tangent kernel

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

MSC(2010)

Search

Navigation