Abstract
A fairly comprehensive analysis is presented for the gradient descent dynamics for training two-layer neural network models in the situation when the parameters in both layers are updated. General initialization schemes as well as general regimes for the network width and training data size are considered. In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels. In addition, it is proved that throughout the training process the functions represented by the neural network model are uniformly close to that of a kernel method. For general values of the network width and training data size, sharp estimates of the generalization error is established for target functions in the appropriate reproducing kernel Hilbert space.
Similar content being viewed by others
References
Allen-Zhu Z, Li Y, Liang Y. Learning and generalization in overparameterized neural networks, going beyond two layers. ArXiv:1811.04918, 2018
Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via over-parameterization. ArXiv:1811.03962, 2018
Aronszajn N. Theory of reproducing kernels. Trans Amer Math Soc, 1950, 68: 337–404
Arora S, Du S S, Hu W, et al. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. ArXiv:1901.08584, 2019
Barron A R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans Inform Theory, 1993, 39: 930–945
Braun M L. Accurate error bounds for the eigenvalues of the kernel matrix. J Mach Learn Res, 2006, 7: 2303–2328
Breiman L. Hinging hyperplanes for regression, classification, and function approximation. IEEE Trans Inform Theory, 1993, 39: 999–1013
Cao Y, Gu Q. A generalization theory of gradient descent for learning over-parameterized deep ReLU networks. ArXiv:1902.01384, 2019
Chizat L, Bach F. A note on lazy training in supervised differentiable programming. ArXiv:1812.07956, 2018
Daniely A. SGD learns the conjugate kernel class of the network. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2017, 2422–2430
Du S S, Lee D J, Li H, et al. Gradient descent finds global minima of deep neural networks. ArXiv:1811.03804, 2018
Du S S, Zhai X, Poczos B, et al. Gradient descent provably optimizes over-parameterized neural networks. In: 7th International Conference on Learning Representations. https://openreview.net/pdf?id=S1eK3i09YQ, 2019
E W, Ma C, Wu L. A priori estimates of the population risk for two-layer neural networks. In: Communications in Mathematical Sciences. Somerville: Int Press, 2019, 1407–1425
E W, Ma C, Wu L. A comparative analysis of the optimization and generalization property of two-layer neural network and random feature models under gradient descent dynamics. ArXiv:1904.04326, 2019
Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 8580–8589
Kawaguchi K. Deep learning without poor local minima. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2016, 586–594
Keskar S N, Mudigere D, Nocedal J, et al. On large-batch training for deep learning: Generalization gap and sharp minima. In: 5th International Conference on Learning Representations. ArXiv:1609.04836, 2017
Klusowski J M, Barron A R. Risk bounds for high-dimensional ridge function combinations including neural networks. ArXiv:1607.01434, 2016
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2012, 1097–1105
LeCun Y, Bengio Y, Hinton G E. Deep learning. Nature, 2015, 521: 436–444
Li Y, Liang Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 1–10
Mei S, Montanari A, Nguyen P. A mean field view of the landscape of two-layers neural networks. Proc Natl Acad Sci USA, 2018, 115: E7665–E7671
Neyshabur B, Tomioka R, Srebro N. In search of the real inductive bias: On the role of implicit regularization in deep learning. ArXiv:1412.6614, 2014
Rahimi A, Recht B. Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2008, 1177–1184
Rahimi A, Recht B. Uniform approximation of functions with random bases. In: 46th Annual Allerton Conference on Communication, Control, and Computing. New York: IEEE, 2008, 555–561
Rahimi A, Recht B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2009, 1313–1320
Rotskoff G, Vanden-Eijnden E. Parameters as interacting particles: Long time convergence and asymptotic error scaling of neural networks. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 7146–7155
Shalev-Shwartz S, Ben-David S. Understanding Machine Learning: From Theory to Algorithms. Cambridge: Cambridge university press, 2014
Sirignano J, Spiliopoulos K. Mean field analysis of neural networks: A central limit theorem. ArXiv:1808.09372, 2018
Xie B, Liang Y, Song L. Diverse neural network learns true target functions. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. Fort Lauderdale: PMLR, 2017, 1216–1224
Zhang C, Bengio S, Hardt M, et al. Understanding deep learning requires rethinking generalization. In: 5th International Conference on Learning Representations. ArXiv:1611.03530, 2017
Zou D, Cao Y, Zhou D, et al. Stochastic gradient descent optimizes over-parameterized deep ReLU networks. ArX-iv:1811.08888, 2018
Acknowledgements
This work was supported by a gift to Princeton University from iFlytek and the Office of Naval Research (ONR) (Grant No. N00014-13-1-0338).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
E, W., Ma, C. & Wu, L. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Sci. China Math. 63, 1235–1258 (2020). https://doi.org/10.1007/s11425-019-1628-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11425-019-1628-5