Skip to main content
Log in

A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics

  • Articles
  • Published:
Science China Mathematics Aims and scope Submit manuscript

Abstract

A fairly comprehensive analysis is presented for the gradient descent dynamics for training two-layer neural network models in the situation when the parameters in both layers are updated. General initialization schemes as well as general regimes for the network width and training data size are considered. In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels. In addition, it is proved that throughout the training process the functions represented by the neural network model are uniformly close to that of a kernel method. For general values of the network width and training data size, sharp estimates of the generalization error is established for target functions in the appropriate reproducing kernel Hilbert space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Allen-Zhu Z, Li Y, Liang Y. Learning and generalization in overparameterized neural networks, going beyond two layers. ArXiv:1811.04918, 2018

  2. Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via over-parameterization. ArXiv:1811.03962, 2018

  3. Aronszajn N. Theory of reproducing kernels. Trans Amer Math Soc, 1950, 68: 337–404

    Article  MathSciNet  Google Scholar 

  4. Arora S, Du S S, Hu W, et al. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. ArXiv:1901.08584, 2019

  5. Barron A R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans Inform Theory, 1993, 39: 930–945

    Article  MathSciNet  Google Scholar 

  6. Braun M L. Accurate error bounds for the eigenvalues of the kernel matrix. J Mach Learn Res, 2006, 7: 2303–2328

    MathSciNet  MATH  Google Scholar 

  7. Breiman L. Hinging hyperplanes for regression, classification, and function approximation. IEEE Trans Inform Theory, 1993, 39: 999–1013

    Article  MathSciNet  Google Scholar 

  8. Cao Y, Gu Q. A generalization theory of gradient descent for learning over-parameterized deep ReLU networks. ArXiv:1902.01384, 2019

  9. Chizat L, Bach F. A note on lazy training in supervised differentiable programming. ArXiv:1812.07956, 2018

  10. Daniely A. SGD learns the conjugate kernel class of the network. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2017, 2422–2430

    Google Scholar 

  11. Du S S, Lee D J, Li H, et al. Gradient descent finds global minima of deep neural networks. ArXiv:1811.03804, 2018

  12. Du S S, Zhai X, Poczos B, et al. Gradient descent provably optimizes over-parameterized neural networks. In: 7th International Conference on Learning Representations. https://openreview.net/pdf?id=S1eK3i09YQ, 2019

  13. E W, Ma C, Wu L. A priori estimates of the population risk for two-layer neural networks. In: Communications in Mathematical Sciences. Somerville: Int Press, 2019, 1407–1425

    Google Scholar 

  14. E W, Ma C, Wu L. A comparative analysis of the optimization and generalization property of two-layer neural network and random feature models under gradient descent dynamics. ArXiv:1904.04326, 2019

  15. Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 8580–8589

    Google Scholar 

  16. Kawaguchi K. Deep learning without poor local minima. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2016, 586–594

    Google Scholar 

  17. Keskar S N, Mudigere D, Nocedal J, et al. On large-batch training for deep learning: Generalization gap and sharp minima. In: 5th International Conference on Learning Representations. ArXiv:1609.04836, 2017

  18. Klusowski J M, Barron A R. Risk bounds for high-dimensional ridge function combinations including neural networks. ArXiv:1607.01434, 2016

  19. Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2012, 1097–1105

    Google Scholar 

  20. LeCun Y, Bengio Y, Hinton G E. Deep learning. Nature, 2015, 521: 436–444

    Article  Google Scholar 

  21. Li Y, Liang Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 1–10

    Google Scholar 

  22. Mei S, Montanari A, Nguyen P. A mean field view of the landscape of two-layers neural networks. Proc Natl Acad Sci USA, 2018, 115: E7665–E7671

    Article  MathSciNet  Google Scholar 

  23. Neyshabur B, Tomioka R, Srebro N. In search of the real inductive bias: On the role of implicit regularization in deep learning. ArXiv:1412.6614, 2014

  24. Rahimi A, Recht B. Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2008, 1177–1184

    Google Scholar 

  25. Rahimi A, Recht B. Uniform approximation of functions with random bases. In: 46th Annual Allerton Conference on Communication, Control, and Computing. New York: IEEE, 2008, 555–561

    Google Scholar 

  26. Rahimi A, Recht B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2009, 1313–1320

    Google Scholar 

  27. Rotskoff G, Vanden-Eijnden E. Parameters as interacting particles: Long time convergence and asymptotic error scaling of neural networks. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2018, 7146–7155

    Google Scholar 

  28. Shalev-Shwartz S, Ben-David S. Understanding Machine Learning: From Theory to Algorithms. Cambridge: Cambridge university press, 2014

    Book  Google Scholar 

  29. Sirignano J, Spiliopoulos K. Mean field analysis of neural networks: A central limit theorem. ArXiv:1808.09372, 2018

  30. Xie B, Liang Y, Song L. Diverse neural network learns true target functions. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. Fort Lauderdale: PMLR, 2017, 1216–1224

    Google Scholar 

  31. Zhang C, Bengio S, Hardt M, et al. Understanding deep learning requires rethinking generalization. In: 5th International Conference on Learning Representations. ArXiv:1611.03530, 2017

  32. Zou D, Cao Y, Zhou D, et al. Stochastic gradient descent optimizes over-parameterized deep ReLU networks. ArX-iv:1811.08888, 2018

Download references

Acknowledgements

This work was supported by a gift to Princeton University from iFlytek and the Office of Naval Research (ONR) (Grant No. N00014-13-1-0338).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weinan E.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

E, W., Ma, C. & Wu, L. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Sci. China Math. 63, 1235–1258 (2020). https://doi.org/10.1007/s11425-019-1628-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11425-019-1628-5

Keywords

MSC(2010)

Navigation