Skip to main content
Log in

On the universal consistency of an over-parametrized deep neural network estimate learned by gradient descent

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

Estimation of a multivariate regression function from independent and identically distributed data is considered. An estimate is defined which fits a deep neural network consisting of a large number of fully connected neural networks, which are computed in parallel, via gradient descent to the data. The estimate is over-parametrized in the sense that the number of its parameters is much larger than the sample size. It is shown that with a suitable random initialization of the network, a sufficiently small gradient descent step size, and a number of gradient descent steps that slightly exceed the reciprocal of this step size, the estimate is universally consistent. This means that the expected \(L_2\) error converges to zero for all distributions of the data where the response variable is square integrable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Allen-Zhu, Z., Li, Y., Song, Z. (2019). A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th international conference on machine learning, 97, 242–252.

  • Bartlett, P. L., Montanari, A., Rakhlin, A. (2021). Deep learning: A statistical viewpoint. Acta Numerica, 30, 87–201.

    Article  MathSciNet  Google Scholar 

  • Bauer, B., Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Annals of Statistics, 47, 2261–2285.

    Article  MathSciNet  Google Scholar 

  • Braun, A., Kohler, M., Langer, S., Walk, H. (2021). Convergence rates for shallow neural networks learned by gradient descent. Bernoulli, 30, 475–502.

    MathSciNet  Google Scholar 

  • Chizat, L., Bach, F.(2018). On the global convergence of gradient descent for over-parameterized models using optimal transport. arXiv:1805.09545.

  • Dyer, E., Gur-Ari, G.(2019). Asymptotics of wide networks from Feynman diagrams. arXiv:1909.11304.

  • Györfi, L., Kohler, M., Krzyżak, A., Walk, H. (2002). Why is nonparametric regression important? In Springer series in statistics, A distribution-free theory of nonparametric regression (pp. 1–16). New York: Springer.

  • Hanin, B., Nica, M. (2019). Finite depth and width corrections to the neural tangent kernel. arXiv: 1909.05989.

  • Imaizumi, M., Fukamizu, K. (2019). Deep neural networks learn non-smooth functions effectively. In The 22nd international conference on artificial intelligence and statistics (pp. 869–878).

  • Kawaguchi, K., Huang, J. (2019). Gradient descent finds global minima for generalizable deep neural networks of practical sizes. arXiv: 1908.02419.

  • Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv:1408.5882.

  • Kohler, M., Krzyżak, A. (2017). Nonparametric regression based on hierarchical interaction models. IEEE Transaction on Information Theory, 63, 1620–1630.

    Article  MathSciNet  Google Scholar 

  • Kohler, M., Krzyżak, A. (2021). Over-parametrized deep neural networks minimizing the empirical risk do not generalize well. Bernoulli, 27, 2564–2597.

    Article  MathSciNet  Google Scholar 

  • Kohler, M., Krzyżak, A. (2022). Over-parametrized neural networks learned by gradient descent can generalize especially well. Submitted for publication. https://www2.mathematik.tu-darmstadt.de/~kohler/preprint22_01.pdf.

  • Kohler, M., Langer, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates using ReLU activation functions. Annals of Statistics, 49, 2231–2249.

    Article  MathSciNet  Google Scholar 

  • Krizhevsky, A., Sutskever, I., Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Association for Computing Machinery, 60, 84–90.

    Article  Google Scholar 

  • Langer, S. (2021). Analysis of the rate of convergence of fully connected deep neural network regression estimates with smooth activation function. Journal of Multivariate Analysis, 182, 104695.

    Article  MathSciNet  Google Scholar 

  • Langer, S. (2021). Approximating smooth functions by deep neural networks with sigmoid activation function. Journal of Multivariate Analysis, 182, 104696.

    Article  MathSciNet  Google Scholar 

  • Li, G., Gu, Y., Ding, J. (2021). The rate of convergence of variation-constrained deep neural networks. arXiv:2106.12068.

  • Lu, J., Shen, Z., Yang, H., Zhang, S.(2020). Deep network approximation for smooth functions. arXiv:2001.03040

  • Mei, S., Montanari, A., Nguyen, P.-M. (2018). A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115, E7665–E7671.

    Article  MathSciNet  Google Scholar 

  • Nguyen, P.-M., Pham, H. T. (2020). A rigorous framework for the mean field limit of multilayer neural networks. arXiv:2001.1144.

  • Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with ReLU activation function (with discussion). Annals of Statistics, 48, 1875–1897.

    MathSciNet  Google Scholar 

  • Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T. P., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., Hassabis, D. (2017). Mastering the game of go without human knowledge. Nature, 550, 354–359.

    Article  Google Scholar 

  • Stone, C. J. (1977). Consistent nonparametric regression. Annals of Statistics, 5, 595–645.

    Article  MathSciNet  Google Scholar 

  • Suzuki, T. (2018). Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: Optimal rate and curse of dimensionality. arXiv:1810.08033.

  • Suzuki, T., Nitanda, A. (2019). Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic Besov space. arXiv:1910.12799.

  • Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144.

  • Yang, G., Hu, J. E. (2020). Feature learning in infinite-width neural networks. arXiv:2011.14522.

  • Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks. Neural Networks, 94, 103–114.

    Article  Google Scholar 

  • Yarotsky, D., Zhevnerchuk, A. (2020). The phase diagram of approximation rates for deep neural networks. Advances in Neural Information Processing Systems, 33, 13005–13015.

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank an anonymous referee for many invaluable comments which helped to improve an early version of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Selina Drews.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 0 KB)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Drews, S., Kohler, M. On the universal consistency of an over-parametrized deep neural network estimate learned by gradient descent. Ann Inst Stat Math 76, 361–391 (2024). https://doi.org/10.1007/s10463-024-00898-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-024-00898-6

Keywords

Navigation