On the universal consistency of an over-parametrized deep neural network estimate learned by gradient descent

Drews, Selina; Kohler, Michael

doi:10.1007/s10463-024-00898-6

On the universal consistency of an over-parametrized deep neural network estimate learned by gradient descent

Published: 08 April 2024

Volume 76, pages 361–391, (2024)
Cite this article

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Selina Drews¹ &
Michael Kohler¹

53 Accesses
Explore all metrics

Abstract

Estimation of a multivariate regression function from independent and identically distributed data is considered. An estimate is defined which fits a deep neural network consisting of a large number of fully connected neural networks, which are computed in parallel, via gradient descent to the data. The estimate is over-parametrized in the sense that the number of its parameters is much larger than the sample size. It is shown that with a suitable random initialization of the network, a sufficiently small gradient descent step size, and a number of gradient descent steps that slightly exceed the reciprocal of this step size, the estimate is universally consistent. This means that the expected \(L_2\) error converges to zero for all distributions of the data where the response variable is square integrable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Covariance approximation of nonlinear regression

Article 22 July 2016

On Existence of Explicit Asymptotically Normal Estimators in Nonlinear Regression Problems

Generalized Bayes Minimax Estimators of the Variance of a Multivariate Normal Distribution

Article 17 April 2023

References

Allen-Zhu, Z., Li, Y., Song, Z. (2019). A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th international conference on machine learning, 97, 242–252.
Bartlett, P. L., Montanari, A., Rakhlin, A. (2021). Deep learning: A statistical viewpoint. Acta Numerica, 30, 87–201.
Article MathSciNet Google Scholar
Bauer, B., Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Annals of Statistics, 47, 2261–2285.
Article MathSciNet Google Scholar
Braun, A., Kohler, M., Langer, S., Walk, H. (2021). Convergence rates for shallow neural networks learned by gradient descent. Bernoulli, 30, 475–502.
MathSciNet Google Scholar
Chizat, L., Bach, F.(2018). On the global convergence of gradient descent for over-parameterized models using optimal transport. arXiv:1805.09545.
Dyer, E., Gur-Ari, G.(2019). Asymptotics of wide networks from Feynman diagrams. arXiv:1909.11304.
Györfi, L., Kohler, M., Krzyżak, A., Walk, H. (2002). Why is nonparametric regression important? In Springer series in statistics, A distribution-free theory of nonparametric regression (pp. 1–16). New York: Springer.
Hanin, B., Nica, M. (2019). Finite depth and width corrections to the neural tangent kernel. arXiv: 1909.05989.
Imaizumi, M., Fukamizu, K. (2019). Deep neural networks learn non-smooth functions effectively. In The 22nd international conference on artificial intelligence and statistics (pp. 869–878).
Kawaguchi, K., Huang, J. (2019). Gradient descent finds global minima for generalizable deep neural networks of practical sizes. arXiv: 1908.02419.
Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv:1408.5882.
Kohler, M., Krzyżak, A. (2017). Nonparametric regression based on hierarchical interaction models. IEEE Transaction on Information Theory, 63, 1620–1630.
Article MathSciNet Google Scholar
Kohler, M., Krzyżak, A. (2021). Over-parametrized deep neural networks minimizing the empirical risk do not generalize well. Bernoulli, 27, 2564–2597.
Article MathSciNet Google Scholar
Kohler, M., Krzyżak, A. (2022). Over-parametrized neural networks learned by gradient descent can generalize especially well. Submitted for publication. https://www2.mathematik.tu-darmstadt.de/~kohler/preprint22_01.pdf.
Kohler, M., Langer, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates using ReLU activation functions. Annals of Statistics, 49, 2231–2249.
Article MathSciNet Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Association for Computing Machinery, 60, 84–90.
Article Google Scholar
Langer, S. (2021). Analysis of the rate of convergence of fully connected deep neural network regression estimates with smooth activation function. Journal of Multivariate Analysis, 182, 104695.
Article MathSciNet Google Scholar
Langer, S. (2021). Approximating smooth functions by deep neural networks with sigmoid activation function. Journal of Multivariate Analysis, 182, 104696.
Article MathSciNet Google Scholar
Li, G., Gu, Y., Ding, J. (2021). The rate of convergence of variation-constrained deep neural networks. arXiv:2106.12068.
Lu, J., Shen, Z., Yang, H., Zhang, S.(2020). Deep network approximation for smooth functions. arXiv:2001.03040
Mei, S., Montanari, A., Nguyen, P.-M. (2018). A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115, E7665–E7671.
Article MathSciNet Google Scholar
Nguyen, P.-M., Pham, H. T. (2020). A rigorous framework for the mean field limit of multilayer neural networks. arXiv:2001.1144.
Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with ReLU activation function (with discussion). Annals of Statistics, 48, 1875–1897.
MathSciNet Google Scholar
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T. P., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., Hassabis, D. (2017). Mastering the game of go without human knowledge. Nature, 550, 354–359.
Article Google Scholar
Stone, C. J. (1977). Consistent nonparametric regression. Annals of Statistics, 5, 595–645.
Article MathSciNet Google Scholar
Suzuki, T. (2018). Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: Optimal rate and curse of dimensionality. arXiv:1810.08033.
Suzuki, T., Nitanda, A. (2019). Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic Besov space. arXiv:1910.12799.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144.
Yang, G., Hu, J. E. (2020). Feature learning in infinite-width neural networks. arXiv:2011.14522.
Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks. Neural Networks, 94, 103–114.
Article Google Scholar
Yarotsky, D., Zhevnerchuk, A. (2020). The phase diagram of approximation rates for deep neural networks. Advances in Neural Information Processing Systems, 33, 13005–13015.
Google Scholar

Download references

Acknowledgements

The authors would like to thank an anonymous referee for many invaluable comments which helped to improve an early version of this manuscript.

Author information

Authors and Affiliations

Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289, Darmstadt, Germany
Selina Drews & Michael Kohler

Authors

Selina Drews
View author publications
You can also search for this author in PubMed Google Scholar
Michael Kohler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Selina Drews.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 0 KB)

About this article

Cite this article

Drews, S., Kohler, M. On the universal consistency of an over-parametrized deep neural network estimate learned by gradient descent. Ann Inst Stat Math 76, 361–391 (2024). https://doi.org/10.1007/s10463-024-00898-6

Download citation

Received: 13 October 2022
Revised: 11 September 2023
Accepted: 28 December 2023
Published: 08 April 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s10463-024-00898-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the universal consistency of an over-parametrized deep neural network estimate learned by gradient descent

Abstract

Access this article

Similar content being viewed by others

Covariance approximation of nonlinear regression

On Existence of Explicit Asymptotically Normal Estimators in Nonlinear Regression Problems

Generalized Bayes Minimax Estimators of the Variance of a Multivariate Normal Distribution

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 0 KB)

About this article

Cite this article

Keywords

Navigation

On the universal consistency of an over-parametrized deep neural network estimate learned by gradient descent

Abstract

Access this article

Similar content being viewed by others

Covariance approximation of nonlinear regression

On Existence of Explicit Asymptotically Normal Estimators in Nonlinear Regression Problems

Generalized Bayes Minimax Estimators of the Variance of a Multivariate Normal Distribution

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 0 KB)

About this article

Cite this article

Share this article

Keywords

Search

Navigation