Abstract
Estimation of a multivariate regression function from independent and identically distributed data is considered. An estimate is defined which fits a deep neural network consisting of a large number of fully connected neural networks, which are computed in parallel, via gradient descent to the data. The estimate is over-parametrized in the sense that the number of its parameters is much larger than the sample size. It is shown that with a suitable random initialization of the network, a sufficiently small gradient descent step size, and a number of gradient descent steps that slightly exceed the reciprocal of this step size, the estimate is universally consistent. This means that the expected \(L_2\) error converges to zero for all distributions of the data where the response variable is square integrable.
Similar content being viewed by others
References
Allen-Zhu, Z., Li, Y., Song, Z. (2019). A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th international conference on machine learning, 97, 242–252.
Bartlett, P. L., Montanari, A., Rakhlin, A. (2021). Deep learning: A statistical viewpoint. Acta Numerica, 30, 87–201.
Bauer, B., Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Annals of Statistics, 47, 2261–2285.
Braun, A., Kohler, M., Langer, S., Walk, H. (2021). Convergence rates for shallow neural networks learned by gradient descent. Bernoulli, 30, 475–502.
Chizat, L., Bach, F.(2018). On the global convergence of gradient descent for over-parameterized models using optimal transport. arXiv:1805.09545.
Dyer, E., Gur-Ari, G.(2019). Asymptotics of wide networks from Feynman diagrams. arXiv:1909.11304.
Györfi, L., Kohler, M., Krzyżak, A., Walk, H. (2002). Why is nonparametric regression important? In Springer series in statistics, A distribution-free theory of nonparametric regression (pp. 1–16). New York: Springer.
Hanin, B., Nica, M. (2019). Finite depth and width corrections to the neural tangent kernel. arXiv: 1909.05989.
Imaizumi, M., Fukamizu, K. (2019). Deep neural networks learn non-smooth functions effectively. In The 22nd international conference on artificial intelligence and statistics (pp. 869–878).
Kawaguchi, K., Huang, J. (2019). Gradient descent finds global minima for generalizable deep neural networks of practical sizes. arXiv: 1908.02419.
Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv:1408.5882.
Kohler, M., Krzyżak, A. (2017). Nonparametric regression based on hierarchical interaction models. IEEE Transaction on Information Theory, 63, 1620–1630.
Kohler, M., Krzyżak, A. (2021). Over-parametrized deep neural networks minimizing the empirical risk do not generalize well. Bernoulli, 27, 2564–2597.
Kohler, M., Krzyżak, A. (2022). Over-parametrized neural networks learned by gradient descent can generalize especially well. Submitted for publication. https://www2.mathematik.tu-darmstadt.de/~kohler/preprint22_01.pdf.
Kohler, M., Langer, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates using ReLU activation functions. Annals of Statistics, 49, 2231–2249.
Krizhevsky, A., Sutskever, I., Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Association for Computing Machinery, 60, 84–90.
Langer, S. (2021). Analysis of the rate of convergence of fully connected deep neural network regression estimates with smooth activation function. Journal of Multivariate Analysis, 182, 104695.
Langer, S. (2021). Approximating smooth functions by deep neural networks with sigmoid activation function. Journal of Multivariate Analysis, 182, 104696.
Li, G., Gu, Y., Ding, J. (2021). The rate of convergence of variation-constrained deep neural networks. arXiv:2106.12068.
Lu, J., Shen, Z., Yang, H., Zhang, S.(2020). Deep network approximation for smooth functions. arXiv:2001.03040
Mei, S., Montanari, A., Nguyen, P.-M. (2018). A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115, E7665–E7671.
Nguyen, P.-M., Pham, H. T. (2020). A rigorous framework for the mean field limit of multilayer neural networks. arXiv:2001.1144.
Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with ReLU activation function (with discussion). Annals of Statistics, 48, 1875–1897.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T. P., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., Hassabis, D. (2017). Mastering the game of go without human knowledge. Nature, 550, 354–359.
Stone, C. J. (1977). Consistent nonparametric regression. Annals of Statistics, 5, 595–645.
Suzuki, T. (2018). Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: Optimal rate and curse of dimensionality. arXiv:1810.08033.
Suzuki, T., Nitanda, A. (2019). Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic Besov space. arXiv:1910.12799.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144.
Yang, G., Hu, J. E. (2020). Feature learning in infinite-width neural networks. arXiv:2011.14522.
Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks. Neural Networks, 94, 103–114.
Yarotsky, D., Zhevnerchuk, A. (2020). The phase diagram of approximation rates for deep neural networks. Advances in Neural Information Processing Systems, 33, 13005–13015.
Acknowledgements
The authors would like to thank an anonymous referee for many invaluable comments which helped to improve an early version of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
About this article
Cite this article
Drews, S., Kohler, M. On the universal consistency of an over-parametrized deep neural network estimate learned by gradient descent. Ann Inst Stat Math 76, 361–391 (2024). https://doi.org/10.1007/s10463-024-00898-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-024-00898-6