Skip to main content
Log in

Deep learning theory of distribution regression with CNNs

  • Published:
Advances in Computational Mathematics Aims and scope Submit manuscript

Abstract

We establish a deep learning theory for distribution regression with deep convolutional neural networks (DCNNs). Deep learning based on structured deep neural networks has been powerful in practical applications. Generalization analysis for regression with DCNNs has been carried out very recently. However, for the distribution regression problem in which the input variables are probability measures, there is no mathematical model or theoretical analysis of DCNN-based learning theory. One of the difficulties is that the classical neural network structure requires the input variable to be a Euclidean vector. When the input samples are probability distributions, the traditional neural network structure cannot be directly used. A well-defined DCNN framework for distribution regression is desirable. In this paper, we overcome the difficulty and establish a novel DCNN-based learning theory for a two-stage distribution regression model. Firstly, we realize an approximation theory for functionals defined on the set of Borel probability measures with the proposed DCNN framework. Then, we show that the hypothesis space is well-defined by rigorously proving its compactness. Furthermore, in the hypothesis space induced by the general DCNN framework with distribution inputs, by using a two-stage error decomposition technique, we derive a novel DCNN-based two-stage oracle inequality and optimal learning rates (up to a logarithmic factor) for the proposed algorithm for distribution regression.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Barron, A.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39, 930–945 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  2. Bauer, F., Pereverzev, S., Rosasco, L.: On regularization algorithms in learning theory. J. Complex. 23(1), 52–72 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  3. Stone, C.J.: Optimal global rates of convergence for nonparametric regression. Ann. Stat. 1040-1053 (1982)

  4. Christmann, A., Steinwart, I.: Consistency and robustness of kernel-based regression in convex risk minimization. Bernoulli 13(3), 799–819 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  5. Chui, C.K., Li, X., Mhaskar, H.N.: Neural networks for localized approximation. Math. Comput. 63(208), 607–623 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  6. Chui, C.K., Lin, S.B., Zhou, D.X.: Deep neural networks for rotation-invariance approximation and learning. Anal. Appl. 17, 737–772 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  7. Cucker, F., Smale, S.: On the mathematical foundations of learning. Bulletin Amer. Math. Soc. 39, 1–49 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  8. Cucker, F., Zhou, D.X.: Learning theory: an approximation theory viewpoint. Cambridge University Press, Cambridge (2007)

    Book  MATH  Google Scholar 

  9. DiBenedetto, E., Debenedetto, E.: Real analysis. Birkhäuser, Boston (2002)

  10. Dong, S.N., Sun, W.C.: Distributed learning and distribution regression of coefficient regularization. J. Approx. Theory 263, 105523 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  11. Dong, S.N., Sun, W.C.: Learning rate of distribution regression with dependent samples. J. Complex. 101679, (2022)

  12. Fang, Z.Y., Guo, Z.C., Zhou, D.X.: Optimal learning rates for distribution regression. J. Complex. 56, 101426 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  13. Feng, Y.L., Huang, X.L., Shi, L., Yang, Y.N., Suykens, J.A.K.: Learning with the maximum correntropy criterion induced losses for regression. J. Mach. Learn. Res. 16(30), 993–1034 (2015)

    MathSciNet  MATH  Google Scholar 

  14. Guo, Z.C., Shi, L.: Optimal rates for coefficient-based regularized regression. Appl. Comput. Harmon. Anal. 47(3), 662–701 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  15. Guo, Z.C., Lin, S.B., Shi, L.: Distributed learning with multi-penalty regularization. Appl. Comput. Harmon. Anal. 46(3), 478–499 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  16. Hayakawa, S., Suzuki, T.: On the minimax optimality and superiority of deep neural network learning over sparse parameter spaces. Neural Netw. 123, 343–361 (2020)

    Article  MATH  Google Scholar 

  17. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989)

    Article  MATH  Google Scholar 

  18. Klusowski, J.M., Barron, A.R.: Approximation by combinations of ReLU and squared ReLU ridge functions with \(\ell ^ 1\) and \(\ell ^ 0\) controls. IEEE Trans. Inf. Theory 64(12), 7649–7656 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  19. Lin, Y.V., Pinkus, A.: Fundamentality of ridge functions. J. Approx. Theory 75, 295–311 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  20. Maiorov, V.E.: On best approximation by ridge functions. J. Approx. Theory 99, 68–94 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  21. Mao, T., Shi, Z.J., Zhou, D.X.: Theory of deep convolutional neural networks III: approximating radial functions. Neural Netw. 144, 778–79 (2021)

    Article  Google Scholar 

  22. Mücke, N.: Stochastic gradient descent meets distribution regression. In International Conference on Artificial Intelligence and Statistics (pp. 2143-2151). PMLR, (2021)

  23. Mhaskar, H.N.: Approximation properties of a multilayered feedforward artificial neural network. Adv. Comput. Math. 1, 61–80 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  24. Mhaskar, H.N.: Dimension independent bounds for general shallow networks. Neural Netw. 123, 142–152 (2020)

  25. Póczos, B., Singh, A., Rinaldo, A., Wasserman, L.: Distribution-free distribution regression. In Artificial Intelligence and Statistics (pp. 507-515). PMLR, (2013)

  26. Steinwart, I., Christmann, A.: Support vector machines. Springer Science & Business Media, (2008)

  27. Schmidt-Hieber, J.: Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 48(4), 1875–1897 (2020)

    MathSciNet  MATH  Google Scholar 

  28. Smale, S., Zhou, D.X.: Shannon sampling and function reconstruction from point values. Bull. Am. Math. Soc. 41(3), 279–305 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  29. Smale, S., Zhou, D.X.: Learning theory estimates via integral operators and their approximations. Constr. Approx. 26(2), 153–172 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  30. Szabó, Z., Gretton, A., Póczos, B., Sriperumbudur, B.: Two-stage sampled learning theory on distributions. In Artificial Intelligence and Statistics (pp. 948-957). PMLR, (2015)

  31. Szabó, Z., Sriperumbudur, B.K., Póczos, B., Gretton, A.: Learning theory for distribution regression. J. Mach. Learn. Res. 17, 5272–311 (2016)

    MathSciNet  MATH  Google Scholar 

  32. Villani, C.: Optimal transport: old and new (Vol. 338, p. 23). Berlin: Springer (2009)

  33. Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Networks 94, 103–114 (2017)

  34. Yu, Z., Ho, D.W.C., Shi, Z.J., Zhou, D.X.: Robust kernel-based distribution regression. Inverse Problems 37(10), 105014 (2021)

    Article  MathSciNet  Google Scholar 

  35. Zhao, P., Zhou, Z.H.: Label distribution learning by optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1) (2018)

  36. Zhou, D.X.: Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal. 48(2), 787–794 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  37. Zhou, D.X.: Deep distributed convolutional neural networks: universality. Anal. Appl. 16, 895–919 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  38. Zhou, D.X.: Theory of deep convolutional neural networks: Downsampling. Neural Netw. 124, 319–327 (2020)

  39. Zweig, A., Bruna, J.: A functional perspective on learning symmetric functions with neural networks. Int. Conf. Mach. Learn. (pp. 13023-13032). PMLR, (2021)

Download references

Acknowledgements

The first version of the paper was written when the second author worked at City University of Hong Kong, supported partially by the Laboratory for AI-Powered Financial Technologies, the Research Grants Council of Hong Kong [Projects # C1013-21GF and #11308121], the Germany/Hong Kong Joint Research Scheme [Project No. G-CityU101/20], the CityU Strategic Interdisciplinary Research Grant [Project No. 7020010], National Science Foundation of China [Project No. 12061160462], and Hong Kong Institute for Data Science. The first author would like to thank Zhongjie Shi for nice communications and discussions on related topics. The authors would like to thank the anonymous referee for his/her careful review which helps improve the quality of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhan Yu.

Ethics declarations

Conflicts of interest statement

The authors declare no competing interests.

Additional information

Communicated by: Siddhartha Mishra

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 The sequence factorization lemma

Lemma 1

([36]) Let \(s\ge 2\) and \(W=(W_k)_{k=-\infty }^{\infty }\) be a sequence supported in \(\{0,1,...,\mathcal {M}\}\) with \(\mathcal {M}\ge 0\). Then there exists a finite sequence of filters \(\{\omega ^{(j)}\}_{j=1}^p\) each supported in \(\{0,1,...,s\}\) with \(p\le \lceil \frac{\mathcal {M}}{s-1}\rceil \) such that the following convolutional factorization holds true

$$\begin{aligned}W=\omega ^{(p)}*\omega ^{(p-1)}*\cdots *\omega ^{(2)}*\omega ^{(1)}. \end{aligned}$$

1.2 The Cauchy bound of polynomial roots and Vieta’s formula of polynomial coefficients

Lemma 2

If \(w=\left\{ w_{j}\right\} _{j \in \mathbb {Z}}\) is a real sequence supported in \(\{0, \ldots , K\}\) with \(w_{K}=1\), then all the complex roots of its symbol \(\widetilde{w}(z)=\sum _{j=0}^{K} w_{j} z^{j}\) are located in the disk of radius \(1+\max _{j=0, \ldots , K-1}\left| w_{j}\right| \), the Cauchy bound of \(\widetilde{w}\). If we factorize \(\widetilde{w}\) into monic polynomials of degree at most s, then all the coefficients of these factor polynomials are bounded by \(s^{\frac{s}{2}}\left( 1+\max _{j=0, \ldots , K-1}\left| w_{j}\right| \right) ^{s} \le \) \(s^{\frac{s}{2}}\left( 1+\Vert w\Vert _{\infty }\right) ^{s}\).

1.3 The covering number bound of the function set \(\mathcal {H}_{J+1}^l\)

Lemma 3

For any \(l\in \{1,2,...,(2N+3)\widetilde{d}\}\) with \(\widetilde{d}=\lfloor (d+J s) / d\rfloor \). The covering number of \(\mathcal {H}_{J+1}^l\) satisfies

$$\begin{aligned}\log \mathcal {N}(\mathcal {H}_{J+1}^l,\epsilon ,\Vert \cdot \Vert _{C(\Omega )})\le \mathcal {A}_1 N \log \frac{\widehat{R}}{\epsilon }+\mathcal {A}_2 N \log N, \ 0<\epsilon \le 1, \end{aligned}$$

where \(\mathcal {A}_1\), \(\mathcal {A}_2\) and \(\widehat{R}\) are defined as in Lemma 6.

Proof

Since the function set \(\mathcal {H}_{J+1}^l\) inherits the substructure up to \((J+1)\)th-layer of the hypothesis space \(\mathcal {H}_{R, N}^{\mathcal {P}(\Omega )}\), the proof can be derived by following the similar procedures of the proof in Lemma 4 and Lemma 6.

For any \(l\in \{1,2,...,(2N+3)\widetilde{d}\}\) and any \(H\in \mathcal {H}_{J+1}^l\), if we choose another function \(\widetilde{H}\) in \(\mathcal {H}_{J+1}^l\) induced by \(\widetilde{\omega }^{(j)}\) for \(j=1,2,...,J\), \(\widetilde{b}^{(j)}\) for \(j=1,2,...,J+1\), \(\widetilde{F}^{(J+1)}\), satisfying the restriction that \(\left\| \omega ^{(j)}-\widetilde{\omega }^{(j)}\right\| _{\infty }\le \epsilon \) for \(j=1,2,...,J\), \(\left\| b^{(j)}-\widetilde{b}^{(j)}\right\| _{\infty }\le \epsilon \) for \(j=1,2,...,J+1\), \(\left\| F^{(J+1)}-\widetilde{F}^{(J+1)}\right\| _{\infty }\le \epsilon \), \(\left\| c-\widetilde{c}\right\| _{\infty }\le \epsilon \), then via a similar way of getting 3.18 in Lemma 6, we have

$$\begin{aligned} \nonumber \left\| H-\widetilde{H}\right\| _{C(\Omega )}=&\left\| \left( F^{\left( J+1\right) } \mathfrak {D}_{d}(h^{\left( J\right) }(x))-b^{\left( J+1\right) }\right) _l-\left( \widetilde{F}^{\left( J+1\right) } \mathfrak {D}_{d}(\widetilde{h}^{\left( J\right) }(x))-\widetilde{b}^{\left( J+1\right) }\right) _l\right\| _{C(\Omega )}\\ \nonumber \le&\left\| \left( F^{\left( J+1\right) } \mathfrak {D}_{d}(h^{\left( J\right) }(x))-b^{\left( J+1\right) }\right) -\left( \widetilde{F}^{\left( J+1\right) } \mathfrak {D}_{d}(\widetilde{h}^{\left( J\right) }(x))-\widetilde{b}^{\left( J+1\right) }\right) \right\| _{l^{\infty }(\Omega )}\\ \le&6J^2((s+1)R)^J\epsilon \le 6d^2((s+1)R)^{d-1}\epsilon :=\check{\epsilon }. \end{aligned}$$
(1.1)

We observe that

$$\check{\epsilon }=6d^2((s+1)R)^{d-1}\epsilon <95d^2N^4((s+1)R)^{d+1}\epsilon =\widehat{\epsilon }$$

which is defined in 3.19 and that the free parameters till \((J+1)\)-th layer in \(\mathcal {H}_{J+1}^j\) is not larger than free parameters till \((J+2)\)-th layer in \(\mathcal {H}_{R,N}^{\mathcal {P}(\Omega )}\). Hence, by taking a \(\epsilon \) net of the set of free parameters in \(\omega ^{(j)}\) for \(j=1,2,...,J\), \(b^{(j)}\) for \(j=1,2,...,J+1\), and \(F^{(J+1)}\), we derive that

$$\mathcal {N}\left(\mathcal {H}_{J+1}^l,\check{\epsilon },\Vert \cdot \Vert _{C(\Omega )}\right)\le N^{\mathcal {A}_2N}\left( \frac{\widehat{R}}{\check{\epsilon }}\right) ^{\mathcal {A}_1N},$$

which completes the proof after taking the logarithm.\(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, Z., Zhou, DX. Deep learning theory of distribution regression with CNNs. Adv Comput Math 49, 51 (2023). https://doi.org/10.1007/s10444-023-10054-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10444-023-10054-y

Keywords

Mathematics Subject Classification (2010)

Navigation