Deep learning theory of distribution regression with CNNs

Yu, Zhan; Zhou, Ding-Xuan

doi:10.1007/s10444-023-10054-y

Deep learning theory of distribution regression with CNNs

Published: 07 July 2023

Volume 49, article number 51, (2023)
Cite this article

Advances in Computational Mathematics Aims and scope Submit manuscript

145 Accesses
Explore all metrics

Abstract

We establish a deep learning theory for distribution regression with deep convolutional neural networks (DCNNs). Deep learning based on structured deep neural networks has been powerful in practical applications. Generalization analysis for regression with DCNNs has been carried out very recently. However, for the distribution regression problem in which the input variables are probability measures, there is no mathematical model or theoretical analysis of DCNN-based learning theory. One of the difficulties is that the classical neural network structure requires the input variable to be a Euclidean vector. When the input samples are probability distributions, the traditional neural network structure cannot be directly used. A well-defined DCNN framework for distribution regression is desirable. In this paper, we overcome the difficulty and establish a novel DCNN-based learning theory for a two-stage distribution regression model. Firstly, we realize an approximation theory for functionals defined on the set of Borel probability measures with the proposed DCNN framework. Then, we show that the hypothesis space is well-defined by rigorously proving its compactness. Furthermore, in the hypothesis space induced by the general DCNN framework with distribution inputs, by using a two-stage error decomposition technique, we derive a novel DCNN-based two-stage oracle inequality and optimal learning rates (up to a logarithmic factor) for the proposed algorithm for distribution regression.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multivariate Density Estimation with Deep Neural Mixture Models

Article Open access 26 February 2023

Optimal Transport for Generative Models

References

Barron, A.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39, 930–945 (1993)
Article MathSciNet MATH Google Scholar
Bauer, F., Pereverzev, S., Rosasco, L.: On regularization algorithms in learning theory. J. Complex. 23(1), 52–72 (2007)
Article MathSciNet MATH Google Scholar
Stone, C.J.: Optimal global rates of convergence for nonparametric regression. Ann. Stat. 1040-1053 (1982)
Christmann, A., Steinwart, I.: Consistency and robustness of kernel-based regression in convex risk minimization. Bernoulli 13(3), 799–819 (2007)
Article MathSciNet MATH Google Scholar
Chui, C.K., Li, X., Mhaskar, H.N.: Neural networks for localized approximation. Math. Comput. 63(208), 607–623 (1994)
Article MathSciNet MATH Google Scholar
Chui, C.K., Lin, S.B., Zhou, D.X.: Deep neural networks for rotation-invariance approximation and learning. Anal. Appl. 17, 737–772 (2019)
Article MathSciNet MATH Google Scholar
Cucker, F., Smale, S.: On the mathematical foundations of learning. Bulletin Amer. Math. Soc. 39, 1–49 (2002)
Article MathSciNet MATH Google Scholar
Cucker, F., Zhou, D.X.: Learning theory: an approximation theory viewpoint. Cambridge University Press, Cambridge (2007)
Book MATH Google Scholar
DiBenedetto, E., Debenedetto, E.: Real analysis. Birkhäuser, Boston (2002)
Dong, S.N., Sun, W.C.: Distributed learning and distribution regression of coefficient regularization. J. Approx. Theory 263, 105523 (2021)
Article MathSciNet MATH Google Scholar
Dong, S.N., Sun, W.C.: Learning rate of distribution regression with dependent samples. J. Complex. 101679, (2022)
Fang, Z.Y., Guo, Z.C., Zhou, D.X.: Optimal learning rates for distribution regression. J. Complex. 56, 101426 (2020)
Article MathSciNet MATH Google Scholar
Feng, Y.L., Huang, X.L., Shi, L., Yang, Y.N., Suykens, J.A.K.: Learning with the maximum correntropy criterion induced losses for regression. J. Mach. Learn. Res. 16(30), 993–1034 (2015)
MathSciNet MATH Google Scholar
Guo, Z.C., Shi, L.: Optimal rates for coefficient-based regularized regression. Appl. Comput. Harmon. Anal. 47(3), 662–701 (2019)
Article MathSciNet MATH Google Scholar
Guo, Z.C., Lin, S.B., Shi, L.: Distributed learning with multi-penalty regularization. Appl. Comput. Harmon. Anal. 46(3), 478–499 (2019)
Article MathSciNet MATH Google Scholar
Hayakawa, S., Suzuki, T.: On the minimax optimality and superiority of deep neural network learning over sparse parameter spaces. Neural Netw. 123, 343–361 (2020)
Article MATH Google Scholar
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989)
Article MATH Google Scholar
Klusowski, J.M., Barron, A.R.: Approximation by combinations of ReLU and squared ReLU ridge functions with $\ell ^ 1$ and $\ell ^ 0$ controls. IEEE Trans. Inf. Theory 64(12), 7649–7656 (2018)
Article MathSciNet MATH Google Scholar
Lin, Y.V., Pinkus, A.: Fundamentality of ridge functions. J. Approx. Theory 75, 295–311 (1993)
Article MathSciNet MATH Google Scholar
Maiorov, V.E.: On best approximation by ridge functions. J. Approx. Theory 99, 68–94 (1999)
Article MathSciNet MATH Google Scholar
Mao, T., Shi, Z.J., Zhou, D.X.: Theory of deep convolutional neural networks III: approximating radial functions. Neural Netw. 144, 778–79 (2021)
Article Google Scholar
Mücke, N.: Stochastic gradient descent meets distribution regression. In International Conference on Artificial Intelligence and Statistics (pp. 2143-2151). PMLR, (2021)
Mhaskar, H.N.: Approximation properties of a multilayered feedforward artificial neural network. Adv. Comput. Math. 1, 61–80 (1993)
Article MathSciNet MATH Google Scholar
Mhaskar, H.N.: Dimension independent bounds for general shallow networks. Neural Netw. 123, 142–152 (2020)
Póczos, B., Singh, A., Rinaldo, A., Wasserman, L.: Distribution-free distribution regression. In Artificial Intelligence and Statistics (pp. 507-515). PMLR, (2013)
Steinwart, I., Christmann, A.: Support vector machines. Springer Science & Business Media, (2008)
Schmidt-Hieber, J.: Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 48(4), 1875–1897 (2020)
MathSciNet MATH Google Scholar
Smale, S., Zhou, D.X.: Shannon sampling and function reconstruction from point values. Bull. Am. Math. Soc. 41(3), 279–305 (2004)
Article MathSciNet MATH Google Scholar
Smale, S., Zhou, D.X.: Learning theory estimates via integral operators and their approximations. Constr. Approx. 26(2), 153–172 (2007)
Article MathSciNet MATH Google Scholar
Szabó, Z., Gretton, A., Póczos, B., Sriperumbudur, B.: Two-stage sampled learning theory on distributions. In Artificial Intelligence and Statistics (pp. 948-957). PMLR, (2015)
Szabó, Z., Sriperumbudur, B.K., Póczos, B., Gretton, A.: Learning theory for distribution regression. J. Mach. Learn. Res. 17, 5272–311 (2016)
MathSciNet MATH Google Scholar
Villani, C.: Optimal transport: old and new (Vol. 338, p. 23). Berlin: Springer (2009)
Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Networks 94, 103–114 (2017)
Yu, Z., Ho, D.W.C., Shi, Z.J., Zhou, D.X.: Robust kernel-based distribution regression. Inverse Problems 37(10), 105014 (2021)
Article MathSciNet Google Scholar
Zhao, P., Zhou, Z.H.: Label distribution learning by optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1) (2018)
Zhou, D.X.: Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal. 48(2), 787–794 (2020)
Article MathSciNet MATH Google Scholar
Zhou, D.X.: Deep distributed convolutional neural networks: universality. Anal. Appl. 16, 895–919 (2018)
Article MathSciNet MATH Google Scholar
Zhou, D.X.: Theory of deep convolutional neural networks: Downsampling. Neural Netw. 124, 319–327 (2020)
Zweig, A., Bruna, J.: A functional perspective on learning symmetric functions with neural networks. Int. Conf. Mach. Learn. (pp. 13023-13032). PMLR, (2021)

Download references

Acknowledgements

The first version of the paper was written when the second author worked at City University of Hong Kong, supported partially by the Laboratory for AI-Powered Financial Technologies, the Research Grants Council of Hong Kong [Projects # C1013-21GF and #11308121], the Germany/Hong Kong Joint Research Scheme [Project No. G-CityU101/20], the CityU Strategic Interdisciplinary Research Grant [Project No. 7020010], National Science Foundation of China [Project No. 12061160462], and Hong Kong Institute for Data Science. The first author would like to thank Zhongjie Shi for nice communications and discussions on related topics. The authors would like to thank the anonymous referee for his/her careful review which helps improve the quality of the paper.

Author information

Authors and Affiliations

Department of Mathematics, Hong Kong Baptist University, 224 Waterloo Road, Kowloon Tong, Hong Kong
Zhan Yu
School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
Ding-Xuan Zhou

Authors

Zhan Yu
View author publications
You can also search for this author in PubMed Google Scholar
Ding-Xuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhan Yu.

Ethics declarations

Conflicts of interest statement

The authors declare no competing interests.

Additional information

Communicated by: Siddhartha Mishra

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 The sequence factorization lemma

Lemma 1

([36]) Let $s\ge 2$ and $W=(W_k)_{k=-\infty }^{\infty }$ be a sequence supported in $\{0,1,...,\mathcal {M}\}$ with $\mathcal {M}\ge 0$. Then there exists a finite sequence of filters $\{\omega ^{(j)}\}_{j=1}^p$ each supported in $\{0,1,...,s\}$ with $p\le \lceil \frac{\mathcal {M}}{s-1}\rceil $ such that the following convolutional factorization holds true

$$\begin{aligned}W=\omega ^{(p)}*\omega ^{(p-1)}*\cdots *\omega ^{(2)}*\omega ^{(1)}. \end{aligned}$$

1.2 The Cauchy bound of polynomial roots and Vieta’s formula of polynomial coefficients

Lemma 2

If $w=\left\{ w_{j}\right\} _{j \in \mathbb {Z}}$ is a real sequence supported in $\{0, \ldots , K\}$ with $w_{K}=1$, then all the complex roots of its symbol $\widetilde{w}(z)=\sum _{j=0}^{K} w_{j} z^{j}$ are located in the disk of radius $1+\max _{j=0, \ldots , K-1}\left| w_{j}\right| $, the Cauchy bound of $\widetilde{w}$. If we factorize $\widetilde{w}$ into monic polynomials of degree at most s, then all the coefficients of these factor polynomials are bounded by $s^{\frac{s}{2}}\left( 1+\max _{j=0, \ldots , K-1}\left| w_{j}\right| \right) ^{s} \le $ $s^{\frac{s}{2}}\left( 1+\Vert w\Vert _{\infty }\right) ^{s}$.

1.3 The covering number bound of the function set $\mathcal {H}_{J+1}^l$

Lemma 3

For any $l\in \{1,2,...,(2N+3)\widetilde{d}\}$ with $\widetilde{d}=\lfloor (d+J s) / d\rfloor $. The covering number of $\mathcal {H}_{J+1}^l$ satisfies

$$\begin{aligned}\log \mathcal {N}(\mathcal {H}_{J+1}^l,\epsilon ,\Vert \cdot \Vert _{C(\Omega )})\le \mathcal {A}_1 N \log \frac{\widehat{R}}{\epsilon }+\mathcal {A}_2 N \log N, \ 0<\epsilon \le 1, \end{aligned}$$

where $\mathcal {A}_1$, $\mathcal {A}_2$ and $\widehat{R}$ are defined as in Lemma 6.

Proof

Since the function set $\mathcal {H}_{J+1}^l$ inherits the substructure up to $(J+1)$th-layer of the hypothesis space $\mathcal {H}_{R, N}^{\mathcal {P}(\Omega )}$, the proof can be derived by following the similar procedures of the proof in Lemma 4 and Lemma 6.

For any $l\in \{1,2,...,(2N+3)\widetilde{d}\}$ and any $H\in \mathcal {H}_{J+1}^l$, if we choose another function $\widetilde{H}$ in $\mathcal {H}_{J+1}^l$ induced by $\widetilde{\omega }^{(j)}$ for $j=1,2,...,J$, $\widetilde{b}^{(j)}$ for $j=1,2,...,J+1$, $\widetilde{F}^{(J+1)}$, satisfying the restriction that $\left\| \omega ^{(j)}-\widetilde{\omega }^{(j)}\right\| _{\infty }\le \epsilon $ for $j=1,2,...,J$, $\left\| b^{(j)}-\widetilde{b}^{(j)}\right\| _{\infty }\le \epsilon $ for $j=1,2,...,J+1$, $\left\| F^{(J+1)}-\widetilde{F}^{(J+1)}\right\| _{\infty }\le \epsilon $, $\left\| c-\widetilde{c}\right\| _{\infty }\le \epsilon $, then via a similar way of getting 3.18 in Lemma 6, we have

$$\begin{aligned} \nonumber \left\| H-\widetilde{H}\right\| _{C(\Omega )}=&\left\| \left( F^{\left( J+1\right) } \mathfrak {D}_{d}(h^{\left( J\right) }(x))-b^{\left( J+1\right) }\right) _l-\left( \widetilde{F}^{\left( J+1\right) } \mathfrak {D}_{d}(\widetilde{h}^{\left( J\right) }(x))-\widetilde{b}^{\left( J+1\right) }\right) _l\right\| _{C(\Omega )}\\ \nonumber \le&\left\| \left( F^{\left( J+1\right) } \mathfrak {D}_{d}(h^{\left( J\right) }(x))-b^{\left( J+1\right) }\right) -\left( \widetilde{F}^{\left( J+1\right) } \mathfrak {D}_{d}(\widetilde{h}^{\left( J\right) }(x))-\widetilde{b}^{\left( J+1\right) }\right) \right\| _{l^{\infty }(\Omega )}\\ \le&6J^2((s+1)R)^J\epsilon \le 6d^2((s+1)R)^{d-1}\epsilon :=\check{\epsilon }. \end{aligned}$$

(1.1)

We observe that

$$\check{\epsilon }=6d^2((s+1)R)^{d-1}\epsilon <95d^2N^4((s+1)R)^{d+1}\epsilon =\widehat{\epsilon }$$

which is defined in 3.19 and that the free parameters till $(J+1)$-th layer in $\mathcal {H}_{J+1}^j$ is not larger than free parameters till $(J+2)$-th layer in $\mathcal {H}_{R,N}^{\mathcal {P}(\Omega )}$. Hence, by taking a $\epsilon $ net of the set of free parameters in $\omega ^{(j)}$ for $j=1,2,...,J$, $b^{(j)}$ for $j=1,2,...,J+1$, and $F^{(J+1)}$, we derive that

$$\mathcal {N}\left(\mathcal {H}_{J+1}^l,\check{\epsilon },\Vert \cdot \Vert _{C(\Omega )}\right)\le N^{\mathcal {A}_2N}\left( \frac{\widehat{R}}{\check{\epsilon }}\right) ^{\mathcal {A}_1N},$$

which completes the proof after taking the logarithm.$\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yu, Z., Zhou, DX. Deep learning theory of distribution regression with CNNs. Adv Comput Math 49, 51 (2023). https://doi.org/10.1007/s10444-023-10054-y

Download citation

Received: 21 August 2022
Accepted: 06 June 2023
Published: 07 July 2023
DOI: https://doi.org/10.1007/s10444-023-10054-y

Keywords

Mathematics Subject Classification (2010)

.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning theory of distribution regression with CNNs

Abstract

Access this article

Similar content being viewed by others

Multivariate Density Estimation with Deep Neural Mixture Models

Optimal Transport for Generative Models

Optimal Transport for Generative Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest statement

Additional information

Publisher's Note

Appendix

1.1 The sequence factorization lemma

Lemma 1

1.2 The Cauchy bound of polynomial roots and Vieta’s formula of polynomial coefficients

Lemma 2

1.3 The covering number bound of the function set \(\mathcal {H}_{J+1}^l\)

Lemma 3

Proof

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2010)

Navigation

Deep learning theory of distribution regression with CNNs

Abstract

Access this article

Similar content being viewed by others

Multivariate Density Estimation with Deep Neural Mixture Models

Optimal Transport for Generative Models

Optimal Transport for Generative Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest statement

Additional information

Publisher's Note

Appendix

Appendix

1.1 The sequence factorization lemma

Lemma 1

1.2 The Cauchy bound of polynomial roots and Vieta’s formula of polynomial coefficients

Lemma 2

1.3 The covering number bound of the function set \(\mathcal {H}_{J+1}^l\)

Lemma 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2010)

Search

Navigation