Skip to main content
Log in

Optimal Rates for Regularization of Statistical Inverse Learning Problems

  • Published:
Foundations of Computational Mathematics Aims and scope Submit manuscript

Abstract

We consider a statistical inverse learning (also called inverse regression) problem, where we observe the image of a function f through a linear operator A at i.i.d. random design points \(X_i\), superposed with an additive noise. The distribution of the design points is unknown and can be very general. We analyze simultaneously the direct (estimation of Af) and the inverse (estimation of f) learning problems. In this general framework, we obtain strong and weak minimax optimal rates of convergence (as the number of observations n grows large) for a large class of spectral regularization methods over regularity classes defined through appropriate source conditions. This improves on or completes previous results obtained in related settings. The optimality of the obtained rates is shown not only in the exponent in n but also in the explicit dependency of the constant factor in the variance of the noise and the radius of the source condition set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Notes

  1. This can be extended to the case where g is only approximated in \(L^2(\nu )\) by a sequence of functions in \({\mathcal H}_K\). For the sake of the present discussion, only the case where it is assumed \(g \in {\mathcal H}_K\) is of interest.

References

  1. F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. J. Complexity, 23(1):52–72, 2007.

    Article  MathSciNet  MATH  Google Scholar 

  2. R. Bhatia. Matrix Analysis. Springer, 1997.

  3. R. Bhatia and J. Holbrook. Fréchet derivatives of the power function. Indiana University Mathematics Journal, 49 (3):1155–1173, 2000.

    Article  MathSciNet  MATH  Google Scholar 

  4. N. H. Bingham, C. M. Goldie, and J. L. Teugels. Regular Variation, volume 27 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, 1987.

  5. N. Bissantz, T. Hohage, A. Munk, and F. Ruymgaart. Convergence rates of general regularization methods for statistical inverse problems and applications. SIAM J. Numer. Analysis, 45(6):2610–2636, 2007.

    Article  MathSciNet  MATH  Google Scholar 

  6. G. Blanchard and N. Krämer. Convergence rates of kernel conjugate gradient for random design regression. Analysis and Applications, 2016.

  7. G. Blanchard and P. Massart. Discussion of “2004 IMS medallion lecture: Local Rademacher complexities and oracle inequalities in risk minimization”, by V. Koltchinskii. Annals of Statistics, 34(6):2664–2671, 2006.

  8. P. Bühlmann and B. Yu. Boosting with the \(l_2\)-loss: Regression and classification. Journal of American Statistical Association, 98(462):324–339, 2003.

    Article  MathSciNet  MATH  Google Scholar 

  9. A. Caponnetto. Optimal rates for regularization operators in learning theory. Technical report, MIT, 2006.

  10. A. Caponnetto and Y. Yao. Cross-validion based adaptation for regularization operators in learning theory. Analysis and Applications, 8(2):161–183, 2010.

    Article  MathSciNet  MATH  Google Scholar 

  11. F. Cucker and S. Smale. Best choices for regularization parameters in learning theory: on the bias-variance problem. Foundations of Computational Mathematics, 2(4):413–428, 2002.

    Article  MathSciNet  MATH  Google Scholar 

  12. E. De Vito and A. Caponnetto. Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2006.

    MathSciNet  MATH  Google Scholar 

  13. E. De Vito, L. Rosasco, and A. Caponnetto. Discretization error analysis for Tikhonov regularization. Analysis and Applications, 4(1):81–99, 2006.

    Article  MathSciNet  MATH  Google Scholar 

  14. E. De Vito, L. Rosasco, A. Caponnetto, and U. De Giovannini. Learning from examples as an inverse problem. J. of Machine Learning Research, 6:883–904, 2005.

    MathSciNet  MATH  Google Scholar 

  15. R. DeVore, G. Kerkyacharian, D. Picard, and V.Temlyakov. Mathematical methods for supervised learning. Foundations of Computational Mathematics, 6(1):3–58, 2006.

    Article  MathSciNet  MATH  Google Scholar 

  16. L. Dicker, D. Foster, and D. Hsu. Kernel methods and regularization techniques for nonparametric regression: Minimax optimality and adaptation. Technical report, Rutgers University, 2015.

  17. H. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Kluwer Academic Publishers, 2000.

  18. K. Fukumizu, F. R. Bach, and A. Gretton. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 8:361–383, 2007.

    MathSciNet  MATH  Google Scholar 

  19. L. L. Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri. Spectral algorithms for supervised learning. Neural Computation, 20(7):1873–1897, 2008.

    Article  MathSciNet  MATH  Google Scholar 

  20. F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural network architectures. Neural Computation, 7(2):219–269, 1993.

    Article  Google Scholar 

  21. L. Györfi, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-free Theory of Nonparametric Regression. Springer, 2002.

  22. P. Halmos and V. Sunder. Bounded Integral Operators on \(L^2\) -Spaces. Springer, 1978.

  23. L. Hörmander. The analysis of linear partial differential operators I. Springer, 1983.

  24. S. Loustau. Inverse statistical learning. Electron. J. Statist., 7:2065–2097, 2013.

    Article  MathSciNet  MATH  Google Scholar 

  25. S. Loustau and C. Marteau. Minimax fast rates for discriminant analysis with errors in variables. Bernoulli, 21(1):176–208, 2015.

    Article  MathSciNet  MATH  Google Scholar 

  26. P. Mathé and S. Pereverzev. Geometry of linear ill-posed problems in variable Hilbert scales. Inverse Problems, 19(3):789, 2003.

    Article  MathSciNet  MATH  Google Scholar 

  27. S. Mendelson and J. Neeman. Regularization in kernel learning. The Annals of Statistics, 38(1):526–565, 2010.

    Article  MathSciNet  MATH  Google Scholar 

  28. F. O’Sullivan. Convergence characteristics of methods of regularization estimators for nonlinear operator equations. SIAM J. Numer. Anal., 27(6):1635–1649, 1990.

    Article  MathSciNet  MATH  Google Scholar 

  29. I. F. Pinelis and A. I. Sakhanenko. Remarks on inequalities for probabilities of large deviations. Theory Probab. Appl., 30(1):143–148, 1985.

    Article  MathSciNet  MATH  Google Scholar 

  30. S. Smale and D. Zhou. Shannon sampling II: Connections to learning theory. Appl. Comput. Harmon. Analysis, 19(3):285–302, 2005.

    Article  MathSciNet  MATH  Google Scholar 

  31. S. Smale and D. Zhou. Learning theory estimates via integral operators and their approximation. Constructive Approximation, 26(2):153–172, 2007.

    Article  MathSciNet  MATH  Google Scholar 

  32. I. Steinwart and A. Christman. Support Vector Machines. Springer, 2008.

  33. I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. Proceedings of the 22nd Annual Conference on Learning Theory, pages 79–93, 2009.

  34. V. Temlyakov. Approximation in learning theory. Constructive Approximation, 27(1):33–74, 2008.

    Article  MathSciNet  MATH  Google Scholar 

  35. A. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2008.

  36. G. Wahba. Spline Models for Observational Data, volume 59. SIAM CBMS-NSF Series in Applied Mathematics, 1990.

  37. C. Wang and D.-X. Zhou. Optimal learning rates for least squares regularized regression with unbounded sampling. Journal of Complexity, 27(1):55–67, 2011.

    Article  MathSciNet  MATH  Google Scholar 

  38. Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicole Mücke.

Additional information

Communicated by Tomasio Poggio.

This research was supported by the DFG via Research Unit 1735 Structural Inference in Statistics.

Appendices

Appendix A: Proof of Concentration Inequalities

Proposition A.1

Let \((Z, {\mathcal B}, {\mathbb {P}})\) be a probability space and \(\xi \) a random variable on Z with values in a real separable Hilbert space \({\mathcal H}\). Assume that there are two positive constants L and \(\sigma \) such that for any \(m\ge 2\)

$$\begin{aligned} {\mathbb {E}}\big [ \left\| \xi - {\mathbb {E}}[\xi ] \right\| _{{\mathcal H}_1}^m \big ] \le \frac{1}{2}m!\sigma ^2L^{m-2}. \end{aligned}$$
(A.1)

If the sample \(z_1,\ldots ,z_n\) is drawn i.i.d. from Z according to \({\mathbb {P}}\), then, for any \(0<\eta <1\), with probability greater than \(1-\eta \)

$$\begin{aligned} \Big \Vert \frac{1}{n}\sum _{j=1}^n\xi (z_j)-{\mathbb {E}}[\xi ] \Big \Vert _{\mathcal H} \le \delta (n,\eta ), \end{aligned}$$
(A.2)

where

$$\begin{aligned} \delta (n, \eta ):= & {} 2 \log (2\eta ^{-1}) \left( \frac{L}{n} + \frac{\sigma }{\sqrt{n}} \right) . \end{aligned}$$

In particular, (A.1) holds if

$$\begin{aligned} \left\| \xi (z) \right\| _{{\mathcal H}_1}\le & {} \frac{L}{2} \quad a.s.,\\ {\mathbb {E}}\big [ \left\| \xi \right\| ^2_{{\mathcal H}_1} \big ]\le & {} \sigma ^2. \end{aligned}$$

Proof

See [9, 10], from the original result of [29] (Corollary 1). \(\square \)

Proof of Proposition 5.2

Define \(\xi _1: {\mathcal X}\times {\mathbb {R}}\longrightarrow {\mathcal H}_1\) by

$$\begin{aligned} \xi _1 (x,y)&:= (\bar{B} + \lambda )^{-1/2} (y-\bar{S}_{x}f_{\rho }) \bar{F}_x \\&= (\bar{B} + \lambda )^{-1/2} \bar{S}_{x}^\star (y-\bar{S}_{x}f_{\rho }). \end{aligned}$$

Abusing notation we also denote \(\xi _1\) the random variable \(\xi _1(X,Y)\) where \((X,Y) \sim \rho \). The model assumption 2.10 implies

$$\begin{aligned} {\mathbb {E}}[ \xi _1 ]&= (\bar{B} + \lambda )^{-1/2} \int _{{\mathcal X}} \bar{F}_x\int _{{\mathbb {R}}}(y-\bar{S}_{x}f_{\rho }) \rho (\mathrm{d}y|x)\nu (\mathrm{d}x) \\&= (\bar{B} + \lambda )^{-1/2} \int _{{\mathcal X}} \bar{F}_x(\bar{S}_{x}f_{\rho }- \bar{S}_{x}f_{\rho }) \nu (\mathrm{d}x)\\&= 0, \end{aligned}$$

and therefore

$$\begin{aligned} \frac{1}{n}\sum _{j=1}^n\xi _1 (x_j, y_j)- {\mathbb {E}}[\xi _1 ]&= \frac{1}{n}\sum _{j=1}^n (\bar{B} + \lambda )^{-1/2} ( y_{j}- \bar{S}_{x_j}f_{\rho }) \bar{F}_{x_j} \\&= (\bar{B} + \lambda )^{-1/2} \bar{S}_{\mathbf{x}}^{\star } \left( \mathbf{y}- \bar{S}_{\mathbf{x}}f_{\rho }\right) .\\&= (\bar{B} + \lambda )^{-1/2}\left( \bar{S}_{\mathbf{x}}^{\star }\mathbf{y}- \bar{B}_{\mathbf{x}}f_{\rho }\right) . \end{aligned}$$

Moreover, by Assumption 2.11, for \(m\ge 2\):

$$\begin{aligned} {\mathbb {E}}[\left\| \xi _1\right\| ^m_{{\mathcal H}_1}]&= \int _{{\mathcal X}\times {\mathbb {R}}} \left\| (\bar{B} + \lambda )^{-1/2} \bar{S}_{x}^\star (y-\bar{S}_{x}f_{\rho })\right\| ^m_{{\mathcal H}_1} \rho (\mathrm{d}x,\mathrm{d}y)\\&= \int _{{\mathcal X}\times {\mathbb {R}}} | \langle \bar{S}_{x}(\bar{B} + \lambda )^{-1} \bar{S}_{x}^\star (y-\bar{S}_{x}f_{\rho }), (y-\bar{S}_{x}f_{\rho }) \rangle _{{\mathbb {R}}} |^{\frac{m}{2}} \rho (\mathrm{d}x,\mathrm{d}y)\\&\le \int _{{\mathcal X}} \left\| \bar{S}_{x}(\bar{B} + \lambda )^{-1} \bar{S}_{x}^\star \right\| ^{\frac{m}{2}} \int _{{\mathbb {R}}} \left|y-\bar{S}_{x}f_{\rho } \right|^m \rho (\mathrm{d}y|x) \nu (\mathrm{d}x) \\&\le \frac{1}{2}m! \sigma ^2 M^{m-2} \int _{{\mathcal X}} \left\| \bar{S}_{x}(\bar{B} + \lambda )^{-1} \bar{S}_{x}^\star \right\| ^{\frac{m}{2}} \nu (\mathrm{d}x). \end{aligned}$$

Setting \(A:=\bar{S}_{x}(\bar{B} + \lambda )^{-1/2}\), we have using \(\left\| \cdot \right\| \le \mathrm {tr}(\cdot )\) for positive operators:

$$\begin{aligned} \left\| \bar{S}_{x}(\bar{B} + \lambda )^{-1} \bar{S}_{x}^\star \right\| ^{\frac{m}{2}}&= \left\| AA^\star \right\| ^{\frac{m}{2}} \le \left\| AA^\star \right\| ^{\frac{m}{2}-1}\mathrm {tr}(AA^\star ) \\&= \left\| AA^\star \right\| ^{\frac{m}{2}-1} \mathrm {tr}(A^\star A). \end{aligned}$$

Firstly, observe that

$$\begin{aligned} \left\| AA^\star \right\| ^{\frac{m}{2}-1} \le \left( \frac{1}{\lambda }\right) ^{\frac{m}{2}-1}, \end{aligned}$$

since our main assumption 2.1 implies \(\left\| \bar{S}_x\right\| \le 1\). Secondly, by linearity of \(\mathrm {tr}(\cdot )\),

$$\begin{aligned} \int _{{\mathcal X}} \mathrm {tr}(A^\star A) \nu (dx) = \int _{{\mathcal X}} \mathrm {tr}( (\bar{B} + \lambda )^{-1/2} \bar{B}_{x}(\bar{B} + \lambda )^{-1/2} ) \nu (\mathrm{d}x) = {\mathcal N}(\lambda ). \end{aligned}$$

Thus,

$$\begin{aligned} {\mathbb {E}}[\left\| \xi _1\right\| ^m_{{\mathcal H}_1}] \le \frac{1}{2}m! (\sigma \sqrt{{\mathcal N}(\lambda ) })^2 \left( \frac{M}{\sqrt{\lambda }}\right) ^{m-2} . \end{aligned}$$

As a result, Proposition A.1 implies with probability at least \(1-\eta \)

$$\begin{aligned} \left\| (\bar{B} + \lambda )^{-1/2} ( \bar{B}_{\mathbf{x}}f_{\rho }-\bar{S}_{\mathbf{x}}^{\star }\mathbf{y}) \right\| _{{\mathcal H}_1} \le \delta _1\left( n,\eta \right) , \end{aligned}$$

where

$$\begin{aligned} \delta _1(n,\eta ) = 2\log (2\eta ^{-1}) \left( \frac{M}{n\sqrt{\lambda }} + \frac{\sigma }{\sqrt{ n}}\sqrt{{\mathcal N}(\lambda )}\right) . \end{aligned}$$

For the second part of the proposition, we introduce similarly

$$\begin{aligned} \xi '_1(x,y):= (y-\bar{S}_x f_{\rho }) \bar{F}_x = \bar{S}^\star _x (y-\bar{S}_x f_{\rho }), \end{aligned}$$

which satisfies

$$\begin{aligned} {\mathbb {E}}[ \xi _1']=0; \quad \frac{1}{n}\sum _{j=1}^n\xi '_1 (x_j, y_j)- {\mathbb {E}}[\xi '_1 ]= \bar{S}_{\mathbf{x}}^{\star }\mathbf{y}- \bar{B}_{\mathbf{x}}f_{\rho }, \end{aligned}$$

and

$$\begin{aligned} {\mathbb {E}}\big [\left\| \xi '_1\right\| ^m_{{\mathcal H}_1}\big ] \le {\mathbb {E}}\big [ \left|y-\bar{S}_x f_{\rho } \right|^m\big ] \le \frac{1}{2}m! \sigma ^2 M^{m-2}. \end{aligned}$$

Applying Proposition A.1 yields the result. \(\square \)

Proof of Proposition 5.3

We proceed as above by defining \(\xi _2: {\mathcal X}\longrightarrow {\mathrm {HS}}({\mathcal H}_1)\) (the latter denoting the space of Hilbert–Schmidt operators on \({\mathcal H}_1\)) by

$$\begin{aligned} \xi _2(x):= (\bar{B}+\lambda )^{-1} \bar{B}_x, \end{aligned}$$

where \(\bar{B}_x := \bar{F}_x \otimes \bar{F}_x^\star \). We also use the same notation \(\xi _2\) for the random variable \(\xi _2(X)\) with \(X\sim \nu \). Then,

$$\begin{aligned} {\mathbb {E}}[\xi _2] = (\bar{B}+\lambda )^{-1} \int _{{\mathcal X}} \bar{B}_x \nu (\mathrm{d}x) = (\bar{B}+\lambda )^{-1}\bar{B}, \end{aligned}$$

and therefore

$$\begin{aligned} \frac{1}{n}\sum _{j=1}^n\xi _2(x_j) - {\mathbb {E}}[\xi _2]= (\bar{B}+\lambda )^{-1}(\bar{B}- \bar{B}_{\mathbf{x}}). \end{aligned}$$

Furthermore, since \(\bar{B}_x\) is of trace class and \((\bar{B}+\lambda )^{-1}\) is bounded, we have using Assumption 2.1

$$\begin{aligned} \left\| \xi _2 (x)\right\| _{{\mathrm {HS}}} \le \left\| (\bar{B}+\lambda )^{-1}\right\| \left\| \bar{B}_x\right\| _{{\mathrm {HS}}} \le \lambda ^{-1} =: L_2/2, \end{aligned}$$

uniformly for any \(x \in {\mathcal X}\). Moreover,

$$\begin{aligned} {\mathbb {E}}\big [ \left\| \xi _2\right\| ^2_{{\mathrm {HS}}} \big ]&= \int _{{\mathcal X}} \mathrm {tr}( \bar{B}_x(\bar{B}+\lambda )^{-2}\bar{B}_x ) \nu (\mathrm{d}x) \\&\le \left\| (\bar{B}+\lambda )^{-1}\right\| \int _{{\mathcal X}} \left\| \bar{B}_x\right\| \mathrm {tr}((\bar{B}+\lambda )^{-1} \bar{B}_x )\nu (\mathrm{d}x) \\&\le \frac{{\mathcal N}(\lambda )}{\lambda } = :\sigma _2^2. \end{aligned}$$

Thus, Proposition A.1 applies and gives with probability at least \(1-\eta \)

$$\begin{aligned} \left\| (\bar{B}+\lambda )^{-1}(\bar{B}- \bar{B}_{\mathbf{x}}) \right\| _{{\mathrm {HS}}} \le \delta _2 (n,\eta ) \end{aligned}$$

with

$$\begin{aligned} \delta _2(n,\eta ) = 2\log (2\eta ^{-1}) \left( \frac{2}{n \lambda } + \sqrt{\frac{{\mathcal N}(\lambda )}{n\lambda }} \right) . \end{aligned}$$

\(\square \)

Proof of Proposition 5.4

We write the Neumann series identity

$$\begin{aligned} (\bar{B}_{\mathbf{x}} + \lambda )^{-1}(\bar{B} + \lambda ) = (I - H_{\mathbf{x}}(\lambda ) )^{-1} = \sum _{j=0}^{\infty } H_{\mathbf{x}}(\lambda )^j(\lambda ), \end{aligned}$$

with

$$\begin{aligned} H_{\mathbf{x}}(\lambda ) = (\bar{B}+ \lambda )^{-1}(\bar{B} - \bar{B}_{\mathbf{x}}). \end{aligned}$$

It is well known that the series converges in norm provided that \(\left\| H_{\mathbf{x}}(\lambda ) \right\| < 1 \). In fact, applying Proposition 5.3 gives with probability at least \(1-\eta \):

$$\begin{aligned} \left\| H_{\mathbf{x}}(\lambda ) \right\| \le 2\log (2\eta ^{-1}) \left( \frac{2}{n \lambda } + \sqrt{\frac{{\mathcal N}(\lambda )}{n\lambda }} \right) . \end{aligned}$$

Put \(C_\eta := 2\log (2\eta ^{-1}) >1\) for any \(\eta \in (0,1)\). Assumption (5.2) reads \(\sqrt{n\lambda } \ge 4 C_\eta \sqrt{\max ({\mathcal N}(\lambda ),1)}\), implying \(\sqrt{n\lambda } \ge 4 C_\eta \ge 4\) and therefore \(\frac{2}{n\lambda } \le \frac{1}{2\sqrt{n\lambda }} \le \frac{1}{8C_\eta }\), hence

$$\begin{aligned} C_\eta \left( \frac{2}{n \lambda } + \sqrt{\frac{{\mathcal N}(\lambda )}{n\lambda }} \right) \le C_\eta \left( \frac{1}{8C_\eta } + \frac{1}{4C_\eta }\right) < \frac{1}{2}. \end{aligned}$$

Thus, with probability at least \(1-\eta \):

$$\begin{aligned} \left\| (\bar{B}_{\mathbf{x}} + \lambda )^{-1}(\bar{B} + \lambda ) \right\| \le 2. \end{aligned}$$

\(\square \)

Proof of Proposition 5.5

Defining \(\xi _3: {\mathcal X}\longrightarrow {\mathrm {HS}}({\mathcal H}_1)\) by

$$\begin{aligned} \xi _3(x):= \bar{F}_x \otimes \bar{F}_x^\star = \bar{B}_x \end{aligned}$$

and denoting also, as before, \(\xi _3\) for the random variable \(\xi _3(X)\) (with \(X\sim \nu \)), we have \({\mathbb {E}}[\xi _3] = \bar{B} \) and therefore

$$\begin{aligned} \frac{1}{n}\sum _{j=1}^n\xi _3(x_j) - {\mathbb {E}}[\xi _3]= (\bar{B}_{\mathbf{x}}- \bar{B}). \end{aligned}$$

Furthermore, by Assumption 2.1

$$\begin{aligned} \left\| \xi _3 (x)\right\| _{{\mathrm {HS}}} = \left\| \bar{F}_x\right\| ^2 \le 1 =: \frac{L_3}{2} \quad \mathrm{a.s.}, \end{aligned}$$

also leading to \({\mathbb {E}}[ \left\| \xi _2\right\| ^2_{{\mathrm {HS}}} ] \le 1 =: \sigma _3^2\). Thus, Proposition A.1 applies and gives with probability at least \(1-\eta \)

$$\begin{aligned} \left\| \bar{B}- \bar{B}_{\mathbf{x}} \right\| _{{\mathrm {HS}}} \le 6 \log (2\eta ^{-1}) \frac{1}{\sqrt{n}}. \end{aligned}$$

\(\square \)

Appendix B: Perturbation Result

The estimate of the following proposition is crucial for proving the upper bound in case the source condition is of Hölder type r with \(r>1\). We remark that for \(r>1\) the function \(t \mapsto t^r\) is not operator monotone. One might naively expect estimate (B.1) to hold for a constant C given by the Lipschitz constant of the scalar function \(t^r\). As shown in [3], this is false even for finite-dimensional positive matrices. The point of Proposition B.1 is that (B.1) still holds for some larger constant depending on r and the upper bound of the spectrum. We do not expect this result to be particularly novel, but tracking down a proof in the literature proved elusive, not to mention that occasionally erroneous statements about related issues can be found. For this reason we here provide a self-contained proof for completeness sake.

Proposition B.1

Let \(B_1, B_2\) be two nonnegative self-adjoint operators on some Hilbert space with \(||B_j|| \le a\), \(j=1,2\), for some \(a>0\). Assume the \(B_j\) belongs to the Schatten class \({\mathcal S}^p\) for \(1 \le p \le \infty \). If \(1 < r\), then

$$\begin{aligned} ||B_1^r-B_2^r||_p \le C_{r}a^{r-1} || B_1 - B_2 ||_p , \end{aligned}$$
(B.1)

for some \(C_{r}<\infty \). This inequality also holds in operator norm for non-compact bounded (nonnegative and self-adjoint) \(B_j\).

Proof

We extend the proof of [18], given there in the case \(r=3/2\) in operator norm. We also restrict ourselves to the case \(a=1\). On \({\mathcal D}:=\{z: |z| \le 1 \}\), we consider the functions \(f(z)=(1-z)^{r}\) and \(g(z)=(1-z)^{r-1}\). The proof is based on the power series expansions

$$\begin{aligned} f(z)=\sum _{n\ge 0}^{\infty }b_nz^n \quad \text{ and } \quad g(z)= \sum _{n \ge 0}^{\infty }c_nz^n, \end{aligned}$$

which converge absolutely on \({\mathcal D}\). To ensure absolute convergence on the boundary \(|z|=1\), notice that

$$\begin{aligned} c_n=\frac{1}{n!}g^{(n)}(0) = \frac{(-1)^n}{n!}\prod _{j=1}^{n}(r-j), \end{aligned}$$

so that all coefficients \(c_n\) for \(n\ge r\) have the same sign \(s := (-1)^{\lfloor r \rfloor }\) (if r is an integer these coefficients vanish without altering the argument below) implying for any \(N>r\):

$$\begin{aligned} \sum _{n=0}^N |c_n| =&\sum _{n=0}^{\lfloor r \rfloor }|c_n| + s \sum _{n=\lfloor r \rfloor +1}^Nc_n = \sum _{n=0}^{\lfloor r \rfloor }|c_n| + s \lim _{z \nearrow 1}\sum _{n=\lfloor r \rfloor +1}^Nc_nz^n \\ \le&\sum _{n=0}^{\lfloor r \rfloor }|c_n| + s\lim _{z \nearrow 1}\Big ( g(z)- \sum _{n=0}^{\lfloor r \rfloor } c_n \Big ) \\ =&2\sum _{i=0}^{ \lfloor r/2 \rfloor }|c_{\lfloor r \rfloor - 2i} | . \end{aligned}$$

A bound for \(\sum _n |b_n|\) can be derived analogously. Since \(f(1-B_j) = B_j^r\), we obtain

$$\begin{aligned} \Vert B_1^r- B_2^r \Vert _p \le \sum _{n=0}^{\infty } |b_n| \Vert (I-B_1)^n - (I-B_2)^n \Vert _p. \end{aligned}$$

Using the algebraic identity \( T^{n+1} - S^{n+1} = T(T^n-S^n) + (T-S)S^n \), the triangle inequality, and making use of \(\Vert TS\Vert _p \le \Vert T\Vert \Vert S\Vert _p\) for \(S \in {\mathcal S}^p\), T bounded, the reader can easily convince himself by induction that

  • for \(j=1,2\), \(B_j \in {\mathcal S}^p\) imply \((I-B_1)^n - (I-B_2)^n \in {\mathcal S}^p\) and

  • \(\Vert (I-B_1)^n - (I-B_2)^n \Vert _p \le n\Vert B_1 - B_2 \Vert _p\).

From \(f'(z)=-r g(z)\) we have the relation \(|b_n|= \frac{r}{n}|c_{n-1}|\), \(n \ge 1\). Collecting all pieces leads to

$$\begin{aligned} \Vert B_1^r- B_2^r \Vert _p \le \Vert B_1 - B_2\Vert _p \sum _{n=0}^{\infty }n|b_n| = r \Vert B_1 - B_2\Vert _p \sum _{n=0}^{\infty }|c_n|. \end{aligned}$$

\(\square \)

Appendix C: Auxiliary Technical Lemmata

Lemma C.1

Let X be a nonnegative real random variable such that the following holds:

$$\begin{aligned} {\mathbb {P}}[X > F(t)] \le t, \text { for all } t \in (0,1], \end{aligned}$$
(C.1)

where F is a monotone non-increasing function \((0,1] \rightarrow {\mathbb {R}}_+\). Then

$$\begin{aligned} {\mathbb {E}}[X] \le \int _{0}^1 F(u) \mathrm{d}u. \end{aligned}$$

Proof

An intuitive, non-rigorous proof is as follows. Let G be the tail distribution function of X, then it is well known that \({\mathbb {E}}[X] = \int _{{\mathbb {R}}_+} G\). Now it seems clear that \(\int _{{\mathbb {R}}_+} G = \int _0^1 G^{-1} \), where \(G^{-1}\) is the upper quantile function for X. Finally, F is an upper bound on \(G^{-1}\).

Now for a rigorous proof, we can assume without loss of generality that F is left continuous: Replacing F by its left limit in all points of (0, 1] can only make it larger since it is non-increasing, hence (C.1) is still satisfied; moreover, since a monotone function has an at most countable number of discontinuity points, this operation does not change the value of the integral \(\int _0^1 F\). Define the following pseudo-inverse for \(x\in {\mathbb {R}}_+\):

$$\begin{aligned} F^{\dagger }(x) := \inf \left\{ t\in (0,1] : F(t) < x\right\} , \end{aligned}$$

with the convention \(\inf \emptyset = 1\). Denote \(\widetilde{U} := F^{\dagger }(X)\). From the definition of \(F^\dagger \) and the monotonicity of F it holds that \(F^{\dagger }(x) < t \Rightarrow x>F(t)\) for all \((x,t) \in {\mathbb {R}}_+ \times (0,1]\). Hence, for any \(t\in (0,1]\)

$$\begin{aligned} {\mathbb {P}}[ \widetilde{U} < t] \le {\mathbb {P}}[ X > F(t) ] \le t, \end{aligned}$$

implying that for all \(t \in [0,1]\), \({\mathbb {P}}[ \widetilde{U} \le t] \le t\), i.e., \(\widetilde{U}\) is stochastically larger than a uniform variable on [0, 1]. Furthermore, by left continuity of F, one can readily check that \(F(F^\dagger (x)) \ge x\) if \(x \le F(0)\). Since \({\mathbb {P}}[X > F(0)] = 0\), we can replace X by \(\widetilde{X} := \min (X,F(0))\) without changing its distribution (nor that of \(\widetilde{U}\)). With this modification, it then holds that \( F(\widetilde{U}) = F(F^\dagger (\widetilde{X})) \ge \widetilde{X}\). Hence,

$$\begin{aligned} {\mathbb {E}}[X] = {\mathbb {E}}[\widetilde{X}] \le {\mathbb {E}}[F(\widetilde{U})] \le {\mathbb {E}}[F(U)] = \int _0^1 F(u) \mathrm{d}u, \end{aligned}$$

where U is a uniform variable on [0, 1], and the second equality holds since F is non-increasing. \(\square \)

Corollary C.2

Let X be a nonnegative random variable and \(t_0 \in (0,1)\) such that the following holds:

$$\begin{aligned} {\mathbb {P}}[X > a + b \log t^{-1}]\le & {} t, \text { for all } t \in (t_0,1], \text { and } \end{aligned}$$
(C.2)
$$\begin{aligned} {\mathbb {P}}[X > a' + b' \log t^{-1}]\le & {} t, \text { for all } t \in (0,1], \end{aligned}$$
(C.3)

where \(a,b,a',b'\) are nonnegative numbers. Then for any positive \(p\le \frac{1}{2} \log t_0^{-1}{::}\)

$$\begin{aligned} {\mathbb {E}}[X^p] \le C_p \left( a^p + b^p \Gamma (p+1) + t_0 \left( (a')^p + 2(b' \log t_0^{-1})^p \right) \right) , \end{aligned}$$

with \(C_p:= \max (2^{p-1},1)\).

Proof

Let \(F(t) := {\mathbf {1}\{t \in (t_0,1]\}} (a + b \log t^{-1}) + {\mathbf {1}\{t \in (0,t_0]\}} (a' + b' \log t^{-1})\). Then F is nonnegative, non-increasing on (0, 1] and

$$\begin{aligned} {\mathbb {P}}[ X^p > F^p(t)] \le t \end{aligned}$$

for all \(t \in (0,1]\). Applying Lemma C.1, we find

$$\begin{aligned} {\mathbb {E}}[X^p] \le \int _0^{t_0} (a' + b' \log t^{-1})^p \mathrm{d}t + \int _{t_0}^1 (a + b \log t^{-1})^p \mathrm{d}t. \end{aligned}$$
(C.4)

Using \((x+y)^{p} \le C_p (x^{p}+y^{p})\) for \(x,y \ge 0\), where \(C_p = \max (2^{p-1},1)\), we upper bound the second integral in (C.4) via

$$\begin{aligned} \int _{t_0}^1 (a + b \log t^{-1})^p \mathrm{d}t \le C_p \left( a^p + b^p \int _0^1 (\log t^{-1})^p \mathrm{d}t\right) = C_p \left( a^p + b^p \Gamma (p+1)\right) . \end{aligned}$$

Concerning the first integral in (C.4), we write similarly

$$\begin{aligned}&\int ^{t_0}_0 (a' + b' \log t^{-1})^p \mathrm{d}t \le C_p \left( t_0 (a')^p + (b')^p \int _0^{t_0} (\log t^{-1})^p \mathrm{d}t\right) \\&\quad = C_p \left( t_0 (a')^p + (b')^p \Gamma (p+1,\log t_0^{-1})\right) , \end{aligned}$$

by the change of variable \(u = \log t^{-1}\), where \(\Gamma \) is the incomplete gamma function. We use the following coarse bound: It can be checked that \(t\mapsto t^p\mathrm{e}^{-\frac{t}{2}}\) is decreasing for \(t \ge 2p\), hence putting \(x:= \log t_0^{-1}\),

$$\begin{aligned} \Gamma (p+1,x) = \int _{x}^\infty t^p \mathrm{e}^{-t}\mathrm{d}t \le x^p \mathrm{e}^{-\frac{x}{2}} \int _x^\infty \mathrm{e}^{-\frac{t}{2}} \mathrm{d}t = 2 x^p \mathrm{e}^{-x} = 2 t_0 (\log t_0^{-1})^{p}, \end{aligned}$$

provided \(x = \log t_0^{-1} \ge 2p\). Collecting all the above pieces we get the conclusion. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Blanchard, G., Mücke, N. Optimal Rates for Regularization of Statistical Inverse Learning Problems. Found Comput Math 18, 971–1013 (2018). https://doi.org/10.1007/s10208-017-9359-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10208-017-9359-7

Keywords

Mathematics Subject Classification

Navigation