Optimal Rates for Regularization of Statistical Inverse Learning Problems

Blanchard, Gilles; Mücke, Nicole

doi:10.1007/s10208-017-9359-7

Optimal Rates for Regularization of Statistical Inverse Learning Problems

Published: 20 June 2017

Volume 18, pages 971–1013, (2018)
Cite this article

Foundations of Computational Mathematics Aims and scope Submit manuscript

Gilles Blanchard¹ &
Nicole Mücke¹

1619 Accesses
45 Citations
1 Altmetric
Explore all metrics

Abstract

We consider a statistical inverse learning (also called inverse regression) problem, where we observe the image of a function f through a linear operator A at i.i.d. random design points $X_i$, superposed with an additive noise. The distribution of the design points is unknown and can be very general. We analyze simultaneously the direct (estimation of Af) and the inverse (estimation of f) learning problems. In this general framework, we obtain strong and weak minimax optimal rates of convergence (as the number of observations n grows large) for a large class of spectral regularization methods over regularity classes defined through appropriate source conditions. This improves on or completes previous results obtained in related settings. The optimality of the obtained rates is shown not only in the exponent in n but also in the explicit dependency of the constant factor in the variance of the noise and the radius of the source condition set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Notes

This can be extended to the case where g is only approximated in $L^2(\nu )$ by a sequence of functions in ${\mathcal H}_K$. For the sake of the present discussion, only the case where it is assumed $g \in {\mathcal H}_K$ is of interest.

References

F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. J. Complexity, 23(1):52–72, 2007.
Article MathSciNet MATH Google Scholar
R. Bhatia. Matrix Analysis. Springer, 1997.
R. Bhatia and J. Holbrook. Fréchet derivatives of the power function. Indiana University Mathematics Journal, 49 (3):1155–1173, 2000.
Article MathSciNet MATH Google Scholar
N. H. Bingham, C. M. Goldie, and J. L. Teugels. Regular Variation, volume 27 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, 1987.
N. Bissantz, T. Hohage, A. Munk, and F. Ruymgaart. Convergence rates of general regularization methods for statistical inverse problems and applications. SIAM J. Numer. Analysis, 45(6):2610–2636, 2007.
Article MathSciNet MATH Google Scholar
G. Blanchard and N. Krämer. Convergence rates of kernel conjugate gradient for random design regression. Analysis and Applications, 2016.
G. Blanchard and P. Massart. Discussion of “2004 IMS medallion lecture: Local Rademacher complexities and oracle inequalities in risk minimization”, by V. Koltchinskii. Annals of Statistics, 34(6):2664–2671, 2006.
P. Bühlmann and B. Yu. Boosting with the $l_2$-loss: Regression and classification. Journal of American Statistical Association, 98(462):324–339, 2003.
Article MathSciNet MATH Google Scholar
A. Caponnetto. Optimal rates for regularization operators in learning theory. Technical report, MIT, 2006.
A. Caponnetto and Y. Yao. Cross-validion based adaptation for regularization operators in learning theory. Analysis and Applications, 8(2):161–183, 2010.
Article MathSciNet MATH Google Scholar
F. Cucker and S. Smale. Best choices for regularization parameters in learning theory: on the bias-variance problem. Foundations of Computational Mathematics, 2(4):413–428, 2002.
Article MathSciNet MATH Google Scholar
E. De Vito and A. Caponnetto. Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2006.
MathSciNet MATH Google Scholar
E. De Vito, L. Rosasco, and A. Caponnetto. Discretization error analysis for Tikhonov regularization. Analysis and Applications, 4(1):81–99, 2006.
Article MathSciNet MATH Google Scholar
E. De Vito, L. Rosasco, A. Caponnetto, and U. De Giovannini. Learning from examples as an inverse problem. J. of Machine Learning Research, 6:883–904, 2005.
MathSciNet MATH Google Scholar
R. DeVore, G. Kerkyacharian, D. Picard, and V.Temlyakov. Mathematical methods for supervised learning. Foundations of Computational Mathematics, 6(1):3–58, 2006.
Article MathSciNet MATH Google Scholar
L. Dicker, D. Foster, and D. Hsu. Kernel methods and regularization techniques for nonparametric regression: Minimax optimality and adaptation. Technical report, Rutgers University, 2015.
H. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Kluwer Academic Publishers, 2000.
K. Fukumizu, F. R. Bach, and A. Gretton. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 8:361–383, 2007.
MathSciNet MATH Google Scholar
L. L. Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri. Spectral algorithms for supervised learning. Neural Computation, 20(7):1873–1897, 2008.
Article MathSciNet MATH Google Scholar
F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural network architectures. Neural Computation, 7(2):219–269, 1993.
Article Google Scholar
L. Györfi, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-free Theory of Nonparametric Regression. Springer, 2002.
P. Halmos and V. Sunder. Bounded Integral Operators on $L^2$ -Spaces. Springer, 1978.
L. Hörmander. The analysis of linear partial differential operators I. Springer, 1983.
S. Loustau. Inverse statistical learning. Electron. J. Statist., 7:2065–2097, 2013.
Article MathSciNet MATH Google Scholar
S. Loustau and C. Marteau. Minimax fast rates for discriminant analysis with errors in variables. Bernoulli, 21(1):176–208, 2015.
Article MathSciNet MATH Google Scholar
P. Mathé and S. Pereverzev. Geometry of linear ill-posed problems in variable Hilbert scales. Inverse Problems, 19(3):789, 2003.
Article MathSciNet MATH Google Scholar
S. Mendelson and J. Neeman. Regularization in kernel learning. The Annals of Statistics, 38(1):526–565, 2010.
Article MathSciNet MATH Google Scholar
F. O’Sullivan. Convergence characteristics of methods of regularization estimators for nonlinear operator equations. SIAM J. Numer. Anal., 27(6):1635–1649, 1990.
Article MathSciNet MATH Google Scholar
I. F. Pinelis and A. I. Sakhanenko. Remarks on inequalities for probabilities of large deviations. Theory Probab. Appl., 30(1):143–148, 1985.
Article MathSciNet MATH Google Scholar
S. Smale and D. Zhou. Shannon sampling II: Connections to learning theory. Appl. Comput. Harmon. Analysis, 19(3):285–302, 2005.
Article MathSciNet MATH Google Scholar
S. Smale and D. Zhou. Learning theory estimates via integral operators and their approximation. Constructive Approximation, 26(2):153–172, 2007.
Article MathSciNet MATH Google Scholar
I. Steinwart and A. Christman. Support Vector Machines. Springer, 2008.
I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. Proceedings of the 22nd Annual Conference on Learning Theory, pages 79–93, 2009.
V. Temlyakov. Approximation in learning theory. Constructive Approximation, 27(1):33–74, 2008.
Article MathSciNet MATH Google Scholar
A. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2008.
G. Wahba. Spline Models for Observational Data, volume 59. SIAM CBMS-NSF Series in Applied Mathematics, 1990.
C. Wang and D.-X. Zhou. Optimal learning rates for least squares regularized regression with unbounded sampling. Journal of Complexity, 27(1):55–67, 2011.
Article MathSciNet MATH Google Scholar
Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Mathematics, University of Potsdam, Karl-Liebknecht-Strasse 24-25, 14476, Potsdam, Germany
Gilles Blanchard & Nicole Mücke

Authors

Gilles Blanchard
View author publications
You can also search for this author in PubMed Google Scholar
Nicole Mücke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicole Mücke.

Additional information

Communicated by Tomasio Poggio.

This research was supported by the DFG via Research Unit 1735 Structural Inference in Statistics.

Appendices

Appendix A: Proof of Concentration Inequalities

Proposition A.1

Let $(Z, {\mathcal B}, {\mathbb {P}})$ be a probability space and $\xi $ a random variable on Z with values in a real separable Hilbert space ${\mathcal H}$. Assume that there are two positive constants L and $\sigma $ such that for any $m\ge 2$

$$\begin{aligned} {\mathbb {E}}\big [ \left\| \xi - {\mathbb {E}}[\xi ] \right\| _{{\mathcal H}_1}^m \big ] \le \frac{1}{2}m!\sigma ^2L^{m-2}. \end{aligned}$$

(A.1)

If the sample $z_1,\ldots ,z_n$ is drawn i.i.d. from Z according to ${\mathbb {P}}$, then, for any $0<\eta <1$, with probability greater than $1-\eta $

$$\begin{aligned} \Big \Vert \frac{1}{n}\sum _{j=1}^n\xi (z_j)-{\mathbb {E}}[\xi ] \Big \Vert _{\mathcal H} \le \delta (n,\eta ), \end{aligned}$$

(A.2)

where

$$\begin{aligned} \delta (n, \eta ):= & {} 2 \log (2\eta ^{-1}) \left( \frac{L}{n} + \frac{\sigma }{\sqrt{n}} \right) . \end{aligned}$$

In particular, (A.1) holds if

$$\begin{aligned} \left\| \xi (z) \right\| _{{\mathcal H}_1}\le & {} \frac{L}{2} \quad a.s.,\\ {\mathbb {E}}\big [ \left\| \xi \right\| ^2_{{\mathcal H}_1} \big ]\le & {} \sigma ^2. \end{aligned}$$

Proof

See [9, 10], from the original result of [29] (Corollary 1). $\square $

Proof of Proposition 5.2

Define $\xi _1: {\mathcal X}\times {\mathbb {R}}\longrightarrow {\mathcal H}_1$ by

$$\begin{aligned} \xi _1 (x,y)&:= (\bar{B} + \lambda )^{-1/2} (y-\bar{S}_{x}f_{\rho }) \bar{F}_x \\&= (\bar{B} + \lambda )^{-1/2} \bar{S}_{x}^\star (y-\bar{S}_{x}f_{\rho }). \end{aligned}$$

Abusing notation we also denote $\xi _1$ the random variable $\xi _1(X,Y)$ where $(X,Y) \sim \rho $. The model assumption 2.10 implies

$$\begin{aligned} {\mathbb {E}}[ \xi _1 ]&= (\bar{B} + \lambda )^{-1/2} \int _{{\mathcal X}} \bar{F}_x\int _{{\mathbb {R}}}(y-\bar{S}_{x}f_{\rho }) \rho (\mathrm{d}y|x)\nu (\mathrm{d}x) \\&= (\bar{B} + \lambda )^{-1/2} \int _{{\mathcal X}} \bar{F}_x(\bar{S}_{x}f_{\rho }- \bar{S}_{x}f_{\rho }) \nu (\mathrm{d}x)\\&= 0, \end{aligned}$$

and therefore

$$\begin{aligned} \frac{1}{n}\sum _{j=1}^n\xi _1 (x_j, y_j)- {\mathbb {E}}[\xi _1 ]&= \frac{1}{n}\sum _{j=1}^n (\bar{B} + \lambda )^{-1/2} ( y_{j}- \bar{S}_{x_j}f_{\rho }) \bar{F}_{x_j} \\&= (\bar{B} + \lambda )^{-1/2} \bar{S}_{\mathbf{x}}^{\star } \left( \mathbf{y}- \bar{S}_{\mathbf{x}}f_{\rho }\right) .\\&= (\bar{B} + \lambda )^{-1/2}\left( \bar{S}_{\mathbf{x}}^{\star }\mathbf{y}- \bar{B}_{\mathbf{x}}f_{\rho }\right) . \end{aligned}$$

Moreover, by Assumption 2.11, for $m\ge 2$:

$$\begin{aligned} {\mathbb {E}}[\left\| \xi _1\right\| ^m_{{\mathcal H}_1}]&= \int _{{\mathcal X}\times {\mathbb {R}}} \left\| (\bar{B} + \lambda )^{-1/2} \bar{S}_{x}^\star (y-\bar{S}_{x}f_{\rho })\right\| ^m_{{\mathcal H}_1} \rho (\mathrm{d}x,\mathrm{d}y)\\&= \int _{{\mathcal X}\times {\mathbb {R}}} | \langle \bar{S}_{x}(\bar{B} + \lambda )^{-1} \bar{S}_{x}^\star (y-\bar{S}_{x}f_{\rho }), (y-\bar{S}_{x}f_{\rho }) \rangle _{{\mathbb {R}}} |^{\frac{m}{2}} \rho (\mathrm{d}x,\mathrm{d}y)\\&\le \int _{{\mathcal X}} \left\| \bar{S}_{x}(\bar{B} + \lambda )^{-1} \bar{S}_{x}^\star \right\| ^{\frac{m}{2}} \int _{{\mathbb {R}}} \left|y-\bar{S}_{x}f_{\rho } \right|^m \rho (\mathrm{d}y|x) \nu (\mathrm{d}x) \\&\le \frac{1}{2}m! \sigma ^2 M^{m-2} \int _{{\mathcal X}} \left\| \bar{S}_{x}(\bar{B} + \lambda )^{-1} \bar{S}_{x}^\star \right\| ^{\frac{m}{2}} \nu (\mathrm{d}x). \end{aligned}$$

Setting $A:=\bar{S}_{x}(\bar{B} + \lambda )^{-1/2}$, we have using $\left\| \cdot \right\| \le \mathrm {tr}(\cdot )$ for positive operators:

$$\begin{aligned} \left\| \bar{S}_{x}(\bar{B} + \lambda )^{-1} \bar{S}_{x}^\star \right\| ^{\frac{m}{2}}&= \left\| AA^\star \right\| ^{\frac{m}{2}} \le \left\| AA^\star \right\| ^{\frac{m}{2}-1}\mathrm {tr}(AA^\star ) \\&= \left\| AA^\star \right\| ^{\frac{m}{2}-1} \mathrm {tr}(A^\star A). \end{aligned}$$

Firstly, observe that

$$\begin{aligned} \left\| AA^\star \right\| ^{\frac{m}{2}-1} \le \left( \frac{1}{\lambda }\right) ^{\frac{m}{2}-1}, \end{aligned}$$

since our main assumption 2.1 implies $\left\| \bar{S}_x\right\| \le 1$. Secondly, by linearity of $\mathrm {tr}(\cdot )$,

$$\begin{aligned} \int _{{\mathcal X}} \mathrm {tr}(A^\star A) \nu (dx) = \int _{{\mathcal X}} \mathrm {tr}( (\bar{B} + \lambda )^{-1/2} \bar{B}_{x}(\bar{B} + \lambda )^{-1/2} ) \nu (\mathrm{d}x) = {\mathcal N}(\lambda ). \end{aligned}$$

Thus,

$$\begin{aligned} {\mathbb {E}}[\left\| \xi _1\right\| ^m_{{\mathcal H}_1}] \le \frac{1}{2}m! (\sigma \sqrt{{\mathcal N}(\lambda ) })^2 \left( \frac{M}{\sqrt{\lambda }}\right) ^{m-2} . \end{aligned}$$

As a result, Proposition A.1 implies with probability at least $1-\eta $

$$\begin{aligned} \left\| (\bar{B} + \lambda )^{-1/2} ( \bar{B}_{\mathbf{x}}f_{\rho }-\bar{S}_{\mathbf{x}}^{\star }\mathbf{y}) \right\| _{{\mathcal H}_1} \le \delta _1\left( n,\eta \right) , \end{aligned}$$

where

$$\begin{aligned} \delta _1(n,\eta ) = 2\log (2\eta ^{-1}) \left( \frac{M}{n\sqrt{\lambda }} + \frac{\sigma }{\sqrt{ n}}\sqrt{{\mathcal N}(\lambda )}\right) . \end{aligned}$$

For the second part of the proposition, we introduce similarly

$$\begin{aligned} \xi '_1(x,y):= (y-\bar{S}_x f_{\rho }) \bar{F}_x = \bar{S}^\star _x (y-\bar{S}_x f_{\rho }), \end{aligned}$$

which satisfies

$$\begin{aligned} {\mathbb {E}}[ \xi _1']=0; \quad \frac{1}{n}\sum _{j=1}^n\xi '_1 (x_j, y_j)- {\mathbb {E}}[\xi '_1 ]= \bar{S}_{\mathbf{x}}^{\star }\mathbf{y}- \bar{B}_{\mathbf{x}}f_{\rho }, \end{aligned}$$

and

$$\begin{aligned} {\mathbb {E}}\big [\left\| \xi '_1\right\| ^m_{{\mathcal H}_1}\big ] \le {\mathbb {E}}\big [ \left|y-\bar{S}_x f_{\rho } \right|^m\big ] \le \frac{1}{2}m! \sigma ^2 M^{m-2}. \end{aligned}$$

Applying Proposition A.1 yields the result. $\square $

Proof of Proposition 5.3

We proceed as above by defining $\xi _2: {\mathcal X}\longrightarrow {\mathrm {HS}}({\mathcal H}_1)$ (the latter denoting the space of Hilbert–Schmidt operators on ${\mathcal H}_1$) by

$$\begin{aligned} \xi _2(x):= (\bar{B}+\lambda )^{-1} \bar{B}_x, \end{aligned}$$

where $\bar{B}_x := \bar{F}_x \otimes \bar{F}_x^\star $. We also use the same notation $\xi _2$ for the random variable $\xi _2(X)$ with $X\sim \nu $. Then,

$$\begin{aligned} {\mathbb {E}}[\xi _2] = (\bar{B}+\lambda )^{-1} \int _{{\mathcal X}} \bar{B}_x \nu (\mathrm{d}x) = (\bar{B}+\lambda )^{-1}\bar{B}, \end{aligned}$$

and therefore

$$\begin{aligned} \frac{1}{n}\sum _{j=1}^n\xi _2(x_j) - {\mathbb {E}}[\xi _2]= (\bar{B}+\lambda )^{-1}(\bar{B}- \bar{B}_{\mathbf{x}}). \end{aligned}$$

Furthermore, since $\bar{B}_x$ is of trace class and $(\bar{B}+\lambda )^{-1}$ is bounded, we have using Assumption 2.1

$$\begin{aligned} \left\| \xi _2 (x)\right\| _{{\mathrm {HS}}} \le \left\| (\bar{B}+\lambda )^{-1}\right\| \left\| \bar{B}_x\right\| _{{\mathrm {HS}}} \le \lambda ^{-1} =: L_2/2, \end{aligned}$$

uniformly for any $x \in {\mathcal X}$. Moreover,

$$\begin{aligned} {\mathbb {E}}\big [ \left\| \xi _2\right\| ^2_{{\mathrm {HS}}} \big ]&= \int _{{\mathcal X}} \mathrm {tr}( \bar{B}_x(\bar{B}+\lambda )^{-2}\bar{B}_x ) \nu (\mathrm{d}x) \\&\le \left\| (\bar{B}+\lambda )^{-1}\right\| \int _{{\mathcal X}} \left\| \bar{B}_x\right\| \mathrm {tr}((\bar{B}+\lambda )^{-1} \bar{B}_x )\nu (\mathrm{d}x) \\&\le \frac{{\mathcal N}(\lambda )}{\lambda } = :\sigma _2^2. \end{aligned}$$

Thus, Proposition A.1 applies and gives with probability at least $1-\eta $

$$\begin{aligned} \left\| (\bar{B}+\lambda )^{-1}(\bar{B}- \bar{B}_{\mathbf{x}}) \right\| _{{\mathrm {HS}}} \le \delta _2 (n,\eta ) \end{aligned}$$

with

$$\begin{aligned} \delta _2(n,\eta ) = 2\log (2\eta ^{-1}) \left( \frac{2}{n \lambda } + \sqrt{\frac{{\mathcal N}(\lambda )}{n\lambda }} \right) . \end{aligned}$$

$\square $

Proof of Proposition 5.4

We write the Neumann series identity

$$\begin{aligned} (\bar{B}_{\mathbf{x}} + \lambda )^{-1}(\bar{B} + \lambda ) = (I - H_{\mathbf{x}}(\lambda ) )^{-1} = \sum _{j=0}^{\infty } H_{\mathbf{x}}(\lambda )^j(\lambda ), \end{aligned}$$

with

$$\begin{aligned} H_{\mathbf{x}}(\lambda ) = (\bar{B}+ \lambda )^{-1}(\bar{B} - \bar{B}_{\mathbf{x}}). \end{aligned}$$

It is well known that the series converges in norm provided that $\left\| H_{\mathbf{x}}(\lambda ) \right\| < 1 $. In fact, applying Proposition 5.3 gives with probability at least $1-\eta $:

$$\begin{aligned} \left\| H_{\mathbf{x}}(\lambda ) \right\| \le 2\log (2\eta ^{-1}) \left( \frac{2}{n \lambda } + \sqrt{\frac{{\mathcal N}(\lambda )}{n\lambda }} \right) . \end{aligned}$$

Put $C_\eta := 2\log (2\eta ^{-1}) >1$ for any $\eta \in (0,1)$. Assumption (5.2) reads $\sqrt{n\lambda } \ge 4 C_\eta \sqrt{\max ({\mathcal N}(\lambda ),1)}$, implying $\sqrt{n\lambda } \ge 4 C_\eta \ge 4$ and therefore $\frac{2}{n\lambda } \le \frac{1}{2\sqrt{n\lambda }} \le \frac{1}{8C_\eta }$, hence

$$\begin{aligned} C_\eta \left( \frac{2}{n \lambda } + \sqrt{\frac{{\mathcal N}(\lambda )}{n\lambda }} \right) \le C_\eta \left( \frac{1}{8C_\eta } + \frac{1}{4C_\eta }\right) < \frac{1}{2}. \end{aligned}$$

Thus, with probability at least $1-\eta $:

$$\begin{aligned} \left\| (\bar{B}_{\mathbf{x}} + \lambda )^{-1}(\bar{B} + \lambda ) \right\| \le 2. \end{aligned}$$

$\square $

Proof of Proposition 5.5

Defining $\xi _3: {\mathcal X}\longrightarrow {\mathrm {HS}}({\mathcal H}_1)$ by

$$\begin{aligned} \xi _3(x):= \bar{F}_x \otimes \bar{F}_x^\star = \bar{B}_x \end{aligned}$$

and denoting also, as before, $\xi _3$ for the random variable $\xi _3(X)$ (with $X\sim \nu $), we have ${\mathbb {E}}[\xi _3] = \bar{B} $ and therefore

$$\begin{aligned} \frac{1}{n}\sum _{j=1}^n\xi _3(x_j) - {\mathbb {E}}[\xi _3]= (\bar{B}_{\mathbf{x}}- \bar{B}). \end{aligned}$$

Furthermore, by Assumption 2.1

$$\begin{aligned} \left\| \xi _3 (x)\right\| _{{\mathrm {HS}}} = \left\| \bar{F}_x\right\| ^2 \le 1 =: \frac{L_3}{2} \quad \mathrm{a.s.}, \end{aligned}$$

also leading to ${\mathbb {E}}[ \left\| \xi _2\right\| ^2_{{\mathrm {HS}}} ] \le 1 =: \sigma _3^2$. Thus, Proposition A.1 applies and gives with probability at least $1-\eta $

$$\begin{aligned} \left\| \bar{B}- \bar{B}_{\mathbf{x}} \right\| _{{\mathrm {HS}}} \le 6 \log (2\eta ^{-1}) \frac{1}{\sqrt{n}}. \end{aligned}$$

$\square $

Appendix B: Perturbation Result

The estimate of the following proposition is crucial for proving the upper bound in case the source condition is of Hölder type r with $r>1$. We remark that for $r>1$ the function $t \mapsto t^r$ is not operator monotone. One might naively expect estimate (B.1) to hold for a constant C given by the Lipschitz constant of the scalar function $t^r$. As shown in [3], this is false even for finite-dimensional positive matrices. The point of Proposition B.1 is that (B.1) still holds for some larger constant depending on r and the upper bound of the spectrum. We do not expect this result to be particularly novel, but tracking down a proof in the literature proved elusive, not to mention that occasionally erroneous statements about related issues can be found. For this reason we here provide a self-contained proof for completeness sake.

Proposition B.1

Let $B_1, B_2$ be two nonnegative self-adjoint operators on some Hilbert space with $||B_j|| \le a$, $j=1,2$, for some $a>0$. Assume the $B_j$ belongs to the Schatten class ${\mathcal S}^p$ for $1 \le p \le \infty $. If $1 < r$, then

$$\begin{aligned} ||B_1^r-B_2^r||_p \le C_{r}a^{r-1} || B_1 - B_2 ||_p , \end{aligned}$$

(B.1)

for some $C_{r}<\infty $. This inequality also holds in operator norm for non-compact bounded (nonnegative and self-adjoint) $B_j$.

Proof

We extend the proof of [18], given there in the case $r=3/2$ in operator norm. We also restrict ourselves to the case $a=1$. On ${\mathcal D}:=\{z: |z| \le 1 \}$, we consider the functions $f(z)=(1-z)^{r}$ and $g(z)=(1-z)^{r-1}$. The proof is based on the power series expansions

$$\begin{aligned} f(z)=\sum _{n\ge 0}^{\infty }b_nz^n \quad \text{ and } \quad g(z)= \sum _{n \ge 0}^{\infty }c_nz^n, \end{aligned}$$

which converge absolutely on ${\mathcal D}$. To ensure absolute convergence on the boundary $|z|=1$, notice that

$$\begin{aligned} c_n=\frac{1}{n!}g^{(n)}(0) = \frac{(-1)^n}{n!}\prod _{j=1}^{n}(r-j), \end{aligned}$$

so that all coefficients $c_n$ for $n\ge r$ have the same sign $s := (-1)^{\lfloor r \rfloor }$ (if r is an integer these coefficients vanish without altering the argument below) implying for any $N>r$:

$$\begin{aligned} \sum _{n=0}^N |c_n| =&\sum _{n=0}^{\lfloor r \rfloor }|c_n| + s \sum _{n=\lfloor r \rfloor +1}^Nc_n = \sum _{n=0}^{\lfloor r \rfloor }|c_n| + s \lim _{z \nearrow 1}\sum _{n=\lfloor r \rfloor +1}^Nc_nz^n \\ \le&\sum _{n=0}^{\lfloor r \rfloor }|c_n| + s\lim _{z \nearrow 1}\Big ( g(z)- \sum _{n=0}^{\lfloor r \rfloor } c_n \Big ) \\ =&2\sum _{i=0}^{ \lfloor r/2 \rfloor }|c_{\lfloor r \rfloor - 2i} | . \end{aligned}$$

A bound for $\sum _n |b_n|$ can be derived analogously. Since $f(1-B_j) = B_j^r$, we obtain

$$\begin{aligned} \Vert B_1^r- B_2^r \Vert _p \le \sum _{n=0}^{\infty } |b_n| \Vert (I-B_1)^n - (I-B_2)^n \Vert _p. \end{aligned}$$

Using the algebraic identity $ T^{n+1} - S^{n+1} = T(T^n-S^n) + (T-S)S^n $, the triangle inequality, and making use of $\Vert TS\Vert _p \le \Vert T\Vert \Vert S\Vert _p$ for $S \in {\mathcal S}^p$, T bounded, the reader can easily convince himself by induction that

for $j=1,2$, $B_j \in {\mathcal S}^p$ imply $(I-B_1)^n - (I-B_2)^n \in {\mathcal S}^p$ and
$\Vert (I-B_1)^n - (I-B_2)^n \Vert _p \le n\Vert B_1 - B_2 \Vert _p$.

From $f'(z)=-r g(z)$ we have the relation $|b_n|= \frac{r}{n}|c_{n-1}|$, $n \ge 1$. Collecting all pieces leads to

$$\begin{aligned} \Vert B_1^r- B_2^r \Vert _p \le \Vert B_1 - B_2\Vert _p \sum _{n=0}^{\infty }n|b_n| = r \Vert B_1 - B_2\Vert _p \sum _{n=0}^{\infty }|c_n|. \end{aligned}$$

$\square $

Appendix C: Auxiliary Technical Lemmata

Lemma C.1

Let X be a nonnegative real random variable such that the following holds:

$$\begin{aligned} {\mathbb {P}}[X > F(t)] \le t, \text { for all } t \in (0,1], \end{aligned}$$

(C.1)

where F is a monotone non-increasing function $(0,1] \rightarrow {\mathbb {R}}_+$. Then

$$\begin{aligned} {\mathbb {E}}[X] \le \int _{0}^1 F(u) \mathrm{d}u. \end{aligned}$$

Proof

An intuitive, non-rigorous proof is as follows. Let G be the tail distribution function of X, then it is well known that ${\mathbb {E}}[X] = \int _{{\mathbb {R}}_+} G$. Now it seems clear that $\int _{{\mathbb {R}}_+} G = \int _0^1 G^{-1} $, where $G^{-1}$ is the upper quantile function for X. Finally, F is an upper bound on $G^{-1}$.

Now for a rigorous proof, we can assume without loss of generality that F is left continuous: Replacing F by its left limit in all points of (0, 1] can only make it larger since it is non-increasing, hence (C.1) is still satisfied; moreover, since a monotone function has an at most countable number of discontinuity points, this operation does not change the value of the integral $\int _0^1 F$. Define the following pseudo-inverse for $x\in {\mathbb {R}}_+$:

$$\begin{aligned} F^{\dagger }(x) := \inf \left\{ t\in (0,1] : F(t) < x\right\} , \end{aligned}$$

with the convention $\inf \emptyset = 1$. Denote $\widetilde{U} := F^{\dagger }(X)$. From the definition of $F^\dagger $ and the monotonicity of F it holds that $F^{\dagger }(x) < t \Rightarrow x>F(t)$ for all $(x,t) \in {\mathbb {R}}_+ \times (0,1]$. Hence, for any $t\in (0,1]$

$$\begin{aligned} {\mathbb {P}}[ \widetilde{U} < t] \le {\mathbb {P}}[ X > F(t) ] \le t, \end{aligned}$$

implying that for all $t \in [0,1]$, ${\mathbb {P}}[ \widetilde{U} \le t] \le t$, i.e., $\widetilde{U}$ is stochastically larger than a uniform variable on [0, 1]. Furthermore, by left continuity of F, one can readily check that $F(F^\dagger (x)) \ge x$ if $x \le F(0)$. Since ${\mathbb {P}}[X > F(0)] = 0$, we can replace X by $\widetilde{X} := \min (X,F(0))$ without changing its distribution (nor that of $\widetilde{U}$). With this modification, it then holds that $ F(\widetilde{U}) = F(F^\dagger (\widetilde{X})) \ge \widetilde{X}$. Hence,

$$\begin{aligned} {\mathbb {E}}[X] = {\mathbb {E}}[\widetilde{X}] \le {\mathbb {E}}[F(\widetilde{U})] \le {\mathbb {E}}[F(U)] = \int _0^1 F(u) \mathrm{d}u, \end{aligned}$$

where U is a uniform variable on [0, 1], and the second equality holds since F is non-increasing. $\square $

Corollary C.2

Let X be a nonnegative random variable and $t_0 \in (0,1)$ such that the following holds:

$$\begin{aligned} {\mathbb {P}}[X > a + b \log t^{-1}]\le & {} t, \text { for all } t \in (t_0,1], \text { and } \end{aligned}$$

(C.2)

$$\begin{aligned} {\mathbb {P}}[X > a' + b' \log t^{-1}]\le & {} t, \text { for all } t \in (0,1], \end{aligned}$$

(C.3)

where $a,b,a',b'$ are nonnegative numbers. Then for any positive $p\le \frac{1}{2} \log t_0^{-1}{::}$

$$\begin{aligned} {\mathbb {E}}[X^p] \le C_p \left( a^p + b^p \Gamma (p+1) + t_0 \left( (a')^p + 2(b' \log t_0^{-1})^p \right) \right) , \end{aligned}$$

with $C_p:= \max (2^{p-1},1)$.

Proof

Let $F(t) := {\mathbf {1}\{t \in (t_0,1]\}} (a + b \log t^{-1}) + {\mathbf {1}\{t \in (0,t_0]\}} (a' + b' \log t^{-1})$. Then F is nonnegative, non-increasing on (0, 1] and

$$\begin{aligned} {\mathbb {P}}[ X^p > F^p(t)] \le t \end{aligned}$$

for all $t \in (0,1]$. Applying Lemma C.1, we find

$$\begin{aligned} {\mathbb {E}}[X^p] \le \int _0^{t_0} (a' + b' \log t^{-1})^p \mathrm{d}t + \int _{t_0}^1 (a + b \log t^{-1})^p \mathrm{d}t. \end{aligned}$$

(C.4)

Using $(x+y)^{p} \le C_p (x^{p}+y^{p})$ for $x,y \ge 0$, where $C_p = \max (2^{p-1},1)$, we upper bound the second integral in (C.4) via

$$\begin{aligned} \int _{t_0}^1 (a + b \log t^{-1})^p \mathrm{d}t \le C_p \left( a^p + b^p \int _0^1 (\log t^{-1})^p \mathrm{d}t\right) = C_p \left( a^p + b^p \Gamma (p+1)\right) . \end{aligned}$$

Concerning the first integral in (C.4), we write similarly

$$\begin{aligned}&\int ^{t_0}_0 (a' + b' \log t^{-1})^p \mathrm{d}t \le C_p \left( t_0 (a')^p + (b')^p \int _0^{t_0} (\log t^{-1})^p \mathrm{d}t\right) \\&\quad = C_p \left( t_0 (a')^p + (b')^p \Gamma (p+1,\log t_0^{-1})\right) , \end{aligned}$$

by the change of variable $u = \log t^{-1}$, where $\Gamma $ is the incomplete gamma function. We use the following coarse bound: It can be checked that $t\mapsto t^p\mathrm{e}^{-\frac{t}{2}}$ is decreasing for $t \ge 2p$, hence putting $x:= \log t_0^{-1}$,

$$\begin{aligned} \Gamma (p+1,x) = \int _{x}^\infty t^p \mathrm{e}^{-t}\mathrm{d}t \le x^p \mathrm{e}^{-\frac{x}{2}} \int _x^\infty \mathrm{e}^{-\frac{t}{2}} \mathrm{d}t = 2 x^p \mathrm{e}^{-x} = 2 t_0 (\log t_0^{-1})^{p}, \end{aligned}$$

provided $x = \log t_0^{-1} \ge 2p$. Collecting all the above pieces we get the conclusion. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Blanchard, G., Mücke, N. Optimal Rates for Regularization of Statistical Inverse Learning Problems. Found Comput Math 18, 971–1013 (2018). https://doi.org/10.1007/s10208-017-9359-7

Download citation

Received: 14 April 2016
Revised: 01 December 2016
Accepted: 07 May 2017
Published: 20 June 2017
Issue Date: August 2018
DOI: https://doi.org/10.1007/s10208-017-9359-7

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal Rates for Regularization of Statistical Inverse Learning Problems

Abstract

Access this article

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Proof of Concentration Inequalities

Proposition A.1

Proof

Proof of Proposition 5.2

Proof of Proposition 5.3

Proof of Proposition 5.4

Proof of Proposition 5.5

Appendix B: Perturbation Result

Proposition B.1

Proof

Appendix C: Auxiliary Technical Lemmata

Lemma C.1

Proof

Corollary C.2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation