Abstract
We consider a statistical inverse learning (also called inverse regression) problem, where we observe the image of a function f through a linear operator A at i.i.d. random design points \(X_i\), superposed with an additive noise. The distribution of the design points is unknown and can be very general. We analyze simultaneously the direct (estimation of Af) and the inverse (estimation of f) learning problems. In this general framework, we obtain strong and weak minimax optimal rates of convergence (as the number of observations n grows large) for a large class of spectral regularization methods over regularity classes defined through appropriate source conditions. This improves on or completes previous results obtained in related settings. The optimality of the obtained rates is shown not only in the exponent in n but also in the explicit dependency of the constant factor in the variance of the noise and the radius of the source condition set.
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
Notes
This can be extended to the case where g is only approximated in \(L^2(\nu )\) by a sequence of functions in \({\mathcal H}_K\). For the sake of the present discussion, only the case where it is assumed \(g \in {\mathcal H}_K\) is of interest.
References
F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. J. Complexity, 23(1):52–72, 2007.
R. Bhatia. Matrix Analysis. Springer, 1997.
R. Bhatia and J. Holbrook. Fréchet derivatives of the power function. Indiana University Mathematics Journal, 49 (3):1155–1173, 2000.
N. H. Bingham, C. M. Goldie, and J. L. Teugels. Regular Variation, volume 27 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, 1987.
N. Bissantz, T. Hohage, A. Munk, and F. Ruymgaart. Convergence rates of general regularization methods for statistical inverse problems and applications. SIAM J. Numer. Analysis, 45(6):2610–2636, 2007.
G. Blanchard and N. Krämer. Convergence rates of kernel conjugate gradient for random design regression. Analysis and Applications, 2016.
G. Blanchard and P. Massart. Discussion of “2004 IMS medallion lecture: Local Rademacher complexities and oracle inequalities in risk minimization”, by V. Koltchinskii. Annals of Statistics, 34(6):2664–2671, 2006.
P. Bühlmann and B. Yu. Boosting with the \(l_2\)-loss: Regression and classification. Journal of American Statistical Association, 98(462):324–339, 2003.
A. Caponnetto. Optimal rates for regularization operators in learning theory. Technical report, MIT, 2006.
A. Caponnetto and Y. Yao. Cross-validion based adaptation for regularization operators in learning theory. Analysis and Applications, 8(2):161–183, 2010.
F. Cucker and S. Smale. Best choices for regularization parameters in learning theory: on the bias-variance problem. Foundations of Computational Mathematics, 2(4):413–428, 2002.
E. De Vito and A. Caponnetto. Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2006.
E. De Vito, L. Rosasco, and A. Caponnetto. Discretization error analysis for Tikhonov regularization. Analysis and Applications, 4(1):81–99, 2006.
E. De Vito, L. Rosasco, A. Caponnetto, and U. De Giovannini. Learning from examples as an inverse problem. J. of Machine Learning Research, 6:883–904, 2005.
R. DeVore, G. Kerkyacharian, D. Picard, and V.Temlyakov. Mathematical methods for supervised learning. Foundations of Computational Mathematics, 6(1):3–58, 2006.
L. Dicker, D. Foster, and D. Hsu. Kernel methods and regularization techniques for nonparametric regression: Minimax optimality and adaptation. Technical report, Rutgers University, 2015.
H. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Kluwer Academic Publishers, 2000.
K. Fukumizu, F. R. Bach, and A. Gretton. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 8:361–383, 2007.
L. L. Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri. Spectral algorithms for supervised learning. Neural Computation, 20(7):1873–1897, 2008.
F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural network architectures. Neural Computation, 7(2):219–269, 1993.
L. Györfi, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-free Theory of Nonparametric Regression. Springer, 2002.
P. Halmos and V. Sunder. Bounded Integral Operators on \(L^2\) -Spaces. Springer, 1978.
L. Hörmander. The analysis of linear partial differential operators I. Springer, 1983.
S. Loustau. Inverse statistical learning. Electron. J. Statist., 7:2065–2097, 2013.
S. Loustau and C. Marteau. Minimax fast rates for discriminant analysis with errors in variables. Bernoulli, 21(1):176–208, 2015.
P. Mathé and S. Pereverzev. Geometry of linear ill-posed problems in variable Hilbert scales. Inverse Problems, 19(3):789, 2003.
S. Mendelson and J. Neeman. Regularization in kernel learning. The Annals of Statistics, 38(1):526–565, 2010.
F. O’Sullivan. Convergence characteristics of methods of regularization estimators for nonlinear operator equations. SIAM J. Numer. Anal., 27(6):1635–1649, 1990.
I. F. Pinelis and A. I. Sakhanenko. Remarks on inequalities for probabilities of large deviations. Theory Probab. Appl., 30(1):143–148, 1985.
S. Smale and D. Zhou. Shannon sampling II: Connections to learning theory. Appl. Comput. Harmon. Analysis, 19(3):285–302, 2005.
S. Smale and D. Zhou. Learning theory estimates via integral operators and their approximation. Constructive Approximation, 26(2):153–172, 2007.
I. Steinwart and A. Christman. Support Vector Machines. Springer, 2008.
I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. Proceedings of the 22nd Annual Conference on Learning Theory, pages 79–93, 2009.
V. Temlyakov. Approximation in learning theory. Constructive Approximation, 27(1):33–74, 2008.
A. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2008.
G. Wahba. Spline Models for Observational Data, volume 59. SIAM CBMS-NSF Series in Applied Mathematics, 1990.
C. Wang and D.-X. Zhou. Optimal learning rates for least squares regularized regression with unbounded sampling. Journal of Complexity, 27(1):55–67, 2011.
Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Tomasio Poggio.
This research was supported by the DFG via Research Unit 1735 Structural Inference in Statistics.
Appendices
Appendix A: Proof of Concentration Inequalities
Proposition A.1
Let \((Z, {\mathcal B}, {\mathbb {P}})\) be a probability space and \(\xi \) a random variable on Z with values in a real separable Hilbert space \({\mathcal H}\). Assume that there are two positive constants L and \(\sigma \) such that for any \(m\ge 2\)
If the sample \(z_1,\ldots ,z_n\) is drawn i.i.d. from Z according to \({\mathbb {P}}\), then, for any \(0<\eta <1\), with probability greater than \(1-\eta \)
where
In particular, (A.1) holds if
Proof
See [9, 10], from the original result of [29] (Corollary 1). \(\square \)
Proof of Proposition 5.2
Define \(\xi _1: {\mathcal X}\times {\mathbb {R}}\longrightarrow {\mathcal H}_1\) by
Abusing notation we also denote \(\xi _1\) the random variable \(\xi _1(X,Y)\) where \((X,Y) \sim \rho \). The model assumption 2.10 implies
and therefore
Moreover, by Assumption 2.11, for \(m\ge 2\):
Setting \(A:=\bar{S}_{x}(\bar{B} + \lambda )^{-1/2}\), we have using \(\left\| \cdot \right\| \le \mathrm {tr}(\cdot )\) for positive operators:
Firstly, observe that
since our main assumption 2.1 implies \(\left\| \bar{S}_x\right\| \le 1\). Secondly, by linearity of \(\mathrm {tr}(\cdot )\),
Thus,
As a result, Proposition A.1 implies with probability at least \(1-\eta \)
where
For the second part of the proposition, we introduce similarly
which satisfies
and
Applying Proposition A.1 yields the result. \(\square \)
Proof of Proposition 5.3
We proceed as above by defining \(\xi _2: {\mathcal X}\longrightarrow {\mathrm {HS}}({\mathcal H}_1)\) (the latter denoting the space of Hilbert–Schmidt operators on \({\mathcal H}_1\)) by
where \(\bar{B}_x := \bar{F}_x \otimes \bar{F}_x^\star \). We also use the same notation \(\xi _2\) for the random variable \(\xi _2(X)\) with \(X\sim \nu \). Then,
and therefore
Furthermore, since \(\bar{B}_x\) is of trace class and \((\bar{B}+\lambda )^{-1}\) is bounded, we have using Assumption 2.1
uniformly for any \(x \in {\mathcal X}\). Moreover,
Thus, Proposition A.1 applies and gives with probability at least \(1-\eta \)
with
\(\square \)
Proof of Proposition 5.4
We write the Neumann series identity
with
It is well known that the series converges in norm provided that \(\left\| H_{\mathbf{x}}(\lambda ) \right\| < 1 \). In fact, applying Proposition 5.3 gives with probability at least \(1-\eta \):
Put \(C_\eta := 2\log (2\eta ^{-1}) >1\) for any \(\eta \in (0,1)\). Assumption (5.2) reads \(\sqrt{n\lambda } \ge 4 C_\eta \sqrt{\max ({\mathcal N}(\lambda ),1)}\), implying \(\sqrt{n\lambda } \ge 4 C_\eta \ge 4\) and therefore \(\frac{2}{n\lambda } \le \frac{1}{2\sqrt{n\lambda }} \le \frac{1}{8C_\eta }\), hence
Thus, with probability at least \(1-\eta \):
\(\square \)
Proof of Proposition 5.5
Defining \(\xi _3: {\mathcal X}\longrightarrow {\mathrm {HS}}({\mathcal H}_1)\) by
and denoting also, as before, \(\xi _3\) for the random variable \(\xi _3(X)\) (with \(X\sim \nu \)), we have \({\mathbb {E}}[\xi _3] = \bar{B} \) and therefore
Furthermore, by Assumption 2.1
also leading to \({\mathbb {E}}[ \left\| \xi _2\right\| ^2_{{\mathrm {HS}}} ] \le 1 =: \sigma _3^2\). Thus, Proposition A.1 applies and gives with probability at least \(1-\eta \)
\(\square \)
Appendix B: Perturbation Result
The estimate of the following proposition is crucial for proving the upper bound in case the source condition is of Hölder type r with \(r>1\). We remark that for \(r>1\) the function \(t \mapsto t^r\) is not operator monotone. One might naively expect estimate (B.1) to hold for a constant C given by the Lipschitz constant of the scalar function \(t^r\). As shown in [3], this is false even for finite-dimensional positive matrices. The point of Proposition B.1 is that (B.1) still holds for some larger constant depending on r and the upper bound of the spectrum. We do not expect this result to be particularly novel, but tracking down a proof in the literature proved elusive, not to mention that occasionally erroneous statements about related issues can be found. For this reason we here provide a self-contained proof for completeness sake.
Proposition B.1
Let \(B_1, B_2\) be two nonnegative self-adjoint operators on some Hilbert space with \(||B_j|| \le a\), \(j=1,2\), for some \(a>0\). Assume the \(B_j\) belongs to the Schatten class \({\mathcal S}^p\) for \(1 \le p \le \infty \). If \(1 < r\), then
for some \(C_{r}<\infty \). This inequality also holds in operator norm for non-compact bounded (nonnegative and self-adjoint) \(B_j\).
Proof
We extend the proof of [18], given there in the case \(r=3/2\) in operator norm. We also restrict ourselves to the case \(a=1\). On \({\mathcal D}:=\{z: |z| \le 1 \}\), we consider the functions \(f(z)=(1-z)^{r}\) and \(g(z)=(1-z)^{r-1}\). The proof is based on the power series expansions
which converge absolutely on \({\mathcal D}\). To ensure absolute convergence on the boundary \(|z|=1\), notice that
so that all coefficients \(c_n\) for \(n\ge r\) have the same sign \(s := (-1)^{\lfloor r \rfloor }\) (if r is an integer these coefficients vanish without altering the argument below) implying for any \(N>r\):
A bound for \(\sum _n |b_n|\) can be derived analogously. Since \(f(1-B_j) = B_j^r\), we obtain
Using the algebraic identity \( T^{n+1} - S^{n+1} = T(T^n-S^n) + (T-S)S^n \), the triangle inequality, and making use of \(\Vert TS\Vert _p \le \Vert T\Vert \Vert S\Vert _p\) for \(S \in {\mathcal S}^p\), T bounded, the reader can easily convince himself by induction that
-
for \(j=1,2\), \(B_j \in {\mathcal S}^p\) imply \((I-B_1)^n - (I-B_2)^n \in {\mathcal S}^p\) and
-
\(\Vert (I-B_1)^n - (I-B_2)^n \Vert _p \le n\Vert B_1 - B_2 \Vert _p\).
From \(f'(z)=-r g(z)\) we have the relation \(|b_n|= \frac{r}{n}|c_{n-1}|\), \(n \ge 1\). Collecting all pieces leads to
\(\square \)
Appendix C: Auxiliary Technical Lemmata
Lemma C.1
Let X be a nonnegative real random variable such that the following holds:
where F is a monotone non-increasing function \((0,1] \rightarrow {\mathbb {R}}_+\). Then
Proof
An intuitive, non-rigorous proof is as follows. Let G be the tail distribution function of X, then it is well known that \({\mathbb {E}}[X] = \int _{{\mathbb {R}}_+} G\). Now it seems clear that \(\int _{{\mathbb {R}}_+} G = \int _0^1 G^{-1} \), where \(G^{-1}\) is the upper quantile function for X. Finally, F is an upper bound on \(G^{-1}\).
Now for a rigorous proof, we can assume without loss of generality that F is left continuous: Replacing F by its left limit in all points of (0, 1] can only make it larger since it is non-increasing, hence (C.1) is still satisfied; moreover, since a monotone function has an at most countable number of discontinuity points, this operation does not change the value of the integral \(\int _0^1 F\). Define the following pseudo-inverse for \(x\in {\mathbb {R}}_+\):
with the convention \(\inf \emptyset = 1\). Denote \(\widetilde{U} := F^{\dagger }(X)\). From the definition of \(F^\dagger \) and the monotonicity of F it holds that \(F^{\dagger }(x) < t \Rightarrow x>F(t)\) for all \((x,t) \in {\mathbb {R}}_+ \times (0,1]\). Hence, for any \(t\in (0,1]\)
implying that for all \(t \in [0,1]\), \({\mathbb {P}}[ \widetilde{U} \le t] \le t\), i.e., \(\widetilde{U}\) is stochastically larger than a uniform variable on [0, 1]. Furthermore, by left continuity of F, one can readily check that \(F(F^\dagger (x)) \ge x\) if \(x \le F(0)\). Since \({\mathbb {P}}[X > F(0)] = 0\), we can replace X by \(\widetilde{X} := \min (X,F(0))\) without changing its distribution (nor that of \(\widetilde{U}\)). With this modification, it then holds that \( F(\widetilde{U}) = F(F^\dagger (\widetilde{X})) \ge \widetilde{X}\). Hence,
where U is a uniform variable on [0, 1], and the second equality holds since F is non-increasing. \(\square \)
Corollary C.2
Let X be a nonnegative random variable and \(t_0 \in (0,1)\) such that the following holds:
where \(a,b,a',b'\) are nonnegative numbers. Then for any positive \(p\le \frac{1}{2} \log t_0^{-1}{::}\)
with \(C_p:= \max (2^{p-1},1)\).
Proof
Let \(F(t) := {\mathbf {1}\{t \in (t_0,1]\}} (a + b \log t^{-1}) + {\mathbf {1}\{t \in (0,t_0]\}} (a' + b' \log t^{-1})\). Then F is nonnegative, non-increasing on (0, 1] and
for all \(t \in (0,1]\). Applying Lemma C.1, we find
Using \((x+y)^{p} \le C_p (x^{p}+y^{p})\) for \(x,y \ge 0\), where \(C_p = \max (2^{p-1},1)\), we upper bound the second integral in (C.4) via
Concerning the first integral in (C.4), we write similarly
by the change of variable \(u = \log t^{-1}\), where \(\Gamma \) is the incomplete gamma function. We use the following coarse bound: It can be checked that \(t\mapsto t^p\mathrm{e}^{-\frac{t}{2}}\) is decreasing for \(t \ge 2p\), hence putting \(x:= \log t_0^{-1}\),
provided \(x = \log t_0^{-1} \ge 2p\). Collecting all the above pieces we get the conclusion. \(\square \)
Rights and permissions
About this article
Cite this article
Blanchard, G., Mücke, N. Optimal Rates for Regularization of Statistical Inverse Learning Problems. Found Comput Math 18, 971–1013 (2018). https://doi.org/10.1007/s10208-017-9359-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-017-9359-7
Keywords
- Reproducing kernel Hilbert space
- Spectral regularization
- Inverse problem
- Statistical learning
- Minimax convergence rates