Abstract
We study linear ill-posed inverse problems with noisy data in the framework of statistical learning. The corresponding linear operator equation is assumed to fit a given Hilbert scale, generated by some unbounded self-adjoint operator. Approximate reconstructions from random noisy data are obtained with general regularization schemes in such a way that these belong to the domain of the generator. The analysis has thus to distinguish two cases, the regular one, when the true solution also belongs to the domain of the generator, and the ‘oversmoothing’ one, when this is not the case. Rates of convergence for the regularized solutions will be expressed in terms of certain distance functions. For solutions with smoothness given in terms of source conditions with respect to the scale generating operator, then the error bounds can then be made explicit in terms of the sample size.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
We consider learning in linear inverse problems in Hilbert space. Within the classical framework of supervised learning, we are given data \(\left\{ (x_i,y_i)\right\} _{i=1}^m\) which follow the model
where \(\varepsilon _i\) is the observational noise, and m denotes the sample size. The function g is unknown, belonging to some reproducing kernel Hilbert space, say \(\mathcal {H}^{\prime }\). The goal is to learn it from the given data. To be more precise, we assume that the random observations \(\left\{ (x_i,y_i)\right\} _{i=1}^m\) are independent and follow some unknown probability distribution \(\rho\), defined on the sample space \(Z=X\times Y\). Further, we assume that the input space X is a Polish space, and that the output space Y is a real separable Hilbert space.
In inverse learning, the function g from (1) is driven by some element f in a Hilbert space \(\mathcal {H}\) via a mapping \(A:\mathcal {H}\rightarrow \mathcal {H}^{\prime }\) as
In the present study, this mapping is assumed to be a bounded linear (smoothing) mapping, and it is also assumed to be injective to have the correspondence g to f unique. The unique solution of (2) is denoted by \(f_\rho\). Literature for this setup is scarce, and we mention (Blanchard & Mücke, 2018), and a related study Rastogi et al. (2020), in which the underlying mapping A is assumed to be non-linear.
Often the sought for element \(f_\rho\) is known to have additional features, as e.g. smoothness and the standard approaches for reconstruction of an approximation of \(f_\rho\) do not take this into account. Therefore, we shall analyze such inverse learning problems in scales of Hilbert spaces. This topic has a long history within the classical setup of regularization theory, starting from (Natterer, 1984), see also the monograph (Engl et al., 1996, Chapt. 8). In most cases, the scale of Hilbert spaces is assumed to be a scale of Sobolev spaces, and the smoothing properties of the underlying operator A are measured with respect to this scale. This allows for a mathematical analysis, even if the singular value decomposition of A cannot be used to design a regularization scheme. Also, solution smoothness, i.e., the smoothness of \(f_\rho\) is described by assuming that it belongs to some space within this scale. Recently, regularization in Hilbert scales gained interest in statistical inverse problems, especially for the Bayesian approach to such problems, where we mention the studies (Gugushvili et al., 2020), and more recently (Agapiou & Mathé, 2022). To the best of our knowledge, inverse learning problems in scales have not been studied, yet.
Here we highlight the following prototypical example.
Example
Let \(A:\mathscr {L}^2_0(0,1) \rightarrow \mathcal {H}^1_0(0,1)\) be the integration operator
where \(\mathcal {H}^1_0(0,1)\) denotes the Sobolev space of abs. continuous functions g with \(g(0) = g(1)=0\), and \(\mathscr {L}^2_0(0,1)\) consists of elements which integrate to zero, i. e., \((Af)(1)=0\). Thus, we are looking for finding the derivative of a given function, one of the most classical inverse problems. In the above formulation, the operator A is injective. Moreover, it is known that the Sobolev space \(\mathcal {H}':= \mathcal {H}^1_0(0,1)\) is a reproducing kernel Hilbert space. Details are given in Blanchard and Mücke (2018). Therefore, a suitable scale of Hilbert spaces is the class of Sobolev spaces \(\mathcal {H}^s_0(0,1),\ s \in [0,p]\) for some \(p\ge 1\). For such analysis to work we assume that the given operator A ‘fits the scale’, which will be expressed in terms of a link condition. For the above example, the operator A has step one, meaning that elements from \(\mathscr {L}^2_0(0,1)\) are mapped to \(\mathcal {H}^1_0(0,1)\). Moreover, in this context, smoothness is also given relative to this scale, as e.g., \(f_\rho \in \mathcal {H}^s_0(0,1)\) for some \(0 < s \le p\). This is significantly different from other works, where smoothness is relative to the underlying covariance operator, and hence cannot be verified.
Further examples of Hilbert Scales relevant for learning in Reproducing Kernel Hilbert Spaces can be find in Mücke and Reiss (2020).
More generally, in the present study we shall assume that there is an unbounded self-adjoint operator \(L:\mathcal {D}(L) \subset \mathcal {H}\rightarrow \mathcal {H}\), which generates a scale of Hilbert spaces \(\mathcal {H}_{s}:= \mathcal {D}(L^{s}),\ s\ge 0\). Both, the operator Eq. (2), and the solution smoothness are assumed to fit this scale by assumptions, made below.
We highlight one specific means of reconstruction, often called penalized least squares. In this standard approach the estimator \(f_{\textbf{z},\lambda }\) is the minimizer of
where \(\lambda\) is a positive regularization parameter which balances the error term and the penalty \(\left\| {f}\right\| _{\mathcal {H}}^2\). This penalty will control the norm (in \(\mathcal {H}\)) of the minimizer, but it cannot incur additional properties. Here, we implement such additional properties by assuming that all considered minimizers \(f_{\textbf{z},\lambda }\), which are taken into account belong to \(\mathcal {D}(L)\). In the analysis of inverse problems this setup has a long history, starting from the above-mentioned study (Natterer, 1984), and it has since then been frequently considered both for linear (Böttcher et al., 2006; Mair, 1994; Mathé & Tautenhahn, 2006, 2007; Nair, 1999, 2002; Nair et al., 2005; Neubauer, 1988; Tautenhahn, 1996), and for non-linear mappings A Hofmann and Mathé (2018, 2020). The additional information \(f_{\textbf{z},\lambda }\in \mathcal D(L)\) is taken into account by replacing the above minimization problem (4) by
with minimzer \(f_{\textbf{z},\lambda }\in \mathcal {D}(L)\), and we may formally introduce \(u_{\textbf{z},\lambda }:= L f_{\textbf{z},\lambda }\in \mathcal {H}\).
In the regular case, when \(f_\rho \in \mathcal {D}(L)\), then we let \(u_\rho := Lf_\rho \in \mathcal {H}\). With this notation we can rewrite (2) as
Then the Tikhonov minimization problem (5) would reduce to the standard one
albeit for a different operator \(A L^{-1}\). Accordingly, the error bounds relate as
Therefore, error bounds for \(u_\rho - u_{\textbf{z},\lambda }\) in the weak norm (in \(\mathcal {H}_{-1}\)) yield bounds for \(f_\rho - f_{\textbf{z},\lambda }\). The latter bounds (in the weak norm) are not known from previous studies. In the oversmoothing cases, i.e., when \(f_\rho \not \in \mathcal {D}(L)\), then such one-to-one correspondence cannot be established, and additional efforts are required.
For ‘classical’ inverse problems the fundamental features of regularization in Hilbert scales are known. The questions that we address here try to answer whether these features retain in inverse learning.
-
Do regularization schemes which are known to provide optimal rates of reconstruction (as the noise level tends to zero) have analogs here with similar results in inverse learning (as the sample size tends to infinity)?
-
Are optimal rates of reconstruction obtained when the true solution does not belong to \(\mathcal {D}(L)\) (oversmoothing case)?
-
Will the use of a smoothness promoting operator \(L^{-1}\) delay saturation?
In order to answer these questions we shall discuss rates of convergence for general (spectral) regularization schemes in Hilbert scales, and under quite general noise condition, see Assumption 2. As mentioned before, in order to treat regularization in Hilbert scales we shall link the given operator A to the scale, which is done in Assumption 4. Then we pursue a novel approach. Instead of assuming smoothness of the sought for \(f_\rho\) we shall measure the violation of smoothness relative to a fixed benchmark smoothness, as this will be expressed in terms of a distance function, introduced in Definitions 6 and 7, respectively. Later, in Sect. 4 we shall see how smoothness relative to the given Hilbert scale translates to the behavior of the distance function, and hence, which are the resulting convergence rates.
The paper is organized as follows. The basic definitions, assumptions, and notation required in our analysis are presented in Sect. 2. In Sect. 3, we discuss the bounds of the reconstruction error in the direct learning setting and inverse problem setting by means of distance functions. This section comprises of two main results: The first result is devoted to convergence rates in the oversmoothing case, while the second result focuses on the regular case. When specifying smoothness in terms of source conditions, and this program is performed in Sect. 4, then we can bound the distance functions, and this in turn yields convergence rates in terms of the sample size m. In case that both, the smoothness as well as the link condition are of power type we establish the optimality of the obtained error bounds in the regular case in Sect. 5. Proofs will be given in the appendices. Also, we recall and prove probabilistic estimates which provide the tools for obtaining the error bounds.
2 Notation and assumptions
In this section, we introduce some basic concepts, definitions, notation, and assumptions required in our analysis.
We assume that X is a Polish space, therefore the probability distribution \(\rho\) allows for disintegration as
where \(\rho (y|x)\) is the conditional probability distribution of y given x, and \(\nu (x)\) is the marginal probability distribution. We consider random observations \(\left\{ (x_i,y_i)\right\} _{i=1}^m\) which follow the model \(y= A(f)(x)+\varepsilon\) with centered noise \(\varepsilon\). We assume throughout the paper that the operator A is injective.
Assumption 1
(The true solution) The conditional expectation w.r.t. \(\rho\) of y given x exists (a.s.), and there exists \(f_\rho \in \mathcal {H}~\) such that
The element \(f_\rho\) is the true solution which we aim at estimating.
Assumption 2
(Noise condition) There exist some constants \(M,\varSigma\) such that for almost all \(x\in X\),
This is usually referred to as a Bernstein-type assumption.
We recall the unbounded operator \(L:\mathcal {D}(L) \subset \mathcal {H}\rightarrow \mathcal {H}\), which is assumed to be unbounded and self-adjoint. By spectral theory, the operator \(L^s : \mathcal {D}(L^s) \rightarrow \mathcal {H}\) is well-defined for \(s \in \mathbb {R}\), and the spaces \(\mathcal {H}_s := \mathcal {D}(L^s), s \ge 0\) equipped with the inner product \(\langle { f},{g} \rangle _{\mathcal {H}_s}=\langle { L^s f},{L^s g} \rangle _{\mathcal {H}},\quad f, g \in \mathcal {H}_s\) are Hilbert spaces. For \(s < 0\), the space \(\mathcal {H}_s\) is defined as completion of \(\mathcal {H}\) under the norm \(\left\| {f}\right\| _{s} := \langle { f},{f} \rangle _{s}^{1/2}\). The collection \(\left\{ \mathcal {H}_s,\ s\in \mathbb {R}\right\}\) of Hilbert spaces is called the Hilbert scale induced by L. The following interpolation inequality is an important tool for the analysis:
which holds for any \(t< r < s\), see e.g. (Engl et al., 1996, Chapt. 8).
2.1 Reproducing Kernel Hilbert spaces and related operators
We start with the concept of reproducing kernel Hilbert spaces, see the seminal study (Aronszajn, 1950), which can be characterized by a symmetric, positive semidefinite kernel and each of its functions satisfies the reproducing property. We consider vector-valued reproducing kernel Hilbert spaces, following (Micchelli & Pontil, 2005), which are the generalization of real-valued reproducing kernel Hilbert spaces.
Definition 1
(Vector-valued reproducing kernel Hilbert space) For a non-empty set X and a real separable Hilbert space \((Y,\langle { \cdot },{\cdot } \rangle _{Y})\), a Hilbert space \(\mathcal {H}\) of functions from X to Y is said to be the vector-valued reproducing kernel Hilbert space, if linear functional \(F_{x,y}:\mathcal {H}\rightarrow \mathbb {R}\), defined by
is continuous for every \(x \in X\) and \(y\in Y\).
Definition 2
(Operator-valued positive semi-definite kernel) Suppose \(\mathcal {L}(Y):Y\rightarrow Y\) is the Banach space of bounded linear operators. A function \(K:X\times X\rightarrow \mathcal {L}(Y)\) is said to be an operator-valued positive semi-definite kernel if
-
(i)
\(K(x,x')^*=K(x',x) \qquad \forall ~x,x'\in X.\)
-
(ii)
\(\sum \limits _{i,j=1}^N\langle { y_i},{K(x_i,x_j)y_j} \rangle _{Y}\ge 0 \qquad \forall ~\{x_i\}_{i=1}^N\subset X \text { and } \{y_i\}_{i=1}^N\subset Y.\)
For a given operator-valued positive semi-definite kernel \(K:X \times X \rightarrow \mathcal {L}(Y)\), we can construct a unique vector-valued reproducing kernel Hilbert space \((\mathcal {H},\langle { \cdot },{\cdot } \rangle _{\mathcal {H}})\) of functions from X to Y as follows:
-
(i)
We define the linear function
$$\begin{aligned} K_x: Y \rightarrow \mathcal {H}: y \mapsto K_xy, \end{aligned}$$where \(K_xy:X \rightarrow Y:x' \mapsto (K_xy)(x')=K(x',x)y\) for \(x,x'\in X\) and \(y\in Y\).
-
(ii)
The span of the set \(\{K_xy:x\in X, y\in Y\}\) is dense in \(\mathcal {H}\).
-
(iii)
Reproducing property
$$\begin{aligned} \langle { f(x)},{y} \rangle _{Y}=\langle { f},{K_xy} \rangle _{\mathcal {H}},\qquad x\in X,~y \in Y,~\forall ~f\in \mathcal {H}, \end{aligned}$$in other words, \(f(x) = K_x^* f\).
Moreover, there is a one-to-one correspondence between operator-valued positive semi-definite kernels and vector-valued reproducing kernel Hilbert spaces, see (Micchelli & Pontil, 2005).
Assumption 3
The space \(\mathcal {H}'\) is assumed to be a vector-valued reproducing kernel Hilbert space of functions \(f:X\rightarrow Y\) corresponding to the kernel \(K:X\times X\rightarrow \mathcal {L}(Y)\) such that
-
(i)
\(K_x:Y\rightarrow \mathcal {H}'\) is a Hilbert–Schmidt operator for \(x\in X\) with
$$\begin{aligned} \kappa '^2:=\sup _{x \in X} \left\| {K_x}\right\| _{HS}^2 = {\sup _{x \in X}{\text {tr}}(K_x^*K_x)}<\infty . \end{aligned}$$ -
(ii)
For \(y,y'\in Y\), the real-valued function \(\varsigma :X\times X \rightarrow \mathbb {R}:(x,x')\mapsto \langle { K_{x}y},{K_{x'}y'} \rangle _{\mathcal {H}'}\) is measurable.
Example
In case that the set Y is a bounded subset of \(\mathbb {R}\) then the reproducing kernel Hilbert space becomes real-valued reproducing kernel Hilbert space. The corresponding kernel will then be symmetric, positive semi-definite \(K:X \times X \rightarrow \mathbb {R}\) with the reproducing property \(f(x)=\langle { f},{K_x} \rangle _{\mathcal {H}}\). In this case Assumption 3 simplifies to the condition that the kernel is measurable and \(\kappa '^2:=\sup _{x \in X} \left\| {K_x}\right\| _{\mathcal {H}'}^2=\sup _{x \in X}K(x,x)<\infty\).
Now we introduce some relevant operators used in the convergence analysis. We introduce the notation for the vectors \(\textbf{x}=(x_1,\ldots ,x_m)\), \(\textbf{y}=(y_1,\ldots ,y_m)\), \(\textbf{z}=(z_1,\ldots ,z_m)\). The product Hilbert space \(Y^m\) is equipped with the inner product \(\langle { \textbf{y}},{\textbf{y}'} \rangle _{m} = \frac{1}{m}\sum _{i=1}^m \langle { y_i},{y'_i} \rangle _{Y},\) and the corresponding norm \(\left\| {\textbf{y}}\right\| _{m}^2=\frac{1}{m}\sum _{i=1}^m\left\| {y_i}\right\| _{Y}^2\). We define the sampling operator \(S_\textbf{x}:\mathcal {H}' \rightarrow Y^m:g\mapsto (g(x_1),\ldots ,g(x_m))\), then the adjoint \(S_\textbf{x}^*:Y^m\rightarrow \mathcal {H}'\) is given by
We observe that under Assumption 3 we have
and
In particular, the canonical injection map \(I_\nu : \mathcal {H}' \rightarrow \mathscr {L}^2(X,\nu ;Y)\) is norm bounded by \(\kappa ^\prime\), and so is the empirical version \(S_\textbf{x}\).
We denote the population operators \(B_\nu := I_\nu AL^{-1}:\mathcal {H}\rightarrow \mathscr {L}^2(X,\nu ;Y)\), \(T_\nu := B_\nu ^*B_\nu :\mathcal {H}\rightarrow \mathcal {H}\), \(L_\nu := A^*I_\nu ^*I_\nu A:\mathcal {H}\rightarrow \mathcal {H}\), and their empirical versions \(B_\textbf{x}=S_\textbf{x}A L^{-1}:\mathcal {H}\rightarrow Y^m\), \(T_\textbf{x}=B_\textbf{x}^*B_\textbf{x}:\mathcal {H}\rightarrow \mathcal {H}\), \(L_\textbf{x}=A^*S_\textbf{x}^*S_\textbf{x}A:\mathcal {H}\rightarrow \mathcal {H}\). The operators \(T_\nu\), \(T_\textbf{x}\), \(L_\nu\), \(L_\textbf{x}\) are positive, self-adjoint and depend on the kernel. Under Assumption 3, the operators \(B_\textbf{x}\), \(B_\nu\) are bounded by \(\kappa :=\kappa ' \left\| {AL^{-1}}\right\| _{\mathcal {H}\rightarrow \mathcal {H}'}\) and the operators \(L_\textbf{x}\), \(L_\nu\) are bounded by \(\tilde{\kappa }^2\) for \(\tilde{\kappa }:=\kappa ' \left\| {A}\right\| _{\mathcal {H}\rightarrow \mathcal {H}'}\), i.e., \(\left\| {B_\textbf{x}}\right\| _{\mathcal {H}\rightarrow Y^m}\le \kappa\), \(\left\| {B_\nu }\right\| _{\mathcal {H}\rightarrow \mathscr {L}^2(X,\nu ;Y)}\le \kappa\), \(\left\| {L_\textbf{x}}\right\| _{\mathcal {L}(\mathcal {H})}\le \kappa ^2\) and \(\left\| {L_\nu }\right\| _{\mathcal {L}(\mathcal {H})}\le \tilde{\kappa }^2\).
2.2 Link condition
The subsequent analysis will frequently use the notion of an index function.
Definition 3
(Index function) A function \(\varphi : \mathbb {R}^+ \rightarrow \mathbb {R}^+\) is said to be an index function if it is continuous and strictly increasing with \(\varphi (0) = 0\).
An index function is called sub-linear whenever the mapping \(t\mapsto t/\varphi (t),\ t>0,\) is nondecreasing. Further, we require this index function to belong to the following class of functions.
The representation \(\varphi =\varphi _2\varphi _1\) is not unique, therefore \(\varphi _2\) can be assumed as a Lipschitz function with Lipschitz constant 1. We shall also rely upon the following important result for such Lipschitz continuous index functions \(\varphi _2\), needed in our analysis (Peller, 2016, Corollary 1.2.2):
Example
Power-type functions \(\varphi (t)=t^r\) with \(r>0\), and logarithmic functions \(\varphi (t)=t^p\log ^{-\nu }\left( \frac{1}{t}\right) ,\ p,\nu \ge 0\), are examples of functions in the class \(\mathcal {F}\).
The following assumption is used to relate smoothness in terms of the operator L to the covariance operator \(T_\nu\).
Assumption 4
(link condition) There exist a power \(q > 1\) and an index function \(\varrho\), for which the function \(\varrho ^{2}\) is sub-linear. There is a constant \(1 \le \beta <\infty\) such that
The function \(t\mapsto \varphi (t):=\varrho ^{q-1}(t)\) belongs to the class \(\mathcal {F}\).
Only the left inequality will be used for the regular case. For the oversmoothing case, when we need to relate the effective dimensions we require the other side as well. Also, to show the optimality of the rates both side inequalities are used.
As shown in Böttcher et al. (2006), Assumption 4 implies the range identity \(\mathcal {R}(L^{-q}) = \mathcal {R}(\varrho ^{q}(T_\nu ))\). In the context of a comparison of operators we mention the well-known Heinz Inequality, see (Engl et al., 1996, Prop. 8.21). This asserts that for every exponent \(0< a \le 1\) it holds true
Applying this to the above link condition we obtain the following:
Proposition 1
Under Assumption 4 we have
and
Moreover, we have that
Remark 1
It is heuristically clear that the function \(\varrho ^{2}\) cannot increase faster than linearly, because the operator \(T_\nu = L^{-1} L_\nu L^{-1}\) has \(L^{-2}\) in it. Therefore, requiring sub-linearity is not a strong restriction. More details will be given in Sect. 5.
Link conditions as in Assumption 4 imply decay rates for the singular numbers of the operators, known as Weyl’s Monotonicity Theorem (Bhatia, 1997, Cor. III.2.3). In our case, this yields that \(s_{j}(\varrho (T_\nu )) = \varrho (s_{j}(T_\nu ))\asymp s_{j}(L^{-1})\). For classical spaces, as e.g. Sobolev spaces, when \(L:= (I + \varDelta )^{-1/2}\), then \(s_{j}(L^{-1}) \asymp 1/j\) (one spatial dimension). For the above index function \(\varrho\) this means that \(s_{j}(T_\nu ) \asymp \varrho ^{-1}(1/j)\).
Example
(Finitely smoothing covariance operators) In case that the function \(\varrho\), and hence its inverse are of power type then this implies a power type decay of the singular numbers of \(T_\nu\). In this case, the operator \(T_\nu\) is called finitely smoothing.
Example
(Infinitely smoothing covariance operators) If, on the other hand, the function \(\varrho\) is logarithmic, as e.g., \(\varrho (t)=\left( \log \frac{1}{t}\right) ^{-\frac{1}{\mu }}\), then \(s_{j}(T_\nu ) \asymp e^{-j^{\mu }}\). In this case, the operator \(T_\nu\) is called infinitely smoothing.
2.3 Effective dimension
The concept of the effective dimension, as introduced in Zhang (2002), proved to be important for deriving fast rates of convergence under Hölder’s source condition, see (Blanchard & Mücke, 2018; Caponnetto & De Vito, 2007; Guo et al., 2017), and also for general source conditions, see (Lin et al., 2020; Shuai et al., 2020; Rastogi & Sampath, 2017). For the trace–class operator \(T_\nu\) its effective dimension is defined as,
It is known that the function \(\lambda \rightarrow \mathcal {N}_{T_\nu }(\lambda )\) is continuous and decreasing from \(\infty\) to zero for \(0< \lambda < \infty\) for an infinite dimensional operator \(T_\nu\) (see for details Blanchard and Mathé, 2012; Blanchard and Mücke, 2020; Lin et al., 2015; Shuai et al., 2020; Zhang, 2002). However, we shall use, and this follows from spectral calculus, that the function \(\lambda \mapsto \lambda \mathcal {N}_{T_\nu }(\lambda )\) is increasing.
We have the trivial bound
In the subsequent analysis, we shall need a relationship between the effective dimensions \(\mathcal {N}_{T_\nu }(\lambda )\) and \(\mathcal {N}_{L_\nu }(\lambda )\) of the operators \(T_\nu\) and \(L_\nu\), respectively. For this, the following assumption, introduced in Lin et al. (2015), will be used. There, it was shown that it is satisfied for both moderately ill-posed and severely ill-posed operators.
Assumption 5
There exists a constant C such that for \(0 < t \le \left\| {L_\nu }\right\| _{\mathcal L(\mathcal {H})}\) we have
Proposition 2
Suppose Assumptions 4 and 5 hold true. Suppose that the function \(\varrho\) from the link condition, Assumption 4, is such that the function \(t\mapsto \left( \varrho ^{2q}\right) ^{-1}(t)\) is operator concave, and that there is some \(n\in \mathbb {N}\) for which the function \(t\mapsto \varrho ^{-1}(t)/t^n\) is concave. Then, there is \(\widetilde{C}\) for which we have that
Remark 2
For a power type function \(\varrho (t):= t^{a}\) the above concavity assumptions hold true whenever \(2aq \ge 1\) and \(n \le 1/a \le n+1\). In particular, the number n is uniquely determined.
2.4 Regularization schemes
General regularization schemes were introduced and discussed in ill-posed inverse problems and learning theory (See Shuai & Pereverzev, 2013, Sect. 2.2 and Bauer et al., 2007, Sect. 3.1) for brief discussion). By using the notation from Sect. 2.1, the Tikhonov regularization scheme from (5) can be re-expressed as follows:
with minimizer given as
The following definition extends this by replacing the operator \((T_\textbf{x}+\lambda I)^{-1}\) by some operator function \(g_\lambda (T_\textbf{x})\).
Definition 4
(Spectral regularization) We say that a family of functions \(g_\lambda :[0,\kappa ^2]\rightarrow \mathbb {R}\), \(0<\lambda \le a\), is a regularization scheme if there exists \(D,B,\gamma\) such that
-
\(\sup \limits _{t\in [0,\kappa ^2]}\left|t g_\lambda (t) \right|\le D\).
-
\(\sup \limits _{t\in [0,\kappa ^2]}\left|g_\lambda (t) \right|\le \frac{B}{\lambda }\).
-
\(\sup \limits _{t\in [0,\kappa ^2]}\left|r_\lambda (t) \right|\le \gamma \qquad \text {for}\quad r_\lambda (t)=1-g_\lambda (t)t\).
-
For some constant \(\gamma _p\) (independent of \(\lambda\)), the maximal p satisfying the condition:
$$\begin{aligned} \sup \limits _{t\in [0,\kappa ^2]}\left|r_\lambda (t) \right|t^p\le \gamma _p\lambda ^p \end{aligned}$$is said to be the qualification of the regularization scheme \(g_\lambda\).
Definition 5
The qualification p covers the index function \(\varphi\) if the function \(t\rightarrow \frac{t^p}{\varphi (t)}\) is nondecreasing.
We mention the following result.
Proposition 3
Suppose \(\varphi\) is a nondecreasing index function and the qualification, say \(p\ge 1\), of the regularization \(g_\lambda\) covers \(\varphi\). Then
Also, we have that
Most of the linear (spectral) regularization schemes (Tikhonov regularization, Landweber iteration or spectral cut-off) satisfy the properties of general regularization. Inspired by the representation for the minimizer of the Tikhonov functional (5) we consider a general regularized solution in Hilbert scales corresponding to the above regularization \(g_\lambda\) in the form
where by spectral calculus the real-valued function \(g_\lambda\) is applied to the self-adjoint operator \(T_\textbf{x}\).
3 Convergence analysis
The analysis will distinguish between two cases, the ‘regular’ one, when \(f_\rho \in \mathcal {D}(L)\), and the ‘low smoothness’ case, when \(f_\rho \not \in \mathcal {D}(L)\). In either case, we shall first utilize the concept of distance functions. This will later give rise to establish convergence rates in a more classical style.
For the asymptotical analysis, we shall require the standard assumption relating the sample size m and the parameter \(\lambda\) such that
It will be seen, that asymptotically the condition (12) is always satisfied for the parameter which is optimally chosen under known smoothness.
Since the function \(\mathcal {N}_{T_\nu }(\lambda )\) is decreasing in \(\lambda\), for \(\lambda \le 1\) we have that \(\mathcal {N}_{T_\nu }(1)\le \mathcal {N}_{T_\nu }(\lambda )\). Hence condition (12) yields that
Several probabilistic quantities will be used to express the error bounds. Precisely, for an index function \(\zeta\) we let
and
In case that \(\zeta (t) = t^{r}\) we abbreviate \(\varXi ^{t^{r}}\) by \(\varXi ^{r}\) and \(\varXi ^{t}\) by \(\varXi\), not to be confused with the power. High probability bounds for these quantities are known, and these are given correspondingly in Propositions 4 and 5 in Appendix C.
3.1 The oversmoothing case
As mentioned before, we shall use distance functions, which measure the violation of a benchmark smoothness. Here the benchmark will be \(f_\rho \in \mathcal {D}(L)\).
Definition 6
We define the distance function \(d : [0, \infty )\rightarrow [0, \infty )\) by
The distance function is positive, decreasing, convex and continuous for all \(0 \le R < 1\). It tends to 0 as \(R \rightarrow \infty\), see (Hofmann, 2006). Hence, the unique minimizer exists and will be denoted by \(f_\rho ^R\).
Notice the following: If \(f_\rho \in \mathcal {D}(L)\) then for some R the minimizer \(f_\rho ^R\) of the distance function will obey \(f_\rho ^R=f_\rho\).
Remark 3
In a rudimentary form, this approach was given in (Baumeister, 1987, Thm. 6.8). It was then introduced in regularization theory in Hofmann (2006). Within learning theory, such a concept was also used in the study (Smale & Zhou, 2003).
Theorem 1
Let \(\textbf{z}\) be i.i.d. samples drawn according to the probability measure \(\rho\). Suppose the Assumptions 1–5 hold true. Let \(g_\lambda\) be a regularization with corresponding regularized solution \(f_{\textbf{z},\lambda }\) (see (11)). Suppose that the qualification p of the regularization \(g_\lambda\) covers the function \(\varrho\) (from Assumption 4) and that the functions \(\varrho ^{-1}(t)/t^n\), and \(\left( \varrho ^{2q}\right) ^{-1}(t)\) are concave, or operator concave, for some \(n\ge 1\), respectively. Then for all \(0<\eta <1\), and for \(\lambda\) satisfying the condition (12) the following upper bound holds true with confidence \(1-\eta\):
where C depends on B, D, \(c_p\), \(\kappa\), n, \(\beta\), \(\widetilde{C}\).
The bound from Theorem 1 is valid for all \(R\ge \varSigma +\kappa M/\mathcal {N}_{T_\nu }(1)\), and we shall now optimize the bound from Theorem 1 with respect to the choice of \(R\ge \varSigma +\kappa M/\mathcal {N}_{T_\nu }(1)\).
First, if \(f_\rho \in \mathcal {D}(L)\) then there is \(\bar{R}\ge \varSigma +\kappa M/\mathcal {N}_{T_\nu }(1)\) such that \(d(\bar{R}) = 0\), and
where C depends on B, D, \(c_p\), \(\kappa\), n, \(\beta\), \(\widetilde{C}\).
Otherwise, in the low smoothness case, \(f_\rho \not \in \mathcal {D}(L)\), we introduce the following function
which is non-vanishing decreasing function, and hence the inverse \(\varGamma ^{-1}\) exists, and it is decreasing. Given \(\lambda >0\), by letting \(R = R(\lambda )\) solve the equation \(\varGamma (R) = \varrho (\lambda )\) we find that
where C depends on B, D, \(c_p\), \(\kappa\), n, \(\beta\), \(\widetilde{C}\).
The above dependency \(\lambda \rightarrow R(\lambda )\) can be made explicit when assuming that \(f_\rho\) has some smoothness measured in terms of a source condition, see Sect. 4, below. For Theorem 1 we get the error bound (19) but the parameter \(\lambda\) has to obey (12).
3.2 The regular case
Here we analyze the rates of convergence in the case when the underlying true solution \(f_\rho\) belongs to the domain of the operator L. Again, we shall choose a benchmark smoothness, here in the form of \(f_\rho \in \mathcal {D}(L^q)\) for some \(q\ge 1\). This benchmark smoothness is determined by the user. With respect to this benchmark we introduce the following distance function.
Definition 7
Given \(q\ge 1\) we define the distance function \(d_{q} : [0, \infty )\rightarrow [0, \infty )\) by
Theorem 2
Let \(\textbf{z}\) be i.i.d. samples drawn according to the probability measure \(\rho\). Suppose the Assumptions 1–4 hold true. Let \(g_\lambda\) be a regularization with corresponding regularized solution \(f_{\textbf{z},\lambda }\) (see (11)). Let \(\zeta\) be any index function, such that \(\frac{1}{2}\) covers \(\zeta\). Suppose that the qualification p of the regularization \(g_\lambda\) covers the function \(\zeta \varphi\) (with \(\varphi\) from Assumption 4). Then for all \(0<\eta <1\), and for \(\lambda\) satisfying the condition (12), the following upper bound holds true with confidence \(1-\eta\):
Consequently, we find that
and
where C depends on B, D, \(c_p\), \(\kappa\), and \(C'=2\kappa M+\varSigma\).
The bound from Theorem 2 is valid for all \(R\ge 1\), and we shall now optimize the bound from Theorem 2 with respect to the choice of \(R\ge 1\).
First, if \(f_\rho \in \mathcal {R}\left( L^{-q}\right)\) then \(d_{q}(\bar{R}) = 0\) for some \(\bar{R}\), we find that
Otherwise, in case that \(f_\rho \not \in \mathcal {R}\left( L^{-q}\right)\) we introduce the following function
which is non-vanishing decreasing function, and hence the inverse \(\varGamma _{q}^{-1}\) exists and it is decreasing. We finally get the main result, by letting \(R = R(\lambda )\) solving the equation \(\varGamma _{q}(R) = \varphi (\lambda )\), and we find that
4 Smoothness in terms of source-wise representation
So far convergence results were established in terms of distance functions. We will now specify the smoothness of the true solution in terms of the bounded, injective and self-adjoint operator \(L^{-1}\). This is genuine for regularization in Hilbert scales.
Assumption 6
(General source condition) For an index function \(\theta\), the true solution \(f_\rho\) belongs to the class \(\varOmega (\theta ,R^\dagger )\) with
Notice that elements from \(\varOmega (\theta ,R^\dagger )\) belong to the range of \(\theta (L^{-1})\) which coincides with the domain of \(\theta (L)\), since \(L^{-1}\) was assumed to be bounded.
We aim at bounding the distance functions d(R) and \(d_q(R)\) from the oversmoothing and regular cases, respectively.
For a better understanding, we shall highlight the obtained general bounds when the considered index functions are of power type, and we specify the function \(\theta (t) := t^{r}\), which represents the smoothness, as well as \(\varrho (t) = t^a\), representing the link, for this purpose. It will be seen that the index function \(t\mapsto \theta (\varrho (t)),\ t>0\) is relevant in the subsequent corollaries, which here reads as \(\theta (\varrho (t)) = t^{ar},\ t>0\). Also, in the regular case with benchmark smoothness \(f_\rho \in \mathcal R(L^{-q})\), the function \(t \mapsto \frac{\iota ^q}{\theta }(t)\) appears, and this is required to be an index function. Within the power type context, this reads as \(r < q\), and it simply means that the actual smoothness is not beyond the benchmark.
Finally, we emphasize that the rates will depend on the decay of the effective dimension of the covariance operator \(T_\nu\), which was introduced in Sect. 2.3. Therefore, we will highlight the obtained bounds under specified decay rates for the effective dimension in Sect. 4.3. The obtained overall rates will be highlighted in Tables 1 and 2, respectively.
4.1 The oversmoothing case
Here the benchmark source condition \(f_\rho \in \mathcal R(L^{-1})\) (\(q=1\)) is linear, represented by the identity function \(\iota :t \mapsto t\), and we shall thus assume that the index function \(\theta\) is sub-linear. The obtained bounds will rely on the results from (Hofmann & Mathé, 2007, Theorem 5.9). Under Assumption 6 we find that
In order to minimize the bound from Theorem 1, we balance \(d(R) = R \varrho (\lambda )\), resulting in
Thus, for this value of \(R(\lambda )\) under the condition (12), the bound (19) reduces to
The following corollary is the consequence of Theorem 1 which explicitly provide us with an error bound in terms of the sample size m.
Corollary 1
Suppose that the unknown \(f_\rho\) obeys Assumption 6 for a sub-linear function \(\theta\). Under the same assumptions of Theorem 1 and with the a-priori choice of the regularization parameter \(\lambda ^{*} = \lambda ^{*}(m)\) from solving the equation \(\mathcal N_{T_\nu }(\lambda ^{*}) = m \lambda ^{*}\), for all \(0<\eta <1\), the following error estimates holds with confidence \(1-\eta\):
where C depends on B, D, \(c_p\), \(\kappa\), n, \(\beta\), \(\widetilde{C}\), M, \(\varSigma\), and \(R^\dagger\).
Evidently, the above parameter choice satisfies condition (12).
4.2 The regular case
In this case the benchmark is given by the index function \(\iota ^{q}\), and we shall assume that the given smoothness, measured in terms of \(\theta\), is such that the function \(\iota ^{q}/\theta\) for \(0 < t \le \kappa ^2\), is an index function. However, the definition of the distance function \(R \mapsto d_{q}(R)\) is non-standard. The target norm is \(\left\| {L(f - f_\rho )}\right\| _{\mathcal {H}}\), and, in order to apply the result from (Hofmann & Mathé, 2007, Theorem 5.9) we have to ‘rescale’ the given smoothness (in terms of the operator \(L^{-1}\)) by factor \(L^{-1}\). If Assumption 6 holds true with index function \(\theta\), for which the quotient \(\iota ^{q}/\theta\) is an index function, and so will be the function \(\iota ^{q-1}/(\theta /\iota )\), then this results in the bound
According to Theorem 2 we balance
This yields
Inserting this bound into Theorem 2 we find that
provided that (12) holds.
The optimization of the bound in the inequality (25) depends on which term is dominant in the last two summands. Then we can balance the remaining (two) terms. This results in the following corollaries for the different choices of the regularization parameter:
Corollary 2
Suppose that the unknown \(f_\rho\) obeys Assumption 6 for an index function \(\theta\), and that the related functions \(\frac{\iota ^q}{\theta }(t)\) and \(\frac{\iota ^q}{\theta }\left( \varrho (t)\right) \sqrt{\frac{\mathcal {N}_{T_\nu }(t)}{t}}\) are index functions. Under the same assumptions of Theorem 2, and for the a-priori choice of the regularization parameter \(\lambda ^*=\varphi ^{-1}\left( \frac{1}{\sqrt{m}}\right)\), for all \(0<\eta <1\), the following upper bound holds with confidence \(1-\eta\):
where C depends on B, D, \(c_p\), \(\kappa\), M, \(\varSigma\), and \(R^\dagger\).
Corollary 3
Suppose that the unknown \(f_\rho\) obeys Assumption 6 for an index function \(\theta\), and that the related functions \(\frac{\iota ^q}{\theta }(t)\) and \(\frac{\iota ^q}{\theta }\left( \varrho (t)\right) \sqrt{\frac{\mathcal {N}_{T_\nu }(t)}{t}}\) are index functions. Under the same assumptions of Theorem 2, and for the a-priori choice of the regularization parameter \(\lambda ^*\) as solution to the equation \(\frac{\varTheta ^{2}(\varrho (\lambda ^{*}))}{\varrho ^{2}(\lambda ^{*})} \lambda ^{*}m = \mathcal N_{T_\nu }(\lambda ^{*})\), for all \(0<\eta <1\), the following upper bound holds with confidence \(1-\eta\):
where C depends on B, D, \(c_p\), \(\kappa\), M, \(\varSigma\), and \(R^\dagger\).
Since by assumption the function \(t \mapsto \frac{\varTheta ^{2}(\varrho (\lambda ^{*}))}{\varrho ^{2}(\lambda ^{*})}\) is an index function we will have that condition (12) holds for m large enough.
4.3 Taking the behavior of effective dimension into account
Below, to be specific, we consider the following two behaviors of the decay of the effective dimensions of the covariance operator \(T_\nu\), say power-type and logarithmic type, known to hold true in many situations.
Assumption 7
(Polynomial decay) There exists some positive constant \(c>0\) such that
Assumption 8
(Logarithmic decay) There exists some positive constant \(c>0\) such that
Remark 4
We mention that a polynomial decay of the eigenvalues of the covariance operator \(T_\nu\) yields a power-type behavior of the effective dimension, see (Caponnetto & De Vito, 2007). In some situations this behavior is not evident. Shuai et al. (2020) showed that for Gaussian kernel \(K_1(x,x') = xx' + e^{-8(x-x')^2}\) with the uniform sampling on [0, 1], the effective dimension exhibits a log-type behavior (Assumption 8), on the other hand, the kernel \(K_2(x,x') = \min \{x,x'\}-xt\) exhibits a power-type behavior (Assumption 7).
Below, we shall summarize the convergence rates under the specific behavior of the effective dimension, Assumptions 7 and 8, respectively, in the Tables 1 and 2. We confine to the power type case, when both the link condition as well as the source condition are of power type, i.e., \(\varrho (t)=t^a\) and \(\theta (t)=t^r\) for parameters \(a,r>0\). The qualification of the regularization is denoted by p as before. Also, the benchmark smoothness is q, where either \(q=1\) (oversmoothing case) or \(q>1\) (regular case). Notice, that due to the sub-linearity condition for \(\varrho ^{2}\) we must have that \(0 < a \le 1/2\). Also, throughout the analysis, we assume that the qualification covers the given smoothness, i.e., \(a q\le p\). The bounds presented in the tables are consequences of Corollaries 1–3, respectively. Therefore, Assumptions 1–6 are assumed to be satisfied.
The tables are structured as follows. In the first column we present the rates of convergence \(\varepsilon (m)\) for the error estimates of the form:
In the second column, the corresponding order of the regularization parameter choice \(\lambda ^*\) in terms of m is indicated. In the third column, we highlight the smoothness of the true solution \(f_\rho\). In fourth column, we emphasize additional constraints, specifically on the benchmark smoothness.
The first row corresponds to the oversmoothing case, and the last two rows correspond to the regular case. In the regular case, we observe that the validity of the rates of the convergence depends on the benchmark smoothness through aq. At the intersection point, when \(a q = a r+\frac{b+1}{2}\), both rates coincide. As we will see in the next section the rates of convergence in the regular case (\(q>1\)) are optimal provided that the benchmark smoothness is chosen appropriately.
5 Optimality of the error bounds
We shall discuss the optimality of the previously obtained error bounds, in the regular case, and we shall use the known optimality results from (Blanchard & Mücke, 2018). However, at present the smoothness is measured with respect to the operator \(T_\nu\), whereas in Blanchard and Mücke (2018) this was done with respect to the operator \(L_\nu := A^{ *} I_\nu ^{*} I_\nu A = LT_\nu L\). Therefore, the following ‘recipe’ will be used.
-
1.
Transfer smoothness as given in terms of \(L^{-1}\) to smoothness in terms of \(L_\nu\), and
-
2.
Knowing the decay of the singular numbers of the operator \(T_\nu\) inherent in Assumption 7, find the decay of the singular numbers of \(L_\nu\).
In order to keep the analysis simple and transparent, we confine to power type smoothness \(\theta (t)=t^{r},\ 0 < r \le q\) in Assumption 6, as well as to power type link in Assumption 4 with \(\varrho (t) := t^{a}\) for some \(a>0\).
In the subsequent subsections, we shall sketch the proof of the lower bounds step by step, reaching the optimality assertion at the end. In order to get there, additional assumptions have to be made, a lifting condition (Assumption 9), and a singular number decay condition (Assumption 10).
5.1 Relating smoothness
The link condition is crucial, and the subsequent arguments are of interpolation type, applying Heinz Inequality within the present context. To this end, we require that q is chosen such that \(aq\ge 1/2\). In this case Assumption 4 yields, by applying Heinz Inequality (9) with exponent \(1/(2aq)\le 1\) that
Footnote 1 Letting \(v:= L^{-1}u\) we find that
First, we see from this that \(a < 1/2\), because otherwise \(L_\nu\) would be continuously invertible. Also, the relation (26) would allow transferring smoothness r with respect to \(L^{-1}\) to \(L_\nu\) as long as \(0 < r \le \frac{1}{2a} - 1\). In order to treat higher smoothness (in terms of \(L^{-1}\)) a lifting condition is unavoidable. This must be consistent with the link from (26). Thus we look for a factor z such that \(t^{(\frac{1}{2a} -1)z} = t^{q}\), yielding \(z:= \frac{2aq}{1 - 2a}\).
Assumption 9
(lifting condition) We have that
Remark 5
The strengthening of the original link condition, Assumption 4, towards a lifting condition has been discussed in more detail in Mathé (2019).
Having this lifting, and applying Heinz Inequality (9) (with exponent r/q) yields
and a source-wise representation as in Assumption 6 yields a corresponding source-wise representation with respect to the operator \(L_\nu\) (with different constant).
5.2 Relating effective dimensions
Here we shall use the following consequence of Assumption 4. Indeed, turning from squared norms to quadratic forms we see that
The Weyl Monotonicity Theorem (Bhatia, 1997, Cor. III.2.3) yields that then \(s_{j}(L^{-2q}) \asymp s_{j}(T_\nu ^{2aq}),\ j=1,2,\dots\), or simplified that \(s_{j}(L^{-1}) \asymp s_{j}^{a}(T_\nu ),\ j=1,2,\dots\) by spectral calculus. Here \(s_{j}(L^{-1})\) and \(s_{j}(T_\nu )\) denote the singular numbers of the operators. Similarly, we obtain from (26) that \(s_{j}(L_\nu ) \asymp s_{j}^{\frac{1 - 2a}{a}}(L^{-1})\), and a fortiori that \(s_{j}(L_\nu ) \asymp s_{j}^{1 - 2a}(T_\nu )\).
5.3 Lower bound
In order to show the optimality of the error bounds as discussed in Table 1, we shall assure that the decay of the effective dimension cannot be faster than asserted in Assumption 7.
Assumption 10
(decay of singular number) There is a constant \(c>0\) such that the singular numbers of the operator \(T_\nu\) obey
Notice that this yields that \(\mathcal N(\lambda ) \ge c \lambda ^{-b}\), such that this is the limiting case for which Assumption 7 holds. Hence, the assumed decay of the singular numbers of \(T_\nu\) is best possible by order. The following is reported in Blanchard and Mücke (2018) for the problem (2): Under smoothness r with respect to the operator \(L_\nu\), and with the decay of the singular numbers \(s_{j}(L_\nu )\) not faster than \(j^{-1/b}\), the optimal rate is of the order \(\left( \frac{1}{\sqrt{m}}\right) ^{\frac{2r}{2r + b +1}}\). In the present context, we have to assign \(r\leftarrow \frac{ar}{1-2a}\) and \(b\leftarrow \frac{b}{1-2a}\). This yield a lower bound of the order
for the range \(\frac{ar}{1-2a}\le p\).
This corresponds to the upper bound for \(a\le \frac{1}{2}\), \(a q\le {p}\), \(r \le q\le r +\frac{b+1}{2a}\), as discussed in the last row of Table 1, and it shows that the rate is of optimal order.
6 Conclusion
We investigated regularization schemes in Hilbert scales for linear inverse (learning) problems. Regularized solutions are constructed under the requirement that these belong to \(\mathcal {D}(L)\), for the (unbounded) operator L, which generates the scale. Clearly, this may be extended to the case that the regularized solutions belong to \(\mathcal {D}(L^s)\) for some \(s>0\), simply be considering \(L^s\) as a generator of the (same) scale.
We draw the following conclusions. Some arguments consider the cases of power type conditions, and for this, we recourse to Tables 1 and 2 for details.
Optimal rates: In the regular case, we can achieve the optimal rates of convergence provided that the benchmark smoothness q is chosen in the appropriate region (see Sect. 5.3). In contrast, in the mis-specified case (oversmoothing) we can only prove sub-optimal rates of convergence. By now no techniques are known which are capable to improve the rates in this case.
Saturation effects: In case \(q=r\), we observe from the above analysis that optimal rates can be proven for the range \(ar\le p\), provided that the scheme has qualification p. For standard regularization schemes, this would hold for the range \(\frac{ar}{1-2a}\le p\), only. Hence, the saturation effect is delayed here.
Convergence rates without source condition: Typically, rates of convergence are shown under smoothness in terms of source conditions. Here we establish error bounds by using the concept of distance functions, measuring the violation of a benchmark source condition. When specifying smoothness as a source condition, we use known bounds of the considered distance function. This provides us with convergence rates in terms of the sample size.
Source conditions: When studying kernel methods, the smoothness of the true solution is measured in terms of the source condition with respect to the covariance operator, and hence may hardly be checked. We consider source conditions in terms of the Hilbert scale. This has a clear meaning, and it is independent of the choice of kernel. However, the chosen kernel comes into play when requiring the validity of a link condition.
Availability of data and material
Not applicable.
Code availability
Not applicable.
Notes
We shall suppress the recalculations of the corresponding constants.
we use that \(\left( \varrho ^{2q}\right) ^{-1}(t^{2q}) =\left( \varrho \right) ^{-1}(t)\).
References
Agapiou, S., & Mathé, P. (2022). Designing truncated priors for direct and inverse Bayesian problems. Electronic Journal of Statistics, 16(1), 158–200.
Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404.
Bauer, F., Pereverzev, S., & Rosasco, L. (2007). On regularization algorithms in learning theory. Journal of Complexity, 23(1), 52–72.
Baumeister, J. (1987). Stable solution of inverse problems. Advanced lectures in mathematics. Friedrich Vieweg & Sohn.
Bhatia, R. (1997). Matrix analysis. In Grad. texts Math. (Vol. 169). Springer-Verlag, New York.
Blanchard, G., & Mathé, P. (2012). Discrepancy principle for statistical inverse problems with application to conjugate gradient iteration. Inverse Problems, 28(11), 115011.
Blanchard, G., Mathé, P., & Mücke, N. (2019). Lepskii principle in supervised learning. arXiv:1905.10764.
Blanchard, G., & Mücke, N. (2018). Optimal rates for regularization of statistical inverse learning problems. Foundations of Computational Mathematics, 18(4), 971–1013.
Blanchard, G., & Mücke, N. (2020). Kernel regression, minimax rates and effective dimensionality: Beyond the regular case. Analysis and Applications, 18(04), 683–696.
Böttcher, A., Hofmann, B., Tautenhahn, U., & Yamamoto, M. (2006). Convergence rates for Tikhonov regularization from different kinds of smoothness conditions. Applicable Analysis, 85(5), 555–578.
Caponnetto, A., & De Vito, E. (2007). Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3), 331–368.
Engl, H. W., Hanke, M., & Neubauer, A. (1996). Regularization of inverse problems, volume 375. Math. Appl. Kluwer Academic Publishers Group.
Gugushvili, S., van der Vaart, A., & Yan, D. (2020). Bayesian linear inverse problems in regularity scales. Annales de l’Institut Henri Poincaré Probabilités et Statistiques, 56(3), 2081–2107.
Guo, Z.-C., Lin, S.-B., & Zhou, D.-X. (2017). Learning theory of distributed spectral algorithms. Inverse Problems, 33, 74009.
Hofmann, B. (2006). Approximate source conditions in Tikhonov–Phillips regularization and consequences for inverse problems with multiplication operators. Mathematical Methods in the Applied Sciences, 29(3), 351–371.
Hofmann, B., & Mathé, P. (2007). Analysis of profile functions for general linear regularization methods. SIAM Journal on Numerical Analysis, 45(3), 1122–1141.
Hofmann, B., & Mathé, P. (2018). Tikhonov regularization with oversmoothing penalty for non-linear ill-posed problems in Hilbert scales. Inverse Problems, 34(1), 15007.
Hofmann, B., & Mathé, P. (2020). A priori parameter choice in Tikhonov regularization with oversmoothing penalty for non-linear ill-posed problems. In J. Cheng, S. Lu, & M. Yamamoto (Eds.), Inverse problems related top (pp. 169–176). Springer.
Lin, J., Rudi, A., Rosasco, L., & Cevher, V. (2020). Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces. Applied and Computational Harmonic Analysis, 48(3), 868–890.
Lin, K., Shuai, L., & Mathé, P. (2015). Oracle-type posterior contraction rates in Bayesian inverse problems. Inverse Problems Imaging, 9(3), 895–915.
Mair, B. A. (1994). Tikhonov regularization for finitely and infinitely smoothing operators. SIAM Journal on Mathematical Analysis, 25(1), 135–147.
Mathé, P. (2019). Bayesian inverse problems with non-commuting operators. Mathematics of Computation, 88(320), 2897–2912.
Mathé, P., & Pereverzev, S. V. (2003). Geometry of linear ill-posed problems in variable Hilbert scales. Inverse Problems, 19(3), 789–803.
Mathé, P., & Tautenhahn, U. (2006). Interpolation in variable Hilbert scales with application to inverse problems. Inverse Problems, 22(6), 2271–2297.
Mathé, P., & Tautenhahn, U. (2007). Error bounds for regularization methods in Hilbert scales by using operator monotonicity. Far East Journal of Mathematical Sciences, 24(1), 1.
Micchelli, C. A., & Pontil, M. (2005). On learning vector-valued functions. Neural Computation, 17(1), 177–204.
Mücke, N., & Reiss, E. (2020). Stochastic gradient descent in Hilbert scales: Smoothness, preconditioning and earlier stopping. arXiv:2006.10840.
Nair, M. T. (1999). On Morozov’s method for Tikhonov regularization as an optimal order yielding algorithm. Journal of Analytical and Applied, 18, 37–46.
Nair, M. T. (2002). Optimal order results for a class of regularization methods using unbounded operators. Integral Equations and Operator Theory, 44(1), 79–92.
Nair, M. T., Pereverzev, S. V., & Tautenhahn, U. (2005). Regularization in Hilbert scales under general smoothing conditions. Inverse Problem, 21(6), 1851–1869.
Natterer, F. (1984). Error bounds for Tikhonov regularization in Hilbert scales. Applicable Analysis, 18(1–2), 29–37.
Neubauer, A. (1988). An a posteriori parameter choice for tikhonov regularization in Hilbert scales leading to optimal convergence rates. SIAM Journal on Numerical Analysis, 25(6), 1313–1326.
Peller, V. V. (2016). Multiple operator integrals in perturbation theory. Bulletin of Mathematical Sciences, 6(1), 15–88.
Rastogi, A., Blanchard, G. & Mathé, P. (2020). Convergence analysis of Tikhonov regularization for non-linear statistical inverse learning problems. Electronic Journal of Statistics, 14(2), 2798–2841.
Rastogi, A., & Sampath, S. (2017). Optimal rates for the regularized learning algorithms under general source condition. Frontiers in Applied Mathematics and Statistics, 3, 3.
Shuai, L., Mathé, P., & Pereverzev, S. V. (2020). Balancing principle in supervised learning for a general regularization scheme. Applied and Computational Harmonic Analysis, 48(1), 123–148.
Shuai, L., & Pereverzev, S. (2013). Regularization theory for ill-posed problems: Selected topics (Vol. 58). Walter de Gruyter.
Smale, S., & Zhou, D.-X. (2003). Estimating the approximation error in learning theory. Analysis and Applications, 01(01), 17–41.
Tautenhahn, U. (1996). Error estimates for regularization methods in Hilbert scales. SIAM Journal on Numerical Analysis, 33(6), 2120–2130.
Zhang, T. (2002). Effective dimension and generalization of kernel learning. In Proceedings of 15th International Conference Neural Information Processing System, (pp. 454–461), MIT Press, Cambridge, MA.
Funding
Open Access funding enabled and organized by Projekt DEAL. This research has been partially funded by Deutsche Forschungsgemeinschaft (DFG) under Collaborative Research Centre SFB1294 (SFB-1294/1 - 318763901) and The Berlin Mathematics Research Center MATH+ (EXC-2046/1 - 390685689).
Author information
Authors and Affiliations
Contributions
All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Ethics approval
Not applicable.
Additional information
Editor: Steve Hanneke.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Proofs of Sect. 2
Proof (Proof of Proposition 1)
The first assertions are a consequence of Heinz Inequality (9) with \(a:= 1/q < 1\). For the last one, we argue as follows. Since \(\varrho ^{2}\) is assumed to be sub-linear. Hence we find that
which completes the proof. \(\square\)
For proving Proposition 2 we start with the following technical result.
Lemma 1
Suppose that the function \(\varrho\) from the link condition, Assumption 4 is such that the function \(t\mapsto \left( \varrho ^{2q}\right) ^{-1}(t)\) is operator concave, and that there is some \(n\in \mathbb {N}\) for which the function \(t\mapsto \varrho ^{-1}(t)/t^n\) is concave. Under Assumption 4 we have that
Proof
The proof is based on two consequences of Assumption 4, which, in terms of the partial ordering for self-adjoint operators in Hilbert space can be restated as
Since the operator concave function \(t\mapsto \left( \varrho ^{2q}\right) ^{-1}(t)\) respects the partial ordering we obtainFootnote 2 that
Letting \(u:= Lv\in \mathcal {H}\), and since by construction \(T_\nu = L^{-1}L_\nu L^{-1}\) we deduce that
The sub-linearity of \(\varrho ^{2}\) implies that the function \(t\mapsto \varrho ^{-1}(t)/t^{2}\) is non-decreasing, such that the operator \(\varrho ^{-1}(\beta L^{-1})L^2\) is bounded, and hence the above inequality extends to \(v\in \mathcal {H}\). Next we apply the Weyl Monotonicity Theorem (Bhatia, 1997, Cor. III.2.3) to see that
Applying this theorem to the first inequality in Proposition 1 we also find that
To proceed we shall use the sub-linearity of the function \(\varrho ^2\), and the concavity of the function \(\varsigma (t):=\varrho ^{-1}(t)/t^n\). This yields that \(\varsigma (\beta t)\le \beta \varsigma (t),\ \beta \ge 1\) and overall, we find that
This, together with the inequalities (28) gives
and the proof is complete. \(\square\)
Proof (Proof of Proposition 2)
Since the function \(t\mapsto t/\varrho ^{2}(t)\) is assumed to be an index function, we find from Lemma 1 that the assertion
holds true. This yields
As a consequence of (Lin et al., 2015, Prop. 6) there is \(\widetilde{C}\) such that
This, together with (29), implies that
Since the function \(\lambda \mapsto \lambda \mathcal {N}_{L_\nu }(\lambda )~\) is non-decreasing we continue to bound
which completes the proof. \(\square\)
Proof (Proof of Proposition 3)
The first assertion is a restatement of (Mathé and Pereverzev, 2003, Proposition 3). For the second assertion, we stress that \((\lambda + \sigma )^{p} \le 2^{p-1} (\lambda ^{p} + \sigma ^{p})\), which follows from convexity. This yields
which implies the second assertion and completes the proof. \(\square\)
B Proofs of Sect. 3
Proof (Proof of Theorem 1)
For the minimizer \(f_\rho ^R\) of the distance function defined in (18), the error can be expressed as follows:
By using Proposition 1 the error for the regularized solution can be bounded as
We shall bound each summand on the right in (31).
- \(I_{1}\)::
-
By Lemma 2 we find that
$$\begin{aligned} \left\| {L^{-1}r_\lambda (T_\textbf{x}) L}\right\| _{\mathcal {L}(\mathcal {H})} \le 1+(B + D) \left( \varXi ^\varrho \varXi ^\upsilon + \varXi \varrho (\lambda )(\varrho (\lambda )+1)\frac{\varLambda }{\sqrt{\lambda }}\right) \end{aligned}$$with \(\varXi ^{\varrho }\), \(\varLambda\) as in (14), (15) and \(\upsilon (t):= t/\varrho (t),\ t>0\). From the estimates of Propositions 4, 5 we get with confidence \(1-\eta /2\) that
$$\begin{aligned} \left\| {L^{-1} r_\lambda (T_\textbf{x})L}\right\| _{\mathcal {L}(\mathcal {H})} \le&1+(B+D)\Big \{(2\kappa +1)^8+2(2\kappa +1)^4(\varrho (\lambda )+1)\nonumber \\&\times \left( \frac{\tilde{\kappa }\varrho (\lambda )}{m\lambda }+\sqrt{\frac{\tilde{\kappa }\varrho ^2(\lambda )\mathcal {N}_{L_\nu }(\lambda )}{m\lambda }}\right) \Big \} \log ^4\left( \frac{4}{\eta }\right) , \end{aligned}$$(33)For \(\vartheta (\lambda ):=\frac{\lambda }{\varrho ^2(\lambda )}\), by using that \(\lambda \mapsto \lambda \mathcal {N}_{L_\nu }(\lambda )\) is an increasing function, and \(\lambda \le \vartheta (\lambda )\), for \(\lambda\) small enough, we get
$$\begin{aligned} \lambda \mathcal {N}_{L_\nu }(\lambda ) \le \vartheta (\lambda ) \mathcal {N}_{L_\nu }\left( \vartheta (\lambda )\right) . \end{aligned}$$This together with Proposition 2 implies that
$$\begin{aligned} \varrho ^2(\lambda )\mathcal {N}_{L_\nu }(\lambda ) \le \mathcal {N}_{L_\nu }\left( \frac{\lambda }{\varrho ^2(\lambda )}\right) \le 2\beta ^{n+1}\widetilde{C}\mathcal {N}_{T_\nu }(\lambda ). \end{aligned}$$(34)Under the condition (12) from the estimates (13), (33), (34) we get with confidence \(1-\eta /2\):
$$\begin{aligned} \left\| {L^{-1} r_\lambda (T_\textbf{x})L}\right\| _{\mathcal {L}(\mathcal {H})} \le&1+(B+D)\beta ^{n+1}\widetilde{C}C_{\kappa ,\tilde{\kappa }}\log ^4\left( \frac{4}{\eta }\right) , \end{aligned}$$(35)where \(C_{\kappa ,\tilde{\kappa }}\) depends on \(\kappa ,\tilde{\kappa }\).
- \(I_{2}\)::
-
By construction of \(f_\rho ^R\) we have that \(f_\rho ^R= L^{-1}v,\ \left\| {v}\right\| _{\mathcal {H}}\le R\). Using the fact that p covers \(\varrho\) we bound
$$\begin{aligned} \left\| {\varrho (T_\nu ) r_\lambda (T_\textbf{x})Lf_\rho ^R}\right\| _{\mathcal {H}}&\le R \varXi ^{\varrho } \left\| {\varrho (T_\textbf{x}+\lambda I) r_\lambda (T_\textbf{x})}\right\| _{\mathcal {L}(\mathcal {H})} \le 2R \varXi ^{\varrho } \varrho (\lambda ). \end{aligned}$$(36) - \(I_{3}\)::
-
For the last summand we argue
$$\begin{aligned}&\left\| {\varrho (T_\nu ) g_\lambda (T_\textbf{x})B_\textbf{x}^*(S_\textbf{x}Af_\rho -\textbf{y})}\right\| _{\mathcal {H}} \nonumber \\&\quad \le \varXi ^{\frac{1}{2}}\varXi ^{\varrho } \varPsi \left\| { g_\lambda (T_\textbf{x})\varrho (T_\textbf{x}+\lambda I)(T_\textbf{x}+\lambda I)^{\frac{1}{2}}}\right\| _{\mathcal {L}(\mathcal {H})} \nonumber \\&\quad \le \varXi ^{\frac{1}{2}}\varXi ^{\varrho }\varPsi \sup \limits _{t\in [0,\kappa ^2]}\varrho (t+\lambda )(t+\lambda )^{\frac{1}{2}} \left|g_\lambda (t) \right|\nonumber \\&\quad \le \varXi ^{\frac{1}{2}}\varXi ^{\varrho } \varPsi \left( \sup \limits _{t\in [0,\kappa ^2]}\varrho (t+\lambda )(t+\lambda )^{-\frac{1}{2}}\right) \left\{ \lambda \sup \limits _{t\in [0,\kappa ^2]}\left|g_\lambda (t) \right|+\sup \limits _{t\in [0,\kappa ^2]}\left|t g_\lambda (t) \right|\right\} \nonumber \\&\quad \le \varXi ^{\frac{1}{2}}\varXi ^{\varrho } \varPsi \left\{ B+D\right\} \varrho (\lambda )\lambda ^{-\frac{1}{2}}, \end{aligned}$$(37)where \(\varXi ^{1/2}\) and \(\varPsi\) were as in (14) and (17).
Summarizing, using the estimates of Propositions 4, 5, and (35)–(37), we get with confidence \(1-\eta\) that
For any parameter choice \(\lambda\) satisfying the condition (12) using the inequality (13) we get that
and
This implies
provided that \(R \ge \varSigma +\kappa M/\mathcal {N}_{T_\nu }(1)\). Inserting the bound from inequality (39) into the estimate (38) completes the proof. \(\square\)
Proof (Proof of Theorem 2)
For the minimizer \(f_\rho ^R\) of the distance function defined in (20), the error can be expressed as follows:
First, we estimate the error in the interpolation norm for some index function \(\zeta\):
- \(I_1\)::
-
We bound
$$\begin{aligned}&\left\| {\zeta (T_\nu ) r_\lambda (T_\textbf{x})}\right\| _{\mathcal {L}(\mathcal {H})} \le \left\| {\zeta (T_\nu +\lambda I) r_\lambda (T_\textbf{x})}\right\| _{\mathcal {L}(\mathcal {H})} \nonumber \\&\quad \le \varXi ^\zeta \left\| {\zeta (T_\textbf{x}+\lambda I) r_\lambda (T_\textbf{x})}\right\| _{\mathcal {L}(\mathcal {H})} \le \varXi ^\zeta c_p\zeta (\lambda ). \end{aligned}$$(41) - \(I_{2}\)::
-
For the minimizer \(f_\rho ^R= L^{-q}g\) of the distance function (20), we observe from Proposition 1 that there is \(v\in \mathcal {H}\) such that \(Lf_\rho ^R=L^{-(q-1)}g = \varphi (T_\nu )v\), \(\left\| {v}\right\| _{\mathcal {H}}\le R\). Thus by assuming that the \(\varphi = \varphi _{1} \varphi _{2}\) (with \(\varphi _{1}\) being sub-linear and \(\varphi _{2}\) Lipschitz with constant one) we continue bounding
$$\begin{aligned}&r_\lambda (T_\textbf{x})Lf_\rho ^R=r_\lambda (T_\textbf{x})\varphi (T_\nu )v \nonumber \\&\quad = r_\lambda (T_\textbf{x})\varphi _2(T_\textbf{x})\varphi _1(T_\nu )v+r_\lambda (T_\textbf{x})(\varphi _2(T_\nu )-\varphi _2(T_\textbf{x}))\varphi _1(T_\nu )v. \end{aligned}$$Then we get
$$\begin{aligned}&\left\| {\zeta (T_\nu )r_\lambda(T_\textbf{x})Lf_\rho ^R}\right\| _{\mathcal {H}}=\left\| {\zeta(T_\nu )r_\lambda (T_\textbf{x})\varphi (T_\nu )v}\right\|_{\mathcal {H}}\nonumber \\&\quad \le \varXi ^\zeta \Big\{\left\| {\zeta (T_\textbf{x}+\lambda I)r_\lambda(T_\textbf{x})\varphi _2(T_\textbf{x})\varphi _1(T_\nu )v}\right\|_{\mathcal {H}}\nonumber \\&\qquad +\left\| {\zeta(T_\textbf{x}+\lambda I)r_\lambda (T_\textbf{x})(\varphi _2(T_\nu)-\varphi _2(T_\textbf{x}))\varphi _1(T_\nu )v}\right\| _{\mathcal{H}}\Big \}\nonumber \\&\quad \le R \varXi ^\zeta \left\{\left\| {\zeta (T_\textbf{x}+\lambda I)r_\lambda(T_\textbf{x})\varphi _2(T_\textbf{x})\varphi_1(T_\textbf{x}+\lambda I)}\right\| _{\mathcal {L}(\mathcal{H})}\right. \nonumber \\&\qquad \times \left\| {\left(\frac{1}{\varphi _1}\right) (T_\textbf{x}+\lambda I)\varphi _1(T_\nu+\lambda I)}\right\| _{\mathcal {L}(\mathcal {H})} \nonumber\\&\qquad \left. +\varphi _1(\kappa ^2)\left\| {\zeta(T_\textbf{x}+\lambda I)r_\lambda (T_\textbf{x})}\right\| _{\mathcal{L}(\mathcal {H})}\left\| {T_\nu -T_\textbf{x}}\right\| _{\mathcal{L}(\mathcal {H})} \right\} \nonumber \\&\quad \le R\varXi^\zeta \left\{ \varXi ^{\varphi _1}\sup \limits _{t\in [0,\kappa^2]}\left\{ \left|r_\lambda (t) \right|\varphi _2(t)\zeta (t+\lambda)\varphi _1(t+\lambda )\right\} \right. \nonumber \\&\qquad\left. +\varphi _1(\kappa ^2)\left\| {T_\nu -T_\textbf{x}}\right\|_{\mathcal {L}(\mathcal {H})}\sup \limits _{t\in [0,\kappa^2]}\left\{ \left|r_\lambda (t) \right|\zeta (t+\lambda )\right\}\right\} \nonumber \\&\quad \le R 2^q c_p\zeta (\lambda )\varXi^\zeta \left\{ \varXi ^{\varphi _1}\varphi (\lambda )+\varphi_1(\kappa ^2)\left\| {T_\nu -T_\textbf{x}}\right\| _{\mathcal{L}(\mathcal {H})}\right\} , \end{aligned}$$(42)because of the qualification of the regularization.
- \(I_{3}\)::
-
From the arguments used in (37), we get
$$\begin{aligned} \left\| {\zeta (T_\nu ) g_\lambda (T_\textbf{x})B_\textbf{x}^*(S_\textbf{x}A f_\rho -\textbf{y})}\right\| _{\mathcal {H}} \le \varXi ^{\frac{1}{2}}\varXi ^{\zeta } \varPsi \left\{ B+D\right\} \zeta (\lambda )\lambda ^{-\frac{1}{2}}. \end{aligned}$$(43)
Overall, using Propositions 4–5 and (41)–(43) in (40) we obtain with confidence \(1-\eta\) that
The fact that \(\mathcal {N}_{T_\nu }(\lambda )\) is decreasing function of \(\lambda\) with the inequality (12) implies that
This, together with (44) yields the first result.
For the last two estimates in Theorem 2, by using Proposition 1 we get
and
These two upper bounds can now be estimated from the general bound by letting \(\zeta :=\varrho\) and \(\zeta (t):=t^\frac{1}{2}\), respectively. We also use that \(\varrho ^2\) is sub-linear, and this completes the proof. \(\square\)
C Probabilistic bounds
In the following proposition, we present the standard perturbation inequalities in learning theory which measure the effect of random sampling in the probabilistic sense. The following two propositions can be proved using the arguments given in Step 2.1. of (Caponnetto and De Vito, 2007, Thm. 4).
Proposition 4
Suppose Assumptions 1–3 hold true, then for \(m \in \mathbb {N}\) and \(0<\eta <1\), each of the following estimate holds with the confidence \(1-\eta\),
and
In the following proposition, the probabilistic estimate of the first term can be established under the condition (12) on the regularization parameter \(\lambda\), and the sample size m. The last two estimates are obtained by using (Blanchard, 2019, Prop. A.2).
Proposition 5
Suppose Assumption 3 and the condition (12) hold true. Let \(\zeta : \mathbb {R}^+ \rightarrow \mathbb {R}^+\) be a nondecreasing and sub-linear function, then for \(m \in \mathbb {N}\) and \(0<\eta <1\), each of the following estimates hold with the confidence \(1-\eta\),
for \(0\le s \le 1\) and
Lemma 2
Suppose Assumption 4 holds true. Let \(g_\lambda\) be any regularization with residual function \(r_\lambda\). Then for \(\upsilon (t)=t/\varrho (t)\), we have that
Proof
For \(L_\textbf{x}=A^*S_\textbf{x}^*S_\textbf{x}A\) and \(L_\nu =A^*I_\nu ^*I_\nu A\) with the fact that \(T_\textbf{x}- T_\nu = L^{-1} \left( L_\textbf{x}- L_\nu \right) L^{-1}\), the proof will be based on the following decomposition
and this yields the estimate
We observe that \(\left\| {g_\lambda (T_\textbf{x})(T_\textbf{x}+\lambda I)}\right\| _{\mathcal {L}(\mathcal {H})} \le B+D\). For the function \(\upsilon (t)=t/\varrho (t)\), we can bound \(I_{1}\) as
It remains to bound the second and third factors. From Proposition 1 we find that
Again, under Assumption 4 we find that
which finally yields that \(I_{1} \le \varXi ^\upsilon \varXi ^\varrho (B+D)\).
The terms \(I_2\), \(I_3\) can be bounded as
and
This complete the proof.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Rastogi, A., Mathé, P. Inverse learning in Hilbert scales. Mach Learn 112, 2469–2499 (2023). https://doi.org/10.1007/s10994-022-06284-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-022-06284-8
Keywords
- Statistical inverse problem
- Spectral regularization
- Hilbert Scales
- Reproducing kernel Hilbert space
- Minimax convergence rates