1 Introduction

We consider learning in linear inverse problems in Hilbert space. Within the classical framework of supervised learning, we are given data \(\left\{ (x_i,y_i)\right\} _{i=1}^m\) which follow the model

$$\begin{aligned} y_{i} = g(x_{i}) + \varepsilon _{i},\quad i=1,\ldots ,m, \end{aligned}$$
(1)

where \(\varepsilon _i\) is the observational noise, and m denotes the sample size. The function g is unknown, belonging to some reproducing kernel Hilbert space, say \(\mathcal {H}^{\prime }\). The goal is to learn it from the given data. To be more precise, we assume that the random observations \(\left\{ (x_i,y_i)\right\} _{i=1}^m\) are independent and follow some unknown probability distribution \(\rho\), defined on the sample space \(Z=X\times Y\). Further, we assume that the input space X is a Polish space, and that the output space Y is a real separable Hilbert space.

In inverse learning, the function g from (1) is driven by some element f in a Hilbert space \(\mathcal {H}\) via a mapping \(A:\mathcal {H}\rightarrow \mathcal {H}^{\prime }\) as

$$\begin{aligned} A(f) = g,\qquad \text {for}\quad f\in \mathcal {H}\quad \text {and} \quad g\in \mathcal {H}'. \end{aligned}$$
(2)

In the present study, this mapping is assumed to be a bounded linear (smoothing) mapping, and it is also assumed to be injective to have the correspondence g to f unique. The unique solution of (2) is denoted by \(f_\rho\). Literature for this setup is scarce, and we mention (Blanchard & Mücke, 2018), and a related study Rastogi et al. (2020), in which the underlying mapping A is assumed to be non-linear.

Often the sought for element \(f_\rho\) is known to have additional features, as e.g. smoothness and the standard approaches for reconstruction of an approximation of \(f_\rho\) do not take this into account. Therefore, we shall analyze such inverse learning problems in scales of Hilbert spaces. This topic has a long history within the classical setup of regularization theory, starting from (Natterer, 1984), see also the monograph (Engl et al., 1996, Chapt. 8). In most cases, the scale of Hilbert spaces is assumed to be a scale of Sobolev spaces, and the smoothing properties of the underlying operator A are measured with respect to this scale. This allows for a mathematical analysis, even if the singular value decomposition of A cannot be used to design a regularization scheme. Also, solution smoothness, i.e., the smoothness of \(f_\rho\) is described by assuming that it belongs to some space within this scale. Recently, regularization in Hilbert scales gained interest in statistical inverse problems, especially for the Bayesian approach to such problems, where we mention the studies (Gugushvili et al., 2020), and more recently (Agapiou & Mathé, 2022). To the best of our knowledge, inverse learning problems in scales have not been studied, yet.

Here we highlight the following prototypical example.

Example

Let \(A:\mathscr {L}^2_0(0,1) \rightarrow \mathcal {H}^1_0(0,1)\) be the integration operator

$$\begin{aligned} (Af)(x) := \int _0^x f(t)\,dt,\quad x\in (0,1), \end{aligned}$$
(3)

where \(\mathcal {H}^1_0(0,1)\) denotes the Sobolev space of abs. continuous functions g with \(g(0) = g(1)=0\), and \(\mathscr {L}^2_0(0,1)\) consists of elements which integrate to zero, i. e., \((Af)(1)=0\). Thus, we are looking for finding the derivative of a given function, one of the most classical inverse problems. In the above formulation, the operator A is injective. Moreover, it is known that the Sobolev space \(\mathcal {H}':= \mathcal {H}^1_0(0,1)\) is a reproducing kernel Hilbert space. Details are given in Blanchard and Mücke (2018). Therefore, a suitable scale of Hilbert spaces is the class of Sobolev spaces \(\mathcal {H}^s_0(0,1),\ s \in [0,p]\) for some \(p\ge 1\). For such analysis to work we assume that the given operator A ‘fits the scale’, which will be expressed in terms of a link condition. For the above example, the operator A has step one, meaning that elements from \(\mathscr {L}^2_0(0,1)\) are mapped to \(\mathcal {H}^1_0(0,1)\). Moreover, in this context, smoothness is also given relative to this scale, as e.g., \(f_\rho \in \mathcal {H}^s_0(0,1)\) for some \(0 < s \le p\). This is significantly different from other works, where smoothness is relative to the underlying covariance operator, and hence cannot be verified.

Further examples of Hilbert Scales relevant for learning in Reproducing Kernel Hilbert Spaces can be find in Mücke and Reiss (2020).

More generally, in the present study we shall assume that there is an unbounded self-adjoint operator \(L:\mathcal {D}(L) \subset \mathcal {H}\rightarrow \mathcal {H}\), which generates a scale of Hilbert spaces \(\mathcal {H}_{s}:= \mathcal {D}(L^{s}),\ s\ge 0\). Both, the operator Eq. (2), and the solution smoothness are assumed to fit this scale by assumptions, made below.

We highlight one specific means of reconstruction, often called penalized least squares. In this standard approach the estimator \(f_{\textbf{z},\lambda }\) is the minimizer of

$$\begin{aligned} \frac{1}{m}\sum _{i=1}^m\left\| { A(f)(x_i)-y_i}\right\| _{Y}^2+\lambda \left\| {f}\right\| _{\mathcal {H}}^2, \end{aligned}$$
(4)

where \(\lambda\) is a positive regularization parameter which balances the error term and the penalty \(\left\| {f}\right\| _{\mathcal {H}}^2\). This penalty will control the norm (in \(\mathcal {H}\)) of the minimizer, but it cannot incur additional properties. Here, we implement such additional properties by assuming that all considered minimizers \(f_{\textbf{z},\lambda }\), which are taken into account belong to \(\mathcal {D}(L)\). In the analysis of inverse problems this setup has a long history, starting from the above-mentioned study (Natterer, 1984), and it has since then been frequently considered both for linear (Böttcher et al., 2006; Mair, 1994; Mathé & Tautenhahn, 2006, 2007; Nair, 1999, 2002; Nair et al., 2005; Neubauer, 1988; Tautenhahn, 1996), and for non-linear mappings A Hofmann and Mathé (2018, 2020). The additional information \(f_{\textbf{z},\lambda }\in \mathcal D(L)\) is taken into account by replacing the above minimization problem (4) by

$$\begin{aligned} \frac{1}{m}\sum _{i=1}^m\left\| { A(f)(x_i)-y_i}\right\| _{Y}^2+\lambda \left\| {L f}\right\| _{\mathcal {H}}^2, \end{aligned}$$
(5)

with minimzer \(f_{\textbf{z},\lambda }\in \mathcal {D}(L)\), and we may formally introduce \(u_{\textbf{z},\lambda }:= L f_{\textbf{z},\lambda }\in \mathcal {H}\).

In the regular case, when \(f_\rho \in \mathcal {D}(L)\), then we let \(u_\rho := Lf_\rho \in \mathcal {H}\). With this notation we can rewrite (2) as

$$\begin{aligned} g = A f = A L^{-1} u,\quad u\in \mathcal {D}(L). \end{aligned}$$

Then the Tikhonov minimization problem (5) would reduce to the standard one

$$\begin{aligned} \frac{1}{m}\sum \limits _{i=1}^m\left\| { (AL^{-1})(u)(x_i)-y_i}\right\| _{Y}^2+\lambda \left\| {u}\right\| _{\mathcal {H}}^2, \end{aligned}$$
(6)

albeit for a different operator \(A L^{-1}\). Accordingly, the error bounds relate as

$$\begin{aligned} \left\| { f_\rho - f_{\textbf{z},\lambda }}\right\| _{\mathcal {H}} = \left\| {L^{-1}(u_\rho - u_{\textbf{z},\lambda })}\right\| _{\mathcal {H}}. \end{aligned}$$

Therefore, error bounds for \(u_\rho - u_{\textbf{z},\lambda }\) in the weak norm (in \(\mathcal {H}_{-1}\)) yield bounds for \(f_\rho - f_{\textbf{z},\lambda }\). The latter bounds (in the weak norm) are not known from previous studies. In the oversmoothing cases, i.e., when \(f_\rho \not \in \mathcal {D}(L)\), then such one-to-one correspondence cannot be established, and additional efforts are required.

For ‘classical’ inverse problems the fundamental features of regularization in Hilbert scales are known. The questions that we address here try to answer whether these features retain in inverse learning.

  • Do regularization schemes which are known to provide optimal rates of reconstruction (as the noise level tends to zero) have analogs here with similar results in inverse learning (as the sample size tends to infinity)?

  • Are optimal rates of reconstruction obtained when the true solution does not belong to \(\mathcal {D}(L)\) (oversmoothing case)?

  • Will the use of a smoothness promoting operator \(L^{-1}\) delay saturation?

In order to answer these questions we shall discuss rates of convergence for general (spectral) regularization schemes in Hilbert scales, and under quite general noise condition, see Assumption 2. As mentioned before, in order to treat regularization in Hilbert scales we shall link the given operator A to the scale, which is done in Assumption 4. Then we pursue a novel approach. Instead of assuming smoothness of the sought for \(f_\rho\) we shall measure the violation of smoothness relative to a fixed benchmark smoothness, as this will be expressed in terms of a distance function, introduced in Definitions 6 and 7, respectively. Later, in Sect. 4 we shall see how smoothness relative to the given Hilbert scale translates to the behavior of the distance function, and hence, which are the resulting convergence rates.

The paper is organized as follows. The basic definitions, assumptions, and notation required in our analysis are presented in Sect. 2. In Sect. 3, we discuss the bounds of the reconstruction error in the direct learning setting and inverse problem setting by means of distance functions. This section comprises of two main results: The first result is devoted to convergence rates in the oversmoothing case, while the second result focuses on the regular case. When specifying smoothness in terms of source conditions, and this program is performed in Sect. 4, then we can bound the distance functions, and this in turn yields convergence rates in terms of the sample size m. In case that both, the smoothness as well as the link condition are of power type we establish the optimality of the obtained error bounds in the regular case in Sect. 5. Proofs will be given in the appendices. Also, we recall and prove probabilistic estimates which provide the tools for obtaining the error bounds.

2 Notation and assumptions

In this section, we introduce some basic concepts, definitions, notation, and assumptions required in our analysis.

We assume that X is a Polish space, therefore the probability distribution \(\rho\) allows for disintegration as

$$\begin{aligned} \rho (x,y)=\rho (y|x)\nu (x), \end{aligned}$$

where \(\rho (y|x)\) is the conditional probability distribution of y given x, and \(\nu (x)\) is the marginal probability distribution. We consider random observations \(\left\{ (x_i,y_i)\right\} _{i=1}^m\) which follow the model \(y= A(f)(x)+\varepsilon\) with centered noise \(\varepsilon\). We assume throughout the paper that the operator A is injective.

Assumption 1

(The true solution) The conditional expectation w.r.t. \(\rho\) of y given x exists (a.s.), and there exists \(f_\rho \in \mathcal {H}~\) such that

$$\begin{aligned} \int _Y y d\rho (y|x)= g_\rho (x) = A(f_\rho )(x), \text { for all } x\in X. \end{aligned}$$

The element \(f_\rho\) is the true solution which we aim at estimating.

Assumption 2

(Noise condition) There exist some constants \(M,\varSigma\) such that for almost all \(x\in X\),

$$\begin{aligned} \int _Y\left( e^{\left\| {y-A(f_\rho )(x)}\right\| _{Y}/M}-\frac{\left\| {y-A(f_\rho )(x)}\right\| _{Y}}{M}-1\right) d\rho (y|x)\le \frac{\varSigma ^2}{2M^2}. \end{aligned}$$

This is usually referred to as a Bernstein-type assumption.

We recall the unbounded operator \(L:\mathcal {D}(L) \subset \mathcal {H}\rightarrow \mathcal {H}\), which is assumed to be unbounded and self-adjoint. By spectral theory, the operator \(L^s : \mathcal {D}(L^s) \rightarrow \mathcal {H}\) is well-defined for \(s \in \mathbb {R}\), and the spaces \(\mathcal {H}_s := \mathcal {D}(L^s), s \ge 0\) equipped with the inner product \(\langle { f},{g} \rangle _{\mathcal {H}_s}=\langle { L^s f},{L^s g} \rangle _{\mathcal {H}},\quad f, g \in \mathcal {H}_s\) are Hilbert spaces. For \(s < 0\), the space \(\mathcal {H}_s\) is defined as completion of \(\mathcal {H}\) under the norm \(\left\| {f}\right\| _{s} := \langle { f},{f} \rangle _{s}^{1/2}\). The collection \(\left\{ \mathcal {H}_s,\ s\in \mathbb {R}\right\}\) of Hilbert spaces is called the Hilbert scale induced by L. The following interpolation inequality is an important tool for the analysis:

$$\begin{aligned} \left\| {f}\right\| _{\mathcal {H}_r}\le \left\| {f}\right\| _{\mathcal {H}_t}^{\frac{s-r}{s-t}}\left\| {f}\right\| _{\mathcal {H}_s}^{\frac{r-t}{s-t}},\qquad f\in \mathcal {H}_s, \end{aligned}$$
(7)

which holds for any \(t< r < s\), see e.g. (Engl et al., 1996, Chapt. 8).

2.1 Reproducing Kernel Hilbert spaces and related operators

We start with the concept of reproducing kernel Hilbert spaces, see the seminal study (Aronszajn, 1950), which can be characterized by a symmetric, positive semidefinite kernel and each of its functions satisfies the reproducing property. We consider vector-valued reproducing kernel Hilbert spaces, following (Micchelli & Pontil, 2005), which are the generalization of real-valued reproducing kernel Hilbert spaces.

Definition 1

(Vector-valued reproducing kernel Hilbert space) For a non-empty set X and a real separable Hilbert space \((Y,\langle { \cdot },{\cdot } \rangle _{Y})\), a Hilbert space \(\mathcal {H}\) of functions from X to Y is said to be the vector-valued reproducing kernel Hilbert space, if linear functional \(F_{x,y}:\mathcal {H}\rightarrow \mathbb {R}\), defined by

$$\begin{aligned} F_{x,y}(f)=\langle { y},{f(x)} \rangle _{Y} \qquad \forall f \in \mathcal {H},\end{aligned}$$

is continuous for every \(x \in X\) and \(y\in Y\).

Definition 2

(Operator-valued positive semi-definite kernel) Suppose \(\mathcal {L}(Y):Y\rightarrow Y\) is the Banach space of bounded linear operators. A function \(K:X\times X\rightarrow \mathcal {L}(Y)\) is said to be an operator-valued positive semi-definite kernel if

  1. (i)

     \(K(x,x')^*=K(x',x) \qquad \forall ~x,x'\in X.\)

  2. (ii)

     \(\sum \limits _{i,j=1}^N\langle { y_i},{K(x_i,x_j)y_j} \rangle _{Y}\ge 0 \qquad \forall ~\{x_i\}_{i=1}^N\subset X \text { and } \{y_i\}_{i=1}^N\subset Y.\)

For a given operator-valued positive semi-definite kernel \(K:X \times X \rightarrow \mathcal {L}(Y)\), we can construct a unique vector-valued reproducing kernel Hilbert space \((\mathcal {H},\langle { \cdot },{\cdot } \rangle _{\mathcal {H}})\) of functions from X to Y as follows:

  1. (i)

    We define the linear function

    $$\begin{aligned} K_x: Y \rightarrow \mathcal {H}: y \mapsto K_xy, \end{aligned}$$

    where \(K_xy:X \rightarrow Y:x' \mapsto (K_xy)(x')=K(x',x)y\) for \(x,x'\in X\) and \(y\in Y\).

  2. (ii)

    The span of the set \(\{K_xy:x\in X, y\in Y\}\) is dense in \(\mathcal {H}\).

  3. (iii)

    Reproducing property

    $$\begin{aligned} \langle { f(x)},{y} \rangle _{Y}=\langle { f},{K_xy} \rangle _{\mathcal {H}},\qquad x\in X,~y \in Y,~\forall ~f\in \mathcal {H}, \end{aligned}$$

    in other words, \(f(x) = K_x^* f\).

Moreover, there is a one-to-one correspondence between operator-valued positive semi-definite kernels and vector-valued reproducing kernel Hilbert spaces, see (Micchelli & Pontil, 2005).

Assumption 3

The space \(\mathcal {H}'\) is assumed to be a vector-valued reproducing kernel Hilbert space of functions \(f:X\rightarrow Y\) corresponding to the kernel \(K:X\times X\rightarrow \mathcal {L}(Y)\) such that

  1. (i)

     \(K_x:Y\rightarrow \mathcal {H}'\) is a Hilbert–Schmidt operator for \(x\in X\) with

    $$\begin{aligned} \kappa '^2:=\sup _{x \in X} \left\| {K_x}\right\| _{HS}^2 = {\sup _{x \in X}{\text {tr}}(K_x^*K_x)}<\infty . \end{aligned}$$
  2. (ii)

    For \(y,y'\in Y\), the real-valued function \(\varsigma :X\times X \rightarrow \mathbb {R}:(x,x')\mapsto \langle { K_{x}y},{K_{x'}y'} \rangle _{\mathcal {H}'}\) is measurable.

Example

In case that the set Y is a bounded subset of \(\mathbb {R}\) then the reproducing kernel Hilbert space becomes real-valued reproducing kernel Hilbert space. The corresponding kernel will then be symmetric, positive semi-definite \(K:X \times X \rightarrow \mathbb {R}\) with the reproducing property \(f(x)=\langle { f},{K_x} \rangle _{\mathcal {H}}\). In this case Assumption 3 simplifies to the condition that the kernel is measurable and \(\kappa '^2:=\sup _{x \in X} \left\| {K_x}\right\| _{\mathcal {H}'}^2=\sup _{x \in X}K(x,x)<\infty\).

Now we introduce some relevant operators used in the convergence analysis. We introduce the notation for the vectors \(\textbf{x}=(x_1,\ldots ,x_m)\)\(\textbf{y}=(y_1,\ldots ,y_m)\)\(\textbf{z}=(z_1,\ldots ,z_m)\). The product Hilbert space \(Y^m\) is equipped with the inner product \(\langle { \textbf{y}},{\textbf{y}'} \rangle _{m} = \frac{1}{m}\sum _{i=1}^m \langle { y_i},{y'_i} \rangle _{Y},\) and the corresponding norm \(\left\| {\textbf{y}}\right\| _{m}^2=\frac{1}{m}\sum _{i=1}^m\left\| {y_i}\right\| _{Y}^2\). We define the sampling operator \(S_\textbf{x}:\mathcal {H}' \rightarrow Y^m:g\mapsto (g(x_1),\ldots ,g(x_m))\), then the adjoint \(S_\textbf{x}^*:Y^m\rightarrow \mathcal {H}'\) is given by

$$\begin{aligned} S_\textbf{x}^*\textbf{y}=\frac{1}{m}\sum _{i=1}^m K_{x_i} y_i. \end{aligned}$$

We observe that under Assumption 3 we have

$$\begin{aligned} {}\left\| {f}\right\| _{\mathscr {L}^2(X,\nu ;Y)}^2=\int _X\left\| {f(x)}\right\| _{Y}^2d\nu (x) =\int _X\left\| {K_x^*f}\right\| _{Y}^2d\nu (x) \le \kappa '^2\left\| {f}\right\| _{\mathcal {H}^\prime }^2, \end{aligned}$$

and

$$\begin{aligned} \left\| {S_\textbf{x}f}\right\| _{m}^2 =\frac{1}{m}\sum _{i=1}^m\left\| {f(x_i)}\right\| _{Y}^2 =\frac{1}{m}\sum _{i=1}^m\left\| {K_{x_i}^*f}\right\| _{Y}^2 \le \kappa '^2\left\| {f}\right\| _{\mathcal {H}^\prime }^2. \end{aligned}$$

In particular, the canonical injection map \(I_\nu : \mathcal {H}' \rightarrow \mathscr {L}^2(X,\nu ;Y)\) is norm bounded by \(\kappa ^\prime\), and so is the empirical version \(S_\textbf{x}\).

We denote the population operators \(B_\nu := I_\nu AL^{-1}:\mathcal {H}\rightarrow \mathscr {L}^2(X,\nu ;Y)\)\(T_\nu := B_\nu ^*B_\nu :\mathcal {H}\rightarrow \mathcal {H}\)\(L_\nu := A^*I_\nu ^*I_\nu A:\mathcal {H}\rightarrow \mathcal {H}\), and their empirical versions \(B_\textbf{x}=S_\textbf{x}A L^{-1}:\mathcal {H}\rightarrow Y^m\)\(T_\textbf{x}=B_\textbf{x}^*B_\textbf{x}:\mathcal {H}\rightarrow \mathcal {H}\)\(L_\textbf{x}=A^*S_\textbf{x}^*S_\textbf{x}A:\mathcal {H}\rightarrow \mathcal {H}\). The operators \(T_\nu\)\(T_\textbf{x}\)\(L_\nu\)\(L_\textbf{x}\) are positive, self-adjoint and depend on the kernel. Under Assumption 3, the operators \(B_\textbf{x}\)\(B_\nu\) are bounded by \(\kappa :=\kappa ' \left\| {AL^{-1}}\right\| _{\mathcal {H}\rightarrow \mathcal {H}'}\) and the operators \(L_\textbf{x}\)\(L_\nu\) are bounded by \(\tilde{\kappa }^2\) for \(\tilde{\kappa }:=\kappa ' \left\| {A}\right\| _{\mathcal {H}\rightarrow \mathcal {H}'}\), i.e., \(\left\| {B_\textbf{x}}\right\| _{\mathcal {H}\rightarrow Y^m}\le \kappa\)\(\left\| {B_\nu }\right\| _{\mathcal {H}\rightarrow \mathscr {L}^2(X,\nu ;Y)}\le \kappa\)\(\left\| {L_\textbf{x}}\right\| _{\mathcal {L}(\mathcal {H})}\le \kappa ^2\) and \(\left\| {L_\nu }\right\| _{\mathcal {L}(\mathcal {H})}\le \tilde{\kappa }^2\).

2.2 Link condition

The subsequent analysis will frequently use the notion of an index function.

Definition 3

(Index function) A function \(\varphi : \mathbb {R}^+ \rightarrow \mathbb {R}^+\) is said to be an index function if it is continuous and strictly increasing with \(\varphi (0) = 0\).

An index function is called sub-linear whenever the mapping \(t\mapsto t/\varphi (t),\ t>0,\) is nondecreasing. Further, we require this index function to belong to the following class of functions.

$$\begin{aligned} \mathcal {F}&=\{\varphi =\varphi _1\varphi _2:\varphi _1,\varphi _2:[0,\kappa ^2]\rightarrow [0,\infty ),\varphi _1~\text {nondecreasing continuous sub-linear},\nonumber \\&\quad \varphi _2~\text { nondecreasing Lipschitz},~\varphi _1(0)=\varphi _2(0)=0\}. \end{aligned}$$
(8)

The representation \(\varphi =\varphi _2\varphi _1\) is not unique, therefore \(\varphi _2\) can be assumed as a Lipschitz function with Lipschitz constant 1. We shall also rely upon the following important result for such Lipschitz continuous index functions \(\varphi _2\), needed in our analysis (Peller, 2016, Corollary 1.2.2):

$$\begin{aligned} \left\| {\varphi _2(T_\textbf{x})-\varphi _2(T_\nu )}\right\| _{HS}\le \left\| {T_\textbf{x}-T_\nu }\right\| _{HS}. \end{aligned}$$

Example

Power-type functions \(\varphi (t)=t^r\) with \(r>0\), and logarithmic functions \(\varphi (t)=t^p\log ^{-\nu }\left( \frac{1}{t}\right) ,\ p,\nu \ge 0\), are examples of functions in the class \(\mathcal {F}\).

The following assumption is used to relate smoothness in terms of the operator L to the covariance operator \(T_\nu\).

Assumption 4

(link condition) There exist a power \(q > 1\) and an index function \(\varrho\), for which the function \(\varrho ^{2}\) is sub-linear. There is a constant \(1 \le \beta <\infty\) such that

$$\begin{aligned} \left\| {L^{-q}u}\right\| _{\mathcal {H}}\le \left\| {\varrho ^{q}(T_\nu )u}\right\| _{\mathcal {H}} \le \beta ^{q}\left\| {L^{-q}u}\right\| _{\mathcal {H}},\quad u\in \mathcal {H}. \end{aligned}$$

The function \(t\mapsto \varphi (t):=\varrho ^{q-1}(t)\) belongs to the class \(\mathcal {F}\).

Only the left inequality will be used for the regular case. For the oversmoothing case, when we need to relate the effective dimensions we require the other side as well. Also, to show the optimality of the rates both side inequalities are used.

As shown in Böttcher et al. (2006), Assumption 4 implies the range identity \(\mathcal {R}(L^{-q}) = \mathcal {R}(\varrho ^{q}(T_\nu ))\). In the context of a comparison of operators we mention the well-known Heinz Inequality, see (Engl et al., 1996, Prop. 8.21). This asserts that for every exponent \(0< a \le 1\) it holds true

$$\begin{aligned} \left\| {Gu}\right\| _{\mathcal {H}}\le \left\| {Hu}\right\| _{\mathcal {H}},\ u\in \mathcal {H}\quad \text {yields}\quad \left\| {G^{a}u}\right\| _{\mathcal {H}}\le \left\| {H^{a}u}\right\| _{\mathcal {H}},\ u\in \mathcal {H}. \end{aligned}$$
(9)

Applying this to the above link condition we obtain the following:

Proposition 1

Under Assumption 4 we have

$$\begin{aligned} \left\| {L^{-1}u}\right\| _{\mathcal {H}}\le \left\| {\varrho (T_\nu )u}\right\| _{\mathcal {H}} \le \beta \left\| {L^{-1}u}\right\| _{\mathcal {H}},\quad u\in \mathcal {H}\end{aligned}$$

and

$$\begin{aligned} \left\| {L^{-(q-1)}u}\right\| _{\mathcal {H}}\le \left\| {\varrho ^{q-1}(T_\nu )u}\right\| _{\mathcal {H}} \le \beta ^{(q-1)}\left\| {L^{-(q-1)}u}\right\| _{\mathcal {H}},\quad u\in \mathcal {H}. \end{aligned}$$

Moreover, we have that

$$\begin{aligned} \left\| {\varrho (T_\nu ) \left( T_\nu + \lambda I\right) ^{-1/2}}\right\| _{\mathcal {L}(\mathcal {H})}\le \frac{\varrho (\lambda )}{\sqrt{\lambda }},\quad 0 < \lambda \le 1. \end{aligned}$$
(10)

Remark 1

It is heuristically clear that the function \(\varrho ^{2}\) cannot increase faster than linearly, because the operator \(T_\nu = L^{-1} L_\nu L^{-1}\) has \(L^{-2}\) in it. Therefore, requiring sub-linearity is not a strong restriction. More details will be given in Sect. 5.

Link conditions as in Assumption 4 imply decay rates for the singular numbers of the operators, known as Weyl’s Monotonicity Theorem (Bhatia, 1997, Cor. III.2.3). In our case, this yields that \(s_{j}(\varrho (T_\nu )) = \varrho (s_{j}(T_\nu ))\asymp s_{j}(L^{-1})\). For classical spaces, as e.g. Sobolev spaces, when \(L:= (I + \varDelta )^{-1/2}\), then \(s_{j}(L^{-1}) \asymp 1/j\) (one spatial dimension). For the above index function \(\varrho\) this means that \(s_{j}(T_\nu ) \asymp \varrho ^{-1}(1/j)\).

Example

(Finitely smoothing covariance operators) In case that the function \(\varrho\), and hence its inverse are of power type then this implies a power type decay of the singular numbers of \(T_\nu\). In this case, the operator \(T_\nu\) is called finitely smoothing.

Example

(Infinitely smoothing covariance operators) If, on the other hand, the function \(\varrho\) is logarithmic, as e.g., \(\varrho (t)=\left( \log \frac{1}{t}\right) ^{-\frac{1}{\mu }}\), then \(s_{j}(T_\nu ) \asymp e^{-j^{\mu }}\). In this case, the operator \(T_\nu\) is called infinitely smoothing.

2.3 Effective dimension

The concept of the effective dimension, as introduced in Zhang (2002), proved to be important for deriving fast rates of convergence under Hölder’s source condition, see (Blanchard & Mücke, 2018; Caponnetto & De Vito, 2007; Guo et al., 2017), and also for general source conditions, see (Lin et al., 2020; Shuai et al., 2020; Rastogi & Sampath, 2017). For the trace–class operator \(T_\nu\) its effective dimension is defined as,

$$\begin{aligned} \mathcal {N}_{T_\nu }(\lambda ):={\text {Tr}}\left( (T_\nu +\lambda I)^{-1}T_\nu \right) , \text { for }\lambda >0.\end{aligned}$$

It is known that the function \(\lambda \rightarrow \mathcal {N}_{T_\nu }(\lambda )\) is continuous and decreasing from \(\infty\) to zero for \(0< \lambda < \infty\) for an infinite dimensional operator \(T_\nu\) (see for details Blanchard and Mathé, 2012; Blanchard and Mücke, 2020; Lin et al., 2015; Shuai et al., 2020; Zhang, 2002). However, we shall use, and this follows from spectral calculus, that the function \(\lambda \mapsto \lambda \mathcal {N}_{T_\nu }(\lambda )\) is increasing.

We have the trivial bound

$$\begin{aligned} \mathcal {N}_{T_\nu }(\lambda )\le \left\| {(T_\nu +\lambda I)^{-1}}\right\| _{\mathcal {L}(\mathcal {H})}{\text {Tr}}\left( T_\nu \right) \le \frac{\kappa ^2}{\lambda }. \end{aligned}$$

In the subsequent analysis, we shall need a relationship between the effective dimensions \(\mathcal {N}_{T_\nu }(\lambda )\) and \(\mathcal {N}_{L_\nu }(\lambda )\) of the operators \(T_\nu\) and \(L_\nu\), respectively. For this, the following assumption, introduced in Lin et al. (2015), will be used. There, it was shown that it is satisfied for both moderately ill-posed and severely ill-posed operators.

Assumption 5

There exists a constant C such that for \(0 < t \le \left\| {L_\nu }\right\| _{\mathcal L(\mathcal {H})}\) we have

$$\begin{aligned} t^{-1}\sum \limits _{s_j(L_\nu )< t}s_j(L_\nu )<C\#\left\{ j, \quad s_j(L_\nu )\ge t\right\} . \end{aligned}$$

Proposition 2

Suppose Assumptions 4 and 5 hold true. Suppose that the function \(\varrho\) from the link condition, Assumption 4, is such that the function \(t\mapsto \left( \varrho ^{2q}\right) ^{-1}(t)\) is operator concave, and that there is some \(n\in \mathbb {N}\) for which the function \(t\mapsto \varrho ^{-1}(t)/t^n\) is concave. Then, there is \(\widetilde{C}\) for which we have that

$$\begin{aligned} \mathcal {N}_{L_\nu }\left( \frac{\lambda }{\varrho ^2(\lambda )}\right) \le 2\beta ^{n+1}\widetilde{C}\mathcal {N}_{T_\nu }(\lambda ),\quad 0<\lambda \le \left\| {T_\nu }\right\| _{\mathcal L(\mathcal {H})}. \end{aligned}$$

Remark 2

For a power type function \(\varrho (t):= t^{a}\) the above concavity assumptions hold true whenever \(2aq \ge 1\) and \(n \le 1/a \le n+1\). In particular, the number n is uniquely determined.

2.4 Regularization schemes

General regularization schemes were introduced and discussed in ill-posed inverse problems and learning theory (See Shuai & Pereverzev, 2013, Sect. 2.2 and Bauer et al., 2007, Sect. 3.1) for brief discussion). By using the notation from Sect. 2.1, the Tikhonov regularization scheme from (5) can be re-expressed as follows:

$$\begin{aligned} f_{\textbf{z},\lambda }= \mathop {{\textrm{argmin}}}\limits _{f\in \mathcal {D}(L)}\left\{ \left\| {S_\textbf{x}A(f)-\textbf{y}}\right\| _{m}^2+\lambda \left\| {L f}\right\| _{\mathcal {H}}^2\right\} , \end{aligned}$$

with minimizer given as

$$\begin{aligned} f_{\textbf{z},\lambda }=L^{-1}(T_\textbf{x}+\lambda I)^{-1}B_\textbf{x}^*\textbf{y}. \end{aligned}$$

The following definition extends this by replacing the operator \((T_\textbf{x}+\lambda I)^{-1}\) by some operator function \(g_\lambda (T_\textbf{x})\).

Definition 4

(Spectral regularization) We say that a family of functions \(g_\lambda :[0,\kappa ^2]\rightarrow \mathbb {R}\)\(0<\lambda \le a\), is a regularization scheme if there exists \(D,B,\gamma\) such that

  •  \(\sup \limits _{t\in [0,\kappa ^2]}\left|t g_\lambda (t) \right|\le D\).

  •  \(\sup \limits _{t\in [0,\kappa ^2]}\left|g_\lambda (t) \right|\le \frac{B}{\lambda }\).

  •  \(\sup \limits _{t\in [0,\kappa ^2]}\left|r_\lambda (t) \right|\le \gamma \qquad \text {for}\quad r_\lambda (t)=1-g_\lambda (t)t\).

  • For some constant \(\gamma _p\) (independent of \(\lambda\)), the maximal p satisfying the condition:

    $$\begin{aligned} \sup \limits _{t\in [0,\kappa ^2]}\left|r_\lambda (t) \right|t^p\le \gamma _p\lambda ^p \end{aligned}$$

    is said to be the qualification of the regularization scheme \(g_\lambda\).

Definition 5

The qualification p covers the index function \(\varphi\) if the function \(t\rightarrow \frac{t^p}{\varphi (t)}\) is nondecreasing.

We mention the following result.

Proposition 3

Suppose \(\varphi\) is a nondecreasing index function and the qualification, say \(p\ge 1\), of the regularization \(g_\lambda\) covers \(\varphi\). Then

$$\begin{aligned} \sup \limits _{t\in [0,\kappa ^2]}\left|r_\lambda (\sigma ) \right|\varphi (\sigma )\le c_p\varphi (\lambda ),\quad c_p=\max (\gamma ,\gamma _p). \end{aligned}$$

Also, we have that

$$\begin{aligned} \sup \limits _{t\in [0,\kappa ^2]}\left|r_\lambda (\sigma ) \right|\varphi (\lambda + \sigma )\le 2^{p}c_p\varphi (\lambda ). \end{aligned}$$

Most of the linear (spectral) regularization schemes (Tikhonov regularization, Landweber iteration or spectral cut-off) satisfy the properties of general regularization. Inspired by the representation for the minimizer of the Tikhonov functional (5) we consider a general regularized solution in Hilbert scales corresponding to the above regularization \(g_\lambda\) in the form

$$\begin{aligned} f_{\textbf{z},\lambda }=L^{-1}g_\lambda (T_\textbf{x})B_\textbf{x}^*\textbf{y}, \end{aligned}$$
(11)

where by spectral calculus the real-valued function \(g_\lambda\) is applied to the self-adjoint operator \(T_\textbf{x}\).

3 Convergence analysis

The analysis will distinguish between two cases, the ‘regular’ one, when \(f_\rho \in \mathcal {D}(L)\), and the ‘low smoothness’ case, when \(f_\rho \not \in \mathcal {D}(L)\). In either case, we shall first utilize the concept of distance functions. This will later give rise to establish convergence rates in a more classical style.

For the asymptotical analysis, we shall require the standard assumption relating the sample size m and the parameter \(\lambda\) such that

$$\begin{aligned} \mathcal {N}_{T_\nu }(\lambda ) \le m\lambda \qquad \text {and}\qquad 0<\lambda \le 1. \end{aligned}$$
(12)

It will be seen, that asymptotically the condition (12) is always satisfied for the parameter which is optimally chosen under known smoothness.

Since the function \(\mathcal {N}_{T_\nu }(\lambda )\) is decreasing in \(\lambda\), for \(\lambda \le 1\) we have that \(\mathcal {N}_{T_\nu }(1)\le \mathcal {N}_{T_\nu }(\lambda )\). Hence condition (12) yields that

$$\begin{aligned} \mathcal {N}_{T_\nu }(1) \le m\lambda . \end{aligned}$$
(13)

Several probabilistic quantities will be used to express the error bounds. Precisely, for an index function \(\zeta\) we let

$$\begin{aligned} \varXi ^{\zeta }=\varXi ^{\zeta }(\lambda )&:= \left\| {\left( \frac{1}{\zeta }\right) (T_\textbf{x}+\lambda I)\zeta (T_\nu +\lambda I)}\right\| _{\mathcal {L}(\mathcal {H})}, \end{aligned}$$
(14)
$$\begin{aligned} \varLambda = \varLambda (\lambda )&: =\left\| {(L_\nu +\lambda I)^{-1/2}(L_\nu -L_\textbf{x})}\right\| _{HS},\end{aligned}$$
(15)
$$\begin{aligned} \varUpsilon = \varUpsilon (\lambda )&: =\left\| {(T_\nu +\lambda I)^{-1/2}(T_\nu -T_\textbf{x})}\right\| _{HS}, \end{aligned}$$
(16)

and

$$\begin{aligned} \varPsi = \varPsi (\lambda )&:= \left\| {(T_\nu +\lambda I)^{-{1/2}}B_\textbf{x}^*(S_\textbf{x}Af_\rho -\textbf{y})}\right\| _{\mathcal {H}}. \end{aligned}$$
(17)

In case that \(\zeta (t) = t^{r}\) we abbreviate \(\varXi ^{t^{r}}\) by \(\varXi ^{r}\) and \(\varXi ^{t}\) by \(\varXi\), not to be confused with the power. High probability bounds for these quantities are known, and these are given correspondingly in Propositions 4 and 5 in Appendix C.

3.1 The oversmoothing case

As mentioned before, we shall use distance functions, which measure the violation of a benchmark smoothness. Here the benchmark will be \(f_\rho \in \mathcal {D}(L)\).

Definition 6

We define the distance function \(d : [0, \infty )\rightarrow [0, \infty )\) by

$$\begin{aligned} d(R)=\inf \left\{ \left\| {f_\rho - f}\right\| _{\mathcal {H}}:f= L^{-1}v \text { and }\left\| {v}\right\| _{\mathcal {H}} \le R\right\} ,\quad R>0. \end{aligned}$$
(18)

The distance function is positive, decreasing, convex and continuous for all \(0 \le R < 1\). It tends to 0 as \(R \rightarrow \infty\), see (Hofmann, 2006). Hence, the unique minimizer exists and will be denoted by \(f_\rho ^R\).

Notice the following: If \(f_\rho \in \mathcal {D}(L)\) then for some R the minimizer \(f_\rho ^R\) of the distance function will obey \(f_\rho ^R=f_\rho\).

Remark 3

In a rudimentary form, this approach was given in (Baumeister, 1987, Thm. 6.8). It was then introduced in regularization theory in Hofmann (2006). Within learning theory, such a concept was also used in the study (Smale & Zhou, 2003).

Theorem 1

Let \(\textbf{z}\) be i.i.d. samples drawn according to the probability measure \(\rho\). Suppose the Assumptions 15 hold true. Let \(g_\lambda\) be a regularization with corresponding regularized solution \(f_{\textbf{z},\lambda }\) (see (11)). Suppose that the qualification p of the regularization \(g_\lambda\) covers the function \(\varrho\) (from Assumption 4) and that the functions \(\varrho ^{-1}(t)/t^n\), and \(\left( \varrho ^{2q}\right) ^{-1}(t)\) are concave, or operator concave, for some \(n\ge 1\), respectively. Then for all \(0<\eta <1\), and for \(\lambda\) satisfying the condition (12) the following upper bound holds true with confidence \(1-\eta\):

$$\begin{aligned} \left\| {f_{\textbf{z},\lambda }-f_\rho }\right\| _{\mathcal {H}}\le C\left\{ d(R)+ 2R \varrho \left( \lambda \right) \right\} \log ^4\left( \frac{4}{\eta }\right) ,\quad R\ge \varSigma +\kappa M/\mathcal {N}_{T_\nu }(1), \end{aligned}$$

where C depends on BD\(c_p\)\(\kappa\)n\(\beta\)\(\widetilde{C}\).

The bound from Theorem 1 is valid for all \(R\ge \varSigma +\kappa M/\mathcal {N}_{T_\nu }(1)\), and we shall now optimize the bound from Theorem 1 with respect to the choice of \(R\ge \varSigma +\kappa M/\mathcal {N}_{T_\nu }(1)\).

First, if \(f_\rho \in \mathcal {D}(L)\) then there is \(\bar{R}\ge \varSigma +\kappa M/\mathcal {N}_{T_\nu }(1)\) such that \(d(\bar{R}) = 0\), and

$$\begin{aligned} \left\| {f_{\textbf{z},\lambda }-f_\rho }\right\| _{\mathcal {H}}\le C \bar{R}~\varrho (\lambda )\log ^4\left( \frac{4}{\eta }\right) , \end{aligned}$$

where C depends on BD\(c_p\)\(\kappa\)n\(\beta\)\(\widetilde{C}\).

Otherwise, in the low smoothness case, \(f_\rho \not \in \mathcal {D}(L)\), we introduce the following function

$$\begin{aligned} \varGamma (R) := \frac{d(R)}{R}, \qquad R\ge \varSigma +\kappa M/\mathcal {N}_{T_\nu }(1), \end{aligned}$$

which is non-vanishing decreasing function, and hence the inverse \(\varGamma ^{-1}\) exists, and it is decreasing. Given \(\lambda >0\), by letting \(R = R(\lambda )\) solve the equation \(\varGamma (R) = \varrho (\lambda )\) we find that

$$\begin{aligned} \left\| {f_{\textbf{z},\lambda }-f_\rho }\right\| _{\mathcal {H}}\le C R(\lambda )\varrho (\lambda )\log ^4\left( \frac{4}{\eta }\right) , \end{aligned}$$
(19)

where C depends on BD\(c_p\)\(\kappa\)n\(\beta\)\(\widetilde{C}\).

The above dependency \(\lambda \rightarrow R(\lambda )\) can be made explicit when assuming that \(f_\rho\) has some smoothness measured in terms of a source condition, see Sect. 4, below. For Theorem 1 we get the error bound (19) but the parameter \(\lambda\) has to obey (12).

3.2 The regular case

Here we analyze the rates of convergence in the case when the underlying true solution \(f_\rho\) belongs to the domain of the operator L. Again, we shall choose a benchmark smoothness, here in the form of \(f_\rho \in \mathcal {D}(L^q)\) for some \(q\ge 1\). This benchmark smoothness is determined by the user. With respect to this benchmark we introduce the following distance function.

Definition 7

Given \(q\ge 1\) we define the distance function \(d_{q} : [0, \infty )\rightarrow [0, \infty )\) by

$$\begin{aligned} d_q(R)=\inf \left\{ \left\| {L(f-f_\rho )}\right\| _{\mathcal {H}}:f= L^{-q}v \text { and }\left\| {v}\right\| _{\mathcal {H}} \le R\right\} . \end{aligned}$$
(20)

Theorem 2

Let \(\textbf{z}\) be i.i.d. samples drawn according to the probability measure \(\rho\). Suppose the Assumptions 14 hold true. Let \(g_\lambda\) be a regularization with corresponding regularized solution \(f_{\textbf{z},\lambda }\) (see (11)). Let \(\zeta\) be any index function, such that \(\frac{1}{2}\) covers \(\zeta\). Suppose that the qualification p of the regularization \(g_\lambda\) covers the function \(\zeta \varphi\) (with \(\varphi\) from Assumption 4). Then for all \(0<\eta <1\), and for \(\lambda\) satisfying the condition (12), the following upper bound holds true with confidence \(1-\eta\):

$$\begin{aligned} \left\| {\zeta (T_\nu )L\left( f_{\textbf{z},\lambda }-f_\rho \right) }\right\| _{\mathcal {H}}\,\le\,&C\zeta (\lambda )\left\{ d_{q}(R)+R\left( \varphi (\lambda )+\frac{1}{\sqrt{m}}\right) +C'\sqrt{\frac{\mathcal {N}_{T_\nu }(\lambda )}{m\lambda }}\right\} \\&\times \log ^4\left( \frac{4}{\eta }\right) , \end{aligned}$$

Consequently, we find that

$$\begin{aligned} \left\| {f_{\textbf{z},\lambda }-f_\rho }\right\| _{\mathcal {H}}\, \le \, C \varrho (\lambda )\left\{ d_{q}(R)+R\left( \varphi (\lambda )+\frac{1}{\sqrt{m}}\right) +C'\sqrt{\frac{\mathcal {N}_{T_\nu }(\lambda )}{m\lambda }}\right\} \log ^4\left( \frac{4}{\eta }\right) \end{aligned}$$

and

$$\begin{aligned} \left\| {I_\nu A(f_{\textbf{z},\lambda }-f_\rho )}\right\| _{\mathscr {L}^2(X,\nu ;Y)} \,\le\,&C\sqrt{\lambda }\left\{ d_{q}(R)+R\left( \varphi (\lambda )+\frac{1}{\sqrt{m}}\right) +C'\sqrt{\frac{\mathcal {N}_{T_\nu }(\lambda )}{m\lambda }}\right\} \\&\times \log ^4\left( \frac{4}{\eta }\right) , \end{aligned}$$

where C depends on BD\(c_p\)\(\kappa\), and \(C'=2\kappa M+\varSigma\).

The bound from Theorem 2 is valid for all \(R\ge 1\), and we shall now optimize the bound from Theorem 2 with respect to the choice of \(R\ge 1\).

First, if \(f_\rho \in \mathcal {R}\left( L^{-q}\right)\) then \(d_{q}(\bar{R}) = 0\) for some \(\bar{R}\), we find that

$$\begin{aligned} \left\| {f_{\textbf{z},\lambda }-f_\rho }\right\| _{\mathcal {H}}\le C\varrho \left( \lambda \right) \left\{ \bar{R}\left( \varphi (\lambda )+\frac{1}{\sqrt{m}}\right) +C'\sqrt{\frac{\mathcal {N}_{T_\nu }(\lambda )}{m\lambda }}\right\} \log ^4\left( \frac{4}{\eta }\right) . \end{aligned}$$

Otherwise, in case that \(f_\rho \not \in \mathcal {R}\left( L^{-q}\right)\) we introduce the following function

$$\begin{aligned} \varGamma _{q}(R) := \frac{d_{q}(R)}{R}, \qquad R \ge 1, \end{aligned}$$
(21)

which is non-vanishing decreasing function, and hence the inverse \(\varGamma _{q}^{-1}\) exists and it is decreasing. We finally get the main result, by letting \(R = R(\lambda )\) solving the equation \(\varGamma _{q}(R) = \varphi (\lambda )\), and we find that

$$\begin{aligned} \left\| {f_{\textbf{z},\lambda }-f_\rho }\right\| _{\mathcal {H}}\le C\varrho \left( \lambda \right) \left\{ R(\lambda )\left( \varphi (\lambda )+\frac{1}{\sqrt{m}}\right) +C'\sqrt{\frac{\mathcal {N}_{T_\nu }(\lambda )}{m\lambda }}\right\} \log ^4\left( \frac{4}{\eta }\right) . \end{aligned}$$

4 Smoothness in terms of source-wise representation

So far convergence results were established in terms of distance functions. We will now specify the smoothness of the true solution in terms of the bounded, injective and self-adjoint operator \(L^{-1}\). This is genuine for regularization in Hilbert scales.

Assumption 6

(General source condition) For an index function \(\theta\), the true solution \(f_\rho\) belongs to the class \(\varOmega (\theta ,R^\dagger )\) with

$$\begin{aligned} \varOmega (\theta ,R^\dagger ):=\left\{ f \in \mathcal {H}: f= \theta (L^{-1})v \text { and }\left\| {v}\right\| _{\mathcal {H}} \le R^\dagger \right\} . \end{aligned}$$

Notice that elements from \(\varOmega (\theta ,R^\dagger )\) belong to the range of \(\theta (L^{-1})\) which coincides with the domain of \(\theta (L)\), since \(L^{-1}\) was assumed to be bounded.

We aim at bounding the distance functions d(R) and \(d_q(R)\) from the oversmoothing and regular cases, respectively.

For a better understanding, we shall highlight the obtained general bounds when the considered index functions are of power type, and we specify the function \(\theta (t) := t^{r}\), which represents the smoothness, as well as \(\varrho (t) = t^a\), representing the link, for this purpose. It will be seen that the index function \(t\mapsto \theta (\varrho (t)),\ t>0\) is relevant in the subsequent corollaries, which here reads as \(\theta (\varrho (t)) = t^{ar},\ t>0\). Also, in the regular case with benchmark smoothness \(f_\rho \in \mathcal R(L^{-q})\), the function \(t \mapsto \frac{\iota ^q}{\theta }(t)\) appears, and this is required to be an index function. Within the power type context, this reads as \(r < q\), and it simply means that the actual smoothness is not beyond the benchmark.

Finally, we emphasize that the rates will depend on the decay of the effective dimension of the covariance operator \(T_\nu\), which was introduced in Sect. 2.3. Therefore, we will highlight the obtained bounds under specified decay rates for the effective dimension in Sect. 4.3. The obtained overall rates will be highlighted in Tables 1 and 2, respectively.

4.1 The oversmoothing case

Here the benchmark source condition \(f_\rho \in \mathcal R(L^{-1})\) (\(q=1\)) is linear, represented by the identity function \(\iota :t \mapsto t\), and we shall thus assume that the index function \(\theta\) is sub-linear. The obtained bounds will rely on the results from (Hofmann & Mathé, 2007, Theorem 5.9). Under Assumption 6 we find that

$$\begin{aligned} d(R) \le R \left( \left( \frac{\iota }{\theta }\right) ^{-1}\left( \frac{R^{\dag }}{R}\right) \right) ,\quad R>0. \end{aligned}$$

In order to minimize the bound from Theorem 1, we balance \(d(R) = R \varrho (\lambda )\), resulting in

$$\begin{aligned} R(\lambda ) = R^{\dag }\frac{\theta \left( \varrho (\lambda )\right) }{\varrho (\lambda )},\quad \lambda >0. \end{aligned}$$
(22)

Thus, for this value of \(R(\lambda )\) under the condition (12), the bound (19) reduces to

$$\begin{aligned} \left\| {f_{\textbf{z},\lambda }- f_\rho }\right\| _{\mathcal {H}} \le C R(\lambda ) \varrho (\lambda ) \log ^4(4/\eta )\le C R^{\dag }\theta \left( \varrho (\lambda )\right) \log ^4(4/\eta ). \end{aligned}$$
(23)

The following corollary is the consequence of Theorem 1 which explicitly provide us with an error bound in terms of the sample size m.

Corollary 1

Suppose that the unknown \(f_\rho\) obeys Assumption 6 for a sub-linear function \(\theta\). Under the same assumptions of Theorem 1 and with the a-priori choice of the regularization parameter \(\lambda ^{*} = \lambda ^{*}(m)\) from solving the equation \(\mathcal N_{T_\nu }(\lambda ^{*}) = m \lambda ^{*}\), for all \(0<\eta <1\), the following error estimates holds with confidence \(1-\eta\):

$$\begin{aligned} \left\| {f_{\textbf{z},\lambda }-f_\rho }\right\| _{\mathcal {H}} \le C \theta \left( \varrho \left( \lambda ^*\right) \right) \log ^4\left( \frac{4}{\eta }\right) , \end{aligned}$$

where C depends on BD\(c_p\)\(\kappa\)n\(\beta\)\(\widetilde{C}\)M\(\varSigma\), and \(R^\dagger\).

Evidently, the above parameter choice satisfies condition (12).

4.2 The regular case

In this case the benchmark is given by the index function \(\iota ^{q}\), and we shall assume that the given smoothness, measured in terms of \(\theta\), is such that the function \(\iota ^{q}/\theta\) for \(0 < t \le \kappa ^2\), is an index function. However, the definition of the distance function \(R \mapsto d_{q}(R)\) is non-standard. The target norm is \(\left\| {L(f - f_\rho )}\right\| _{\mathcal {H}}\), and, in order to apply the result from (Hofmann & Mathé, 2007, Theorem 5.9) we have to ‘rescale’ the given smoothness (in terms of the operator \(L^{-1}\)) by factor \(L^{-1}\). If Assumption 6 holds true with index function \(\theta\), for which the quotient \(\iota ^{q}/\theta\) is an index function, and so will be the function \(\iota ^{q-1}/(\theta /\iota )\), then this results in the bound

$$\begin{aligned} d_{q}(R) \le R~\left[ \left( \frac{\iota ^{q}}{\theta }\right) ^{-1}\left( \frac{R^\dagger }{R}\right) \right] ^{q-1},\quad R>0. \end{aligned}$$
(24)

According to Theorem 2 we balance

$$\begin{aligned} d_{q}(R) = R \varphi (\lambda ). \end{aligned}$$

This yields

$$\begin{aligned} R(\lambda ) = R^{\dag } \frac{\theta \left( \varrho (\lambda )\right) }{\varrho ^{q}(\lambda )},\quad R>0. \end{aligned}$$

Inserting this bound into Theorem 2 we find that

$$\begin{aligned}&\left\| {f_{\textbf{z},\lambda }-f_\rho }\right\| _{\mathcal {H}}\nonumber \\&\quad \le C\varrho (\lambda )\left\{ R^\dagger \frac{\theta \left( \varrho (\lambda )\right) }{\varrho (\lambda )}\left( 1+\frac{1}{\sqrt{m}\varphi (\lambda )}\right) +C'\sqrt{\frac{\mathcal {N}_{T_\nu }(\lambda )}{m\lambda }}\right\} \log ^4\left( \frac{4}{\eta }\right) \nonumber \\&\quad = C\varrho (\lambda )\left\{ R^\dagger \frac{\theta \left( \varrho (\lambda )\right) }{\varrho (\lambda )}+\frac{1}{\sqrt{m}}\left( R^\dagger \frac{\theta \left( \varrho (\lambda )\right) }{\varrho ^q(\lambda )}+C'\sqrt{\frac{\mathcal {N}_{T_\nu }(\lambda )}{\lambda }}\right) \right\} \log ^4\left( \frac{4}{\eta }\right) \end{aligned}$$
(25)

provided that (12) holds.

The optimization of the bound in the inequality (25) depends on which term is dominant in the last two summands. Then we can balance the remaining (two) terms. This results in the following corollaries for the different choices of the regularization parameter:

Corollary 2

Suppose that the unknown \(f_\rho\) obeys Assumption 6 for an index function \(\theta\), and that the related functions \(\frac{\iota ^q}{\theta }(t)\) and \(\frac{\iota ^q}{\theta }\left( \varrho (t)\right) \sqrt{\frac{\mathcal {N}_{T_\nu }(t)}{t}}\) are index functions. Under the same assumptions of Theorem 2, and for the a-priori choice of the regularization parameter \(\lambda ^*=\varphi ^{-1}\left( \frac{1}{\sqrt{m}}\right)\), for all \(0<\eta <1\), the following upper bound holds with confidence \(1-\eta\):

$$\begin{aligned} \left\| {f_{\textbf{z},\lambda }-f_\rho }\right\| _{\mathcal {H}}\le C\theta \left( \varrho (\lambda ^*)\right) \log ^4\left( \frac{4}{\eta }\right) , \end{aligned}$$

where C depends on BD\(c_p\)\(\kappa\)M\(\varSigma\), and \(R^\dagger\).

Corollary 3

Suppose that the unknown \(f_\rho\) obeys Assumption 6 for an index function \(\theta\), and that the related functions \(\frac{\iota ^q}{\theta }(t)\) and \(\frac{\iota ^q}{\theta }\left( \varrho (t)\right) \sqrt{\frac{\mathcal {N}_{T_\nu }(t)}{t}}\) are index functions. Under the same assumptions of Theorem 2, and for the a-priori choice of the regularization parameter \(\lambda ^*\) as solution to the equation \(\frac{\varTheta ^{2}(\varrho (\lambda ^{*}))}{\varrho ^{2}(\lambda ^{*})} \lambda ^{*}m = \mathcal N_{T_\nu }(\lambda ^{*})\), for all \(0<\eta <1\), the following upper bound holds with confidence \(1-\eta\):

$$\begin{aligned} \left\| {f_{\textbf{z},\lambda }-f_\rho }\right\| _{\mathcal {H}}\le C\theta \left( \varrho (\lambda ^*)\right) \log ^4\left( \frac{4}{\eta }\right) , \end{aligned}$$

where C depends on BD\(c_p\)\(\kappa\)M\(\varSigma\), and \(R^\dagger\).

Since by assumption the function \(t \mapsto \frac{\varTheta ^{2}(\varrho (\lambda ^{*}))}{\varrho ^{2}(\lambda ^{*})}\) is an index function we will have that condition (12) holds for m large enough.

4.3 Taking the behavior of effective dimension into account

Below, to be specific, we consider the following two behaviors of the decay of the effective dimensions of the covariance operator \(T_\nu\), say power-type and logarithmic type, known to hold true in many situations.

Assumption 7

(Polynomial decay) There exists some positive constant \(c>0\) such that

$$\begin{aligned} \mathcal {N}_{T_\nu }(\lambda ) \le c\lambda ^{-b},\quad \text { for }0\le b<1,~\forall \lambda >0. \end{aligned}$$

Assumption 8

(Logarithmic decay) There exists some positive constant \(c>0\) such that

$$\begin{aligned} \mathcal {N}_{T_\nu }(\lambda )\le c\log \left( \frac{1}{\lambda }\right) , \quad \forall \lambda >0. \end{aligned}$$

Remark 4

We mention that a polynomial decay of the eigenvalues of the covariance operator \(T_\nu\) yields a power-type behavior of the effective dimension, see (Caponnetto & De Vito, 2007). In some situations this behavior is not evident. Shuai et al. (2020) showed that for Gaussian kernel \(K_1(x,x') = xx' + e^{-8(x-x')^2}\) with the uniform sampling on [0, 1], the effective dimension exhibits a log-type behavior (Assumption 8), on the other hand, the kernel \(K_2(x,x') = \min \{x,x'\}-xt\) exhibits a power-type behavior (Assumption 7).

Below, we shall summarize the convergence rates under the specific behavior of the effective dimension, Assumptions 7 and 8, respectively, in the Tables 1 and 2. We confine to the power type case, when both the link condition as well as the source condition are of power type, i.e., \(\varrho (t)=t^a\) and \(\theta (t)=t^r\) for parameters \(a,r>0\). The qualification of the regularization is denoted by p as before. Also, the benchmark smoothness is q, where either \(q=1\) (oversmoothing case) or \(q>1\) (regular case). Notice, that due to the sub-linearity condition for \(\varrho ^{2}\) we must have that \(0 < a \le 1/2\). Also, throughout the analysis, we assume that the qualification covers the given smoothness, i.e., \(a q\le p\). The bounds presented in the tables are consequences of Corollaries 13, respectively. Therefore, Assumptions 16 are assumed to be satisfied.

Table 1 (under Assumption 7, and for \(a\le \frac{1}{2}\)\(a q\le {p}\)): convergence rates of the regularized solution \(f_{\textbf{z},\lambda }\)
Table 2 (under Assumption 8, and for \(a\le \frac{1}{2}\)\(a q\le {p}\)): convergence rates of the regularized solution \(f_{\textbf{z},\lambda }\)

The tables are structured as follows. In the first column we present the rates of convergence \(\varepsilon (m)\) for the error estimates of the form:

$$\begin{aligned} \mathbb {P}_{\textbf{z}\in Z^m}\left\{ \left\| {f_{\textbf{z},\lambda }-f_\rho }\right\| _{\mathcal {H}}\le C\varepsilon (m)\log ^4\left( \frac{4}{\eta }\right) \right\} \ge 1-\eta . \end{aligned}$$

In the second column, the corresponding order of the regularization parameter choice \(\lambda ^*\) in terms of m is indicated. In the third column, we highlight the smoothness of the true solution \(f_\rho\). In fourth column, we emphasize additional constraints, specifically on the benchmark smoothness.

The first row corresponds to the oversmoothing case, and the last two rows correspond to the regular case. In the regular case, we observe that the validity of the rates of the convergence depends on the benchmark smoothness through aq. At the intersection point, when \(a q = a r+\frac{b+1}{2}\), both rates coincide. As we will see in the next section the rates of convergence in the regular case (\(q>1\)) are optimal provided that the benchmark smoothness is chosen appropriately.

5 Optimality of the error bounds

We shall discuss the optimality of the previously obtained error bounds, in the regular case, and we shall use the known optimality results from (Blanchard & Mücke, 2018). However, at present the smoothness is measured with respect to the operator \(T_\nu\), whereas in Blanchard and Mücke (2018) this was done with respect to the operator \(L_\nu := A^{ *} I_\nu ^{*} I_\nu A = LT_\nu L\). Therefore, the following ‘recipe’ will be used.

  1. 1.

    Transfer smoothness as given in terms of \(L^{-1}\) to smoothness in terms of \(L_\nu\), and

  2. 2.

    Knowing the decay of the singular numbers of the operator \(T_\nu\) inherent in Assumption 7, find the decay of the singular numbers of \(L_\nu\).

In order to keep the analysis simple and transparent, we confine to power type smoothness \(\theta (t)=t^{r},\ 0 < r \le q\) in Assumption 6, as well as to power type link in Assumption 4 with \(\varrho (t) := t^{a}\) for some \(a>0\).

In the subsequent subsections, we shall sketch the proof of the lower bounds step by step, reaching the optimality assertion at the end. In order to get there, additional assumptions have to be made, a lifting condition (Assumption 9), and a singular number decay condition (Assumption 10).

5.1 Relating smoothness

The link condition is crucial, and the subsequent arguments are of interpolation type, applying Heinz Inequality within the present context. To this end, we require that q is chosen such that \(aq\ge 1/2\). In this case Assumption 4 yields, by applying Heinz Inequality (9) with exponent \(1/(2aq)\le 1\) that

$$\begin{aligned} \left\| {I_\nu A L^{-1}u}\right\| _{\mathscr {L}^2(X,\nu ;Y)} = \left\| {T_\nu ^{1/2}u}\right\| _{\mathcal {H}} \asymp ^{1}\left\| {L^{-\frac{1}{2a}}u}\right\| _{\mathcal {H}},\quad u\in \mathcal {H}. \end{aligned}$$

Footnote 1 Letting \(v:= L^{-1}u\) we find that

$$\begin{aligned} \left\| {L_\nu ^{1/2} v}\right\| _{\mathcal {H}} = \left\| {I_\nu A v}\right\| _{\mathscr {L}^2(X,\nu ;Y)} \asymp \left\| {L^{-(\frac{1}{2a} -1)}v}\right\| _{\mathcal {H}},\quad v\in \mathcal {H}. \end{aligned}$$
(26)

First, we see from this that \(a < 1/2\), because otherwise \(L_\nu\) would be continuously invertible. Also, the relation (26) would allow transferring smoothness r with respect to \(L^{-1}\) to \(L_\nu\) as long as \(0 < r \le \frac{1}{2a} - 1\). In order to treat higher smoothness (in terms of \(L^{-1}\)) a lifting condition is unavoidable. This must be consistent with the link from (26). Thus we look for a factor z such that \(t^{(\frac{1}{2a} -1)z} = t^{q}\), yielding \(z:= \frac{2aq}{1 - 2a}\).

Assumption 9

(lifting condition) We have that

$$\begin{aligned} \left\| {L^{-q}u}\right\| _{\mathcal {H}} \asymp \left\| {L_\nu ^{\frac{aq}{1-2a}}u}\right\| _{\mathcal {H}},\quad u\in \mathcal {H}. \end{aligned}$$

Remark 5

The strengthening of the original link condition, Assumption 4, towards a lifting condition has been discussed in more detail in Mathé (2019).

Having this lifting, and applying Heinz Inequality (9) (with exponent r/q) yields

$$\begin{aligned} \left\| {L^{-r}v}\right\| _{\mathcal {H}} \asymp \left\| {L_\nu ^{\frac{a r}{1-2a}}v}\right\| _{\mathcal {H}},\quad v\in \mathcal {H}, \end{aligned}$$
(27)

and a source-wise representation as in Assumption 6 yields a corresponding source-wise representation with respect to the operator \(L_\nu\) (with different constant).

5.2 Relating effective dimensions

Here we shall use the following consequence of Assumption 4. Indeed, turning from squared norms to quadratic forms we see that

$$\begin{aligned} \langle { L^{-2q}u},{u} \rangle _{\asymp }\langle { T_\nu ^{2aq}u},{u} \rangle _{,}\quad u\in \mathcal {H}. \end{aligned}$$

The Weyl Monotonicity Theorem (Bhatia, 1997, Cor. III.2.3) yields that then \(s_{j}(L^{-2q}) \asymp s_{j}(T_\nu ^{2aq}),\ j=1,2,\dots\), or simplified that \(s_{j}(L^{-1}) \asymp s_{j}^{a}(T_\nu ),\ j=1,2,\dots\) by spectral calculus. Here \(s_{j}(L^{-1})\) and \(s_{j}(T_\nu )\) denote the singular numbers of the operators. Similarly, we obtain from (26) that \(s_{j}(L_\nu ) \asymp s_{j}^{\frac{1 - 2a}{a}}(L^{-1})\), and a fortiori that \(s_{j}(L_\nu ) \asymp s_{j}^{1 - 2a}(T_\nu )\).

5.3 Lower bound

In order to show the optimality of the error bounds as discussed in Table 1, we shall assure that the decay of the effective dimension cannot be faster than asserted in Assumption 7.

Assumption 10

(decay of singular number) There is a constant \(c>0\) such that the singular numbers of the operator \(T_\nu\) obey

$$\begin{aligned} s_{j}(T_\nu ) \ge c j^{-1/b},\quad j=1,2,\dots \end{aligned}$$

Notice that this yields that \(\mathcal N(\lambda ) \ge c \lambda ^{-b}\), such that this is the limiting case for which Assumption 7 holds. Hence, the assumed decay of the singular numbers of \(T_\nu\) is best possible by order. The following is reported in Blanchard and Mücke (2018) for the problem (2): Under smoothness r with respect to the operator \(L_\nu\), and with the decay of the singular numbers \(s_{j}(L_\nu )\) not faster than \(j^{-1/b}\), the optimal rate is of the order \(\left( \frac{1}{\sqrt{m}}\right) ^{\frac{2r}{2r + b +1}}\). In the present context, we have to assign \(r\leftarrow \frac{ar}{1-2a}\) and \(b\leftarrow \frac{b}{1-2a}\). This yield a lower bound of the order

$$\begin{aligned} \left( \frac{1}{\sqrt{m}}\right) ^{\frac{2ar/(1 - 2a)}{2ar/(1 - 2a) + b/(1 - 2a) + 1}} = \left( \frac{1}{\sqrt{m}}\right) ^{\frac{2ar}{2ar + b + 1 - 2a}} \end{aligned}$$

for the range \(\frac{ar}{1-2a}\le p\).

This corresponds to the upper bound for \(a\le \frac{1}{2}\)\(a q\le {p}\),  \(r \le q\le r +\frac{b+1}{2a}\), as discussed in the last row of Table 1, and it shows that the rate is of optimal order.

6 Conclusion

We investigated regularization schemes in Hilbert scales for linear inverse (learning) problems. Regularized solutions are constructed under the requirement that these belong to \(\mathcal {D}(L)\), for the (unbounded) operator L, which generates the scale. Clearly, this may be extended to the case that the regularized solutions belong to \(\mathcal {D}(L^s)\) for some \(s>0\), simply be considering \(L^s\) as a generator of the (same) scale.

We draw the following conclusions. Some arguments consider the cases of power type conditions, and for this, we recourse to Tables 1 and 2 for details.

Optimal rates: In the regular case, we can achieve the optimal rates of convergence provided that the benchmark smoothness q is chosen in the appropriate region (see Sect. 5.3). In contrast, in the mis-specified case (oversmoothing) we can only prove sub-optimal rates of convergence. By now no techniques are known which are capable to improve the rates in this case.

Saturation effects: In case \(q=r\), we observe from the above analysis that optimal rates can be proven for the range \(ar\le p\), provided that the scheme has qualification p. For standard regularization schemes, this would hold for the range \(\frac{ar}{1-2a}\le p\), only. Hence, the saturation effect is delayed here.

Convergence rates without source condition: Typically, rates of convergence are shown under smoothness in terms of source conditions. Here we establish error bounds by using the concept of distance functions, measuring the violation of a benchmark source condition. When specifying smoothness as a source condition, we use known bounds of the considered distance function. This provides us with convergence rates in terms of the sample size.

Source conditions: When studying kernel methods, the smoothness of the true solution is measured in terms of the source condition with respect to the covariance operator, and hence may hardly be checked. We consider source conditions in terms of the Hilbert scale. This has a clear meaning, and it is independent of the choice of kernel. However, the chosen kernel comes into play when requiring the validity of a link condition.