Inverse learning in Hilbert scales

We study linear ill-posed inverse problems with noisy data in the framework of statistical learning. The corresponding linear operator equation is assumed to fit a given Hilbert scale, generated by some unbounded self-adjoint operator. Approximate reconstructions from random noisy data are obtained with general regularization schemes in such a way that these belong to the domain of the generator. The analysis has thus to distinguish two cases, the regular one, when the true solution also belongs to the domain of the generator, and the ‘oversmoothing’ one, when this is not the case. Rates of convergence for the regularized solutions will be expressed in terms of certain distance functions. For solutions with smoothness given in terms of source conditions with respect to the scale generating operator, then the error bounds can then be made explicit in terms of the sample size.


Introduction
Let A be a linear injective operator between the infinite-dimensional Hilbert spaces H and H with the inner products •, • H and •, • H , respectively.Let H be the space of functions between a Polish space X and a real separable Hilbert space Y .Here we study the linear ill-posed operator problems governed by the operator equation (1) A(f ) = g, for f ∈ H and g ∈ H .
We observe noisy values of g at some points, and the foremost objective is to estimate the true solution f .The problem of interest can be described as follows: Given data {(x i , y i )} m i=1 under the model ( 2) where ε i is the observational noise, and m denotes the sample size, determine (approximately) the underlying element f ∈ H with g := A(f ) being the regression function.
For classical inverse problems, the observational noise is assumed to be deterministic.Here we assume that the random observations {(x i , y i )} m i=1 are independent and follow some unknown probability distribution ρ, defined on the sample space Z = X × Y , and hence we are in the context of statistical inverse problems.
The reconstruction of the unknown true solution will be based on spectral regularization schemes.Various schemes can be used to stably estimate the true solution.Tikhonov regularization is widelyconsidered in the literature.This scheme consists of the error term measuring the fitness of the data and a penalty term, controlling the complexity of the reconstruction.In this study we enforce smoothness of the approximated solution by introducing an unbounded, linear, self-adjoint, strictly positive operator L : D(L) ⊂ H → H with a dense domain of definition D(L) ⊂ H, and then we define Tikhonov regularization scheme in Hilbert scales as follows: (3) argmin where λ is a positive regularization parameter and the operator L influences the properties of the approximated solution.Standard Tikhonov regularization corresponds to L := Id : H → H, the identity mapping.
In many practical problems, the operator L is chosen to be a differential operator in some appropriate function spaces, e.g., L 2 -spaces.Notice from (3), that the reconstruction f z,λ belongs to D(L), such that formally we may introduce u z,λ := Lf z,λ ∈ H.In the regular case, when f ρ ∈ D(L), then we let u ρ := Lf ρ ∈ H.With this notation we can rewrite (1) as Also, the Tikhonov minimization problem would reduce to the standard one albeit for a different operator AL −1 .Accordingly, the error bounds relate as Therefore, error bounds for u ρ −u z,λ in the weak norm, in H −1 , yield bounds for f ρ −f z,λ .The latter bounds are not known from previous studies.Also, we are interested in the oversmoothing case, when f ρ ∈ D(L), such that we provide a detailed error analysis, here.However, the above relation will implicitly be utilized in the subsequent proofs.
We review literature related to the considered problem.Regularization schemes in Hilbert scales are widely considered in classical inverse problems (with deterministic noise), starting from F. Natterer [26], and continued in [9,18,20,21,23,24,25,27,31].G. Blanchard and N. Mücke [7] considered general regularization schemes for linear inverse problems in statistical learning and provided (upper and lower) rates of convergence under Hölder type source conditions.Here we consider general (spectral) regularization schemes in Hilbert scales for the statistical inverse problems.We discuss rates of convergence for general regularization under certain noise conditions, approximate source conditions, and a specific link condition between the operators A, governing the equation (1), and the smoothness promoting operator L as used e.g. in (3).We study error estimates by using the concept of reproducing kernel Hilbert spaces.The concept of the effective dimension plays an important role in the convergence analysis.
The key-points in our results can be described as follows: • We do not restrict ourselves to the white or coloured Hilbertian noise.We consider general centered noise, satisfying certain moment conditions, see Assumption 3. • We consider general regularization schemes in Hilbert scales.It is well-known that Tikhonov regularization suffers the saturation effect.On the contrary, this saturation is delayed for Tikhonov regularization in Hilbert scales.• The analysis uses the concept of link conditions, see Assumption 4, required to transfer information in terms of properties of the operator L to the covariance operator.
• We analyze the regular case, i.e., when the true solution belongs to the domain of operator L.
• We also focus on the oversmoothing case, when the true solution does not belong to the domain of operator L.
The paper is organized as follows.The basic definitions, assumptions, and notation required in our analysis are presented in Section 2. In Section 3 we discuss the bounds of the reconstruction error in the direct learning setting and inverse problem setting by means of distance functions.This section comprises of two main results: The first result is devoted to convergence rates in the oversmoothing case, while the second result focuses on the regular case.When specifying smoothness in terms of source conditions we can bound the distance functions, and this gives rise to convergence rates in terms of the sample size m.This program is performed in Section 4. In case that both, the smoothness as well as the link condition are of power type we establish the optimality of the obtained error bounds in the regular case in Section 5.
In the Appendix, we present probabilistic estimates which provide the tools to obtain the error bounds.

Notation and Assumptions
In this section, we introduce some basic concepts, definitions, notation, and assumptions required in our analysis.
We assume that X is a Polish space, therefore the probability distribution ρ allows for a disintegration as where ρ(y|x) is the conditional probability distribution of y given x, and ν(x) is the marginal probability distribution.We consider random observations {(x i , y i )} m i=1 which follow the model y = A(f )(x) + ε with centered noise ε.We assume throughout the paper that the operator A is injective.
Assumption 1 (The true solution).The conditional expectation w.r.t.ρ of y given x exists (a.s.), and there exists The element f ρ is the true solution which we aim at estimating.

Assumption 2 (Noise condition
).There exist some constants M, Σ such that for almost all x ∈ X, This assumption is usually referred to as a Bernstein-type assumption.
We return to the unbounded operator L. By spectral theory, the operator L s : D(L s ) → H is welldefined for s ∈ R, and the spaces H s := D(L s ), s ≥ 0 equipped with the inner product f, g Hs = L s f , L s g H , f, g ∈ H s are Hilbert spaces.For s < 0, the space H s is defined as completion of H under the norm f s := f, f 1/2 s .The space (H s ) s ∈ R is called the Hilbert scale induced by L. The following interpolation inequality is an important tool for the analysis: which holds for any t < r < s [11,Chapt. 8].

2.1.
Reproducing Kernel Hilbert space and related operators.We start with the concept of reproducing kernel Hilbert spaces.It is a subspace of L 2 (X, ν; Y ) (the space of square-integrable functions from X to Y with respect to the probability distribution ν) which can be characterized by a symmetric, positive semidefinite kernel and each of its functions satisfies the reproducing property.Here we discuss the vector-valued reproducing kernel Hilbert spaces, following [22], which are the generalization of real-valued reproducing kernel Hilbert spaces [1].
Definition 2.1 (Vector-valued reproducing kernel Hilbert space).For a non-empty set X and a real separable Hilbert space (Y, •, • Y ), a Hilbert space H of functions from X to Y is said to be the vectorvalued reproducing kernel Hilbert space, if the linear functional F x,y : H → R, defined by For a given operator-valued positive semi-definite kernel K : X × X → L(Y ), we can construct a unique vector-valued reproducing kernel Hilbert space (H, •, • H ) of functions from X to Y as follows: (i) We define the linear function in other words f (x) = K * x f .Moreover, there is a one-to-one correspondence between operator-valued positive semi-definite kernels and vector-valued reproducing kernel Hilbert spaces, see [22].
We assume the following assumption concerning the Hilbert space H : The space H is assumed to be a vector-valued reproducing kernel Hilbert space of functions f : (ii) For y, y ∈ Y , the real-valued function ς : X × X → R : (x, x ) → K x y, K x y H is measurable.
Example 2.3.In case that the set Y is a bounded subset of R then the reproducing kernel Hilbert space becomes real-valued reproducing kernel Hilbert space.The corresponding kernel becomes the symmetric, positive semi-definite K : X × X → R with the reproducing property f (x) = f, K x H . Also, in this case the Assumption 3 simplifies to the condition that the kernel is measurable and Now we introduce some relevant operators used in the convergence analysis.We introduce the notation for the vectors x = (x 1 , . . ., x m ), y = (y 1 , . . ., y m ), z = (z 1 , . . ., z m ).The product Hilbert space Y m is equipped with the inner product y, y m = 1 m m i=1 y i , y i Y , and the corresponding norm We define the sampling operator S x : H → Y m : g → (g(x 1 ), . . ., g(x m )), then the adjoint S * x : Y m → H is given by Let I ν denotes the canonical injection map H → L 2 (X, ν; Y ).Then we observe that, under Assumption 3, both the operators S x and I ν are bounded by κ , since We denote the population operators x S x A : H → H.The operators T ν , T x , L ν , L x are positive, self-adjoint and depend on the kernel.Under Assumption 3, the operators B x , B ν are bounded by κ := κ AL −1  H→H and the operators L x , L ν are bounded by κ2 for κ := κ A H→H , i.e., B x H→Y m ≤ κ, B ν H→L 2 (X,ν;Y ) ≤ κ, L x L(H) ≤ κ 2 and L ν L(H) ≤ κ2 .
2.2.Link condition.In the subsequent analysis, we shall derive convergence rates by using approximate source conditions, which are related to a certain benchmark smoothness.This benchmark smoothness is determined by the user.In order to have handy arguments to derive the convergence rates, we shall fix an (integer) power q ≥ 1.We shall use a link condition to transfer smoothness in terms of the operator L to the covariance operator T ν .This link condition will involve an index function.
Definition 2.4 (Index function).A function ϕ : R + → R + is said to be an index function if it is continuous and strictly increasing with ϕ(0) = 0.
An index function is called sub-linear whenever the mapping t → t/ϕ(t), t > 0, is nondecreasing.Further, we require this index function to belong to the following class of functions.
The representation ϕ = ϕ 2 ϕ 1 is not unique, therefore ϕ 2 can be assumed as a Lipschitz function with Lipschitz constant 1.Now we phrase an important result, needed in our analysis [28, Corollary 1.2.2]:

Assumption 4 (link condition
).There exist a power q > 1 and an index function , for which the function 2 is sub-linear.There are constants 1 ≤ β < ∞ such that The function t → ϕ(t) := q−1 (t) belongs to the class F.
As shown in [9], Assumption 4 implies the range identity R(L −q ) = R( q (T ν )).In the context of a comparison of operators we mention the well-known Heinz Inequality, see [11,Prop. 8.21], which asserts that a comparison Gu H ≤ Hu H , u ∈ H, for non-negative self-adjoint operators G, H : H → H yields for every exponent 0 < q ≤ 1 that G q u H ≤ H q u H , u ∈ H. Applying this to the above link condition we obtain the following: Proposition 2.6.Under Assumption 4 we have Moreover, we have that Proof.The first assertions are a consequence of Heinz Inequality.For the last one, we argue as follows.
Since 2 is assumed to be sub-linear.Hence we find that which completes the proof.
Remark 2.7.From the assertion, it is heuristically clear that the function 2 cannot increase faster than linearly, because the operator More details will be given in Section 5.
Example 2.8 (Finitely smoothing).In case that the function , and hence its inverse is of power type then this implies a power type decay of the singular numbers of T ν .In this case, the operator T ν is called finitely smoothing.
Example 2.9 (Infinitely smoothing).If, on the other hand, the function is logarithmic, as e.g., (t) = In this case, the operator T ν is called infinitely smoothing.

2.3.
Effective dimension.Now we introduce the concept of the effective dimension which is an important ingredient to derive the rates of convergence under Hölder's source condition [7,10,12] and general source condition [16,29].The effective dimension for the trace-class operator T ν is defined as, It is known that the function λ → N Tν (λ) is continuous and decreasing from ∞ to zero for 0 < λ < ∞ for an infinite dimensional operator T ν (see for details [5,8,15,16,32]).
The integral operator T ν is a trace class operator, hence the effective dimension is finite, and we have that In the subsequent analysis, we shall need a relationship between the effective dimensions N Tν (λ) and N Lν (λ).For this, the link condition (Assumption 4) is crucial.The arguments will be based on operator monotonicity and concavity.Below, for an operator T we assign s j (T ), j = 1, 2, . . . the singular numbers of the operator T .
The following assumption was introduced in [15].There, it was shown that it is satisfied for both moderately ill-posed and severely ill-posed operators.
Assumption 5.There exists a constant C such that for 0 < t ≤ L ν L(H) we have The relation between the effective dimensions is established in the following proposition, with proof will given in Appendix A.
Proposition 2.10.Suppose Assumptions 4 and 5 hold true.Suppose the function from the link condition is such that the function t → 2q −1 (t) is operator concave, and that there is some n ∈ N for which the function t → −1 (t)/t n is concave.Then, there is C for which we have that Remark 2.11.For a power type function (t) := t a the above concavity assumptions hold true whenever 2aq ≥ 1 and n ≤ 1/a ≤ n + 1.In particular the number n is uniquely determined.
2.4.Regularization Schemes.General regularization schemes were introduced and discussed in ill-posed inverse problems and learning theory (See [17, Section 2.2] and [2, Section 3.1] for brief discussion).By using the notation from § 2.1, the Tikhonov regularization scheme from (3) can be re-expressed as follows: and its minimizer is given by We consider the following definition.
Definition 2.12 (General regularization).We say that a family of functions • For some constant γ p (independent of λ), the maximal p satisfying the condition: |r λ (t)| t p ≤ γ p λ p is said to be the qualification of the regularization scheme g λ .
Definition 2.13.The qualification p covers the index function ϕ if the function t → t p ϕ(t) is nondecreasing.
We mention the following result.
Proposition 2.14.Suppose ϕ is a nondecreasing index function and the qualification, say p ≥ 1, of the regularization g λ covers ϕ.Then Also, we have that Proof.The first assertion is a restatement of [19,Proposition 3].For the second assertion, we stress that (λ + σ) p ≤ 2 p−1 (λ p + σ p ), which follows from convexity.This yields which implies the second assertion and completes the proof.
Essentially all the linear regularization schemes (Tikhonov regularization, Landweber iteration or spectral cut-off) satisfy the properties of general regularization.Inspired by the representation for the minimizer of the Tikhonov functional we consider a general regularized solution in Hilbert scales corresponding to the above regularization in the form (7) f z,λ = L −1 g λ (T x )B * x y.

Convergence analysis
Here we study the convergence for general regularization schemes in the Hilbert scale of the linear statistical inverse problem based on the prior assumptions and the link condition.
The analysis will distinguish between two cases, the 'regular' one, when f ρ ∈ D(L), and the 'low smoothness' case, when f ρ ∈ D(L).In either case, we shall first utilize the concept of distance functions.This will later give rise to establish convergence rates in a more classical style.
For the asymptotical analysis, we shall require the standard assumption relating the sample size m and the parameter λ such that (8) N Tν (λ) ≤ mλ and 0 < λ ≤ 1.
It will be seen, that asymptotically the condition ( 8) is always satisfied for the parameter which is optimally chosen under known smoothness.
Several probabilistic quantities will be used to express the error bounds.Precisely, for an index function ζ we let and In case that ζ(t) = t r we abbreviate Ξ t r by Ξ r and Ξ t by Ξ, not to be confused with the power.High probability bounds for these quantities are known, and these will be given correspondingly in Propositions B.1 and B.2.
3.1.The oversmoothing case.As mentioned before, we shall use distance functions, and these are called 'approximate source conditions' sometimes, because these measure the violation of a benchmark smoothness.Here the benchmark will be f ρ ∈ D(L).

Definition 3.1 (Approximate source condition). We define the distance function
We denote f R ρ the element which realizes the above minimization problem.
Notice the following: If f ρ ∈ D(L) then for some R the minimizer f R ρ of the distance function will obey f R ρ = f ρ .
Remark 3.2.In a rudimentary form, this approach was given in [3,Thm. 6.8].It was then introduced in regularization theory in [13].Within learning theory, such a concept was also used in the study [30].
Theorem 3.3.Let z be i.i.d.samples drawn according to the probability measure ρ.Suppose the Assumptions 1-5 hold true.Suppose that the qualification p of the regularization g λ covers the function (for (t) from Assumption 4) and that −1 (t)/t n , 2q −1 (t) are concave, or operator concave functions for some n ≥ 1, respectively.Then for all 0 < η < 1, and for λ satisfying the condition (8) the following upper bound holds for the regularized solution f z,λ (7) with confidence 1 − η: where C depends on B, D, c p , κ, n, β, C.
Proof.For the minimizer f R ρ of the distance function defined in (14), the error can be expressed as follows: By using Proposition 2.6 the error for the regularized solution can be bounded as We shall bound each summand on the right in (15).
Using the fact that p covers we bound I 3 : For the last summand we argue where Ξ 1/2 and Ψ were as in (10) and (13).
The bound from Theorem 3.3 is valid for all R ≥ Σ + κM/N Tν (1), and we shall now optimize the bound from Theorem 3.3 with respect to the choice of R ≥ Σ + κM/N Tν (1).
First, if f ρ ∈ D(L) then there is R ≥ Σ + κM/N Tν (1) such that d( R) = 0, and where C depends on B, D, c p , κ, n, β, C. Otherwise, in the low smoothness case, f ρ ∈ D(L), we introduce the following function which is non-vanishing decreasing function, and hence the inverse Γ −1 exists, and it is decreasing.Given λ > 0, by letting R = R(λ) solve the equation Γ(R) = (λ) we find that where C depends on B, D, c p , κ, n, β, C. The above dependency λ → R(λ) can be made explicit when assuming that f ρ has some smoothness measured in terms of a source condition, see Section 4, below.For Theorem 3.3 we get the error bound (23) but the parameter λ has to obey (8).We will get the explicit error bound in terms of the sample size m in Corollary 4.1.

The regular case.
Here we analyze the rates of convergence in the case when the underlying true solution f ρ belongs to the domain of the operator L. Again, we shall choose a benchmark smoothness function.
With respect to this benchmark we introduce the following distance function.
Definition 3.4 (Approximate source condition).Given q ≥ 1 we define the distance function Theorem 3.5.Let z be i.i.d.samples drawn according to the probability measure ρ.Suppose Assumptions 1-4.Let ζ be any index function, such that 1 2 covers ζ.Suppose that the qualification p of the regularization g λ covers the function ζϕ (for ϕ(t) from Assumption 4).Then for all 0 < η < 1, the following upper bound holds for the regularized solution f z,λ (7), and for λ satisfying the condition (8), with confidence 1 − η: Consequently, we find that 4 4 η and where C depends on B, D, c p , κ, and C = 2κM + Σ.
Proof.For the minimizer f R ρ of the distance function defined in (24), the error can be expressed as follows: First, we estimate the error in the interpolation norm for some index function ζ: I 2 : For the minimizer f R ρ = L −q g of the distance function (24), we observe from Proposition 2.6 that there is Thus by assuming that the function ϕ = ϕ 1 ϕ 2 with ϕ 1 being sub-linear and ϕ 2 Lipschitz (with constant one) we continue bounding Then we get, because of the qualification of the regularization.I 3 : From the arguments used in (20), we get Overall, using Propositions B.1-B.2 and ( 26)-( 28) in ( 25) we obtain with confidence 1 − η: The fact that N Tν (λ) is decreasing function of λ with the inequality (8) implies that This, together with (29) yields the first result.
For the last two estimates in Theorem 3.5, by using Proposition 2.6 we get, These two upper bounds can now be estimated from the general bound by letting ζ := and ζ(t) := t 1 2 , respectively.We also use that 2 is sub-linear, and this completes the proof.
The bound from Theorem 3.5 is valid for all R ≥ 1, and we shall now optimize the bound from Theorem 3.5 with respect to the choice of R ≥ 1.
First, if f ρ ∈ R (L −q ) then d q ( R) = 0 for some R, we find that Otherwise, in case that f ρ ∈ R (L −q ) we introduce the following function which is non-vanishing decreasing function, and hence the inverse Γ −1 q exists and it is decreasing.We finally get the main result, by letting R = R(λ) solving the equation Γ q (R) = ϕ(λ), and we find that

Smoothness in terms of source-wise representation
Here we shall specify the smoothness of the true solution in terms of the bounded linear injection and self-adjoint operator L −1 .
Assumption 6 (General source condition).For an index function θ, the true solution f ρ belongs to the class Ω(θ, R † ) with In the special case when the function θ(t) := t r is a power function, such source-wise representation is called Hölder type.
We aim at bounding the distance functions d(R) and d q (R) from the oversmoothing and regular cases, respectively.
4.1.The oversmoothing case.Here the benchmark source condition is linear, and we shall thus assume that the index function θ is sub-linear.The obtained bounds will rely on the results from [14, Theorem 5.9].We denote the identity function ι : t → t, representing the benchmark smoothness index function.Under Assumption 6 we find that In order to minimize the bound from Theorem 3.3, we balance Thus, for this value of R(λ) under the condition ( 8), the bound (23) reduces to The following corollary is the consequence of Theorem 3.3 which explicitly provide the error bound under the parameter choice of λ in terms of the sample size m.Corollary 4.1.Under the same assumptions of Theorem 3.3 and Assumption 6 for the sub-linear function θ with the a-priori choice of the regularization parameter λ * from solving the equation N Tν (λ * ) = mλ * , for all 0 < η < 1, the following error estimates holds with confidence 1 − η: where C depends on B, D, c p , κ, n, β, C, M , Σ, and R † .
We observe that the above parameter choice evidently satisfies condition (8).
4.2.The regular case.In this case the benchmark is given by the index function ι q , and we shall assume that the given smoothness, measured in terms of θ, is such that the function ι q /θ for 0 < t ≤ κ 2 , is an index function.However, the definition of the distance function R → d q (R) is non-standard.The target norm is L(f − f ρ ) H , and, in order to apply the result from [14, Theorem 5.9] we have to 'rescale' the given smoothness (in terms of the operator L −1 ) by factor L −1 .If Assumption 6 holds true with index function θ, for which the quotient ι q /θ is an index function, and so will be the function ι q−1 /(θ/ι), then this results in the bound According to Theorem 3.5 we balance This yields Inserting this bound into Theorem 3.5 we find that 4 4 η provided that (8) holds.
The optimization of the bound in the inequality (34) depends on which term is dominant in the last two summands.Then we can balance the remaining (two) terms.This results in the following corollaries for the different choices of the regularization parameter: Corollary 4.2.Suppose ι q θ (t) and ι q θ ( (t)) N Tν (t) t are the index functions.Then under the same assumptions of Theorem 3.5 and Assumption 6 with the a-priori choice of the regularization parameter λ * = ϕ −1 1 √ m , for all 0 < η < 1, the following upper bound holds with confidence 1 − η: where C depends on B, D, c p , κ, M , Σ, and R † .
Since by assumption the function t → Θ 2 ( (λ * )) is an index function we will have that condition (8) holds for m large enough.

4.3.
Taking the behavior of effective dimension into account.Below, to be specific, we consider the following two behaviors of the decay of the effective dimensions, say power-type and logarithm type, which is known to hold true in many situations.Assumption 7 (Polynomial decay condition).Assume that there exists some positive constant c > 0 such that Assumption 8 (Logarithmic decay condition).Assume that there exists some positive constant c > 0 such that Remark 4.4.We mention that a polynomial decay of the eigenvalues of the covariance operator T ν yields the polynomial-type behavior of the effective dimension, see [10].Rather in some situations this behavior is not evident.Lu et al. [16] showed that for Gaussian kernel K 1 (x, x ) = xx + e −8(x−x ) 2 with the uniform sampling on [0, 1], the effective dimension exhibits the log-type behavior (Assumption 8), on the other hand, the kernel K 2 (x, x ) = min{x, x } − xt exhibits the power-type behavior (Assumption 7).
Table 1.Convergence rates of the regularized solution f z,λ for a ≤ 1 2 , aq ≤ p under Assumption 7.

Case
Convergence Parameter True Benchmark Conditions rates Convergence rates of the regularized solution f z,λ for a ≤ 1 2 , aq ≤ p under Assumption 8. log m m In Tables 1 and 2 we present the convergence rates under the specific behavior of the effective dimension (Assumptions 7 and 8, respectively).For a clear picture of the error analysis, we present the error bounds in the particular case when the link condition as well as the source condition are of power type, i.e., (t) = t a and θ(t) = t r for parameters a, r > 0. The qualification of the regularization is denoted by p as before.Also, the benchmark smoothness is q, where either q = 1 (oversmoothning case) or q > 1 (regular case).Notice, that due to the sub-linearity condition for 2 we must have that 0 < a ≤ 1/2.Also, throughout the analysis, we assume that the qualification covers the given smoothness, i.e., aq ≤ p.The bounds presented in the tables are consequences of Corollaries 4.1-4.3,respectively.Therefore Assumptions 1-6 are assumed to be satisfied for the following results.
The table is structured as follows.In the first column we present the rates of convergence ε(m) for the error estimates of the form: In the second column, the corresponding order of the regularization parameter choice λ * in terms of m is indicated.In the third and fourth columns, we highlight the smoothness of the true solution f ρ , and the benchmark smoothness, respectively.The fifth column presents the parameter involved in the link condition.In the last column, we emphasize additional constraints, specifically on the benchmark smoothness.
The first row corresponds to the oversmoothing case, and the last two rows correspond to the regular case.In the regular case, we observe that the validity of the rates of the convergence depends on the benchmark smoothness through aq.At the intersection point, when aq = ar + b+1 2 , then both rates coincide.As we will see in the next section the rates of convergence in the regular case (q > 1) are optimal provided that the benchmark smoothness is chosen appropriately.

Optimality of the error bounds
We shall discuss the optimality of the previously obtained error bounds, in the regular case, and we shall use the known optimality results from [7].However, at present the smoothness is measured with respect to the operator T ν , whereas in [7] this was done with respect to the operator L ν := A * I * ν I ν A = LT ν L. Therefore, the following 'recipe' will be used.
(1) Transfer smoothness as given in terms of L −1 to smoothness in terms of L ν , and (2) Knowing the decay of the singular numbers of the operator T ν inherent in Assumption 7, find the decay of the singular numbers of L ν .
In order to keep the analysis simple and transparent we confine to power type smoothness θ(t) = t r , 0 < r ≤ q in Assumption 6, as well as to power type link in Assumption 4 with (t) := t a for some a > 0.
5.1.Relating smoothness.The link condition is crucial, and the subsequent arguments are of interpolation type, applying Heinz Inequality within the present context.To this end, we require that q is chosen such that aq ≥ 1/2.In this case Assumption 4 yields, by applying Heinz Inequality with exponent 1/(2aq) ≤ 1 that Letting v := L −1 u we find that First, we see from this that a < 1/2, because otherwise L ν would be continuously invertible.Also, the relation (35) would allow transferring smoothness r with respect to L −1 to L ν as long as 0 < r ≤ 1 2a − 1.In order to treat higher smoothness (in terms of L −1 ) a lifting condition is unavoidable.This must be consistent with the link from (35).Thus we look for a factor z such that t ( 1 2a −1)z = t q , yielding z := 2aq 1−2a .
Assumption 9 (lifting condition).We have that Having this lifting, and applying Heinz Inequality (with exponent r/q) yields (36) and a source-wise representation as in Assumption 6 yields a corresponding source-wise representation with respect to the operator L ν (with different constant).5.2.Relating effective dimensions.Here we shall use the following consequence of the link condition in Assumption 4. Indeed, by squaring the norms we see that The Weyl Monotonicity Theorem [4, Cor.III.2.3] yields that then s j (L −2q ) s j (T 2aq ν ), j = 1, 2, . . ., or simplified that s j (L −1 ) s a j (T ν ), j = 1, 2, . . .by spectral calculus.Here s j (L −1 ) and s j (T ν ) denote the singular numbers of the operators.Similarly, we obtain from (35) that s j (L ν ) s 1−2a a j (L −1 ), and a fortiori that s j (L ν ) s 1−2a j (T ν ).5.3.Lower bound.In order to show the optimality of the error bounds as discussed in Table 1, we shall assure that the decay of the effective dimension cannot be faster than asserted in Assumption 7.
Assumption 10.There is a constant c > 0 such that the singular numbers of the operator T ν obey s j (T ν ) ≥ cj −1/b , j = 1, 2, . . .Notice that this yields that N (λ) ≥ cλ −b , such that this is the limiting case for which Assumption 7 holds.The following is reported in [7] for the problem (1): Under smoothness r with respect to the operator L ν , and with the decay of the singular numbers s j (L ν ) not faster than j −1/b , the optimal rate is of the order This corresponds to the upper bound as discussed in the last row of Table 1, and it shows that the rate is of optimal order.

Conclusion
We summarize the above findings.We investigated regularization in Hilbert scales for the considered inverse problem with general centered noise, which is assumed to obey a Bernstein-type moment condition.This noise condition is not required when the output space is bounded.We analyzed regularization in a Hilbert scale, generated by some unbounded operator L. In order to do so we used a link condition to transfer information from L −1 to T ν , the underlying covariance operator.
In the main body, we established error bounds in terms of distance functions, which measure the deviation of the regression function to some benchmark smoothness.These error bounds were then specified for smoothness given in terms of solution smoothness with respect to the operator L −1 , by bounding the corresponding distance functions.The error estimates are explicitly described as the exponential deviation inequalities in terms of the sample size which holds non-asymptotically in the probabilistic sense.We discussed the convergence rates for both oversmoothing and regular cases under different behavior of the effective dimension in reproducing kernel approach.In particular, optimal convergence rates can be achieved with the appropriate choice of benchmark smoothness and an a-priori parameter choice for the regular case.Although we mainly focused bounding the reconstruction error f z,λ − f ρ H , error estimates of the prediction error I ν A(f z,λ − f ρ ) L 2 (X,ν;Y ) can also be derived similarly in terms of sample size using Theorem 3.5.The optimal parameter choice depends on the unknown parameters a, b, r, reflecting the link condition, the decay of the effective dimension, and the solution smoothness.Therefore a data-driven parameter choice may be required to apply the regularization algorithms.This will be a topic of future research.
Appendix A. Proof of Proposition 2.10 We start with the following technical result.
Lemma A.1.Suppose that the function from the link condition is such that the function t → 2q −1 (t) is operator concave, and that there is some n ∈ N for which the function t → −1 (t)/t n is concave.Under Assumption 4 we have that Proof.The proof is based on two consequences of Assumption 4, which, in terms of the partial ordering for self-adjoint operators in Hilbert space can be restated as Applying the operator concave function t → 2q −1 (t) respects the partial ordering, and we obtain 2 that Letting u := Lv ∈ H, and since by construction The sub-linearity of 2 implies that the function t → −1 (t)/t 2 is non-decreasing, such that the operator −1 (βL , j = 1, 2, . . .
Applying this theorem to the first inequality in Proposition 2.6 we also find that To proceed we shall use that the sub-linearity of the function 2 and the concavity of the function ς(t) := −1 (t)/t n .This yields that ς(βt) ≤ βς(t), β ≥ 1 and overall, we find that .
This, together with the inequalities (37) gives and the proof is complete.
Since the function λ → λN Lν (λ) is non-decreasing we continue to bound (39) which completes the proof.

Appendix B. Probabilistic bounds
In the following proposition, we present the standard perturbation inequalities in learning theory which measures the effect of random sampling in the probabilistic sense.The following two propositions can be proved using the arguments given in Step 2.1.of [10,Thm. 4].Proof.For L x = A * S * x S x A and L ν = A * I * ν I ν A with the fact that T x − T ν = L −1 (L x − L ν ) L −1 , the proof will be based on the following decomposition and this yields the estimate We observe that g λ (T x )(T x + λI) L(H) ≤ B + D. For the function υ(t) = t/ (t), we can bound I 1 as The terms I 2 , I 3 can be bounded as L(H)

Example 2 . 5 .
The polynomial function ϕ(t) = t r , and the logarithm function ϕ(t) = t p log −ν 1 t are examples of functions in the class F.
the present context, we have to assign r ← ar 1−2a and b ← b 1−2a .This yield a lower bound of the order 1