Maximal Spaces for Approximation Rates in $\ell^1$-regularization

We study Tikhonov regularization for possibly nonlinear inverse problems with weighted $\ell^1$-penalization. The forward operator, mapping from a sequence space to an arbitrary Banach space, typically an $L^2$-space, is assumed to satisfy a two-sided Lipschitz condition with respect to a weighted $\ell^2$-norm and the norm of the image space. We show that in this setting approximation rates of arbitrarily high H\"older-type order in the regularization parameter can be achieved, and we characterize maximal subspaces of sequences on which these rates are attained. On these subspaces the method also converges with optimal rates in terms of the noise level with the discrepancy principle as parameter choice rule. Our analysis includes the case that the penalty term is not finite at the exact solution ('oversmoothing'). As a standard example we discuss wavelet regularization in Besov spaces $B^r_{1,1}$. In this setting we demonstrate in numerical simulations for a parameter identification problem in a differential equation that our theoretical results correctly predict improved rates of convergence for piecewise smooth unknown coefficients.


Introduction
In this paper we analyze numerical solutions of ill-posed operator equations F (x) = g Institute for Numerical and Applied Mathematics, University of Göttingen, Germany E-mail: p.miller@math.uni-goettingen.de with a (possibly nonlinear) forward operator F mapping sequences x = (x j ) j∈Λ indexed by a countable set Λ to a Banach space Y. We assume that only indirect, noisy observations g obs ∈ Y of the unknown solution x † ∈ R Λ are available satisfying a deterministic error bound g obs − F (x † ) Y ≤ δ. For a fixed sequence of positive weights (r j ) j∈Λ and a regularization parameter α > 0 we consider Tikhonov regularization of the form where D ⊂ R Λ denotes the domain of F . Usually, x † is a sequence of coefficients with respect to some Riesz basis. One of the reasons why such schemes have become popular is that the penalty term α j∈Λ r j |x j | promotes sparsity of the estimatorsx α in the sense that only a finite number of coefficients ofx α are non-zero. The latter holds true if (r j ) j∈Λ decays not too fast relative to the ill-posedness of F (see Proposition 3 below). In contrast to [29] and related works, we do not require that (r j ) j∈Λ is uniformly bounded away from zero. In particular, this allows us to consider Besov B 0 1,1 -norm penalties given by wavelet coefficients. For an overview on the use of this method for a variety linear and nonlinear inverse problems in different fields of applications we refer to the survey paper [26] and to the special issue [27].
Main contributions: The focus of this paper is on error bounds, i.e. rates of convergence ofx α to x † in some norm as the noise level δ tends to 0. Although most results of this paper are formulated for general operators on weighted 1 -spaces, we are mostly interested in the case that x j are wavelet coefficients, and is the composition of a corresponding wavelet synthesis operator S and an operator G defined on a function space. We will assume that G is finitely smoothing in the sense that it satisfies a two-sided Lipschitz condition with respect to function spaces the smoothness index of which differs by a constant a > 0 (see Assumption 2 below and Assumption 3 for a corresponding condition on F ). The class of operators satisfying this condition includes in particular the Radon transform and nonlinear parameter identification problems for partial differential equations with distributed measurements. In this setting Besov B r 1,1 -norms can be written in the form of the penalty term in (1). In a previous paper [24] we have already addressed sparsity promoting penalties in the form of Besov B 0 p,1 -norms with p ∈ [1,2]. For p > 1 only group sparsity in the levels is enforced, but not sparsity of the wavelet coefficients within each level. As a main result of this paper we demonstrate that the analysis in [24] as well as other works to be discussed below do not capture the full potential of estimators (1), i.e. the most commonly used case p = 1: Even though the error bounds in [24] are optimal in a minimax sense, more precisely in a worst case scenario in B s p,∞ -balls, we will derive faster rates of convergence for an important class of functions, which includes piecewise smooth functions. The crucial point is that such functions also belong to Besov spaces with larger smoothness index s, but smaller integrability index p < 1. These results confirm the intuition that estimators of the form (1), which enforce sparsity also within each wavelet level, should perform well for signals which allow accuratele approximations by sparse wavelet expansions.
Furthermore, we prove a converse result, i.e. we characterize the maximal sets on which the estimators (1) achieve a given approximation rate. These maximal sets turn out to be weak weighted t -sequences spaces or real interpolation spaces of Besov spaces, respectively.
Finally, we also treat the oversmoothing case that j∈Λ r j |x † j | = ∞, i.e. that the penalty term enforces the estimatorsx α to be smoother than the exact solution x † . For wavelet B r 1,1 Besov norm penalties, this case may be rather unlikely for r = 0, except maybe for delta peaks. However, in case of the Radon transform, our theory requires us to choose r > 1 2 , and more generally, mildly ill-posed problems in higher spatial dimensions require larger values of r (see eq. (7a) below for details). Then it becomes much more likely that the penalty term fails to be finite at the exact solution, and it is desirable to derive error bounds also for this situation. So far, however, this case has only rarely been considered in variational regularization theory.
Previous works on the convergence analysis of (1): In the seminal paper [11] Daubechies, Defrise & De Mol established the regularizing property of estimators of the form (1) and suggested the so-called iterative thresholding algorithm to compute them. Concerning error bounds, the most favorable case is that the true solution x † is sparse. In this case the convergence rate is linear in the noise level δ, and sparsity of x † is not only sufficient but (under mild additional assumptions) even necessary for a linear convergence rate ( [21]). However, usually it is more realistic to assume that x † is only approximately sparse in the sense that it can be well approximated by sparse vectors. More general rates of convergence for linear operators F were derived in [4] based on variational source conditions. The rates were characterized in terms of the growth of the norms of the preimages of the unit vectors under F * (or relaxations) and the decay of x † . Relaxations of the first condition were studied in [16,17,15]. For error bounds in the Bregman divergence with respect to the 1 -norm we refer to [5]. In the context of statistical regression by wavelet shrinkage maximal sets of signals for which a certain rate of convergence is achieved have been studied in detail (see [9]).
In the oversmoothing case one difficulty is that neither variational source conditions nor source conditions based on the range of the adjoint operator are applicable. Whereas oversmoothing in Hilbert scales has been analyzed in numerous papers (see, e.g., [22,23,30]), the literature on oversmoothing for more general variational regularization is sparse. The special case of diagonal operators in 1 -regularization has been discussed in [20]. In a very recent work, Chen, Hofmann & Yousept [7] have studied oversmoothing for finitely smoothing operators in scales of Banach spaces generated by sectorial operators.
Plan of the remainder of this paper: In the following section we introduce our setting and assumptions and discuss two examples for which these assumptions are satisfied in the wavelet-Besov space setting (2). Sections 3-5 deal with a general sequence space setting. In Section 3 we introduce a scale of weak sequence spaces which can be characterized by the approximation properties of some hard thresholding operator. These weak sequence spaces turn out to be the maximal sets of solutions on which the method (1) attains certain Hölder-type approximation rates. This is shown for the non-oversmoothing case in Section 4 and for the oversmoothing case in Section 5. In Section 6 we interpret our results in the previous sections in the Besov space setting, before we discuss numerical simulations confirming the predicted convergence rates in Section 7.

Setting, Assumptions, and Examples
In the following we describe our setting in detail including assumptions which are used in many of the following results. None of these assumptions is to be understood as a standing assumption, but each assumption is referenced whenever it is needed.

Motivating example: regularization by wavelet Besov norms
In this subsection, which may be skipped in first reading, we provide more details on the motivating example (2): Suppose the operator F is the composition of a forward operator G mapping functions on a domain Ω to elements of the Hilbert space Y and a wavelet synthesis operator S. We assume that Ω is either a bounded Lipschitz domain in R d or the d-dimensional torus (R/Z) d , and that we have a system (φ j.k ) (j,k)∈Λ of real-valued wavelet functions on Ω.
Here the index set Λ := {(j, k) : j ∈ N 0 , k ∈ Λ j } is composed of a family of finite sets (Λ j ) j∈N0 corresponding to levels j ∈ N 0 , and the growths of the cardinality of these sets is described by the inequalities 2 jd ≤ |Λ j | ≤ C Λ 2 jd for some constant C Λ ≥ 1 and all j ∈ N 0 .
For p, q ∈ (0, ∞) and s ∈ R we introduce sequence spaces with the usual replacements for p = ∞ or q = ∞. It is easy to see that b s p,q are Banach spaces if p, q ≥ 1. Otherwise, if p ∈ (0, 1) or q ∈ (0, 1), they are quasi-Banach spaces, i.e. they satisfy all properties of a Banach space except for the triangle inequality, which only holds true in the weaker form x + y ω,p ≤ C( x ω,p + y ω,p ) with some C > 1. We need the following assumption on the relation of the Besov sequence spaces to a family of Besov function spaces B s p,q (Ω) via the wavelet synthesis operator (Sx)(r) := (j,k)∈Λ x j,k φ j,k (r).
Assumption 1 Let s max > 0. Suppose that (φ j.k ) (j,k)∈Λ is a family of realvalued functions on Ω such that the synthesis operator is a norm isomorphism for all s ∈ (−s max , s max ) and p, q ∈ (0, ∞] satisfying Note that p ≥ 1 implies σ p = 0, and therefore S is a quasi-norm isomorphism for |s| ≤ s max in this case. We refer to the monograph [32] for the definition of Besov spaces B s p,q (Ω), different types of Besov spaces on domains with boundaries, and the verification of Assumption 1.
As main assumption on the forward operator G in function space we suppose that it is finitely smoothing in the following sense: 2,2 (Ω) be non-empty and closed, Y a Banach space and G : D G → Y a map. Assume that there exists a constant L ≥ 1 with Recall that B −a 2,2 (Ω) coincides with the Sobolev space H −a (Ω) with equivalent norms. The first of these inequalities is violated for infinitely smoothing forward operators such as for the backward heat equation or for electrical impedance tomography. In the setting of Assumptions 1 and 2 and for some fixed r ≥ 0 we study the following estimatorŝ We recall two examples of forward operators satisfying Assumption 2 from [24] where further examples are discussed.
Radon transform, which occurs in computed tomography (CT) and positron emission tomography (PET), among others, is defined by It satisfies Assumption 2 with a = d−1 2 .

General sequence spaces setting
Let p ∈ (0, ∞), and let ω = (ω j ) j∈Λ be a sequence of positive reals indexed by some countable set Λ. We consider weighted sequence spaces p ω defined by Note that the Besov sequence spaces b s p,q defined in (3) are of this form if p = q < ∞, more precisely b s p,p = p ω s,p with equal norm for (ω s,p ) (j,k) = 2 j(s+ d 2 − d p ) . Moreover, the penalty term in is given by α · r,1 with the sequence of weights r = (r j ) j∈Λ . Therefore, we obtain the penalty terms α · s,1,1 in (4) for the choice r j,k := 2 j(r− d 2 ) . We formulate a two-sided Lipschitz condition for forward operators F on general sequence spaces and argue that it follows from Assumptions 1 and 2 in the Besov space setting.
Assumption 3 a = (a j ) j∈Λ is a sequence of positive real numbers with a j r −1 j → 0. 1 Moreover, D F ⊆ 2 a is closed with D F ∩ 1 r = ∅ and there exists a constant L > 0 with Suppose Assumptions 1 and 2 hold true, and let With a j,k := 2 −ja and r j,k : is closed, and F := G • S : D F → Y satisfies the two-sided Lipschitz condition above.
In some of the results we also need the following assumption on the domain D F of the map F .
a : x ω,p ≤ ρ} in some p ω space centered at the origin. Concerning the closedness condition in Assumption 3, note that such balls are always closed in 2 a as the following argument shows: Let a and x (k) ω,p ≤ ρ for all k. Then x (k) converges pointwise to x, and hence j∈Γ ω p j |x j | p = lim k→∞ j∈Γ ω p j |x In the case that D F is a ball centered at some reference solution x 0 = 0, we may replace the operator F by the operator x → F (x + x 0 ). This is equivalent to using the penalty term α x − x 0 r,1 in (1) with the original operator F , i.e. Tikhonov regularization with initial guess x 0 . Without such a shift, Assumption 4 is violated.

Existence and uniqueness of minimizers
We briefly address the question of existence and uniqueness of minimizers in (1). Existence follows by a standard argument of the direct method of the calculus of variations as often used in Tikhonov regularization, see, e.g., [31,Thm. 3.22]). Proposition 3 Suppose Assumption 3 holds true. Then for every g obs ∈ Y and α > 0 there exists a solution to the minimization problem in (1). If D F = 2 a and F is linear, then the minimizer is unique.
Proof Let (x (n) ) n∈N be a minimizing sequence of the Tikhonov functional. Then x (n) r,1 is bounded. The compactness of the embedding 1 r ⊂ 2 a (see Proposition 31 in the appendix) implies the existence of a subsequence (w.l.o.g. again the full sequence) converging in · a,2 to some x ∈ 2 a . Then x ∈ D F as D F is closed. The second inequality in Assumption 3 implies Moreover, for any finite subset Γ ⊂ Λ we have and hence x r,1 ≤ lim inf n x (n) r,1 . This shows that x minimizes the Tikhonov functional.
In the linear case the uniqueness follows from strict convexity.
Note that Proposition 3 also yields the existence of minimizers in (4) under Assumptions 1 and 2 and eqs. (7). If F = A : 2 a → Y is linear and satisfies Assumption 3, the usual argument (see, e.g., [29, Lem. 2.1]) shows sparsity of the minimizers as follows: By the first order optimality condition there exists ξ ∈ ∂ · r,1 (x α ) such that ξ belongs to the range of the adjoint A * , that is ξ ∈ 2 a −1 and hence a −1 j |ξ j | → 0. Since a j r −1 j → 0, we have a j ≤ r j for all but finitely many j. Hence, we obtain |ξ j | < r j , forcing x j = 0 for all but finitely many j. Note that for this argument to work, it is enough to require that a j r −1 j is bounded from above. Also the existence of minimizers can be shown under this weaker assumption using the weak * -topology on 1 r (see [14,Prop. 2.2]).

Weak sequence spaces
In this section we introduce spaces of sequences whose bounded sets will provide the source sets for the convergence analysis in the next chapters. We define a specific thresholding map and analyze its approximation properties. Let us first introduce a scale of spaces, part of which interpolates between the spaces 1 r and 2 a involved in our setting. For t ∈ (0, 2] we define weights Note that ω 1 = r and ω 2 = a. The next proposition captures interpolation inequalities we will need later. Proposition 4 (Interpolation inequality) Let u, v, t ∈ (0, 2] and θ ∈ (0, 1) Proof We use Hölder's inequality with the conjugate exponents u (1−θ)t and v θt : Remark 5 In the setting of Proposition 4 real interpolation theory yields the stronger statement t ω t = ( u ω u , v ω v ) θ,t with equivalent quasi-norms (see, e.g., [19,Theorem 2]). The stated interpolation inequality is a consequence.
For t ∈ (0, 2) we define a weak version of the space t ω t . Definition 6 (Source sets) Let t ∈ (0, 2). We define

Remark 7
The functions · kt are quasi-norms. The quasi-Banach spaces k t are weighted Lorentz spaces. They appear as real interpolation spaces between weighted L p spaces.
To be more precise [19,Theorem 2] with equivalence of quasi-norms for u, v, t and θ as in Proposition 4.

Remark 8 Remark 5 and Remark 7 predict an embedding
Indeed the Markov-type inequality For a j = r j = 1 we obtain the weak p -spaces k t = t,∞ that appear in nonlinear approximation theory (see e.g. [10], [8]). We finish this section by defining a specific nonlinear thresholding procedure depending on r and a whose approximation theory is characterized by the spaces k t . This characterization is the core for the proofs in the following chapters. The statement is [10, Theorem 7.1] for weighted sequence space. For sake of completeness we present an elementary proof based on a partition trick that is perceivable in the proof of [10,Theorem 4.2]. Let α > 0. We consider the map If a j r −1 j is bounded above, then a −2 j r 2 j is bounded away from zero. Hence, in this case we see that the set of j ∈ Λ with a −2 j r j α < |x j | is finite, i.e. T α (x) has only finitely many nonvanishing coefficients whenever x ∈ 2 a . Lemma 9 (Approximation rates for Proof We use a partitioning to estimate A similar estimation yields the second inequality: Corollary 10 Assume a j r −1 j is bounded from above. Let 0 < t < p ≤ 2. Then k t ⊂ p ω p . More precisely, there is a constant M > 0 depending on t, p and sup j∈Λ a j r −1 j such that · ω p ,p ≤ M · kt .
Proof Let x ∈ k t . The assumption implies the existence of a constant c > 0 with c ≤ a −2 j r 2 j for all j ∈ Λ. Let α > 0. Then Remark 11 (Connection to best N -term approximation) For better understanding of the source sets we sketch another characterization of k t . For Note that for a j = r j = 1 we simply have S(x) = #supp(x). Then for N > 0 one defines the best approximation error by Using arguments similar to those in the proof of Lemma 22 one can show

Convergence Rates via Variational Source Conditions
We prove rates of convergence for the regularization scheme (1) based on variational source conditions. The latter are nessecary and often sufficient conditions for rates of convergence for Tikhonov regularization and other regularization methods ( [31,13,25]). For 1 -norms these conditions are typically of the form with β ∈ [0, 1] and ψ : [0, ∞) → [0, ∞) a concave, stricly increasing function with ψ(0) = 0. The common starting point of verifications of (9) in the references [4,16,15,24], which have already been discussed in the introduction, is a splitting of the left hand side in (9) into two summands according to a partition of the index set into low level and high level indices. The key difference to our verification in [24] is that this partition will be chosen adaptively to x † below. This possibility is already mentioned, but not further exploited in [

variational source conditions
We start with a Bernstein-type inequality.
The following lemma characterizes variational source conditions (9) for the embedding operator 1 r → 2 a (if a j r −1 j → 0) and power-type functions ψ with β = 1 and β = 0 in terms of the weak sequence spaces k t in Definition 6: Lemma 13 (variational source condition for embedding operator) Assume x † ∈ 1 r and t ∈ (0, 1). The following statements are equivalent: (ii) There exist a constant K > 0 such that for all x ∈ 1 r . (iii) There exist a constant K > 0 such that More precisely, Proof First we assume (i). For α > 0 we consider P α as defined in Lemma 12. Let x ∈ D ∩ 1 r . By splitting all three norm term in the left hand side of (10) by · r,1 = P α · r,1 + (I − P α )· r,1 and using the triangle equality for the (I − P α ) terms and the reverse triangle inequality for the P α terms (see [4,Lemma 5.1]) we obtain We use Lemma 12 to handle the first summand Note that P α x † = T α (x † ). Hence, Lemma 9 yields Inserting the last two inequalities into (11) and choosing It remains to show that (iii) implies (i). Let α > 0. We define Then |x j | ≤ |x † j | for all j ∈ Λ. Hence, x ∈ 1 r . We estimate Rearranging terms in this inequality yields Theorem 14 (variational source condition) Suppose Assumption 3 holds true and let t ∈ (0, 1), > 0 and x † ∈ D. If x kt ≤ then the variational source condition holds true with C vsc = (2 + 4(2 Proof Corollary 10 implies x ∈ D ∩ 1 r . The first claim follows from the first inequality in Assumption 3 together with Lemma 13. The second inequality in Assumption 3 together with Assumption 4 imply statement (iii) in Lemma 13 with K = L 2−2t 2−t C vsc . Therefore, Lemma 13 yields the second claim.

Rates of Convergence
In this section we formulate and discuss bounds on the reconstruction error which follow from the variational source condition (12)  Theorem 15 (Convergence rates) Suppose Assumption 3 holds true. Let t ∈ (0, 1), > 0 and 1. (error splitting) Every minimizerx α of (1) satisfies for all α > 0 with a constant C e depending only on t and L. 2. (rates with a-priori choice of α) If δ > 0 and α is chosen such that then every minimizerx α of (1) satisfies with a constant C p depending only on c 1 , c 2 , t and L.

(rates with discrepancy principle) Let
Here C d > 0 denotes a constant depending only on τ 2 , t and L.
We discuss our results in the following series of remarks:

Remark 16
The proof of Theorem 15 makes no use of the second inequality in Assumption 3.

Remark 17 (error bounds in intermediate norms)
Invoking the interpolation inequalities given in Proposition 4 allows to combine the bounds in the norms · r,1 and · a,2 to bounds in · ω p ,p for p ∈ (t, 1]. In the setting of Theorem 15(2.) or (3.) we obtain with C = C p or C = C d respectively.
Remark 18 (Limit t → 1) Let us consider the limiting case t = 1 by assuming only x † ∈ 1 r ∩ D F . Then it is well known, that the parameter choice α ∼ δ 2 as well the discrepancy principle as in Theorem 15.3. lead to bounds As above, Assumption 3 allows to transfer to a bound x † −x α a,2 ≤Cδ. Interpolating as in the last remark yields Remark 19 (Limit t → 0) Note that in the limit t → 0 the convergence rates get arbitrarily close to the linear convergence rate O(δ), i.e., in contrast to standard quadratic Tikhonov regularization in Hilbert spaces no saturation effect occurs. This is also the reason why we always obtain optimal rates with the discrepancy principle even for smooth solutions x † . As already mentioned in the introduction, the formal limiting rate for t → 0, i.e. a linear convergence rate in δ occurs if and only if x † is sparse as shown by different methods in [21].
We finish this subsection by showing that the convergence rates (15), (17), and (19) are optimal in a minimax sense.
Proposition 20 (Optimality) Suppose that Assumption 3 holds true. Assume furthermore that there are c 0 > 0, q ∈ (0, 1) such that for every η ∈ (0, c 0 ) there is j ∈ Λ satisfying qη ≤ a j r −1 j ≤ η. Let p ∈ (0, 2], t ∈ (0, p) and ρ > 0. Suppose D contains all x ∈ k t with x kt ≤ . Consider an arbitrary reconstruction method described by a mapping R : Y → 1 r approximating the inverse of F . Then the worst case error under the a-priori information x † kt ≤ is bounded below by Proof It is a well-known fact that the left hand side in (20) is bounded from below by 1 2 Ω(2δ, ) with the modulus of continuity Ω(δ, ) := By assumption there exists j 0 ∈ Λ such that j0 and x j = 0 if j = j 0 we obtain x kt = and x a,2 ≤ 2L −1 δ and estimate Note that for Λ = N the additional assumption in Proposition 20 is satisfied if a j r −1 j ∼q j forq ∈ (0, 1) or if a j r −1 j ∼ j −κ for κ > 0, but violated if a j r −1 j ∼ exp(−j 2 ).

Converse Result
As a main result, we now prove that the condition x † ∈ k t is necessary and sufficient for the Hölder type approximation rate O(α 1−t ): Theorem 21 (converse result for exact data) Suppose Assumption 3 and 4 hold true. Let x † ∈ D F ∩ 1 r , t ∈ (0, 1), and (x α ) α>0 the minimizers of (1) for exact data g obs = F (x † ). Then the following statements are equivalent: for all α > 0.
More precisely, we can choose C 2 := c x † t kt , C 3 := √ 2C 2 and bound 3 with a constant c > 0 that depends on L and t only.
Proof (i) ⇒ (ii): By Theorem 15(1.) for δ = 0. (ii) ⇒ (iii): As x α is a minimizer of (1) we have Multiplying by 2 and taking square roots on both sides yields (iii). (iii) ⇒ (i): The strategy is to prove that F (x † ) − F (x α ) Y is an upper bound on x † − T α (x † ) a,2 up to a constant and a linear change of α and then proceed using Lemma 9.
As an intermediate step we first consider The minimizer can be calculated in each coordinate separately by Hence, It remains to find a bound on x † − z α a,2 in terms of F (x † ) − F (x α ) Y . Let α > 0, β := 2L 2 α and z α given by (21). Then Using Assumption 3 and subtracting α z α r,1 yield Due to Assumption 4 we have z α ∈ D F . As x β is a minimizer of (1) we obtain Using the other inequality in Assumption 3 and subtracting β z α r,1 and dividing by β we end up with We insert the last inequality into (22), subtract 1 4 x † − z α 2 a,2 , multiply by 4 and take the square root and get Together with the first step, the hypothesis (iii) and the definition of β we achieve Finally, Lemma 9 yields x ∈ k t with x † kt ≤ cC 5 Convergence analysis for x † / ∈ 1 r We turn to the oversmoothed setting where the unknown solution x † does not admit a finite penalty value. An important ingredient of most variational convergence proofs of Tikhonov regularization is a comparison of the Tikhonov functional at the minimizer and at the exact solution. In the oversmoothing case such a comparison is obviously not useful. As a substitute, one may use a family of approximations of x † at which the penalty functional is finite. See also [22] and [23] where this idea is used and the approximations are called auxiliary elements. Here we will use T α (x † ) for this purpose. We first show that the spaces k t can not only be characterized in terms of the approximation errors (I − T α )(·) ω p ,p as in Lemma 9, but also in terms of T α · r,1 : More precisely, we can bound Proof As in the proof of Lemma 9 we use a partitioning. Assuming x ∈ k t we obtain Vice versa we estimate Hence, The following lemma provides a bound on the minimal value of the Tikhonov functional. From this we deduce bounds on the distance between T α (x † ) and the minimizers of (1) in · a,2 and in · r,1 .
Lemma 23 (preparatory bounds) Let t ∈ (1, 2), δ ≥ 0 and > 0. Suppose 3 and 4 hold true. Assume x † ∈ D F with x † kt ≤ and g obs ∈ Y with g obs − F (x † ) Y ≤ δ. Then there exist constants C t , C a and C r depending only on t and L such that for all α > 0 andx α minimizers of (1).
Proof Due to Assumption 4 we have T α (x † ) ∈ D. Therefore, we may insert T α (x † ) into (1) to start with Lemma 22 provides the bound α T α (x † ) r,1 ≤ C 1 t α 2−t for the second summand on the right hand side with a constant C 1 depending only on t.
In the following we will estimate the first summand on the right hand side. Let ε > 0. By the second inequality in Assumption 3 and Lemma 9 we obtain with a constant C 2 depending on L and t. Inserting into (26) yields (23) with We use (27), the first inequality in Assumption 3 and neglect the penalty term in (23) to estimate only on t. Neglecting the data fidelity term in (23) yields The next result is a converse type result for image space bounds with exact data. In particular, we see that Hölder type image space error bounds are determined by Hölder type bounds on the whole Tikhonov functional at the minimizers and vice versa.
Theorem 24 (converse result for exact data) Suppose Assumption 3 and 4 hold true. Let t ∈ (1, 2), x † ∈ D F and (x α ) α>0 a choice of minimizers in (1) with g obs = F (x † ). The following statements are equivalent: More precisely, we can choose C 2 = C t x † t kt with C t from Lemma 23, C 3 = √ 2C 2 and bound x † kt ≤ cC The following theorem shows that we obtain order optimal convergence rates on k t also in the case of oversmoothing (see Proposition 20).
Theorem 25 (rates of convergence) Suppose Assumptions 3 and 4 hold true. Let t ∈ (1, 2), p ∈ (t, 2] and > 0. Assume x † ∈ D F with x † kt ≤ . 1. (bias bound) Let α > 0. For exact data g obs = F (x † ) every minimizer x α of (1) satisfies with a constant C b depending only on p, t and L. 2. (rate with a-priori choice of α) Let δ > 0, g obs ∈ Y satisfy g obs − F (x † ) Y ≤ δ and 0 < c 1 < c 2 . If α is chosen such that then every minimizerx α of (1) satisfies with a constant C c depending only on c 1 , c 2 , p, t and L. 3. (rate with discrepancy principle) Let δ > 0 and g obs ∈ Y satisfy g obs − Here C d > 0 denotes a constant depending only on τ 1 , τ 2 , p, t and L.
Proof 1. By Proposition 4 we have · ω p ,p ≤ · 2p−2 p a,2 · 2−p p r,1 . With this we interpolate between (24) and (25) with δ = 0 to obtain By Lemma 9 there is a constant K 2 depending only on p and t such that Hence 2. Inserting the parameter choice rule into (24) and (25) yields As above, we interpolate these two inequalities to obtain with We insert the parameter choice into (29) and get . Applying the triangle inequality as in part 1 yields the claim.
Then ε > 0. By Lemma 9 there exists a constant K 4 depending only on t such that Then We make use of the elementary inequality (a + b) 2 ≤ (1 + ε)a 2 + (1 + ε −1 )b 2 which is proven by expanding the square and applying Young's inequality on the mixed term. Together with the second inequality in Assumption 3 we estimate By inserting T β (x † ) into the Tikhonov functional we end up with Hence, x α r,1 ≤ T β (x † ) r,1 . Together with Lemma 22 we obtain the bound with a constant K 5 that depends only on τ , t and L. Using (30) and the first inequality in Assumption 3 we estimate As above, interpolation yields . Finally, Lemma 9 together with the choice of p(2−t) for a constant K 8 that depends only on τ , p, t and L and we conclude

Wavelet Regularization with Besov Spaces Penalties
In the sequel we apply our results developed in the general sequence space setting to obtain obtain convergence rates for wavelet regularization with a Besov r, 1, 1-norm penalty.
Suppose Assumptions and 1 and 2 and eqs.
we obtain b s ts,ts = ts ω ts with equal norm for ω ts given by (8). For s ∈ (0, ∞) we have t s ∈ (0, 1). The following lemma defines and characterizes a function space K ts as the counterpart of k ts for s > 0. As spaces b s p,q and B s p,q (Ω) with p < 1 are involved let us first argue that within the scale b s ts,ts for s > 0 the extra condition σ ts − s max < s in Assumption 1 is always satisfied if we assume a + r > d 2 . To this end let 0 < s < s max . Then Hence, σ ts − s max < 0 < s.
Lemma 26 (Maximal approximation spaces K ts ) Let a, s > 0 and suppose that Assumption 1 and eqs. (7a) and (7b) holds true. We define K ts := S(k ts ) with f Kt s := S −1 x kt s with t s given by (31). Let s < u < s max . The space K ts coincides with the real interpolation space with equivalent quasi-norms, and the following inclusions hold true with continuous embeddings: Hence, Proof For s < u < s max we have k ts = (b −a 2,2 , b u tu,tu ) θ,∞ with equivalent quasinorms (see Remark 7). By functor properties of real interpolation (see [3,Thm. 3.1.2]) this translates to (32). As discussed above, we use a + r > d 2 (see (7a)) to see that u ∈ (σ ts − s max , s) such that S : b u tu,tu → B u tu,tu (Ω) is well defined an bijective. By Remark 8 we have b s ts,ts ⊂ k ts with continuous embedding, implying the first inclusion in (33). Moreover, we have t u ≤ 2a+2r 2a+r ≤ 2. Hence, the continuous embeddings B Theorem 27 (Convergence rates) Suppose Assumptions 2 and 1 hold true with d 2 − r < a < s max and b r 1,1 ∩ S −1 (D G ) = ∅. Let 0 < s < s max with s = r, > 0 and · L p denote the usual norm on L p (Ω) for 1 ≤ p := 2a+2r 2a+r . Assume f † ∈ D G with f † Kt s ≤ . If s < r assume that D F := S −1 (D G ) satisfies Assumption 4. Let δ > 0 and g obs ∈ Y satisfy g obs − F (f † ) Y ≤ δ.
Proof If s > r (hence t s ∈ (0, 1)) we refer to Remark 17. If s < r (hence t ∈ (1, 2)) to Theorem 25 for the bound Together with (34) this proves the result.

Remark 28
In view of Remark 18 we obtain the same results for the case s = r by replacing K ts by B r 1,1 (Ω).
More precisely, we can choose C 2 := c f † ts Kt , C 3 := cC Proof Statement (i) is equivalent to x † = S −1 f † ∈ k t and statement (ii) is equivalent to a bound x − x α 0,1,1 ≤C 2 α s s+2a . Hence, Theorem 21 yields the result.

Example 30
We consider functions f jump , f kink : [0, 1] → R which are C ∞ everywhere with uniform bounds on all derivatives except at a finite number of points in [0, 1], and f kink ∈ C 0,1 ([0, 1]). In other words, f jump , f kink are piecewise smooth, f jump has a finite number of jumps, and f kink has a finite number of kinks. Then for p ∈ (0, ∞), q ∈ (0, ∞], and s ∈ R with s > σ p with σ p as in Assumption 1 we have To see this, we can use the classical definition of Besov spaces in terms of the modulus of continuity if m ≥ 2/p. Therefore, as t s < 1 describing the regularity of f jump or f kink in the scale B s ts,ts (Ω) ⊂ K ts as in Theorems 27 and 29 allows for a larger value of s and hence a faster convergence rate than describing the regularity of these functions in the Besov spaces B s 1,∞ as in [24]. In other words, the previous analysis in [24] provided only suboptimal rates of convergence for this important class of functions. This can also be observed in numerical simulations we provide below.
Note that the largest set on which a given rate of convergence is attained can be achieved by setting r = 0 (i.e. no oversmoothing). This is in contrast to the Hilbert space case where oversmoothing allows to raise the finite qualification of Tikhonov regularization. On the other hand for larger r convergence can be guaranteed in a stronger L p -norm.

Numerical results
For our numerical simulations we consider the problem in Example 2 in the form − u + cu = f in (0, 1), The forward operator in the function space setting is G(c) := u for the fixed right hand side f (·) = sin(4π·) + 2. The true solution c † is given by a piecewise smooth function with either finitely many jumps or kinks as discussed in Example 30.
To solve the boundary value problem (35) we used quadratic finite elements and an equidistant grid containing 127 finite elements. The coefficient c was sampled on an equidistant grid with 1024 points. For the wavelet synthesis operator we used the code PyWavelets [28] with Daubechies wavelet of order 7. The minimization problem in (4) was solved by the Gauß-Newton-type method c k+1 = Sx k+1 , with a constant initial guess c 0 = 1. In each Gauß-Newton step these linearized minimization problems were solved with the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) proposed and analyzed by Beck & Teboulle in [2]. We used the inertial parameter as in [6,Sec. 4]. We did not impose a constraint on the size of x − x 0 0,2,2 , which is required by our theory if Assumption 3 does not hold true globally. However, the size of the domain of validity of this assumption is difficult to assess, and such a constraint is likely to be never active for a sufficiently good initial guess. The regularization parameter α was chosen by a sequential discrepancy principle with τ 1 = 1 and τ 2 = 2 on a grid α j = 2 −j α 0 . To simulate worst case errors, we computed for each noise level δ reconstructions for several data errors u δ − G(c † ), u δ − G(c † ) L 2 = δ, which were given by sin functions with different frequencies. For the piecewise smooth coefficient c † with jumps shown on the left panel of Fig. 1, Example 30 yields c † ∈ B s ts,ts ((0, 1)) ⊂ K ts ⇔ s < 1 t s ⇔ s < 4 3 .
In contrast, the smoothness condition c † ∈ B s 1,∞ ((0, 1)) in our previous analysis in [24], which was formulated in terms of Besov spaces with p = 1 is only satisfied for smaller smoothness indices s ≤ 1, and therefore, the convergence rate in [24] is only of the order c α − c † L 1 = O δ 1 3 . Our numerical results displayed in the right panel of Fig. 1 show that this previous error bound is too pessimistic, and the observed convergence rate matches the rate (36) predicted by our analysis. Fig. 2 Left: true coefficient c † with kinks in the boundary value problem (5) together with a typical reconstruction at noise level δ = 3.5 · 10 −5 . Right: Reconstruction error using b 0 1,1penalization, the rate O(δ 4/7 ) predicted by Theorem 27 (see eq. (37)), and the rate O(δ 1/2 ) predicted by the previous analysis in [24].
Similarly, for the piecewise smooth coefficient c † with kinks shown in the left panel of Fig. 2, Example 30 yields c † ∈ B s ts,ts ((0, 1)) ⊂ K ts ⇔ s < 1 + in [24] based on the regularity condition c † ∈ B 2 1,∞ ((0, 1)) turns out to be suboptimal for this coefficient c † even though it is minimax optimal in B 2 1,∞balls.
Finally, for the same coefficient c † with jumps as in Fig. 1, reconstructions with r = 0 and r = 2 are compared in the left panel of Fig. 3. Visually, Fig. 3 Left: true coefficient c † with jumps in the boundary value problem (5) together with reconstructions for r = 0 and r = 2 at noise level δ = 3.5 · 10 −5 for the same data. Right: Reconstruction error using b 2 1,1 -penalization (oversmoothing) and the rate O(δ 3/10 ) predicted by Theorem 27 (see eq. (38)). This case is not covered by the theory in [24]. the reconstruction quality is similar for both reconstructions. For r = 2 the penalization is oversmoothing, and Example 30 yields c † ∈ B s ts,ts ((0, 1)) ⊂ K ts ⇔ s < 1 t s ⇔ s < 6 7 with t s = 8 s+6 . Hence, Theorem 27 predicts the rate c α − c † L 4/3 = O(δ e ) for all e < 3 10 , which once again matches with the results of our numerical simulations shown on the right panel of Fig. 3. This case is not covered by the theory in [24].

Conclusions
We have derived a converse result for approximation rates of weighted 1regularization. Necessary and sufficient conditions for Hölder-type approximation rates are given by a scale of weak sequence spaces. We also showed that 1 -penalization achieves the minimax-optimal convergence rates on bounded subsets of these weak sequence spaces, i.e. that no other method can uniformly perform better on these sets. However, converse results for noisy data, i.e. the question whether 1 -penalization achieves given convergence rates in terms of the noise level on even larger sets, remains open. Although it seems likely that the answer will be negative, a rigorous proof would probably require uniform lower bounds on the maximal effect of data noise.
A further interesting extension concerns redundant frames. Note that lacking injectivity the composition of a forward operator in function spaces with a synthesis operator of a redundant frame cannot meet the first inequality in Assumption 3. Therefore, the mapping properties of the forward operator in function space will have to be described in a different manner. (See [1, Sec. 6.2.] for a related discussion. ) We have also studied the important special case of penalization by wavelet Besov norms of type B r 1,1 . In this case the maximal spaces leading to Höldertype approximation rates can be characterized as real interpolation spaces of Besov spaces, but to the best of our knowledge they do not coincide with classical function spaces. They are slightly larger than the Besov spaces B s t,t with some t ∈ (0, 1), which in turn are considerably larger than the spaces B s 1,∞ used in previous results. Typical elements of the difference set B s t,t \ B s 1,∞ are piecewise smooth functions with local singularities. Since such functions can be well approximated by functions with sparse wavelet expansions, good performance of 1 -wavelet penalization is intuitively expected. Our results confirm and quantify this intuition.