On convergence rates of adaptive ensemble Kalman inversion for linear ill-posed problems

In this paper we discuss a deterministic form of ensemble Kalman inversion as a regularization method for linear inverse problems. By interpreting ensemble Kalman inversion as a low-rank approximation of Tikhonov regularization, we are able to introduce a new sampling scheme based on the Nyström method that improves practical performance. Furthermore, we formulate an adaptive version of ensemble Kalman inversion where the sample size is coupled with the regularization parameter. We prove that the proposed scheme yields an order optimal regularization method under standard assumptions if the discrepancy principle is used as a stopping criterion. The paper concludes with a numerical comparison of the discussed methods for an inverse problem of the Radon transform.


Introduction
In recent years, ensemble Kalman inversion (EKI) has become a popular tool for solving inverse problems [28]. EKI has advantages against other iterative methods in situations where the evaluation of the forward operator is costly, and information about its adjoint or its derivative is unavailable. While there are some recent results on the convergence of EKI as an optimization method [10,51,52,62], the regularization theory of EKI is still incomplete. In this paper, we provide an analysis of a deterministic form of EKI as a regularization method for solving linear inverse problems. That is, we consider the problem of determining a solution x * of the linear operator equation where L : X → Y is a bounded linear operator between Hilbert spaces. We do not assume that we have access to y, but only to a noisy measurement y = y + ξ, (1.2) where ξ is noise. Such an analysis is important for three reasons: First, it allows a theoretical comparison of EKI with established iterative regularization methods for inverse problems, such as the iteratively regularized Gauss-Newton [3] or the iteratively regularized Landweber [49] iteration. Secondly, it allows the transfer of knowledge between functional-analytic regularization theory, in particular the study of finite-dimensional approximation of Tikhonov regularization [20,43], and the emerging literature on ensemble methods for the solution of inverse problems (see for example [27] or [46]). Finally, this analysis can potentially serve as the basis for a generalized analysis of EKI for nonlinear inverse problems, making use of the deterministic convergence analysis of iterative regularization methods in Hilbert space (see [3,22,30]).
It was already noted in [28] that in the case of a linear operator equation the first iteration of EKI converges to the Tikhonov regularized solution as the sample size approaches infinity. It can be shown that-at least for the deterministic version considered in this paper-this also holds true for all subsequent iterates, where each iterate is associated with a different choice of regularization parameter. Thus, in the linear case, EKI can be completely characterized as a stochastic low-rank approximation of Tikhonov regularization. As a consequence we can prove that under appropriate source conditions and by adapting the sample size to the regularization parameter (this method is then called adaptive EKI), we get optimal convergence rates for EKI in the sense formulated for instance in [14]. Moreover, we show that the efficiency of EKI can be increased by the use of more sophisticated low-rank approximation schemes, such as the Nyström method (see e.g. [17]).
The paper is organized as follows: • We continue this section by recalling some required notation and functionalanalytic prerequisites (Sect. 1.1) and providing an appropriate definition of the deterministic form of EKI that is considered for the rest of this paper (see Sect. 1.2). • In Sect. 2.1, we discuss deterministic EKI as an approximation to Tikhonov regularization. In particular, we derive error estimates in dependence of the regularization parameter which build the foundation for the subsequent formulation of an adaptive version. In Sect. 2.2 we review some results and methods for the low-rank approximation of operators, in particular the Nyström method. We show how these methods naturally lead to new versions of EKI. • In Sect. 3 we propose an adaptive variant of EKI. The algorithm is described in Sect. 3.1 and analyzed as an iterative regularization method in Sect. 3.2, where we describe conditions under which we can prove optimal convergence rates in the zero-noise limit. This constitutes our main result. Further remarks comparing the proposed scheme with similar methods from the existing literature are given in Sect. 3.3. • We conclude our paper in Sect. 4 with numerical experiments in the context of computerized tomography. These experiments demonstrate some advantages and shortcomings of EKI for linear inverse problems. In particular, they show that the Nyström EKI method leads to considerable improvements in terms of numerical performance in comparsion to existing sampling methods. • The appendix reviews some prerequisites from probability theory, and discusses how our exposition relates to alternative formulations of EKI that have been studied elsewhere.

Notation and terminology
We summarize basic notation first: (i) X and Y denote real separable Hilbert spaces.
(ii) L(X; Y) denotes the space of bounded linear operators from X to Y.
(iii) If L : X → Y is a linear operator, we let D(L) ⊂ X denote its domain and R(L) ⊂ Y denote its range. (iv) We call P ∈ L(X; X) positive if Px, x X ≥ 0 for all x ∈ X.
(v) For a positive and self-adjoint operator P ∈ L(X; X), we define the P-weighted norm where the operator P −1/2 is defined as the pseudoinverse of P 1/2 , which in turn can be defined via spectral theory, see for example [14, chapter 2.3]. (vi) Trace class: We say that an operator P ∈ L(X; X) is in the trace class if for any orthonormal basis (e n ) of X we have n | Pe n , e n X | < ∞.

Ensemble Kalman inversion for linear inverse problems
Next, we present a particular form of the EKI iteration associated to problem (1.1). The original form of EKI [28], which we refer to as stochastic EKI, evolves a random ensemble through an iteration where additional noise is added in each step. In the last few years, multiple variants of EKI have been developed that incorporate adaptable stepsizes [10,32] or additional regularization [9]. In particular, one can also formulate a deterministic version that circumvents the addition of noise by directly transforming the ensemble mean and covariance. Such a version of EKI has for example been considered in [10]. In accordance with the literature on ensemble Kalman filtering, we will refer to this as deterministic EKI [26,57]. A more detailed discussion of its relation to the stochastic form of EKI can be found in "Appendix B". The EKI iteration involves two linear operators C 0 : X → X and R : Y → Y that characterize regularity assumptions on the solution x * and the noise ξ . They have to be provided by the practitioner to represent prior information on the problem. In the rest of this article, we will assume that they satisfy the following conditions: Assumption 1.1 Let C 0 ∈ L(X; X) and R ∈ L(Y; Y) be injective, positive and selfadjoint linear operators such that Moreover, we assume that the noisy dataŷ defined in Eq. 1.2 satisfiesŷ ∈ R(R 1/2 ).
As the next proposition shows, the subspace D(C −1/2 0 ) ⊂ X together with the norm · C 0 yields a Hilbert space. This space will play an important role for our analysis in Sect. 3. Proposition 1.2 Let C 0 ∈ L(X; X) be an injective, positive and self-adjoint bounded linear operator. Let Then D(C −1/2 0 ) equipped with the inner product ·, · C 0 defines a Hilbert space, denoted by X C 0 . Moreover 0 ) because C 0 is injective. Furthermore, this bilinear form is symmetric and positive semidefinite because C −1/2 0 is self-adjoint and positive. The definiteness follows from the injectivity of C −1/2 0 . Eq. 1.4 follows from the boundedness of C 0 , since we have Finally, the completeness of D(C −1/2 0 ) with respect to · C 0 is a direct consequence of the completeness of X.

Remark 1.3
At this point, we want to stress that the operator R does not correspond to the assumption that ξ =ŷ − y is a Gaussian random element of Y with covariance R. In fact, in the case where Y is infinite-dimensional, one can show that ξ R = ∞ with probability 1 (see [7, theorem 2.4.7]). The proper interpretation of R is that it determines a subspace Y R ⊂ Y in which ξ is assumed to lie (see Proposition 1.2).
Before we continue with the description of the deterministic EKI iteration, we present an illustrative example for a choice of the operators C 0 and R that is often used in practice.
are bounded domains with piecewise smooth boundaries. Consider the choice C 0 = (I X − ) −1 and R = I Y − . Then the operator C 0 is compact. Here, (I X − ) −1 is the operator which maps a given function ρ ∈ X onto the weak solution of the equation The range of C 0 is H 1 (D), i.e. the Sobolev space of first order. It is easy to see that C −1 0 is positive and self-adjoint, and thus so is C 0 . We also have Similarly The fundamental difference of ensemble methods to existing regularization methods is the use of a stochastic low-rank approximation of C 0 , which reduces the effective dimension of the parameter space X. The next definition gives this notion a precise meaning.
(ii) Let p ∈ [1, ∞) and ( A (J ) ) ∞ J =1 be a family of random bounded linear operators (see "Appendix A") with A (J ) (ω) ∈ L(R J ; X) for all ω ∈ and J ∈ N. We say that it generates a stochastic low-rank approximation of C 0 , of p-order γ , if there exists a constant ν p such that Under Assumption 1.1, the following algorithm is well-defined, for all k ∈ N.
generate a low-rank approximation of C 0 , and letŷ ∈ Y, J ∈ N, and an initial guess x 0 ∈ X be given.
• Initialization: (1.5) and where I J ∈ R J ×J denotes the identity matrix and B (J ) k * : Y → R J denotes the adjoint of B (J ) k . Note that the adjective "deterministic" in Definition 1.6 refers only to the update formula, which-in contrast to the original, stochastic EKI iteration (see Definition B.1)-does not introduce additional noise. Even if a stochastic low-rank approximation is used in Definition 1.6, we will refer to the resulting method as deterministic EKI. In this case, the algorithm is defined pointwise, for every ω ∈ . That is, the quantitieŝ X (J ) k , A (J ) k and B (J ) k all depend on ω. For the rest of this paper, we will suppress this dependence. This allows us to treat both deterministic and stochastic low-rank approximations at once. Remark 1. 7 We have introduced the EKI update equations (Eqs. 1.5-1.6) in the socalled square-root form. It is equivalent (see e.g. [57]) to the so-called covariance form which is more widespread in the literature on the Kalman filter and given bŷ (1.7) The operator C (J ) k is related to A (J ) k from Definition 1.6 via the identity C (J ) k = A (J ) k A (J ) k * , which holds for all k ∈ N. The computational difference between these two formulations is that the square-root form requires the inversion of an operator on R J , while the covariance form requires inversion of an operator on Y.
The existing literature on EKI focuses mostly on the case where the low-rank approximation ( A (J ) ) ∞ J =1 is generated by the so-called anomaly operator of an ensemble U (J ) of random elements-thus the name "ensemble Kalman inversion". That is, one uses A (J ) = A(U (J ) ), where A(U (J ) ) is defined as follows: the ensemble anomaly.
We will see in Sect. 2.2 that (A(U (J ) )) ∞ J =1 generates a stochastic low-rank approximation of C 0 if U 1 , . . . , U J are independent Gaussian random elements with Cov U (J ) j = C 0 , for all j = 1, . . . , J . However, the more general Definition 1.6 allows us to consider other forms of low-rank approximations, in particular also deterministic ones (see Sect. 2.2).

Remark 1.9
The update Eq. 1.5 can also be expressed as the solution to a minimization problem, since for all k ∈ N,X (J ) k+1 is the minimizer of the functional which is well-defined due to Assumption 1.1.

Direct EKI
Ensemble Kalman methods originated in data assimilation [15] and are traditionally applied to state estimation in dynamical systems [40,48]. Following this logic, EKI, which has been developed for the treatment of inverse problems, is often analyzed as a nonstationary regularization method with multiple steps, where the iteration number k controls the amount of regularization. For the deterministic version of EKI given by Eqs. 1.5 and 1.6, one can actually show that multiple iterations with initial covariance operator C 0 are equivalent to a single iteration with covariance operatorC 0 = 1 k C 0 . This result can be seen as direct consequence of the classical equivalence of the Kalman filter to four-dimensional variational data assimilation (4D-VAR) [47]. (J ) , and let (X k ) ∞ k=1 denote the EKI iteration as defined in Definition 1.6. Then, the following representation holdŝ (2.1) Proof Follows from [40, theorem 5.4.7] by setting M = I X , H ξ = L and f (ξ ) =ŷ for ξ = 1, . . . , k.
A first consequence of Theorem 2.1 is that it allows us to embed EKI into a parameter-dependent family of operators, which we will call direct EKI: Definition 2.2 (Direct EKI) Suppose that Assumption 1.1 holds, and let α > 0. Then, we define the direct EKI in the following waŷ That is, the k-th iterate of deterministic EKI is equivalent to direct EKI with the choice α = 1/k. Next, we derive error estimates between direct EKI and Tikhonov regularization in terms of the sample size J and the regularization parameter α. To this end, let us recall the notion of the Tikhonov-regularized solution of Eq. 1.1.
is called the Tikhonov regularized solution of Eq. 1.1 according to the dataŷ and the regularization parameter α. It is denoted withx α and explicitly represented bŷ and where I X : X → X is the identity operator.

Remark 2.4
We emphasize the notational difference between Eqs. 2.2 and 2.5 that I X denotes the identity operator on X while I J ∈ R J ×J denotes the identity matrix for R J .
If we compare Eqs. 2.2 and 2.5, we observe that the main difference between Tikhonov regularization and direct EKI is the replacement of the operator K α (Tikhonov) by a low-rank approximation K α ( A (J ) ) (direct EKI). In the following the difference between the random elementX d,(J) α and the Tikhonov regularized solution x α is estimated. Lemma 2.6 (Tikhonov versus direct EKI) Let α > 0, p ∈ [1, ∞), and suppose that Assumption 1.1 holds. Then there exists a constant c, independent of J , such that Proof By Eqs. 2.5 and 2.2 we havê Using spectral theory, one can show and for all positive and self-adjoint bounded linear operators P and Q (see [14, section 2.3]). Furthermore, recall that every linear operator A satisfies the identity Taking norms and using Eqs. 2.8, 2.9, and 1.3, we then obtain Taking norms in Eq. 2.7 and inserting this estimate proves the assertion.
This lemma shows that the difference between Tikhonov regularization and direct EKI can be bounded in terms of the difference between the operators A (J ) A (J ) * and C 0 . If this difference decreases with a certain rate with respect to J , then direct EKI converges to Tikhonov regularization with the same rate.
generates a deterministic low-rank approximation of C 0 of order γ , then there exists a constant κ such that Proof Follows directly from Lemma 2.6 and Definition 1.5 with κ = ν · c and κ p = ν p · c.

Remark 2.8
Alternatively to the above derivation, the convergence of deterministic EKI to Tikhonov regularization can be seen as special case of the convergence of the ensemble square-root filter to the Kalman filter, see for example [35] or [40, section 5.4]. However, the alternative results presented here are better suited to investigate convergence rates of EKI as a regularization method (see Sect. 3), since they explicitly describe the dependence of the approximation error on the regularization parameter α. We also note that different types of finite-dimensional approximations of Tikhonov approximation have been studied elsewhere, for example in [20,43].

Optimal low-rank approximations for EKI
Proposition 2.7 shows that direct EKI, and thus also EKI, converges to Tikhonov regularization with rate equal to the order of the employed low-rank approximation.
In general, convergent low-rank approximations only exist if the eigenvalues of C 0 satisfy a decay condition. Assumption 2.9 (Decreasing eigenvalues of C 0 ) Let C 0 satisfy Assumption 1.1, and let (λ n ) denote its eigenvalues in decreasing order. We assume that there exists a constant η > 0 such that

Remark 2.10
In this paper, we always assume that all eigenvalues are repeated according to their multiplicities.
Under Assumption 2.9, the Schmidt-Eckhardt-Young-Mirsky theorem [13,37,53] states that the best possible order of any low-rank approximation of C 0 is η, and it is achieved by the truncated singular value decomposition. Theorem 2.12 (Schmidt-Eckhardt-Young-Mirsky) Let C 0 ∈ L(X; X) be positive, self-adjoint and compact, and let (λ n ) denote its eigenvalues in decreasing order. Let A (J ) svd A (J ) svd * denote the J -truncated singular value decomposition of C 0 . Then

Remark 2.13
Note that the optimal possible order for a low-rank approximation does not directly depend on the dimension of the underlying spaces X and Y, only on the decay of the eigenvalues of C 0 . This means that we obtain dimension-independent convergence rates as long as the eigenvalues of C 0 decay sufficiently fast.
Existing formulations of EKI generate a stochastic low-rank approximation of C 0 from the ensemble anomaly A(U (J ) ) (see Definition 1.8) of a randomly generated ensemble U (J ) . For this type of approximation, we have the following result. Theorem 2.14 (Low rank approximation of C 0 )] Assume that C 0 is in the trace-class and let p ∈ [1, ∞) be fixed. Moreover, for every J ∈ N, let U (J ) = [U 1 , . . . , U J ] ∈ X J be an ensemble of independent Gaussian random elements with Cov U j = C 0 , for all j ∈ {1, . . . , J }, and let A(U (J ) ) be as in Definition 1.8. Then (A(U (J ) )) ∞ J =1 generates a stochastic low-rank approximation of C 0 , of p-order 1/2, meaning that there exists a constant ν p such that In particular, for p = 1, there exists a constant c > 0 such that Proof See [31].
The following result on trace-class operators allows us to directly compare the order of (A(U (J ) )) ∞ J =1 to the theoretical optimum defined in Theorem 2.12.

Proposition 2.16 Let C 0 be a positive and self-adjoint trace-class operator with eigenvalues (λ n ). Then
Therefore, if C 0 is in the trace-class, then according to Theorem 2.12 the optimal low-rank approximation of C 0 is given by A (J ) svd A (J ) svd * and is at least of order 1. However, since the low-rank approximation generated by (A(U (J ) )) ∞ J =1 satisfies the lower bound Eq. 2.10, the ensemble-based low-rank approximation, while cheaper, is not of optimal order.
This leads to the question whether there exist low-rank approximations of C 0 that are of optimal order but do not require knowledge of the singular value decomposition of C 0 . The answer to this question is yes. There exist stochastic low-rank approximations that are of optimal order and only require O(J ) evaluations of C 0 [21]. An example of such a scheme is the Nyström method [17,41,44]. We will consider a special case given by algorithm 1.

Algorithm 1 Nyström method (with projection-based sketches)
Given a positive and self-adjoint operator C 0 and a target rank J ∈ N. 1: Generate iid random samples W 1 , . . . , W J ∼ N(0, I X ) and assemble them in It has been shown that this method leads to a stochastic low-rank approximation of optimal order. Theorem 2.17 (Nyström low rank approximation) Let A (J ) nys be obtained from Algorithm 1 and let (λ n ) denote the decreasing eigenvalues of C 0 . Then Proof It follows from lemma 4 in [12] that where Q (J ) is as in algorithm 1. The right-hand side can be estimated using [21, theorem 10.6] (the adaptation to our infinite-dimensional setting is straightforward), yielding Eq. 2.11. If we then choose N = J /2 in Eq. 2.11 (assuming without loss of generality that J is even), the right-hand side becomes

Remark 2.18
By adapting the proof of [21, theorem 10.6], one could also show that the Nyström-method is of p-order η, for all p ∈ [1, ∞).
We will see in Sect. 4 that the accuracy of the Nyström method is very close to the theoretical optimum given by the truncated singular value decomposition.

Convergence of direct EKI
The ensemble anomaly, truncated singular value decomposition, Nyström method, or in fact any other method for the low-rank approximation of positive operators can be used inside EKI. The corresponding error estimates with respect to Tikhonov regularization follow then directly from Proposition 2.7.

Corollary 2.19
Suppose that Assumption 1.1 is satisfied. Then: α is deterministic, and if Assumption 2.9 holds, then there exists a constant κ svd such that nys . If Assumption 2.9 holds with η > 1/2, then there exists a constant κ nys such that . By Theorem 2.14, (A(U (J ) )) ∞ J =1 generates a stochastic lowrank approximation of p-order 1/2. Thus, Eq. 2.13 follows from Proposition 2.7. The estimates Eqs. 2.14 and 2.15 then follow analogously through Theorems 2.12 and 2.17, respectively.

Adaptive ensemble Kalman inversion
We have seen in Proposition 2.7 that direct ensemble Kalman inversion can be understood as a low-rank approximation of Tikhonov regularization. It is well-known that, under a standard source-condition (see Assumption 3.4 below), the Tikhonovregularized solution of a linear equation converges to the infinite-dimensional minimum-norm solution (see Definition 3.3) in the zero-noise limit with a certain rate (see e. g. [14]). Thus, if we ensure that the error between direct EKI and Tikhonov regularization vanishes with the same rate as Tikhonov regularization converges, then direct EKI will also converge with that rate. However, since its iterates are restricted to the finite-dimensional range of A (J ) , direct EKI can only lead to a convergent regularization method if the sample size J is adapted to the noise level.
In this section, we describe how this can be achieved in conjunction with the discrepancy prinicple. The resulting method, which we call adaptive ensemble Kalman inversion, is a convergent regularization method of optimal order in a sense that will be given below. For this result, we require knowledge of a number δ > 0 such that This assumption is often referred to as a deterministic noise model, and the number δ is called the deterministic noise level. For some results on regularization with random noise, see for example [6]. We start with a precise description of the adaptive EKI method in Sect. 3.1, followed by a convergence analysis of the zero-noise limit in Sect. 3.2. General remarks explaining the connection to other forms of EKI and multiscale methods are given in Sect. 3.3.

Description of the method
We start by presenting a version of direct EKI with a-posteriori parameter choice rule in the form of the discrepancy principle. We will refer to this method as adaptive EKI.
In our definition, we distinguish between the cases where the underlying low-rank approximation is deterministic and stochastic. In the stochastic case, we will use a projection onto a suitably large ball around the initial guess x 0 . This projection serves to guarantee stability of the resulting iteration even in the presence of non-deterministic sampling error. In Sect. 3.2, we will see that if the radius of the ball is chosen sufficiently large, it does not negatively affect the convergence behavior. Definition 3.1 (Adaptive EKI) Let γ > 0, b ∈ (0, 1), α 0 > 0 and J 0 ∈ N, and define and generates a deterministic low-rank approximation of C 0 , of order γ , we define the adaptive EKI iteration associated to ( svd (see Theorem 2.12), we refer to the method as adaptive SVD-EKI and denote its iterates withx asvd k .
• Let r > 0 and let B r (x 0 ) denote the closed ball around x 0 with radius r . Let P r denote the orthogonal projection on B r (x 0 ). If ( A (J ) ) ∞ J =1 generates a stochastic low-rank approximation of C 0 , of p-order γ , we define the adaptive EKI iteration associated to ( The exponential reduction of the regularization parameter, given by Eq. 3.2, is a typical choice for regularization methods of similar form, and can already be found in [3]. The choice of (J k ) ∞ k=1 is motivated by Proposition 2.7: By ensuring that J γ k grows at least as fast as α −1 k , we make sure that the approximation error between adaptive EKI and Tikhonov regularization does not explode as k increases.
In order to ensure convergence, we choose a stopping criterion for the adaptive EKI iteration. We consider the discrepancy principle, which has the advantage that it is easy to implement and it requires only little prior information on the forward operator L. In the case where the employed low-rank approximation is stochastic, the resulting stopping index is a random variable.
where we set K δ (ω) = ∞ if such a number does not exist.
For the case of Tikhonov regularization, it is known that the discrepancy principle yields a converging regularization method under standard assumptions. The main difficulty of the analysis of adaptive EKI is to show that this result also holds for the random, approximate iteration given by Definition 3.1.
Pseudo-code for the adaptive EKI method in conjunction with the discrepancy principle is given in Algorithm 2. ∈ (0, 1), r > 0, and deterministic or stochastic low-rank approximation ( A (J ) ) of C 0 , of order γ .

Convergence analysis
Next, we show that adaptive EKI as defined above is a convergent regularization method, where convergence is considered relative to the minimum-norm solution of Eq. 1.1, defined as follows. Our convergence proof is based on the assumption that x † satisfies a source condition, which is defined as follows.

Remark 3.5 Equation 3
.7 can be interpreted as a smoothness assumption on the minimum-norm solution x † . Source conditions are ubiquitous in the mathematical literature on inverse problems. Typically, convergence rates for regularization methods cannot be proven without assuming some type of source condition. Beyond the condition Equation 3.7, also logarithmic, variational, and spectral tail conditions can be considered. See [19,24,42,50] and some more recent references [1,2].
For the subsequent convergence analysis, we focus first on the more challenging case where adaptive EKI is based on a stochastic low-rank approximation. In that case, the following additional assumptions are sufficient to obtain convergence rates. Assumption 3.6 Let p, q ∈ [1, ∞), ∈ (0, τ − 1), and let ( A (J ) ) ∞ J =1 generate a stochastic low-rank approximation of C 0 , of p-order γ .
(i) The projection radius r from Definition 3.1 satisfies where c RL is as in Eq. 1.3 and κ p is as in Proposition 2.7.

Remark 3.7
Note that Eq. 3.9 together with Eqs. 3.2 and 3.3 implies that a corresponding estimate holds for all subsequent iterates, i.e.
Furthermore, the condition given by Eq. 3.8 simply means that the projection radius r has to be chosen large enough in relation to the initial error . We show in Proposition 3.10 that this condition ensures that the projection in Eq. 3.5 does not increase the approximation error between adaptive EKI and Tikhonov regularization.
Our strategy to obtain convergence rates for adaptive EKI is to use the error estimate between direct EKI and Tikhonov regularization, provided by Proposition 2.7, to transfer the well-established convergence results on Tikhonov regularization to adaptive EKI. The main complication is that the discrepancy principle introduces a coupling between the regularization parameter and the sampling error, which makes it challenging to estimate X a K δ −x α K δ X directly. Instead, we employs a good-set strategy, similar to the one used in [4] for the analysis of the iteratively regularized Gauss-Newton method for random noise. The idea behind the good-set-strategy is to define a suitable subset of on which we can perform a deterministic analyis, and then to show that the probability of the complement vanishes sufficiently fast. For our purpose, we define the good set E good (δ) ⊂ by and k δ := min k ∈ N : ŷ − Lx α k R ≤ (τ − )δ . (3.13) Then, the law of total expectation yields, for q ∈ [1, ∞), where E good (δ) denotes the complement of E good (δ). SinceX a K δ ∈ B r (x 0 ) holds by Eq. 3.5, we have (3.14) Hence, it suffices to estimate E X a K δ − x † q X |E good (δ) and P(E good (δ) ) separately.
Estimates for the first term hinge on understanding the behavior of the Tikhonovregularized solutionx α k δ . The following lemma summarizes existing results on Tikhonov regularization that we will make use of in our theoretical analysis of adaptive EKI. To this end, we consider as an auxiliary variable the Tikhonov-regularized solution of Eq. 1.1 according to the exact data y and regularization parameter α, defined as Then there holds Furthermore, there exist constants c 1 and c 2 , independent of α, δ and ρ, such that

19)
and y − Lx α R ≤ c 2 ρα μ+1/2 .  Proof (i) Using the same transformations as in the proof of Lemma 3.8, the statement follows from the proof of theorem 4.17 in [14]. Note that this proof uses a discrepancy principle where α can vary continuously. However, the same argument applies also to the discretized sequence satisfying Eq. 3.2, see [14, remark 4.18].
Taking the logarithm and using the fact that b ∈ (0, 1), we arrive at This proves Eq. 3.23. (iii) As in the proof of Lemma 3.8, let Then If we insert y = Lx † = L(x 0 + C 1/2 0 w † ) in the second term on the right-hand side of Eq. 3.25, we obtain after cancellation and using the definition of B, Using Eq. 3.1 and the spectral estimates (see e.g. [30, lemma 4.5]) in Eq. 3.25, we obtain Finally, it follows from Eq. 3.22 that for all k ≤ k δ , (3.27) which vanishes as δ → 0. Therefore, if we set then it follows from Eqs. 3.26 and 3.27 that holds for all δ ≤δ. By definition ofŵ α k , Eq. 3.28 implies and hence, by Eqs. 1.4 and 3.8, for all k ≤ k δ and δ ≤δ.
With this lemma, we are able to show that the projection in Eq. 3.5 cannot increase the approximation error between adaptive EKI and the corresponding Tikhonov iteration, at least for k ≤ k δ . More precisely, we have the following proposition.
for all ω ∈ and all k ≤ k δ . (3.29) In particular, where κ p is as in Proposition 2.7.
Proof Let k ≤ k δ and ω ∈ . By Eq. 3.5, we have By Lemma 3.9, we havex α k ∈ B r (x 0 ). Consequently, by the property of the orthogonal projection, we have This yields Together with Eq. 3.31, this yields Eq. 3.29. Eq. 3.30 then follows from Eq. 3.29 and Proposition 2.7.
The next proposition provides the desires asymptotic convergence rates of the probability P(E good (δ) ). (3.33) Proof By Eq. 3.11 and the subadditivity of P, we have Without loss of generality, let δ ≤δ, whereδ is as in Lemma 3.9. Using Proposition 3.10 and then Eq. 3.10 in Eq. 3.35 yields Inserting this inequality in Eq. 3.34, we arrive at From Lemma 3.9 we know that k δ = O(log(δ −1 )). Since we have 1 − 2μ 2μ+1 > 0, we obtain Hence, we have from Eq. 3.36 that Finally, we show convergence of the random elementX a K δ on the "good set" E good (δ). The construction of E good (δ) allows to apply the proof of [14, theorem 4.17] with straightforward modifications to each individual realizationX a K δ (ω) (ω), given ω ∈ E good (δ). Assumptions 1.1, 3.4 and 3.6, there exists C > 0, independent of ω and δ, such that
• First, we show that K δ (ω) ≤ k δ : To see this, note that By definition of k δ and E good (δ), this implies Hence, by definition of K δ , there must hold K δ (ω) ≤ k δ . (3.39) and consequently also • Next, we show that there exists a constant c 4 , independent of ρ, δ and ω, such that From Eq. 3.20, we obtain On the other hand, Inserting Eq. 3.18 yields By the definition of K δ and Eq. 3.40, this reduces to Combining Eqs. 3.42 and 3.43 yields Since τ − − 1 > 0, we can rearrange this inequality to which shows Eq. 3.41 for suitable choice of c 4 . • Next, we show that there exists a constant c 5 , independent of ω and δ, such that (3.44) We start with the triangle inequality By Eq. 3.19, the first term on the right-hand side satisfies Inserting Eq. 3.41 yields For the second term on the right-hand side of Eq. 3.45, we have by Eq. 3.17: (3.47) We then estimate, using Eq. 3.18, From this, another use of the triangle inequality yields Finally, using the definition of K δ and Eq. 3.40 yields Inserting Eq. 3.48 in Eq. 3.47 yields We can use Eq. 3.39 to estimate the first and Eq. 3.44 to estimate the second term of the right-hand side, which yields Hence, we can choose C > 0, independently of δ and ω, such that Eq. 3.38 holds.
With this, we arrive at convergence rates for adaptive EKI under a stochastic lowrank approximation.
Proof By Proposition 3.12, there holds for all ω ∈ E good (δ).

This implies in particular
Using this inequality and Proposition 3.11 in Eq. 3.14 yields from which Eq. 3.50 follows.
For completeness, we also formulate the convergence rate results under a deterministic low-rank approximation. In this case, the proof of Proposition 3.12 applies without change, and we obtain the following result. Assumptions 1.1 and 3.4 hold, and let ( A (J ) ) ∞ J =1 generates a deterministic low-rank approximation of C 0 , of order γ . Assume there is ∈ (0, τ − 1) such that

Theorem 3.14 Let
where κ is as in Proposition 2.7. Then

Remark 3.15
Comparing the condition Eq. 3.51 for the deterministic case to the condition Eq. 3.10 for the stochastic case, we see that the major difference is that the stochastic case requires an additional multiplicative factor δ − q p . This additional factor is used in the proof of Proposition 3.11 to ensure that P((E good (δ) )) = O(δ 2μq 2μ+1 ). Formally, we recover the deterministic case from the stochastic case in the limit p → ∞ (where p = ∞ corresponds to almost sure convergence).

Remark 3.16
The proven convergence rate is optimal for μ ∈ (0, 1 2 ), in the sense that if only Assumption 3.4 is known, there exists no regularization method that satisfies a better general bound with respect to δ and μ [14, proposition 3.15].
Continuing our discussion from Sect. 2.2, we see from Theorem 3.13 and Theorem 3.14 that the three special cases of adaptive EKI defined in Definition 3.1, namely adaptive Standard-, SVD-and Nyström-EKI, are all of (stochastic) optimal order. However, the faster convergence of the SVD-and Nyström-based low-rank approximation means that the sample size J k does not have to grow as fast as for Standard-EKI, which makes those two methods computationally cheaper. (i) Let p ∈ [1, ∞), C 0 be in the trace class, and suppose that Assumption 3.6 is satisfied for γ = 1/2. Then there holds (ii) Assume that C 0 satisfies Assumption 2.9 with constant η > 0, and suppose that Eq. 3.51 is satisfied for γ = η. Then there holds (iii) Let p ∈ [1, ∞), assume that C 0 satisfies Assumption 2.9 with constant η > 1/2, and suppose tat Assumption 3.6 is satisfied for γ = η. Then there holds Proof Recall that (X aeki k ) ∞ k=1 is a special case of adaptive EKI where the low-rank approximation is generated by (A(U (J ) )) ∞ J =1 (see Theorem 2.14). Thus, if C 0 is in the trace-class, Theorem 3.13 applies with γ = 1/2 and yields the desired convergence rate. The corresponding result for (x asvd k ) ∞ k=1 follows analogously from Theorems 3.14 and 2.12, while the result for (X anys k ) ∞ k=1 follows from Theorems 3.13 and 2.17.
As an example, suppose we know that C 0 is in the trace class, i.e. η ≥ 1. Then Corollary 3.17 implies that Standard-EKI is of optimal order if J k ≥ b −2k J 0 , whereas Nyström-EKI is of optimal order if J k ≥ b −k J 0 (see Eq. 3.3). This means that Nyström-EKI performs comparably with only a square-root of the sample size. Furthermore, if the eigenvalues of C 0 decay faster than O(n −1 ), Nyström-EKI can take advantage of this, whereas Standard-EKI is limited by the lower bound Eq. 2.10.

Relation to other versions of EKI
Note that our focus differs from the strictly Bayesian setting in which ensemble Kalman inversion is often introduced. In the Bayesian setting, it is assumed that the regularization parameter represents the available prior information, and the regularized solution is identified with the MAP estimate. In regularization theory, we are interested in showing convergence rates in the zero-noise limit, which requires the use of parameter choice rules that select the regularization parameter α in terms of the noise level and properties of the forward operator L. Above, we have focused on the discrepancy principle. In contrast to a-priori choice rules, the use of the discrepancy principle has the advantage that it requires only little prior information on the operator L. However, its use is contingent on performing multiple steps of EKI with decreasing values of the regularization parameter. This strategy has a lot of similarities to the empirical Bayesian approach, where we assume a Gaussian prior but treat the regularization parameter as unknown and try to estimate it from the data (see e.g. [59]). Our analysis shows that, by coupling the sample size to the regularization parameter, it becomes possible to obtain the optimal convergence rates in the zero-noise limit. This is also the major difference of the presented scheme to other versions of EKI.

Relation to multiscale methods
The ideas behind adaptive EKI are similar to sequential multiscale methods, where one iteratively moves from a low-dimensional coarse-scale subspace to finer scales. A related work along these lines is [39], which also applies to ensemble methods, but considers the setting where in each step an approximate solution on a different Fig. 1 The Shepp-Logan phantom subspace is computed. Under certain conditions on the multiscale decomposition, this approach can be shown to be equivalent to Tikhonov regularization in the full space. In contrast, the idea behind adaptive EKI is only to approximate Tikhonov regularization, in a way that achieves the same convergence order in the zero-noise limit.

Localization
In some practical applications (e.g. numerical weather prediction [26]) it is only feasible to work with ensemble sizes that are orders of magnitude smaller than the parameter dimension. In these situations, localization [18] is often used to increase the effective ensemble size through incorporation of domain knowledge on the correlation structure of the parameter or observation of interest. Since adaptive EKI can be formulated both in square-root and covariance form (see Remark 1.7), it can be combined with most of the existing localization methods, such as covariance localization [25] or local analysis [45]. Note that localization for stochastic EKI has been studied in [58].

Numerical experiments
We performed numerical experiments to evaluate the performance of adaptive EKI.

Test problem
We have chosen inversion of the Radon transform L (see for instance [34]) as our test example. The analytical results show that the large ensemble limit approximates the Tikhonov regularized solution, which we aim to verify numerically. And we also compare the different variants of EKI in terms of efficiency. As a test object, we use the classic Shepp-Logan phantom [54] with size d × d, d = 100, (see Fig. 1). This corresponds to a parameter dimension of n := dim X = d 2 = 10,000 and a measurement dimension of m := dim Y = 14,200.

Data simulation
We generated noise ξ s ∼ N(0, I m ) from a standard normal distribution and then rescaled the noise by setting ξ := y 10 ξ s ξ s , thereby ensuring a signal-to-noise ratio of exactly 10. We then usedŷ = y + ξ as noisy measurement for the tested methods. We also rescaled the measurement and the observation operator by ξ so that δ = ŷ − y = 1.

Considered methods
In our experiment, we set R = I m , and chose C 0 ∈ R n×n equal to a discretized covariance operator of an Ornstein-Uhlenbeck process, with correlation length h > 0 (we used the value h = 0.01). Such operators are often used as prior covariance for Bayesian MAP estimation in tomography, for example in [56]. They correspond to the assumption that the correlation between individual pixels decreases exponentially with distance, where q i denotes the normalized position of the i-th pixel if the image is scaled to [0, 1] × [0, 1]. We compared the 3 different instances of adaptive EKI discussed in Sect. 3: The different values of b are used in order to ensure that the sequence of sample sizes (J k ) ∞ k=1 is equal for all three methods. Moreover, all methods used the discrepancy principle (see Definition 3.2) with τ = 1.2. In any case, the iterations where aborted once J k was larger than n, since at this point the computational complexity of EKI is higher than of Tikhonov regularization.

Implementation
The algorithms were implemented in Python and use efficient Numpy [23] and SciPy [60] routines. We used the existing implementation of the Radon transform in the scikit-image library [61], and took advantage of the Ray framework [38]   The y-axis denotes the relative reconstruction error e rel . The Standard-EKI iteration was not able to satisfy the discrepancy principle for J k < n

Convergence of adaptive EKI
For each iteration, we evaluated the relative reconstruction error The results are visualized in Fig. 2. Note that every iteration is computationally more expensive than the previous one since the sample size J k increases steadily. While the Nyström-EKI and the SVD-EKI methods were able to satisfy the discrepancy principle after 18 iterations (with sample size J 18 = 2221), the Standard EKI-iteration was not able to satisfy the discrepancy principle for a sample size less than n. Apart from that, one clearly sees that Nyström-EKI and SVD-EKI both significantly outperform Standard-EKI. Consistent with Theorem 2.12, one may observe that SVD-EKI yields the most accurate reconstruction for given sample size.

Comparison of Standard-EKI with Nyström-EKI
In Fig. 3, we visually compare the reconstruction with Standard-EKI to the reconstruction with Nyström-EKI. Both reconstructions use the same value of α and sample size J = 2000. One can see that the standard method is much more noisy than the Nyström method. This noise does not come from the noisy measurement, it is introduced by the sampling process.

Convergence to Tikhonov regularization for large sample sizes
In Fig. 4, we have plotted the reconstruction with Nyström-EKI for increasing values of J . For J = 500, the reconstruction is hardly useful. However, for J = 2000 the reconstruction is already almost comparable to the Tikhonov reconstruction, although a little bit blurred. For higher values of J , the improvement is only marginal. This shows that the presence of noise allows considerable a-priori (that is, not using knowledge on L orŷ) dimensionality reduction. We also repeated the experiment for fixed regularization parameter α = 0.03 and different values of J in order to examine the convergence estimate from Sect. 2.3 numerically. In Fig. 5, we plotted the approximation error with respect to Tikhonov regularization, normalized with x * , i.e. e app (x; α) := x −x α x * . In accordance with Proposition 2.7, the approximation error of Standard-EKI decreases like J −1/2 . However, it is still significant even if the number of ensembles is close to n. With Nyström-EKI or SVD-EKI, the approximation error becomes negligible even for relatively small sample sizes.

Divergence for small values ofK
eeping the sample size fixed at J = 2000, we then repeated the experiment for different values of α (see Fig. 6). One sees that the approximation error of all three methods explodes as α → 0, which demonstrates the necessity of adapting the sample size. Again, Nyström-EKI and SVD-EKI are superior to Standard-EKI.

Conclusions
We have shown that ensemble Kalman inversion is a convergent regularization method if the sample size is adapted to the regularization parameter. The interpretation of EKI as a low-rank aproximation of Tikhonov regularization shows that it provides a tradeoff between exactness and computational cost by shrinking the search space in which we try to reconstruct the unknown parameter x. This approach is suited for problems where the adjoint is not available and the noise is significant, since then the optimal regularization parameter α will typically be larger and a good approximation to the Tikhonov-regularized solution can be achieved for relatively small sample sizes (see Fig. 6).

Fig. 6
The Standard-EKI, Nyström-EKI and SVD-EKI iterations for fixed sample size J and varying regularization parameter. The x-axis denotes the regularization parameter α. The y-axis denotes the scaled approximation error e app for EKI with sample size J = 2000. As α approaches 0, the approximation error explodes It is important to note that the dimensionality reduction in EKI is completely apriori. It uses no knowledge about the forward operator L or the measurementŷ. This has the advantage that it also works in the case where the adjoint of L is not available. On the other hand, if one has access to the adjoint of L, one can compute a low-rank approximation of the whole operator C 1/2 0 L * R −1 LC 1/2 0 instead [16]. This can yield superior results as it allows to also exploit the spectral decay of the forward operator L [55].
While EKI was originally developed for nonlinear inverse problems, our insights from the linear case-in particular the need for adapting the sample size to the noise level-can serve as an Ansatz for an analysis of EKI as a regularization method for nonlinear inverse problems.
After all, the basic ideas of ensemble methods are simple and constitute a very general way to obtain linear dimensionality reduction and algorithms for black-box inverse problems. Therefore, another natural direction of research is to study the resulting stochastic approximations of classical iterative regularization methods, such as the iteratively regularized Gauss-Newton iteration [3], and compare their performance to EKI for the case of nonlinear inverse problems.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A. Appendix: Random elements of Hilbert spaces
We recapitulate basic notions from probability theory on Hilbert spaces. (v) We call a random element X : → X of a Hilbert space X Gaussian if for every continuous linear functional L ∈ X * , L X : → R is a Gaussian random element of R. That is there exist σ L > 0 and m L ∈ R such that for all z ∈ R (vi) It can be shown that for every m ∈ X and every positive and self-adjoint trace class operator C there exists a unique Gaussian random element X with E [X ] = m and Cov (X ) = C. In that case, we will use the notation X ∼ N(m, C).

B. EKI with stochastic perturbations
The deterministic formulation of EKI that we considered in this paper (see Definition 1.6) is based on the ensemble-transform Kalman filter (ETKF) by Bishop, Etherton and Majumdar [5]. It was for example also studied in [10]. In contrast, the original formulation of EKI [28] was based on the EnKF with perturbation of measurements [8]. While the ETKF updates the current state estimateX (J ) k and the ensemble anomaly A (J ) k directly, the EnKF iterates a complete ensemble X (J ) k and updates each ensemble member individually. We call this variant the stochastic form of EKI: Definition B.1 (Stochastic EKI) Given is R ∈ L(Y; Y) and an ensemble X (J ) 0 = (X 0,1 , . . . , X 0,J ) of independent and identically distributed random elements X 0,1 , . . . , X 0,J .
Initialization and then set C (J ) k+1 = C(X (J ) k+1 ). In the present paper, we have focused on the deterministic version of EKI given by Definition 1.6, since it has been observed to perform more reliably in practice as it does not introduce additional noise at every step of the iteration [10]. Nevertheless, both variants seem to be equivalent in the large ensemble-limit. In the linear case, this has been proven: While we do not know of a corresponding proof in the nonlinear case, it has been observed in numerical experiments that also in that case both the deterministic form and the stochastic of the ensemble Kalman filter converge to the same limit [46].