1 Introduction

In recent years, ensemble Kalman inversion (EKI) has become a popular tool for solving inverse problems [28]. EKI has advantages against other iterative methods in situations where the evaluation of the forward operator is costly, and information about its adjoint or its derivative is unavailable.

While there are some recent results on the convergence of EKI as an optimization method [10, 51, 52, 62], the regularization theory of EKI is still incomplete. In this paper, we provide an analysis of a deterministic form of EKI as a regularization method for solving linear inverse problems. That is, we consider the problem of determining a solution \(x_*\) of the linear operator equation

$$\begin{aligned} y = L x_*, \end{aligned}$$
(1.1)

where \(L:\mathbb {X}\rightarrow \mathbb {Y}\) is a bounded linear operator between Hilbert spaces. We do not assume that we have access to y, but only to a noisy measurement

$$\begin{aligned} \hat{y}= y + \xi , \end{aligned}$$
(1.2)

where \(\xi \) is noise.

Such an analysis is important for three reasons: First, it allows a theoretical comparison of EKI with established iterative regularization methods for inverse problems, such as the iteratively regularized Gauss-Newton [3] or the iteratively regularized Landweber [49] iteration. Secondly, it allows the transfer of knowledge between functional-analytic regularization theory, in particular the study of finite-dimensional approximation of Tikhonov regularization [20, 43], and the emerging literature on ensemble methods for the solution of inverse problems (see for example [27] or [46]). Finally, this analysis can potentially serve as the basis for a generalized analysis of EKI for nonlinear inverse problems, making use of the deterministic convergence analysis of iterative regularization methods in Hilbert space (see [3, 22, 30]).

It was already noted in [28] that in the case of a linear operator equation the first iteration of EKI converges to the Tikhonov regularized solution as the sample size approaches infinity. It can be shown that—at least for the deterministic version considered in this paper—this also holds true for all subsequent iterates, where each iterate is associated with a different choice of regularization parameter. Thus, in the linear case, EKI can be completely characterized as a stochastic low-rank approximation of Tikhonov regularization. As a consequence we can prove that under appropriate source conditions and by adapting the sample size to the regularization parameter (this method is then called adaptive EKI), we get optimal convergence rates for EKI in the sense formulated for instance in [14]. Moreover, we show that the efficiency of EKI can be increased by the use of more sophisticated low-rank approximation schemes, such as the Nyström method (see e.g. [17]).

The paper is organized as follows:

  • We continue this section by recalling some required notation and functional-analytic prerequisites (Sect. 1.1) and providing an appropriate definition of the deterministic form of EKI that is considered for the rest of this paper (see Sect. 1.2).

  • In Sect. 2.1, we discuss deterministic EKI as an approximation to Tikhonov regularization. In particular, we derive error estimates in dependence of the regularization parameter which build the foundation for the subsequent formulation of an adaptive version. In Sect. 2.2 we review some results and methods for the low-rank approximation of operators, in particular the Nyström method. We show how these methods naturally lead to new versions of EKI.

  • In Sect. 3 we propose an adaptive variant of EKI. The algorithm is described in Sect. 3.1 and analyzed as an iterative regularization method in Sect. 3.2, where we describe conditions under which we can prove optimal convergence rates in the zero-noise limit. This constitutes our main result. Further remarks comparing the proposed scheme with similar methods from the existing literature are given in Sect. 3.3.

  • We conclude our paper in Sect. 4 with numerical experiments in the context of computerized tomography. These experiments demonstrate some advantages and shortcomings of EKI for linear inverse problems. In particular, they show that the Nyström EKI method leads to considerable improvements in terms of numerical performance in comparsion to existing sampling methods.

  • The appendix reviews some prerequisites from probability theory, and discusses how our exposition relates to alternative formulations of EKI that have been studied elsewhere.

1.1 Notation and terminology

We summarize basic notation first:

  1. (i)

    \(\mathbb {X}\) and \(\mathbb {Y}\) denote real separable Hilbert spaces.

  2. (ii)

    \(\mathcal {L}(\mathbb {X};\mathbb {Y})\) denotes the space of bounded linear operators from \(\mathbb {X}\) to \(\mathbb {Y}\).

  3. (iii)

    If \(L:\mathbb {X}\rightarrow \mathbb {Y}\) is a linear operator, we let \( \mathcal {D}(L) \subset \mathbb {X}\) denote its domain and \( \mathcal {R}(L) \subset \mathbb {Y}\) denote its range.

  4. (iv)

    We call \(P \in \mathcal {L}(\mathbb {X};\mathbb {X})\) positive if \(\left<Px,x\right>_\mathbb {X} \ge 0\) for all \(x \in \mathbb {X}\).

  5. (v)

    For a positive and self-adjoint operator \(P \in \mathcal {L}(\mathbb {X};\mathbb {X})\), we define the P-weighted norm

    $$\begin{aligned} \left\| x\right\| _P = {\left\{ \begin{array}{ll} \left\| P^{-1/2} x\right\| _\mathbb {X}, &{} \text {if } x \in \mathcal {R}(P^{1/2}) , \\ \infty , &{} \text {else}, \end{array}\right. } \end{aligned}$$

    where the operator \(P^{-1/2}\) is defined as the pseudoinverse of \(P^{1/2}\), which in turn can be defined via spectral theory, see for example [14, chapter 2.3].

  6. (vi)

    Trace class: We say that an operator \(P \in \mathcal {L}(\mathbb {X};\mathbb {X})\) is in the trace class if for any orthonormal basis \((e_n)\) of \(\mathbb {X}\) we have

    $$\begin{aligned} \sum _n \left| \left<P e_n,e_n\right>_\mathbb {X}\right| < \infty . \end{aligned}$$
  7. (vii)

    \((\Omega , \mathcal F, \mathbb {P})\) denotes a probability space.

1.2 Ensemble Kalman inversion for linear inverse problems

Next, we present a particular form of the EKI iteration associated to problem (1.1). The original form of EKI [28], which we refer to as stochastic EKI, evolves a random ensemble through an iteration where additional noise is added in each step. In the last few years, multiple variants of EKI have been developed that incorporate adaptable stepsizes [10, 32] or additional regularization [9]. In particular, one can also formulate a deterministic version that circumvents the addition of noise by directly transforming the ensemble mean and covariance. Such a version of EKI has for example been considered in [10]. In accordance with the literature on ensemble Kalman filtering, we will refer to this as deterministic EKI [26, 57]. A more detailed discussion of its relation to the stochastic form of EKI can be found in “Appendix B”.

The EKI iteration involves two linear operators \(C_{0}:\mathbb {X}\rightarrow \mathbb {X}\) and \(R: \mathbb {Y}\rightarrow \mathbb {Y}\) that characterize regularity assumptions on the solution \(x_*\) and the noise \(\xi \). They have to be provided by the practitioner to represent prior information on the problem. In the rest of this article, we will assume that they satisfy the following conditions:

Assumption 1.1

Let \(C_{0}\in \mathcal {L}(\mathbb {X};\mathbb {X})\) and \(R\in \mathcal {L}(\mathbb {Y};\mathbb {Y})\) be injective, positive and self-adjoint linear operators such that

  1. (i)

    \(C_{0}\) is compact,

  2. (ii)

    \( \mathcal {R}(L) \subset \mathcal {D}( R^{-1/2} ) \), and there exists a constant \(c_{RL}\in \mathbb {R}\) such that

    $$\begin{aligned} \left\| R^{-1/2} L\right\| _{\mathcal {L}(\mathbb {X};\mathbb {Y})} \le c_{RL}. \end{aligned}$$
    (1.3)

Moreover, we assume that the noisy data \(\hat{y}\) defined in Eq. 1.2 satisfies \(\hat{y}\in \mathcal {R}(R^{1/2}) \).

As the next proposition shows, the subspace \( \mathcal {D}( C_{0}^{-1/2} ) \subset \mathbb {X}\) together with the norm \( \left\| \cdot \right\| _{C_{0}} \) yields a Hilbert space. This space will play an important role for our analysis in Sect. 3.

Proposition 1.2

Let \(C_{0}\in \mathcal {L}(\mathbb {X};\mathbb {X})\) be an injective, positive and self-adjoint bounded linear operator. Let

$$\begin{aligned} \left<x,y\right>_{C_{0}} := \left<C_{0}^{-1/2}x,C_{0}^{-1/2}y\right>_\mathbb {X} \quad \text { for all } x,y \in \mathcal {D}(C_{0}^{-1/2}) . \end{aligned}$$

Then \( \mathcal {D}(C_{0}^{-1/2}) \) equipped with the inner product \(\left<\cdot ,\cdot \right>_{C_{0}}\) defines a Hilbert space, denoted by \(\mathbb {X}_{C_{0}}\). Moreover

$$\begin{aligned} \left\| x\right\| _\mathbb {X} \le \left\| C_{0}^{1/2}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} \left\| x\right\| _{C_{0}} \quad \text { for all } x \in \mathbb {X}_{C_{0}}. \end{aligned}$$
(1.4)

Proof

The bilinear form \(\left<\cdot ,\cdot \right>_{C_{0}}\) is well-defined on \( \mathcal {D}(C_{0}^{-1/2}) = \mathcal {R}(C_{0}^{1/2}) \) because \(C_{0}\) is injective. Furthermore, this bilinear form is symmetric and positive semidefinite because \(C_{0}^{-1/2}\) is self-adjoint and positive. The definiteness follows from the injectivity of \(C_{0}^{-1/2}\). Eq. 1.4 follows from the boundedness of \(C_{0}\), since we have

$$\begin{aligned} \left\| x\right\| _\mathbb {X} \le \left\| C_{0}^{1/2}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} \left\| C_{0}^{-1/2} x\right\| _\mathbb {X} = \left\| C_{0}^{1/2}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} \left\| x\right\| _{C_{0}} \quad \text {for all } x \in \mathcal {D}(C_{0}^{-1/2}) . \end{aligned}$$

Finally, the completeness of \( \mathcal {D}(C_{0}^{-1/2}) \) with respect to \(\left\| \cdot \right\| _{C_{0}}\) is a direct consequence of the completeness of \(\mathbb {X}\). \(\square \)

Remark 1.3

At this point, we want to stress that the operator \(R\) does not correspond to the assumption that \(\xi = \hat{y}- y\) is a Gaussian random element of \(\mathbb {Y}\) with covariance \(R\). In fact, in the case where \(\mathbb {Y}\) is infinite-dimensional, one can show that \( \left\| \xi \right\| _{R} = \infty \) with probability 1 (see [7, theorem 2.4.7]). The proper interpretation of \(R\) is that it determines a subspace \(\mathbb {Y}_R \subset \mathbb {Y}\) in which \(\xi \) is assumed to lie (see Proposition 1.2).

Before we continue with the description of the deterministic EKI iteration, we present an illustrative example for a choice of the operators \(C_0\) and R that is often used in practice.

Example 1.4

If we let \(\mathbb {X}= L^2(D)\) and \(\mathbb {Y}= L^2(E)\), where \(D \subset \mathbb {R}^{d_1}\) and \(E \subset \mathbb {R}^{d_2}\) are bounded domains with piecewise smooth boundaries. Consider the choice \(C_{0}= (\text {I}_\mathbb {X}- \Delta )^{-1}\) and \(R = \text {I}_\mathbb {Y}- \Delta \). Then the operator \(C_{0}\) is compact. Here, \((\text {I}_\mathbb {X}- \Delta )^{-1}\) is the operator which maps a given function \(\rho \in \mathbb {X}\) onto the weak solution of the equation

$$\begin{aligned} \begin{aligned} (\text {I}_\mathbb {X}- \Delta )u&= \rho \text { in } D,\\ \frac{\partial u}{\partial n}&= 0 \text { on } \partial D, \end{aligned} \end{aligned}$$

The range of \(C_{0}\) is \(H^1(D)\), i.e. the Sobolev space of first order. It is easy to see that \(C_{0}^{-1}\) is positive and self-adjoint, and thus so is \(C_{0}\). We also have

$$\begin{aligned} \left\| u\right\| _{C_{0}}^2= & {} \int _D \left( (I - \Delta )^{1/2} u\right) ^2 d \vec {x} = \int _D u \left( (\text {I}_\mathbb {X}- \Delta )u \right) d \vec {x} \\= & {} \int _D u^2 + \left| \nabla u\right| ^2 d \vec {x} = \left\| u\right\| _{H^1(D)}^2. \end{aligned}$$

Similarly

$$\begin{aligned} \left\| v\right\| _{R} ^2 = \left\| v\right\| _{H^{-1}(E)}^2, \end{aligned}$$

where \(H^{-1}(E)\) denotes the dual space of \(H^1(E)\).

The fundamental difference of ensemble methods to existing regularization methods is the use of a stochastic low-rank approximation of \(C_{0}\), which reduces the effective dimension of the parameter space \(\mathbb {X}\). The next definition gives this notion a precise meaning.

Definition 1.5

(Low-rank approximation) Let \(C_{0}\in \mathcal {L}(\mathbb {X};\mathbb {X})\) be a self-adjoint, positive and compact linear operator and let \(\gamma > 0\).

  1. (i)

    Let \(( \varvec{A}^{\scriptscriptstyle (J)} )_{J=1}^\infty \) be a family of bounded linear operators with \( \varvec{A}^{\scriptscriptstyle (J)} \in \mathcal {L}(\mathbb {R}^J;\mathbb {X})\) for all \(J \in \mathbb {N}\). We say that it generates a deterministic low-rank approximation of \(C_0\), of order \(\gamma \), if there exists a constant \(\nu \) such that

    $$\begin{aligned} \left\| \varvec{A}^{\scriptscriptstyle (J)} { \varvec{A}^{\scriptscriptstyle (J)} }^* - C_{0}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} \le \nu J^{-\gamma } \qquad \text { for all } J \in \mathbb {N}. \end{aligned}$$
  2. (ii)

    Let \(p \in [1,\infty )\) and \(( \varvec{A}^{\scriptscriptstyle (J)} )_{J=1}^\infty \) be a family of random bounded linear operators (see “Appendix A”) with \( \varvec{A}^{\scriptscriptstyle (J)} (\omega ) \in \mathcal {L}(\mathbb {R}^J;\mathbb {X})\) for all \(\omega \in \Omega \) and \(J \in \mathbb {N}\). We say that it generates a stochastic low-rank approximation of \(C_{0}\), of p-order \(\gamma \), if there exists a constant \(\nu _p\) such that

    $$\begin{aligned} \mathbb {E}\left[ \left\| \varvec{A}^{\scriptscriptstyle (J)} { \varvec{A}^{\scriptscriptstyle (J)} }^* - C_{0}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})}^{p} \right] ^{1 / {p} } \le \nu _p J^{-\gamma } \qquad \text { for all } J \in \mathbb {N}. \end{aligned}$$

Under Assumption 1.1, the following algorithm is well-defined, for all \(k \in \mathbb {N}\).

Definition 1.6

(Deterministic EKI) Let \(( \varvec{A}^{\scriptscriptstyle (J)} )_{J=1}^\infty \) generate a low-rank approximation of \(C_{0}\), and let \(\hat{y}\in \mathbb {Y}\), \(J \in \mathbb {N}\), and an initial guess \(x_0\in \mathbb {X}\) be given.

  • Initialization: Set \({\hat{X}}^{\scriptscriptstyle (J)}_0 := x_0\) and \( \varvec{A}^{\scriptscriptstyle (J)} _0 := \varvec{A}^{\scriptscriptstyle (J)} \).

  • Iteration (\(k \rightarrow k+1\)): Let \( {\varvec{B}_k^{\scriptscriptstyle (J)}} = R^{-1/2} L {\varvec{A}_k^{\scriptscriptstyle (J)}} : \mathbb {R}^J \rightarrow \mathbb {Y}\), and set

    $$\begin{aligned}&{\hat{X}}^{\scriptscriptstyle (J)}_{k+1} = \hat{X}_k^{\scriptscriptstyle (J)} + {\varvec{A}_k^{\scriptscriptstyle (J)}} \left( {\varvec{B}_k^{\scriptscriptstyle (J)}} ^* {\varvec{B}_k^{\scriptscriptstyle (J)}} + \mathbb {I}_J\right) ^{-1} {\varvec{B}_k^{\scriptscriptstyle (J)}} ^* R^{-1/2} (\hat{y}- L \hat{X}_k^{\scriptscriptstyle (J)} ), \end{aligned}$$
    (1.5)
    $$\begin{aligned} \text {and} \quad&\varvec{A}^{\scriptscriptstyle (J)} _{k+1} = {\varvec{A}_k^{\scriptscriptstyle (J)}} \left( {\varvec{B}_k^{\scriptscriptstyle (J)}} ^* {\varvec{B}_k^{\scriptscriptstyle (J)}} + \mathbb {I}_J\right) ^{-1/2}, \end{aligned}$$
    (1.6)

    where \(\mathbb {I}_J\in \mathbb {R}^{J\times J}\) denotes the identity matrix and \( {\varvec{B}_k^{\scriptscriptstyle (J)}} ^*:\mathbb {Y}\rightarrow \mathbb {R}^J\) denotes the adjoint of \( {\varvec{B}_k^{\scriptscriptstyle (J)}} \).

Note that the adjective “deterministic” in Definition 1.6 refers only to the update formula, which—in contrast to the original, stochastic EKI iteration (see Definition B.1)—does not introduce additional noise. Even if a stochastic low-rank approximation is used in Definition 1.6, we will refer to the resulting method as deterministic EKI. In this case, the algorithm is defined pointwise, for every \(\omega \in \Omega \). That is, the quantities \(\hat{X}_k^{\scriptscriptstyle (J)} \), \( {\varvec{A}_k^{\scriptscriptstyle (J)}} \) and \( {\varvec{B}_k^{\scriptscriptstyle (J)}} \) all depend on \(\omega \). For the rest of this paper, we will suppress this dependence. This allows us to treat both deterministic and stochastic low-rank approximations at once.

Remark 1.7

We have introduced the EKI update equations (Eqs. 1.51.6) in the so-called square-root form. It is equivalent (see e.g. [57]) to the so-called covariance form which is more widespread in the literature on the Kalman filter and given by

$$\begin{aligned} \begin{aligned}&{\hat{X}}^{\scriptscriptstyle (J)}_{k+1} = \hat{X}_k^{\scriptscriptstyle (J)} + \varvec{C}_k^{\scriptscriptstyle (J)}L^* \left( L \varvec{C}_k^{\scriptscriptstyle (J)}L^* + R\right) ^{-1} (\hat{y}- L \hat{X}_k^{\scriptscriptstyle (J)} ),\\&\varvec{C}^{\scriptscriptstyle (J)}_{k+1} = \varvec{C}_k^{\scriptscriptstyle (J)}- \varvec{C}_k^{\scriptscriptstyle (J)}L^* \left( L \varvec{C}_k^{\scriptscriptstyle (J)}L^* + R\right) ^{-1}L \varvec{C}_k^{\scriptscriptstyle (J)}. \end{aligned} \end{aligned}$$
(1.7)

The operator \(\varvec{C}_k^{\scriptscriptstyle (J)}\) is related to \( {\varvec{A}_k^{\scriptscriptstyle (J)}} \) from Definition 1.6 via the identity \(\varvec{C}_k^{\scriptscriptstyle (J)}= {\varvec{A}_k^{\scriptscriptstyle (J)}} {\varvec{A}_k^{\scriptscriptstyle (J)}} ^*\), which holds for all \(k \in \mathbb {N}\). The computational difference between these two formulations is that the square-root form requires the inversion of an operator on \(\mathbb {R}^J\), while the covariance form requires inversion of an operator on \(\mathbb {Y}\).

The existing literature on EKI focuses mostly on the case where the low-rank approximation \(( \varvec{A}^{\scriptscriptstyle (J)} )_{J=1}^\infty \) is generated by the so-called anomaly operator of an ensemble \(\varvec{U}^{\scriptscriptstyle (J)}\) of random elements—thus the name “ensemble Kalman inversion”. That is, one uses \( \varvec{A}^{\scriptscriptstyle (J)} = \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)})\), where \( \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)})\) is defined as follows:

Definition 1.8

(Ensemble anomaly) A J-tuple \(\varvec{U}^{\scriptscriptstyle (J)}= (U_1,\ldots ,U_J)\) of random elements \(U_1^{(J)},\ldots ,U_J^{(J)}\) of \(\mathbb {X}\) is called a random ensemble. We call the random element

$$\begin{aligned} \overline{U}^{\scriptscriptstyle (J)} := \frac{1}{\sqrt{J}} \sum _{j=1}^J U_j \end{aligned}$$
(1.8)

the ensemble mean. Furthermore, we call the random continuous linear operator from \(\mathbb {R}^J\) to \(\mathbb {X}\) (see “Appendix A”) defined by

$$\begin{aligned} \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)})v := \frac{1}{\sqrt{J}} \sum _{j=1}^J v_j (U_j - \overline{U}^{\scriptscriptstyle (J)} ) \qquad \text {for all } v \in \mathbb {R}^J, \end{aligned}$$
(1.9)

the ensemble anomaly.

We will see in Sect. 2.2 that \(( \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)}))_{J=1}^\infty \) generates a stochastic low-rank approximation of \(C_{0}\) if \(U_1,\ldots ,U_J\) are independent Gaussian random elements with \( \text {Cov}\left( U_j^{(J)} \right) = C_{0}\), for all \(j=1,\ldots ,J\). However, the more general Definition 1.6 allows us to consider other forms of low-rank approximations, in particular also deterministic ones (see Sect. 2.2).

Remark 1.9

The update Eq. 1.5 can also be expressed as the solution to a minimization problem, since for all \(k \in \mathbb {N}\), \({\hat{X}}^{\scriptscriptstyle (J)}_{k+1}\) is the minimizer of the functional

$$\begin{aligned} x \in \mathcal {D}( {\varvec{A}_k^{\scriptscriptstyle (J)}} ^*) \longmapsto \left\| L x - \hat{y}\right\| ^2_R + \left\| x-\hat{X}_k^{\scriptscriptstyle (J)} \right\| ^2_{ {\varvec{A}_k^{\scriptscriptstyle (J)}} ( {\varvec{A}_k^{\scriptscriptstyle (J)}} )^*}, \end{aligned}$$
(1.10)

which is well-defined due to Assumption 1.1.

2 EKI as approximate Tikhonov regularization

2.1 Direct EKI

Ensemble Kalman methods originated in data assimilation [15] and are traditionally applied to state estimation in dynamical systems [40, 48]. Following this logic, EKI, which has been developed for the treatment of inverse problems, is often analyzed as a nonstationary regularization method with multiple steps, where the iteration number k controls the amount of regularization. For the deterministic version of EKI given by Eqs. 1.5 and 1.6, one can actually show that multiple iterations with initial covariance operator \(C_{0}\) are equivalent to a single iteration with covariance operator \(\tilde{C}_{0}= \frac{1}{k} C_{0}\). This result can be seen as direct consequence of the classical equivalence of the Kalman filter to four-dimensional variational data assimilation (4D-VAR) [47].

Theorem 2.1

Let \( {\varvec{B}^{\scriptscriptstyle (J)}} := R^{-1/2} L \varvec{A}^{\scriptscriptstyle (J)} \), and let \((X_k)_{k=1}^\infty \) denote the EKI iteration as defined in Definition 1.6. Then, the following representation holds

$$\begin{aligned}&\hat{X}_k^{\scriptscriptstyle (J)} = x_0+ \varvec{A}^{\scriptscriptstyle (J)} \left( {\varvec{B}^{\scriptscriptstyle (J)}} ^* {\varvec{B}^{\scriptscriptstyle (J)}} + k^{-1} \mathbb {I}_J\right) ^{-1} {\varvec{B}^{\scriptscriptstyle (J)}} ^* R^{-1/2} (\hat{y}- L x_0) \qquad \text {for all } k \in \mathbb {N}. \end{aligned}$$
(2.1)

Proof

Follows from [40, theorem 5.4.7] by setting \(M=\text {I}_\mathbb {X}\), \(H_\xi = L\) and \(f^{(\xi )} = \hat{y}\) for \(\xi =1,\ldots , k\). \(\square \)

A first consequence of Theorem 2.1 is that it allows us to embed EKI into a parameter-dependent family of operators, which we will call direct EKI:

Definition 2.2

(Direct EKI) Suppose that Assumption 1.1 holds, and let \(\alpha > 0\). Then, we define the direct EKI in the following way

$$\begin{aligned} \begin{aligned} \hat{X}^{d, \scriptscriptstyle (J)}_\alpha&:= x_0+ K_\alpha ( \varvec{A}^{\scriptscriptstyle (J)} ) (\hat{y}- L x_0),\\ \text {where} \quad K_\alpha ( \varvec{A}^{\scriptscriptstyle (J)} )&:= \varvec{A}^{\scriptscriptstyle (J)} \left( { \varvec{A}^{\scriptscriptstyle (J)} }^* L^* R^{-1} L \varvec{A}^{\scriptscriptstyle (J)} + \alpha \mathbb {I}_J\right) ^{-1} { \varvec{A}^{\scriptscriptstyle (J)} }^* L^*R^{-1}. \end{aligned} \end{aligned}$$
(2.2)

According to Eq. 2.1, we have

$$\begin{aligned} \hat{X}^{d,\scriptscriptstyle (J)} _{1/k} = \hat{X}_k^{\scriptscriptstyle (J)} . \end{aligned}$$
(2.3)

That is, the k-th iterate of deterministic EKI is equivalent to direct EKI with the choice \(\alpha = 1/k\).

Next, we derive error estimates between direct EKI and Tikhonov regularization in terms of the sample size J and the regularization parameter \(\alpha \). To this end, let us recall the notion of the Tikhonov-regularized solution of Eq. 1.1.

Definition 2.3

(Tikhonov regularization) Let Assumption 1.1 hold. Then the unique minimizer of

$$\begin{aligned} x \in \mathbb {X}\longmapsto \left\| \hat{y}- Lx\right\| _R^2 + \alpha \left\| x-x_0\right\| _{C_{0}}^2 \end{aligned}$$
(2.4)

is called the Tikhonov regularized solution of Eq. 1.1according to the data \(\hat{y}\) and the regularization parameter \(\alpha \). It is denoted with \(\hat{x}_{\alpha }\) and explicitly represented by

$$\begin{aligned} \begin{aligned} \hat{x}_{\alpha }&:= x_0+ \mathcal K_\alpha (\hat{y}- L x_0), \\ \text {where} \quad \mathcal K_\alpha&:= C_{0}^{1/2} \left( C_{0}^{1/2} L^* R^{-1} L C_{0}+ \alpha \text {I}_\mathbb {X}\right) ^{-1} C_{0}^{1/2} L^* R^{-1}, \end{aligned} \end{aligned}$$
(2.5)

and where \(\text {I}_\mathbb {X}:\mathbb {X}\rightarrow \mathbb {X}\) is the identity operator.

Remark 2.4

We emphasize the notational difference between Eqs. 2.2 and 2.5 that \(\text {I}_\mathbb {X}\) denotes the identity operator on \(\mathbb {X}\) while \(\mathbb {I}_J\in \mathbb {R}^{J \times J}\) denotes the identity matrix for \(\mathbb {R}^J\).

Example 2.5

Consider again Example 1.4. In that case Eq. 2.4 becomes

$$\begin{aligned} x \in \mathbb {X}\longmapsto \left\| \hat{y}- Lx\right\| _{H^{-1}(E)}^2 + \alpha \left\| x-x_0\right\| _{H^1(D)}^2. \end{aligned}$$

If we compare Eqs. 2.2 and 2.5, we observe that the main difference between Tikhonov regularization and direct EKI is the replacement of the operator \(\mathcal K_\alpha \) (Tikhonov) by a low-rank approximation \(K_\alpha ( \varvec{A}^{\scriptscriptstyle (J)} )\) (direct EKI). In the following the difference between the random element \( \hat{X}^{d, \scriptscriptstyle (J)}_\alpha \) and the Tikhonov regularized solution \(\hat{x}_{\alpha }\) is estimated.

Lemma 2.6

(Tikhonov versus direct EKI) Let \(\alpha > 0\), \(p \in [1,\infty )\), and suppose that Assumption 1.1 holds. Then there exists a constant c, independent of J, such that

$$\begin{aligned} \left\| \hat{X}^{d, \scriptscriptstyle (J)}_\alpha - \hat{x}_{\alpha }\right\| _\mathbb {X} \le c \alpha ^{-1} \left\| \varvec{A}^{\scriptscriptstyle (J)} { \varvec{A}^{\scriptscriptstyle (J)} }^* - C_{0}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} \qquad \text {for all } J \in \mathbb {N}. \end{aligned}$$
(2.6)

Proof

By Eqs. 2.5 and 2.2 we have

$$\begin{aligned} \hat{X}^{d, \scriptscriptstyle (J)}_\alpha - \hat{x}_{\alpha }= (K_\alpha ( \varvec{A}^{\scriptscriptstyle (J)} ) R^{1/2} - \mathcal K_\alpha R^{1/2}) R^{-1/2}(\hat{y}- L x_0). \end{aligned}$$
(2.7)

Using spectral theory, one can show

$$\begin{aligned} \left\| (P + \alpha \text {I}_\mathbb {X})^{-1}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})}&\le \alpha ^{-1}, \end{aligned}$$
(2.8)
$$\begin{aligned} \text {and} \qquad \left\| (P + \alpha \text {I}_\mathbb {X})^{-1} - (Q + \alpha \text {I}_\mathbb {X})^{-1}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})}&\le \alpha ^{-1} \left\| P-Q\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})}, \end{aligned}$$
(2.9)

for all positive and self-adjoint bounded linear operators P and Q (see [14, section 2.3]). Furthermore, recall that every linear operator A satisfies the identity \((A^*A + \alpha \text {I}_\mathbb {X})^{-1}A^* = A^* (AA^* + \alpha \text {I}_\mathbb {X})^{-1}\) if one of these expressions is well-defined. With this, one can show that

$$\begin{aligned} K_\alpha ( \varvec{A}^{\scriptscriptstyle (J)} )&= \varvec{A}^{\scriptscriptstyle (J)} { \varvec{A}^{\scriptscriptstyle (J)} }^* L^* R^{-1/2} \left( {\varvec{B}^{\scriptscriptstyle (J)}} {\varvec{B}^{\scriptscriptstyle (J)}} ^* + \alpha \mathbb {I}_J\right) ^{-1} R^{-1/2},\\ \text {and} \quad \mathcal K_\alpha&= C_{0}^* L^* R^{-1/2} \left( B B^* + \alpha \text {I}_\mathbb {X}\right) ^{-1} R^{-1/2}, \end{aligned}$$

where we used the notation \( {\varvec{B}^{\scriptscriptstyle (J)}} := R^{-1/2} L \varvec{A}^{\scriptscriptstyle (J)} \) and \(B := R^{-1/2} L C_0^{1/2}\) for brevity. These identities imply

$$\begin{aligned} K_\alpha ( \varvec{A}^{\scriptscriptstyle (J)} ) - \mathcal K_\alpha&= ( \varvec{A}^{\scriptscriptstyle (J)} { \varvec{A}^{\scriptscriptstyle (J)} }^* - C_{0}) R^{-1/2} L \left( {\varvec{B}^{\scriptscriptstyle (J)}} {\varvec{B}^{\scriptscriptstyle (J)}} ^* + \alpha \text {I}_\mathbb {X}\right) ^{-1} R^{-1/2} \\&\quad + C_{0}L^* R^{-1/2} \left[ \left( {\varvec{B}^{\scriptscriptstyle (J)}} {\varvec{B}^{\scriptscriptstyle (J)}} ^* + \alpha \text {I}_\mathbb {X}\right) ^{-1} - \left( B B^* + \alpha \text {I}_\mathbb {X}\right) ^{-1}\right] R^{-1/2}. \end{aligned}$$

Taking norms and using Eqs. 2.8, 2.9, and 1.3, we then obtain

$$\begin{aligned}&\left\| K_\alpha ( \varvec{A}^{\scriptscriptstyle (J)} )R^{1/2} - \mathcal K_\alpha R^{1/2}\right\| _{\mathcal {L}(\mathbb {Y};\mathbb {X})} \\&\quad \le c_{RL} (1 + c_{RL}^2 \left\| C_0\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})}) \alpha ^{-1} \left\| \varvec{A}^{\scriptscriptstyle (J)} { \varvec{A}^{\scriptscriptstyle (J)} }^* - C_{0}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})}. \end{aligned}$$

Taking norms in Eq. 2.7 and inserting this estimate proves the assertion. \(\square \)

This lemma shows that the difference between Tikhonov regularization and direct EKI can be bounded in terms of the difference between the operators \( \varvec{A}^{\scriptscriptstyle (J)} { \varvec{A}^{\scriptscriptstyle (J)} }^*\) and \(C_{0}\). If this difference decreases with a certain rate with respect to J, then direct EKI converges to Tikhonov regularization with the same rate.

Proposition 2.7

(Convergence of EKI to Tikhonov) Let Assumption 1.1 hold.

  1. (i)

    If \(( \varvec{A}^{\scriptscriptstyle (J)} )_{J=1}^\infty \) generates a deterministic low-rank approximation of \(C_{0}\) of order \(\gamma \), then there exists a constant \(\kappa \) such that

    $$\begin{aligned} \left\| \hat{X}^{d, \scriptscriptstyle (J)}_\alpha - \hat{x}_{\alpha }\right\| _\mathbb {X} \le \kappa \alpha ^{-1} J^{-\gamma } \qquad \text {for all } \alpha > 0 \text { and all } J \in \mathbb {N}. \end{aligned}$$
  2. (ii)

    Let \(p \in [1,\infty )\). If \(( \varvec{A}^{\scriptscriptstyle (J)} )_{J=1}^\infty \) generates a stochastic low-rank approximation of \(C_{0}\) of p-order \(\gamma \), then there exists a constant \(\kappa _p\) such that

    $$\begin{aligned} \mathbb {E}\left[ \left\| \hat{X}^{d, \scriptscriptstyle (J)}_\alpha - \hat{x}_{\alpha }\right\| _\mathbb {X}^{p} \right] ^{1 / {p} } \le \kappa _p \alpha ^{-1} J^{-\gamma } \qquad \text {for all } \alpha > 0 \text { and all } J \in \mathbb {N}. \end{aligned}$$

Proof

Follows directly from Lemma 2.6 and Definition 1.5 with \(\kappa =\nu \cdot c\) and \(\kappa _p = \nu _p \cdot c\). \(\square \)

Remark 2.8

Alternatively to the above derivation, the convergence of deterministic EKI to Tikhonov regularization can be seen as special case of the convergence of the ensemble square-root filter to the Kalman filter, see for example [35] or [40, section 5.4]. However, the alternative results presented here are better suited to investigate convergence rates of EKI as a regularization method (see Sect. 3), since they explicitly describe the dependence of the approximation error on the regularization parameter \(\alpha \). We also note that different types of finite-dimensional approximations of Tikhonov approximation have been studied elsewhere, for example in [20, 43].

2.2 Optimal low-rank approximations for EKI

Proposition 2.7 shows that direct EKI, and thus also EKI, converges to Tikhonov regularization with rate equal to the order of the employed low-rank approximation. In general, convergent low-rank approximations only exist if the eigenvalues of \(C_{0}\) satisfy a decay condition.

Assumption 2.9

(Decreasing eigenvalues of \(C_{0}\)) Let \(C_{0}\) satisfy Assumption 1.1, and let \((\lambda _n)\) denote its eigenvalues in decreasing order. We assume that there exists a constant \(\eta > 0\) such that

$$\begin{aligned} \lambda _n = O(n^{-\eta }). \end{aligned}$$

Remark 2.10

In this paper, we always assume that all eigenvalues are repeated according to their multiplicities.

Example 2.11

Consider Example 1.4. In this case, Assumption 2.9 is satisfied with \(\eta = 2/d\) [33].

Under Assumption 2.9, the Schmidt-Eckhardt-Young-Mirsky theorem [13, 37, 53] states that the best possible order of any low-rank approximation of \(C_{0}\) is \(\eta \), and it is achieved by the truncated singular value decomposition.

Theorem 2.12

(Schmidt–Eckhardt–Young–Mirsky) Let \(C_{0}\in \mathcal {L}(\mathbb {X};\mathbb {X})\) be positive, self-adjoint and compact, and let \((\lambda _n)\) denote its eigenvalues in decreasing order. Let \( {\varvec{A}^{\scriptscriptstyle (J)}_\text {svd}} {\varvec{A}^{\scriptscriptstyle (J)}_\text {svd}} ^*\) denote the J-truncated singular value decomposition of \(C_{0}\). Then

$$\begin{aligned}&\left\| {\varvec{A}^{\scriptscriptstyle (J)}_\text {svd}} {\varvec{A}^{\scriptscriptstyle (J)}_\text {svd}} ^* - C_{0}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} = \lambda _{J+1} \\&\quad = \inf \left\{ \,\left\| \varvec{P} - C_{0}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})}:\, \varvec{P} \in \mathcal {L}(\mathbb {X};\mathbb {X}),~ \text {rank}(\varvec{P}) \le J\,\right\} . \end{aligned}$$

Remark 2.13

Note that the optimal possible order for a low-rank approximation does not directly depend on the dimension of the underlying spaces \(\mathbb {X}\) and \(\mathbb {Y}\), only on the decay of the eigenvalues of \(C_{0}\). This means that we obtain dimension-independent convergence rates as long as the eigenvalues of \(C_{0}\) decay sufficiently fast.

Existing formulations of EKI generate a stochastic low-rank approximation of \(C_{0}\) from the ensemble anomaly \( \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)})\) (see Definition 1.8) of a randomly generated ensemble \(\varvec{U}^{\scriptscriptstyle (J)}\). For this type of approximation, we have the following result.

Theorem 2.14

(Low rank approximation of \(C_{0}\))] Assume that \(C_{0}\) is in the trace-class and let \(p \in [1,\infty )\) be fixed. Moreover, for every \(J \in \mathbb {N}\), let \(\varvec{U}^{\scriptscriptstyle (J)}= [U_1,\ldots ,U_J] \in \mathbb {X}^J\) be an ensemble of independent Gaussian random elements with \( \text {Cov}\left( U_j \right) = C_{0}\), for all \(j \in \{ 1,\ldots , J \}\), and let \( \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)})\) be as in Definition 1.8. Then \(( \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)}))_{J=1}^\infty \) generates a stochastic low-rank approximation of \(C_{0}\), of p-order 1/2, meaning that there exists a constant \(\nu _p\) such that

$$\begin{aligned} \mathbb {E}\left[ \left\| \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)}) \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)})^* - C_{0}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})}^{p} \right] ^{1 / {p} } \le \nu _p J^{-1/2} \qquad \text {for all } J \in \mathbb {N}. \end{aligned}$$

In particular, for \(p=1\), there exists a constant \(c > 0\) such that

$$\begin{aligned} c J^{-1/2} \le \mathbb {E}\left[ \left\| \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)}) \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)})^* - C_{0}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} \right] \qquad \text {for all } J \in \mathbb {N}. \end{aligned}$$
(2.10)

Proof

See [31].

Example 2.15

We want to give some examples of trace-class operators on \(L^2(\mu )\), where \(\mu \) is a Radon measure on a domain \(U \subset \mathbb {R}^d\) with \({{\,\mathrm{supp}\,}}\mu = U\). Then, Mercer’s theorem (see e.g. [11, Theorem 5.6.9]) characterizes a large class of trace-class operators: An operator \(P: L^2(\mu ) \rightarrow L^2(\mu )\) is in the trace-class if it can be represented by an integrable continuous positive-definite kernel, i.e.

$$\begin{aligned} P f(x) = \int _U k(x,y)f(y) \,\text {d}\mu (y). \end{aligned}$$

The following result on trace-class operators allows us to directly compare the order of \(( \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)}))_{J=1}^\infty \) to the theoretical optimum defined in Theorem 2.12.

Proposition 2.16

Let \(C_{0}\) be a positive and self-adjoint trace-class operator with eigenvalues \((\lambda _n)\). Then

$$\begin{aligned} \lambda _n = O(n^{-1}). \end{aligned}$$

Proof

It follows from the assumptions on \(C_{0}\) that

$$\begin{aligned} \sum _{n=1}^\infty \lambda _n < \infty , \end{aligned}$$

(see e.g. [11, lemma 5.6.2]) which implies \(\lambda _n = O(n^{-1})\). \(\square \)

Therefore, if \(C_{0}\) is in the trace-class, then according to Theorem 2.12 the optimal low-rank approximation of \(C_{0}\) is given by \( {\varvec{A}^{\scriptscriptstyle (J)}_\text {svd}} {\varvec{A}^{\scriptscriptstyle (J)}_\text {svd}} ^*\) and is at least of order 1. However, since the low-rank approximation generated by \(( \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)}))_{J=1}^\infty \) satisfies the lower bound Eq. 2.10, the ensemble-based low-rank approximation, while cheaper, is not of optimal order.

This leads to the question whether there exist low-rank approximations of \(C_{0}\) that are of optimal order but do not require knowledge of the singular value decomposition of \(C_{0}\). The answer to this question is yes. There exist stochastic low-rank approximations that are of optimal order and only require O(J) evaluations of \(C_{0}\) [21]. An example of such a scheme is the Nyström method [17, 41, 44]. We will consider a special case given by algorithm 1.

figure a

It has been shown that this method leads to a stochastic low-rank approximation of optimal order.

Theorem 2.17

(Nyström low rank approximation) Let \( {\varvec{A}^{\scriptscriptstyle (J)}_\text {nys}} \) be obtained from Algorithm 1 and let \((\lambda _n)\) denote the decreasing eigenvalues of \(C_{0}\). Then

$$\begin{aligned} \mathbb {E}\left[ \left\| {\varvec{A}^{\scriptscriptstyle (J)}_\text {nys}} {\varvec{A}^{\scriptscriptstyle (J)}_\text {nys}} ^* - C_{0}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} \right] \le \left( 1 + \frac{J}{J-N-1}\right) \lambda _{N+1} + \frac{e \sqrt{J}}{J-N-1} \sqrt{ \sum _{n > N} \lambda _n^2}, \end{aligned}$$
(2.11)

for all \(N \in \mathbb {N}\) with \(N \le J-2\), where e denotes Euler’s number. In particular, if Assumption 2.9 is satisfied with \(\eta >1/2\), we have

$$\begin{aligned} \mathbb {E}\left[ \left\| {\varvec{A}^{\scriptscriptstyle (J)}_\text {nys}} {\varvec{A}^{\scriptscriptstyle (J)}_\text {nys}} ^* - C_{0}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} \right] = O(J^{-\eta }). \end{aligned}$$
(2.12)

Proof

It follows from lemma 4 in [12] that

$$\begin{aligned} \left\| {\varvec{A}^{\scriptscriptstyle (J)}_\text {nys}} {\varvec{A}^{\scriptscriptstyle (J)}_\text {nys}} ^* - C_{0}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} \le \left\| {\varvec{Q}^{\scriptscriptstyle (J)}} {\varvec{Q}^{\scriptscriptstyle (J)}} ^* C_{0}- C_{0}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})}, \end{aligned}$$

where \( {\varvec{Q}^{\scriptscriptstyle (J)}} \) is as in algorithm 1. The right-hand side can be estimated using [21, theorem 10.6] (the adaptation to our infinite-dimensional setting is straightforward), yielding Eq. 2.11. If we then choose \(N = J/2\) in Eq. 2.11 (assuming without loss of generality that J is even), the right-hand side becomes

$$\begin{aligned}&\left( 1 + \frac{J}{J/2-1}\right) \lambda _{J/2+1} + \frac{e \sqrt{J}}{J/2-1} \sqrt{ \sum _{n > J/2} \lambda _n^2} \le O(J^{-\eta }) + \frac{e \sqrt{J}}{J/2 - 1} O(J^{-\eta + 1/2})\\&\quad = O(J^{-\eta }). \end{aligned}$$

\(\square \)

Remark 2.18

By adapting the proof of [21, theorem 10.6], one could also show that the Nyström-method is of p-order \(\eta \), for all \(p \in [1,\infty )\).

We will see in Sect. 4 that the accuracy of the Nyström method is very close to the theoretical optimum given by the truncated singular value decomposition.

2.3 Convergence of direct EKI

The ensemble anomaly, truncated singular value decomposition, Nyström method, or in fact any other method for the low-rank approximation of positive operators can be used inside EKI. The corresponding error estimates with respect to Tikhonov regularization follow then directly from Proposition 2.7.

Corollary 2.19

Suppose that Assumption 1.1 is satisfied. Then:

  1. (i)

    Let \( \varvec{A}^{\scriptscriptstyle (J)} = \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)})\). If \(C_{0}\) is in the trace-class, then for all \(p \in [1,\infty )\) there exists a constant \(\kappa _p^\text {en}\) such that

    $$\begin{aligned} \mathbb {E}\left[ \left\| \hat{X}^{d, \scriptscriptstyle (J)}_\alpha - \hat{x}_{\alpha }\right\| _\mathbb {X}^{p} \right] ^{1 / {p} } \le \kappa _p^\text {en} \alpha ^{-1} J^{-1/2} \qquad \text {for all } \alpha > 0. \end{aligned}$$
    (2.13)
  2. (ii)

    Let \( \varvec{A}^{\scriptscriptstyle (J)} = {\varvec{A}^{\scriptscriptstyle (J)}_\text {svd}} \). Then \( \hat{X}^{d, \scriptscriptstyle (J)}_\alpha \) is deterministic, and if Assumption 2.9 holds, then there exists a constant \(\kappa ^\text {svd}\) such that

    $$\begin{aligned} \left\| \hat{X}^{d, \scriptscriptstyle (J)}_\alpha - \hat{x}_{\alpha }\right\| _\mathbb {X} \le \kappa ^\text {svd} \alpha ^{-1} J^{-\eta } \qquad \text {for all } \alpha > 0. \end{aligned}$$
    (2.14)
  3. (iii)

    Let \( \varvec{A}^{\scriptscriptstyle (J)} = {\varvec{A}^{\scriptscriptstyle (J)}_\text {nys}} \). If Assumption 2.9 holds with \(\eta > 1/2\), then there exists a constant \(\kappa ^\text {nys}\) such that

    $$\begin{aligned} \mathbb {E}\left[ \left\| \hat{X}^{d, \scriptscriptstyle (J)}_\alpha - \hat{x}_{\alpha }\right\| _\mathbb {X} \right] \le \kappa ^\text {nys} \alpha ^{-1} J^{-\eta } \qquad \text {for all } \alpha > 0. \end{aligned}$$
    (2.15)

Proof

Let \(p \in [1,\infty )\). By Theorem 2.14, \(( \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)}))_{J=1}^\infty \) generates a stochastic low-rank approximation of p-order 1/2. Thus, Eq. 2.13 follows from Proposition 2.7. The estimates Eqs. 2.14 and 2.15 then follow analogously through Theorems 2.12 and , respectively. \(\square \)

3 Adaptive ensemble Kalman inversion

We have seen in Proposition 2.7 that direct ensemble Kalman inversion can be understood as a low-rank approximation of Tikhonov regularization. It is well-known that, under a standard source-condition (see Assumption 3.4 below), the Tikhonov-regularized solution of a linear equation converges to the infinite-dimensional minimum-norm solution (see Definition 3.3) in the zero-noise limit with a certain rate (see e. g. [14]). Thus, if we ensure that the error between direct EKI and Tikhonov regularization vanishes with the same rate as Tikhonov regularization converges, then direct EKI will also converge with that rate. However, since its iterates are restricted to the finite-dimensional range of \( \varvec{A}^{\scriptscriptstyle (J)} \), direct EKI can only lead to a convergent regularization method if the sample size J is adapted to the noise level.

In this section, we describe how this can be achieved in conjunction with the discrepancy prinicple. The resulting method, which we call adaptive ensemble Kalman inversion, is a convergent regularization method of optimal order in a sense that will be given below. For this result, we require knowledge of a number \(\delta > 0\) such that

$$\begin{aligned} \left\| \hat{y}- y\right\| _{R} \le \delta . \end{aligned}$$
(3.1)

This assumption is often referred to as a deterministic noise model, and the number \(\delta \) is called the deterministic noise level. For some results on regularization with random noise, see for example [6].

We start with a precise description of the adaptive EKI method in Sect. 3.1, followed by a convergence analysis of the zero-noise limit in Sect. 3.2. General remarks explaining the connection to other forms of EKI and multiscale methods are given in Sect. 3.3.

3.1 Description of the method

We start by presenting a version of direct EKI with a-posteriori parameter choice rule in the form of the discrepancy principle. We will refer to this method as adaptive EKI.

In our definition, we distinguish between the cases where the underlying low-rank approximation is deterministic and stochastic. In the stochastic case, we will use a projection onto a suitably large ball around the initial guess \(x_0\). This projection serves to guarantee stability of the resulting iteration even in the presence of non-deterministic sampling error. In Sect. 3.2, we will see that if the radius of the ball is chosen sufficiently large, it does not negatively affect the convergence behavior.

Definition 3.1

(Adaptive EKI) Let \(\gamma > 0\), \(b \in (0,1)\), \(\alpha _0 > 0\) and \(J_0 \in \mathbb {N}\), and define

$$\begin{aligned} \alpha _k&= b^k \alpha _0, \end{aligned}$$
(3.2)
$$\begin{aligned} \text {and} \quad J_k&= \lceil b^{-\frac{k}{\gamma }} J_0 \rceil . \end{aligned}$$
(3.3)
  • If \(( \varvec{A}^{\scriptscriptstyle (J)} )_{J=1}^\infty \) generates a deterministic low-rank approximation of \(C_{0}\), of order \(\gamma \), we define the adaptive EKI iteration associated to \(( \varvec{A}^{\scriptscriptstyle (J)} )_{J=1}^\infty \) as

    $$\begin{aligned} \hat{x}^\text {a}_k := \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}, \qquad \text {for } k \in \mathbb {N}, \end{aligned}$$
    (3.4)

    where \( \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}\) is defined in Definition 2.2. If \( \varvec{A}^{\scriptscriptstyle (J)} = {\varvec{A}^{\scriptscriptstyle (J)}_\text {svd}} \) (see Theorem 2.12), we refer to the method as adaptive SVD-EKI and denote its iterates with \(\hat{x}^\text {asvd}_k\).

  • Let \(r > 0\) and let \({\overline{B}_r(x_0)}\) denote the closed ball around \(x_0\) with radius r. Let \(P_r\) denote the orthogonal projection on \({\overline{B}_r(x_0)}\). If \(( \varvec{A}^{\scriptscriptstyle (J)} )_{J=1}^\infty \) generates a stochastic low-rank approximation of \(C_{0}\), of p-order \(\gamma \), we define the adaptive EKI iteration associated to \(( \varvec{A}^{\scriptscriptstyle (J)} )_{J=1}^\infty \) as

    $$\begin{aligned} \hat{X}^\text {a}_k(\omega ) := P_r\left( \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) \right) , \qquad \text {for } k \in \mathbb {N}\text { and } \omega \in \Omega , \end{aligned}$$
    (3.5)

    where \( \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}\) is defined in Definition 2.2. If \( \varvec{A}^{\scriptscriptstyle (J)} = \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)})\) (see Definition 1.8), we refer to this method as adaptive Standard-EKI and denote its iterates with \(\hat{X}^\text {aeki}_k\). Similarly, if \( \varvec{A}^{\scriptscriptstyle (J)} = {\varvec{A}^{\scriptscriptstyle (J)}_\text {nys}} \) (see Theorem ), we refer to the method as adaptive Nyström-EKI and denote its iterates with \(\hat{X}^\text {anys}_k\).

The exponential reduction of the regularization parameter, given by Eq. 3.2, is a typical choice for regularization methods of similar form, and can already be found in [3]. The choice of \((J_k)_{k=1}^\infty \) is motivated by Proposition 2.7: By ensuring that \(J_k^\gamma \) grows at least as fast as \(\alpha _k^{-1}\), we make sure that the approximation error between adaptive EKI and Tikhonov regularization does not explode as k increases.

In order to ensure convergence, we choose a stopping criterion for the adaptive EKI iteration. We consider the discrepancy principle, which has the advantage that it is easy to implement and it requires only little prior information on the forward operator L. In the case where the employed low-rank approximation is stochastic, the resulting stopping index is a random variable.

Definition 3.2

(Discrepancy principle) Let \(\delta \) be as in Eq. 3.1 and \(\tau > 1\). Then, adaptive EKI (Definition 3.1) is terminated after \(K_\delta \) iterations, where the integer random variable \(K_\delta : \Omega \rightarrow \mathbb {N}\cup \{\infty \}\) satisfies

$$\begin{aligned} \left\| \hat{y}- L \hat{X}^\text {a}_{K_\delta (\omega )}(\omega ) \right\| _{R} \le \tau \delta< \left\| \hat{y}- L \hat{X}^\text {a}_k(\omega ) \right\| _{R} \qquad \text {for all } k < K_\delta (\omega ), \end{aligned}$$
(3.6)

where we set \(K_\delta (\omega ) = \infty \) if such a number does not exist.

For the case of Tikhonov regularization, it is known that the discrepancy principle yields a converging regularization method under standard assumptions. The main difficulty of the analysis of adaptive EKI is to show that this result also holds for the random, approximate iteration given by Definition 3.1.

Pseudo-code for the adaptive EKI method in conjunction with the discrepancy principle is given in Algorithm 2.

figure b

3.2 Convergence analysis

Next, we show that adaptive EKI as defined above is a convergent regularization method, where convergence is considered relative to the minimum-norm solution of Eq. 1.1, defined as follows.

Definition 3.3

We call \(x^\dagger \in \mathbb {X}\) an \((x_0, C_{0})\)-minimum-norm solution of \(L x = y\) if

$$\begin{aligned} x^\dagger \in {{\,\mathrm{argmin}\,}}_{x \in \mathbb {X}}\left\{ \, \left\| x - x_0\right\| _{C_{0}} :\, Lx = y\,\right\} . \end{aligned}$$

The existence and uniqueness of \(x^\dagger \) follow from [14, theorem 2.5] taking into account Proposition 1.2.

Before we continue, it is convenient to summarize the different inversion techniques and the according terminology.

 

Random variable

 

\(\hat{X}_k^{\scriptscriptstyle (J)} \)

k-th iterate of EKI with sample size J

Equation 2.1

\( \hat{X}^{d, \scriptscriptstyle (J)}_\alpha \)

Direct EKI with regularization parameter \(\alpha \)

Equation 2.2

\(\hat{x}^\text {a}_k\)

The k-th iterate of adaptive EKI with a deterministiclow-rank approximation

Equation 3.4

\(\hat{X}^\text {a}_k\)

The k-th iterate of adaptive EKI with a stochasticlow-rank approximation

Equation 3.5

\(\hat{x}_\alpha \)

Tikhonov-regularized solution according to the noisy data \(\hat{y}\)

Equation 2.5

\(x_\alpha \)

Tikhonov-regularized solution according to the exact data y

Equation 3.15

Our convergence proof is based on the assumption that \(x^\dagger \) satisfies a source condition, which is defined as follows.

Assumption 3.4

(Source condition) Let \(\mathbb {X}_{C_{0}}\) be defined as in Proposition 1.2. There exists a \((x_0, C_{0})\)-minimum-norm solution \(x^\dagger \in \mathbb {X}_{C_{0}}\) of \(Lx = y\), constants \(\mu \in (0,1/2]\), \(\rho > 0\) , and some \(v \in \mathbb {X}\) with \(\left\| v\right\| _\mathbb {X} \le \rho \) such that

$$\begin{aligned} x^\dagger - x_0= C_0^{1/2} (B^* B)^\mu v, \end{aligned}$$
(3.7)

where \(B = R^{-1/2} L C_0^{1/2}\).

Remark 3.5

Equation 3.7 can be interpreted as a smoothness assumption on the minimum-norm solution \(x^\dagger \). Source conditions are ubiquitous in the mathematical literature on inverse problems. Typically, convergence rates for regularization methods cannot be proven without assuming some type of source condition. Beyond the condition Equation 3.7, also logarithmic, variational, and spectral tail conditions can be considered. See [19, 24, 42, 50] and some more recent references [1, 2].

For the subsequent convergence analysis, we focus first on the more challenging case where adaptive EKI is based on a stochastic low-rank approximation. In that case, the following additional assumptions are sufficient to obtain convergence rates.

Assumption 3.6

Let \(p, q \in [1, \infty )\), \(\epsilon \in (0, \tau -1)\), and let \(( \varvec{A}^{\scriptscriptstyle (J)} )_{J=1}^\infty \) generate a stochastic low-rank approximation of \(C_{0}\), of p-order \(\gamma \).

  1. (i)

    The projection radius r from Definition 3.1 satisfies

    $$\begin{aligned} r \ge 2 \left\| C_{0}^{1/2}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} \left\| x_0 - x^\dagger \right\| _{C_{0}} . \end{aligned}$$
    (3.8)
  2. (ii)

    There holds

    $$\begin{aligned} \alpha _0 J_0^\gamma \ge \frac{c_{RL} \kappa _p}{\epsilon \delta ^{1+\frac{q}{p}}}, \end{aligned}$$
    (3.9)

where \(c_{RL}\) is as in Eq. 1.3 and \(\kappa _p\) is as in Proposition 2.7.

Remark 3.7

Note that Eq. 3.9 together with Eqs. 3.2 and 3.3 implies that a corresponding estimate holds for all subsequent iterates, i.e.

$$\begin{aligned} \alpha _k J_k^\gamma \ge b_k^k \alpha _0 (b_k^{-\frac{k}{\gamma }})^\gamma J_0^\gamma \ge \frac{c_{RL} \kappa _p}{\epsilon \delta ^{1+\frac{q}{p}}} \qquad \text {for all } k \in \mathbb {N}. \end{aligned}$$
(3.10)

Furthermore, the condition given by Eq. 3.8 simply means that the projection radius r has to be chosen large enough in relation to the initial error \( \left\| x_0 - x^\dagger \right\| _{C_{0}} \). We show in Proposition 3.10 that this condition ensures that the projection in Eq. 3.5 does not increase the approximation error between adaptive EKI and Tikhonov regularization.

Our strategy to obtain convergence rates for adaptive EKI is to use the error estimate between direct EKI and Tikhonov regularization, provided by Proposition 2.7, to transfer the well-established convergence results on Tikhonov regularization to adaptive EKI. The main complication is that the discrepancy principle introduces a coupling between the regularization parameter and the sampling error, which makes it challenging to estimate \(\left\| \hat{X}^\text {a}_{K_\delta } - \hat{x}_{\alpha _{K_\delta }}\right\| _\mathbb {X}\) directly. Instead, we employs a good-set strategy, similar to the one used in [4] for the analysis of the iteratively regularized Gauss-Newton method for random noise. The idea behind the good-set-strategy is to define a suitable subset of \(\Omega \) on which we can perform a deterministic analyis, and then to show that the probability of the complement vanishes sufficiently fast. For our purpose, we define the good set \(E_\text {good}(\delta )\subset \Omega \) by

$$\begin{aligned}&E_\text {good}(\delta ):= \bigcap _{k=1}^{k_\delta } E_\text {good}^k, \end{aligned}$$
(3.11)
$$\begin{aligned} \text {where} \quad&E_\text {good}^k:= \left\{ \,\omega \in \Omega :\, \left\| \hat{X}^\text {a}_k(\omega ) - \hat{x}_{\alpha _k}\right\| _\mathbb {X} \le c_{RL}^{-1} \epsilon \delta \,\right\} \end{aligned}$$
(3.12)
$$\begin{aligned} \text {and} \quad&k_\delta := \min \left\{ \,k \in \mathbb {N}:\, \left\| \hat{y}- L \hat{x}_{\alpha _k}\right\| _{R} \le (\tau - \epsilon ) \delta \,\right\} . \end{aligned}$$
(3.13)

Then, the law of total expectation yields, for \(q \in [1, \infty )\),

$$\begin{aligned} \mathbb {E}\left[ \left\| \hat{X}^\text {a}_{K_\delta } - x^\dagger \right\| _\mathbb {X}^q \right]&= \mathbb {E}\left[ \left\| \hat{X}^\text {a}_{K_\delta } - x^\dagger \right\| _\mathbb {X}^q | E_\text {good}(\delta ) \right] \mathbb {P}(E_\text {good}(\delta ))\\&\quad + \mathbb {E}\left[ \left\| \hat{X}^\text {a}_{K_\delta } - x^\dagger \right\| _\mathbb {X}^q | E_\text {good}(\delta )^\complement \right] \mathbb {P}(E_\text {good}(\delta )^\complement ) \\&\le \mathbb {E}\left[ \left\| \hat{X}^\text {a}_{K_\delta } - x^\dagger \right\| _\mathbb {X}^q | E_\text {good}(\delta ) \right] \\&\quad + \mathbb {E}\left[ \left\| \hat{X}^\text {a}_{K_\delta } - x^\dagger \right\| _\mathbb {X}^q | E_\text {good}(\delta )^\complement \right] \mathbb {P}(E_\text {good}(\delta )^\complement ), \end{aligned}$$

where \(E_\text {good}(\delta )^\complement \) denotes the complement of \(E_\text {good}(\delta )\). Since \(\hat{X}^\text {a}_{K_\delta } \in {\overline{B}_r(x_0)}\) holds by Eq. 3.5, we have

$$\begin{aligned} \mathbb {E}\left[ \left\| \hat{X}^\text {a}_{K_\delta } - x^\dagger \right\| _\mathbb {X}^q \right] \le \mathbb {E}\left[ \left\| \hat{X}^\text {a}_{K_\delta } - x^\dagger \right\| _\mathbb {X}^q | E_\text {good}(\delta ) \right] + \left( r + \left\| x^\dagger \right\| _\mathbb {X}\right) ^q \mathbb {P}(E_\text {good}(\delta )^\complement ). \end{aligned}$$
(3.14)

Hence, it suffices to estimate \(\mathbb {E}\left[ \left\| \hat{X}^\text {a}_{K_\delta } - x^\dagger \right\| _\mathbb {X}^q | E_\text {good}(\delta ) \right] \) and \(\mathbb {P}(E_\text {good}(\delta )^\complement )\) separately.

Estimates for the first term hinge on understanding the behavior of the Tikhonov-regularized solution \(\hat{x}_{\alpha _{k_\delta }}\). The following lemma summarizes existing results on Tikhonov regularization that we will make use of in our theoretical analysis of adaptive EKI. To this end, we consider as an auxiliary variable the Tikhonov-regularized solution of Eq. 1.1 according to the exact data y and regularization parameter \(\alpha \), defined as

$$\begin{aligned} x_\alpha := x_0+ C_{0}^{1/2} \left( C_{0}^{1/2} L^* R^{-1} L C_{0}^{1/2} + \alpha \text {I}_\mathbb {X}\right) ^{-1} C_{0}^{1/2} L^* R^{-1} (y - L x_0). \end{aligned}$$
(3.15)

(Compare Definition 2.3.)

Lemma 3.8

(Convergence and stability of Tikhonov regularization) Suppose that Assumptions 1.1 and 3.4 hold, and let \(x_0 \in \mathbb {X}\), \(\alpha > 0\) and \(\delta > 0\). Moreover, assume that

$$\begin{aligned} \left\| \hat{y}- y\right\| _{R} \le \delta . \end{aligned}$$
(3.16)

Then there holds

$$\begin{aligned} \left\| x_\alpha - x^\dagger \right\| _{C_{0}}&\le \rho ^\frac{1}{2\mu + 1} \left\| L x_\alpha - y\right\| _{R} ^\frac{2\mu }{2 \mu + 1}, \end{aligned}$$
(3.17)
$$\begin{aligned} \text {and} \quad \left\| L(x_\alpha - \hat{x}_{\alpha }) - (y - \hat{y})\right\| _{R}&\le \delta . \end{aligned}$$
(3.18)

Furthermore, there exist constants \(c_1\) and \(c_2\), independent of \(\alpha \), \(\delta \) and \(\rho \), such that

$$\begin{aligned} \left\| x_\alpha - \hat{x}_{\alpha }\right\| _{C_{0}}&\le c_1 \delta \alpha ^{-1/2}, \end{aligned}$$
(3.19)
$$\begin{aligned} \text {and} \quad \left\| y - L x_\alpha \right\| _{R}&\le c_2 \rho \alpha ^{\mu + 1/2} . \end{aligned}$$
(3.20)

Proof

Note that \(x^\dagger \) is a \((x_0, C_{0})\)-minimum-norm solution of Eq. 1.1 if and only if \(x^\dagger = x_0+ C_{0}^{1/2} w^\dagger \), where \(w^\dagger \) is a \((0,\text {I}_\mathbb {X})\)-minimum-norm solution of

$$\begin{aligned} R^{-1/2} y = B w, \end{aligned}$$
(3.21)

where \(B := R^{-1/2} L C_{0}^{1/2}\). Similary, if \(w_\alpha \) is the corresponding Tikhonov-regularized solution of Eq. 3.21, i.e.

$$\begin{aligned} w_\alpha = \left( B^*B + \alpha \text {I}_\mathbb {X}\right) ^{-1} B^*R^{-1/2}y, \end{aligned}$$

then \(x_\alpha = x_0+ C_{0}^{1/2} w_\alpha \). Thus, the results follow from the classical case where \(C_{0}= \text {I}_\mathbb {X}\) and \(R= \text {I}_\mathbb {Y}\): The inequalities Eqs. 3.17, 3.18 and 3.19 can be found in [14, (4.66)], [14, (4.68)] and [14, (4.70)], respectively. Equation 3.20 can be obtained from the source condition Equation 3.7 and the interpolation inequality [14, (4.64)], as in the proof of [14, theorem 4.17]. \(\square \)

Moreover, for the deterministic stopping time \(k_\delta \) the following auxiliary result holds.

Lemma 3.9

Given Assumptions 1.1 and 3.4.

  1. (i)

    There exists a constant \(c_3\), independent of \(\alpha \), \(\delta \), and \(\rho \), such that

    $$\begin{aligned} \alpha _{k_\delta } \ge c_3 \left( \frac{\delta }{\rho } \right) ^\frac{2}{2 \mu + 1} \end{aligned}$$
    (3.22)

    for all sufficiently small \(\delta \).

  2. (ii)

    There holds

    $$\begin{aligned} k_\delta = O \left( \log (\delta ^{-1}) \right) . \end{aligned}$$
    (3.23)
  3. (iii)

    If Eq. 3.8 holds, then there exists a sufficiently small \(\bar{\delta } > 0\) such that

    $$\begin{aligned} \hat{x}_{\alpha _k} \in {\overline{B}_r(x_0)}\qquad \text {for all } k \le k_\delta \text { and all } \delta \le \bar{\delta }. \end{aligned}$$
    (3.24)

Proof

  1. (i)

    Using the same transformations as in the proof of Lemma 3.8, the statement follows from the proof of theorem 4.17 in [14]. Note that this proof uses a discrepancy principle where \(\alpha \) can vary continuously. However, the same argument applies also to the discretized sequence satisfying Eq. 3.2, see [14, remark 4.18].

  2. (ii)

    Inserting Eq. 3.2 in Eq. 3.22 yields

    $$\begin{aligned} b^{k_\delta } \alpha _0 \ge c_3 \left( \frac{\delta }{\rho } \right) ^\frac{2}{2 \mu + 1}, \end{aligned}$$

    or equivalently

    $$\begin{aligned} \left( \frac{1}{b}\right) ^{k_\delta } \le \frac{\alpha _0}{c_3} \left( \frac{\rho }{\delta } \right) ^\frac{2}{2 \mu + 1}. \end{aligned}$$

    Taking the logarithm and using the fact that \(b \in (0,1)\), we arrive at

    $$\begin{aligned} k_\delta \le \log (b^{-1})^{-1} \cdot \left[ \log \left( \frac{\alpha _0}{c_3} \right) + \frac{2}{2 \mu + 1} \log \left( \frac{\rho }{\delta } \right) \right] . \end{aligned}$$

    This proves Eq. 3.23.

  3. (iii)

    As in the proof of Lemma 3.8, let \(B = R^{-1/2} L C_{0}^{1/2}\), \(x^\dagger = x_0 + C_{0}w^\dagger \) and \(\hat{x}_\alpha = x_0+ C_{0}^{1/2} \hat{w} _\alpha \), such that

    $$\begin{aligned} \hat{w} _\alpha = \left( B^* B + \alpha \text {I}_\mathbb {X}\right) ^{-1} B^* R^{-1/2}(\hat{y}- L x_0). \end{aligned}$$

    Then

    $$\begin{aligned} \hat{w} _\alpha= & {} \left( B^* B + \alpha \text {I}_\mathbb {X}\right) ^{-1} B^* R^{-1/2}(\hat{y}- L x_0)\nonumber \\= & {} \left( B^* B + \alpha \text {I}_\mathbb {X}\right) ^{-1} B^* R^{-1/2}(\hat{y}- y) + \left( B^* B + \alpha \text {I}_\mathbb {X}\right) ^{-1} B^* R^{-1/2}(y - L x_0). \qquad \end{aligned}$$
    (3.25)

    If we insert \(y = L x^\dagger = L (x_0 + C_{0}^{1/2} w^\dagger )\) in the second term on the right-hand side of Eq. 3.25, we obtain after cancellation and using the definition of B,

    $$\begin{aligned} \hat{w} _\alpha - w^\dagger&= \left( B^* B + \alpha \text {I}_\mathbb {X}\right) ^{-1} B^* R^{-1/2}(\hat{y}- y) + \left( B^* B + \alpha \text {I}_\mathbb {X}\right) ^{-1} B^* B w^\dagger . \end{aligned}$$

    Using Eq. 3.1 and the spectral estimates (see e.g. [30, lemma 4.5])

    $$\begin{aligned} \left\| \left( B^* B + \alpha \text {I}_\mathbb {X}\right) ^{-1} B^*\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})}&\le \frac{1}{2}\alpha ^{-1/2}, \\ \text {and} \quad \left\| \left( B^* B + \alpha \text {I}_\mathbb {X}\right) ^{-1} B^* B\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})}&\le 1, \end{aligned}$$

    in Eq. 3.25, we obtain

    $$\begin{aligned} \left\| \hat{w} _\alpha - w^\dagger \right\| _\mathbb {X} \le \frac{1}{2} \alpha ^{-1/2} \delta + \left\| w^\dagger \right\| _\mathbb {X}. \end{aligned}$$
    (3.26)

    Finally, it follows from Eq. 3.22 that

    $$\begin{aligned} \alpha ^{-1/2}_k \delta \le \alpha ^{-1/2}_{k_\delta } \delta \le c_3 \left( \frac{\rho }{\delta } \right) ^\frac{1}{2 \mu + 1} \delta = c_3 \rho ^\frac{1}{2 \mu + 1} \delta ^\frac{2}{2 \mu + 1} \qquad \text {for all } k \le k_\delta , \end{aligned}$$
    (3.27)

    which vanishes as \(\delta \rightarrow 0\). Therefore, if we set

    $$\begin{aligned} \bar{\delta }= 2^\frac{2 \mu + 1}{2} c_3^{-\frac{2 \mu + 1}{2}} \rho ^{-\frac{1}{2}} \left\| w^\dagger \right\| _\mathbb {X}^\frac{2 \mu + 1}{2}, \end{aligned}$$

    then it follows from Eqs. 3.26 and 3.27 that

    $$\begin{aligned} \left\| \hat{w} _{\alpha _k}\right\| _\mathbb {X} \le 2 \left\| w^\dagger \right\| _\mathbb {X} \qquad \text {for all } k \le k_\delta \end{aligned}$$
    (3.28)

    holds for all \(\delta \le \bar{\delta }\). By definition of \( \hat{w} _{\alpha _k}\), Eq. 3.28 implies

    $$\begin{aligned} \left\| \hat{x}_{\alpha _k} - x_0\right\| _{C_{0}} = \left\| \hat{w} _{\alpha _k}\right\| _\mathbb {X} \le 2 \left\| w^\dagger \right\| _\mathbb {X} = 2 \left\| x^\dagger - x_0\right\| _{C_{0}} , \end{aligned}$$

    and hence, by Eqs. 1.4 and 3.8,

    $$\begin{aligned} \left\| \hat{x}_{\alpha _k} - x_0\right\| _\mathbb {X} \le \left\| C_{0}^{1/2}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} \left\| \hat{x}_{\alpha _k} - x_0\right\| _{C_{0}} \le 2 \left\| C_{0}^{1/2}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} \left\| x^\dagger - x_0\right\| _{C_{0}} \le r, \end{aligned}$$

    for all \(k \le k_\delta \) and \(\delta \le \bar{\delta }\).

\(\square \)

With this lemma, we are able to show that the projection in Eq. 3.5 cannot increase the approximation error between adaptive EKI and the corresponding Tikhonov iteration, at least for \(k \le k_\delta \). More precisely, we have the following proposition.

Proposition 3.10

Let Assumptions 1.1, 3.4 and Eq. 3.8 hold. Let \(\delta \le \bar{\delta }\), where \(\bar{\delta }\) is as in Lemma 3.9. Then

$$\begin{aligned} \left\| \hat{X}^\text {a}_k(\omega ) - \hat{x}_{\alpha _k}\right\| _\mathbb {X} \le \left\| \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) - \hat{x}_{\alpha _k}\right\| _\mathbb {X} \qquad \text {for all } \omega \in \Omega \text { and all } k \le k_\delta . \end{aligned}$$
(3.29)

In particular,

$$\begin{aligned} \mathbb {E}\left[ \left\| \hat{X}^\text {a}_k - \hat{x}_{\alpha _k}\right\| _\mathbb {X}^{p} \right] ^{1 / {p} } \le \kappa _p \alpha _k^{-1} J_k^{- \gamma } \qquad \text {for all } k \le k_\delta , \end{aligned}$$
(3.30)

where \(\kappa _p\) is as in Proposition 2.7.

Proof

Let \(k \le k_\delta \) and \(\omega \in \Omega \). By Eq. 3.5, we have

$$\begin{aligned} \left\| \hat{X}^\text {a}_k(\omega ) - \hat{x}_{\alpha _k}\right\| _\mathbb {X} = \left\| P_r \left( \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) \right) - \hat{x}_{\alpha _k}\right\| _\mathbb {X}. \end{aligned}$$
(3.31)

By Lemma 3.9, we have \(\hat{x}_{\alpha _k} \in {\overline{B}_r(x_0)}\). Consequently, by the property of the orthogonal projection, we have

$$\begin{aligned} \left< \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) - P_r \left( \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) \right) , \hat{x}_{\alpha _k} - P_r \left( \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) \right) \right> \le 0. \end{aligned}$$
(3.32)

This yields

$$\begin{aligned} \left\| \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) - \hat{x}_{\alpha _k}\right\| _\mathbb {X}^2&= \left\| \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) - P_r \left( \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) \right) + P_r \left( \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) \right) - \hat{x}_{\alpha _k}\right\| _\mathbb {X}^2 \\&= \left\| \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) - P_r \left( \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) \right) \right\| _\mathbb {X}^2 + \left\| P_r \left( \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) \right) - \hat{x}_{\alpha _k}\right\| _\mathbb {X}^2 \\&\qquad + 2 \left< \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) - P_r \left( \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) \right) ,P_r \left( \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) \right) - \hat{x}_{\alpha _k}\right> \\&\ge \left\| P_r \left( \hat{X}^{d, \scriptscriptstyle (J_k)} _{\alpha _k}(\omega ) \right) - \hat{x}_{\alpha _k}\right\| _\mathbb {X}^2. \end{aligned}$$

Together with Eq. 3.31, this yields Eq. 3.29. Eq. 3.30 then follows from Eq. 3.29 and Proposition 2.7. \(\square \)

The next proposition provides the desires asymptotic convergence rates of the probability \(\mathbb {P}(E_\text {good}(\delta )^\complement )\).

Proposition 3.11

Given Assumptions 1.1, 3.4 and 3.6, there holds

$$\begin{aligned} \mathbb {P}(E_\text {good}(\delta )^\complement ) = O(\delta ^\frac{2 \mu q}{2 \mu + 1}). \end{aligned}$$
(3.33)

Proof

By Eq. 3.11 and the subadditivity of \(\mathbb {P}\), we have

$$\begin{aligned} \mathbb {P}(E_\text {good}(\delta )^\complement ) = \mathbb {P}\left( \bigcup _{k=1}^{k_\delta }(E_\text {good}^k)^\complement \right) \le \sum _{k=1}^{k_\delta } \mathbb {P}((E_\text {good}^k)^\complement ). \end{aligned}$$
(3.34)

By Eq. 3.12 and Markov’s inequality (see Lemma A.2), we have

$$\begin{aligned} \mathbb {P}((E_\text {good}^k)^\complement )&= \mathbb {P}\left( \left\{ \,\omega \in \Omega :\, \left\| \hat{X}^\text {a}_k - \hat{x}_{\alpha _k}\right\| _\mathbb {X} > c_{RL}^{-1} \epsilon \delta \,\right\} \right) \nonumber \\&\le \frac{c_{RL}^p \mathbb {E}\left[ \left\| \hat{X}^\text {a}_k - \hat{x}_{\alpha _k}\right\| _\mathbb {X}^p \right] }{\epsilon ^p \delta ^p}. \end{aligned}$$
(3.35)

Without loss of generality, let \(\delta \le \bar{\delta }\), where \(\bar{\delta }\) is as in Lemma 3.9. Using Proposition 3.10 and then Eq. 3.10 in Eq. 3.35 yields

$$\begin{aligned} \mathbb {P}((E_\text {good}^k)^\complement ) \le \frac{c_{RL}^p \kappa _p^p \alpha _k^{-p} J_k^{-p \gamma }}{\epsilon ^p \delta ^p} \le \delta ^q. \end{aligned}$$

Inserting this inequality in Eq. 3.34, we arrive at

$$\begin{aligned} \mathbb {P}(E_\text {good}(\delta )^\complement ) \le \sum _{k=1}^{k_\delta } \delta ^q = k_\delta \delta ^q. \end{aligned}$$
(3.36)

From Lemma 3.9 we know that \(k_\delta = O(\log (\delta ^{-1}))\). Since we have \(1 - \frac{2\mu }{2 \mu + 1} > 0\), we obtain

$$\begin{aligned} k_\delta \delta ^{(1 - \frac{2\mu }{2 \mu + 1})q} \rightarrow 0 \qquad \text {as } \delta \rightarrow 0. \end{aligned}$$
(3.37)

Hence, we have from Eq. 3.36 that

$$\begin{aligned} \mathbb {P}(E_\text {good}(\delta )^\complement ) \le k_\delta \delta ^{(1 - \frac{2\mu }{2 \mu + 1})q} \delta ^\frac{2 \mu q}{2 \mu + 1} = O(\delta ^\frac{2 \mu q}{2 \mu + 1}). \end{aligned}$$

\(\square \)

Finally, we show convergence of the random element \(\hat{X}^\text {a}_{K_\delta }\) on the “good set” \(E_\text {good}(\delta )\). The construction of \(E_\text {good}(\delta )\) allows to apply the proof of [14, theorem 4.17] with straightforward modifications to each individual realization \(\hat{X}^\text {a}_{K_\delta (\omega )}(\omega )\), given \(\omega \in E_\text {good}(\delta )\).

Proposition 3.12

Given Assumptions 1.1, 3.4 and 3.6, there exists \(C > 0\), independent of \(\omega \) and \(\delta \), such that

$$\begin{aligned} \left\| \hat{X}^\text {a}_{K_\delta (\omega )}(\omega ) - x^\dagger \right\| _\mathbb {X} \le C \delta ^\frac{2 \mu }{2 \mu + 1} \qquad \text {for all } \omega \in E_\text {good}(\delta ). \end{aligned}$$
(3.38)

Proof

Let \(\omega \in E_\text {good}(\delta )\).

  • First, we show that \(K_\delta (\omega ) \le k_\delta \): To see this, note that

    $$\begin{aligned} \left\| \hat{y}- L \hat{X}^\text {a}_{k_\delta }(\omega )\right\| _{R}&\le \left\| \hat{y}- L \hat{x}_{\alpha _{k_\delta }}\right\| _{R} + \left\| L \left( \hat{x}_{\alpha _{k_\delta }} - \hat{X}^\text {a}_{k_\delta }(\omega ) \right) \right\| _{R} \\&\le \left\| \hat{y}- L \hat{x}_{\alpha _{k_\delta }}\right\| _{R} + c_{RL} \left\| \hat{x}_{\alpha _{k_\delta }} - \hat{X}^\text {a}_{k_\delta }(\omega )\right\| _{C_{0}} . \end{aligned}$$

    By definition of \(k_\delta \) and \(E_\text {good}(\delta )\), this implies

    $$\begin{aligned} \left\| \hat{y}- L \hat{X}^\text {a}_{k_\delta }(\omega )\right\| _{R} \le (\tau - \epsilon ) \delta + \epsilon \delta = \tau \delta . \end{aligned}$$

    Hence, by definition of \(K_\delta \), there must hold \(K_\delta (\omega ) \le k_\delta \).

  • Since \(K_\delta (\omega ) \le k_\delta \), we have by definition of \(E_\text {good}(\delta )\),

    $$\begin{aligned} \left\| \hat{X}^\text {a}_k(\omega ) - \hat{x}_k\right\| _\mathbb {X} \le c_{RL}^{-1} \epsilon \delta \qquad \text {for all } k \le K_\delta (\omega ) , \end{aligned}$$
    (3.39)

    and consequently also

    $$\begin{aligned} \left\| L(\hat{X}^\text {a}_k(\omega ) - \hat{x}_{\alpha _k})\right\| _{R} \le \epsilon \delta \qquad \text {for all } k \le K_\delta (\omega ) . \end{aligned}$$
    (3.40)
  • Next, we show that there exists a constant \(c_4\), independent of \(\rho \), \(\delta \) and \(\omega \), such that

    $$\begin{aligned} \alpha _{ K_\delta (\omega ) }^{-1/2} \le c_4 \left( \frac{\rho }{\delta }\right) ^\frac{1}{2 \mu + 1}. \end{aligned}$$
    (3.41)

    From Eq. 3.20, we obtain

    $$\begin{aligned} \left\| y - L x_{\alpha _{ K_\delta (\omega ) - 1}}\right\| _{R}&\le c_2 \rho \alpha _{ K_\delta (\omega ) - 1}^{\mu + 1/2} \nonumber \\&= c_2 \rho (b^{-1} \alpha _{K_\delta (\omega )} )^{\mu + 1/2}. \end{aligned}$$
    (3.42)

    On the other hand,

    $$\begin{aligned} \left\| y - L x_{\alpha _{ K_\delta (\omega ) - 1}}\right\| _{R} \ge \left\| \hat{y}- L \hat{x}_{\alpha _{ K_\delta (\omega ) - 1}}\right\| _{R} - \left\| (\hat{y}- y) - L(\hat{x}_{\alpha _{ K_\delta (\omega ) - 1}} - x_{\alpha _{ K_\delta (\omega ) - 1}})\right\| _{R} . \end{aligned}$$

    Inserting Eq. 3.18 yields

    $$\begin{aligned} \left\| y - L x_{\alpha _{ K_\delta (\omega ) - 1}}\right\| _{R}&\ge \left\| \hat{y}- L \hat{x}_{\alpha _{ K_\delta (\omega ) - 1}}\right\| _{R} - \delta \\&\ge \left\| \hat{y}- L \hat{X}^\text {a}_{ K_\delta (\omega ) - 1}(\omega )\right\| _{R} - \left\| L (\hat{X}^\text {a}_{ K_\delta (\omega ) - 1}(\omega ) - \hat{x}_{\alpha _{ K_\delta (\omega ) - 1}})\right\| _{R} - \delta . \end{aligned}$$

    By the definition of \(K_\delta \) and Eq. 3.40, this reduces to

    $$\begin{aligned} \left\| y - L x_{\alpha _{ K_\delta (\omega ) - 1}}\right\| _{R} \ge \tau \delta - \epsilon \delta - \delta = (\tau - \epsilon - 1) \delta . \end{aligned}$$
    (3.43)

    Combining Eqs. 3.42 and 3.43 yields

    $$\begin{aligned} (\tau - \epsilon - 1) \delta \le c_2 \rho (b^{-1} \alpha _{K_\delta (\omega )} )^{\mu + 1/2}. \end{aligned}$$

    Since \(\tau - \epsilon - 1 > 0\), we can rearrange this inequality to

    $$\begin{aligned} \alpha _{K_\delta (\omega )} ^{-1/2} \le b^{-1/2} \left( \frac{c_2}{\tau - \epsilon - 1} \right) ^\frac{1}{2 \mu + 1}\left( \frac{\rho }{\delta }\right) ^\frac{1}{2 \mu + 1}, \end{aligned}$$

    which shows Eq. 3.41 for suitable choice of \(c_4\).

  • Next, we show that there exists a constant \(c_5\), independent of \(\omega \) and \(\delta \), such that

    $$\begin{aligned} \left\| \hat{x}_{ \alpha _{K_\delta (\omega )} } - x^\dagger \right\| _{C_{0}} \le c_5 \delta ^\frac{2 \mu }{2 \mu + 1}. \end{aligned}$$
    (3.44)

    We start with the triangle inequality

    $$\begin{aligned} \left\| \hat{x}_{ \alpha _{K_\delta (\omega )} } - x^\dagger \right\| _{C_{0}} \le \left\| \hat{x}_{ \alpha _{K_\delta (\omega )} } - x_{ \alpha _{K_\delta (\omega )} }\right\| _{C_{0}} + \left\| x_{ \alpha _{K_\delta (\omega )} } - x^\dagger \right\| _{C_{0}} . \end{aligned}$$
    (3.45)

    By Eq. 3.19, the first term on the right-hand side satisfies

    $$\begin{aligned} \left\| \hat{x}_{ \alpha _{K_\delta (\omega )} } - x_{ \alpha _{K_\delta (\omega )} }\right\| _{C_{0}} \le c_1 \delta \alpha _{K_\delta (\omega )} ^{-1/2}. \end{aligned}$$

    Inserting Eq. 3.41 yields

    $$\begin{aligned} \left\| \hat{x}_{ \alpha _{K_\delta (\omega )} } - x_{ \alpha _{K_\delta (\omega )} }\right\| _{C_{0}} \le c_1 c_4 \rho ^\frac{1}{2 \mu + 1} \delta ^\frac{2 \mu }{2 \mu + 1}. \end{aligned}$$
    (3.46)

    For the second term on the right-hand side of Eq. 3.45, we have by Eq. 3.17:

    $$\begin{aligned} \left\| x_{ \alpha _{K_\delta (\omega )} } - x^\dagger \right\| _{C_{0}} \le \rho ^\frac{1}{2 \mu + 1} \left\| L x_{ \alpha _{K_\delta (\omega )} } - y\right\| _{R} ^\frac{2 \mu }{2 \mu + 1}. \end{aligned}$$
    (3.47)

    We then estimate, using Eq. 3.18,

    $$\begin{aligned} \left\| L x_{ \alpha _{K_\delta (\omega )} } - y\right\| _{R}&\le \left\| \hat{y}- L \hat{x}_{ \alpha _{K_\delta (\omega )} }\right\| _{R} + \left\| (y - \hat{y}) - L \left( x_{ \alpha _{K_\delta (\omega )} } - \hat{x}_{ \alpha _{K_\delta (\omega )} } \right) \right\| _{R} \\&\le \left\| \hat{y}- L \hat{x}_{ \alpha _{K_\delta (\omega )} }\right\| _{R} + \delta . \end{aligned}$$

    From this, another use of the triangle inequality yields

    $$\begin{aligned} \left\| L x_{ \alpha _{K_\delta (\omega )} } - y\right\| _{R} \le \left\| \hat{y}- L \hat{X}^\text {a}_{ K_\delta (\omega ) }(\omega )\right\| _{R} + \left\| L \left( \hat{X}^\text {a}_{ K_\delta (\omega ) } - \hat{x}_{ \alpha _{K_\delta (\omega )} } \right) \right\| _{R} + \delta . \end{aligned}$$

    Finally, using the definition of \(K_\delta \) and Eq. 3.40 yields

    $$\begin{aligned} \left\| L x_{ \alpha _{K_\delta (\omega )} } - y\right\| _{R} \le (\tau + \epsilon + 1) \delta . \end{aligned}$$
    (3.48)

    Inserting Eq. 3.48 in Eq. 3.47 yields

    $$\begin{aligned} \left\| x_{ \alpha _{K_\delta (\omega )} } - x^\dagger \right\| _{C_{0}} \le (\tau + \epsilon + 1) \rho ^\frac{1}{2 \mu + 1} \delta ^\frac{2 \mu }{2 \mu + 1}. \end{aligned}$$
    (3.49)

    Finally, inserting both Eqs. 3.46 and 3.49 in Eq. 3.45 yields Eq. 3.44 for sutable choice of \(c_5\).

  • From the triangle inequality and Eq. 1.4, we have

    $$\begin{aligned} \left\| \hat{X}^\text {a}_{ K_\delta (\omega ) }(\omega ) - x^\dagger \right\| _\mathbb {X}&\le \left\| \hat{X}^\text {a}_{ K_\delta (\omega ) } - \hat{x}_{ \alpha _{K_\delta (\omega )} }\right\| _\mathbb {X} + \left\| \hat{x}_{ \alpha _{K_\delta (\omega )} } - x^\dagger \right\| _\mathbb {X} \\&\le \left\| \hat{X}^\text {a}_{ K_\delta (\omega ) } - \hat{x}_{ \alpha _{K_\delta (\omega )} }\right\| _\mathbb {X} + \left\| C_{0}^{1/2}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} \left\| \hat{x}_{ \alpha _{K_\delta (\omega )} } - x^\dagger \right\| _{C_{0}} . \end{aligned}$$

    We can use Eq. 3.39 to estimate the first and Eq. 3.44 to estimate the second term of the right-hand side, which yields

    $$\begin{aligned} \left\| \hat{X}^\text {a}_{ K_\delta (\omega ) }(\omega ) - x^\dagger \right\| _\mathbb {X} = c_{RL}^{-1} \epsilon \delta + \left\| C_{0}^{1/2}\right\| _{\mathcal {L}(\mathbb {X};\mathbb {X})} \cdot c_5 \delta ^\frac{2 \mu }{2 \mu + 1}. \end{aligned}$$

    Hence, we can choose \(C > 0\), independently of \(\delta \) and \(\omega \), such that Eq. 3.38 holds.

\(\square \)

With this, we arrive at convergence rates for adaptive EKI under a stochastic low-rank approximation.

Theorem 3.13

Given Assumptions 1.1, 3.4 and 3.6, there holds

$$\begin{aligned} \mathbb {E}\left[ \left\| \hat{X}^\text {a}_{K_\delta } - x^\dagger \right\| _\mathbb {X}^{q} \right] ^{1 / {q} } = O( \delta ^\frac{2 \mu }{2 \mu + 1}). \end{aligned}$$
(3.50)

Proof

By Proposition 3.12, there holds

$$\begin{aligned} \left\| \hat{X}^\text {a}_{ K_\delta (\omega ) }(\omega ) - x^\dagger \right\| _\mathbb {X}^q \le C^q \delta ^\frac{2 \mu q}{2 \mu + 1} \qquad \text {for all } \omega \in E_\text {good}(\delta ). \end{aligned}$$

This implies in particular

$$\begin{aligned} \mathbb {E}\left[ \left\| \hat{X}^\text {a}_{K_\delta } - x^\dagger \right\| _\mathbb {X}^q | E_\text {good}(\delta ) \right] \le C^q \delta ^\frac{2 \mu q}{2 \mu + 1}. \end{aligned}$$

Using this inequality and Proposition 3.11 in Eq. 3.14 yields

$$\begin{aligned} \mathbb {E}\left[ \left\| \hat{X}^\text {a}_{K_\delta } - x^\dagger \right\| _\mathbb {X}^q \right] = O(\delta ^\frac{2 \mu q}{2 \mu + 1}), \end{aligned}$$

from which Eq. 3.50 follows. \(\square \)

For completeness, we also formulate the convergence rate results under a deterministic low-rank approximation. In this case, the proof of Proposition 3.12 applies without change, and we obtain the following result.

Theorem 3.14

Let Assumptions 1.1 and 3.4 hold, and let \(( \varvec{A}^{\scriptscriptstyle (J)} )_{J=1}^\infty \) generates a deterministic low-rank approximation of \(C_{0}\), of order \(\gamma \). Assume there is \(\epsilon \in (0, \tau - 1)\) such that

$$\begin{aligned} \alpha _0 J_0^\gamma \ge \frac{c_{RL} \kappa }{\epsilon \delta }, \end{aligned}$$
(3.51)

where \(\kappa \) is as in Proposition 2.7. Then

$$\begin{aligned} \left\| \hat{x}^\text {a}_{K_\delta } - x^\dagger \right\| _\mathbb {X} = O(\delta ^\frac{2 \mu }{2 \mu + 1}). \end{aligned}$$

Remark 3.15

Comparing the condition Eq. 3.51 for the deterministic case to the condition Eq. 3.10 for the stochastic case, we see that the major difference is that the stochastic case requires an additional multiplicative factor \(\delta ^{-\frac{q}{p}}\). This additional factor is used in the proof of Proposition 3.11 to ensure that \(\mathbb {P}((E_\text {good}(\delta )^\complement )) = O(\delta ^\frac{2 \mu q}{2 \mu + 1})\). Formally, we recover the deterministic case from the stochastic case in the limit \(p \rightarrow \infty \) (where \(p=\infty \) corresponds to almost sure convergence).

Remark 3.16

The proven convergence rate is optimal for \(\mu \in (0,\frac{1}{2})\), in the sense that if only Assumption 3.4 is known, there exists no regularization method that satisfies a better general bound with respect to \(\delta \) and \(\mu \) [14, proposition 3.15].

Continuing our discussion from Sect. 2.2, we see from Theorem 3.13 and Theorem 3.14 that the three special cases of adaptive EKI defined in Definition 3.1, namely adaptive Standard-, SVD- and Nyström-EKI, are all of (stochastic) optimal order. However, the faster convergence of the SVD- and Nyström-based low-rank approximation means that the sample size \(J_k\) does not have to grow as fast as for Standard-EKI, which makes those two methods computationally cheaper.

Corollary 3.17

Let Assumptions 1.1 and 3.4 hold.

  1. (i)

    Let \(p \in [1,\infty )\), \(C_{0}\) be in the trace class, and suppose that Assumption 3.6 is satisfied for \(\gamma =1/2\). Then there holds

    $$\begin{aligned} \mathbb {E}\left[ \left\| \hat{X}^\text {aeki}_{K_\delta } - x^\dagger \right\| _\mathbb {X}^{p} \right] ^{1 / {p} } = O( \delta ^\frac{2 \mu }{2 \mu + 1}). \end{aligned}$$
  2. (ii)

    Assume that \(C_{0}\) satisfies Assumption 2.9 with constant \(\eta >0\), and suppose that Eq. 3.51 is satisfied for \(\gamma = \eta \). Then there holds

    $$\begin{aligned} \left\| \hat{x}^\text {asvd}_{K_\delta } - x^\dagger \right\| _\mathbb {X} = O( \delta ^\frac{2 \mu }{2 \mu + 1}). \end{aligned}$$
  3. (iii)

    Let \(p \in [1,\infty )\), assume that \(C_{0}\) satisfies Assumption 2.9 with constant \(\eta >1/2\), and suppose tat Assumption 3.6 is satisfied for \(\gamma =\eta \). Then there holds

    $$\begin{aligned} \mathbb {E}\left[ \left\| \hat{X}^\text {anys}_{K_\delta } - x^\dagger \right\| _\mathbb {X} \right] = O(\delta ^\frac{2 \mu }{2 \mu + 1}). \end{aligned}$$

Proof

Recall that \((\hat{X}^\text {aeki}_k)_{k=1}^\infty \) is a special case of adaptive EKI where the low-rank approximation is generated by \(( \mathcal {A} (\varvec{U}^{\scriptscriptstyle (J)}))_{J=1}^\infty \) (see Theorem 2.14). Thus, if \(C_{0}\) is in the trace-class, Theorem 3.13 applies with \(\gamma =1/2\) and yields the desired convergence rate. The corresponding result for \((\hat{x}^\text {asvd}_k)_{k=1}^\infty \) follows analogously from Theorems 3.14 and 2.12, while the result for \((\hat{X}^\text {anys}_k)_{k=1}^\infty \) follows from Theorems 3.13 and . \(\square \)

As an example, suppose we know that \(C_{0}\) is in the trace class, i.e. \(\eta \ge 1\). Then Corollary 3.17 implies that Standard-EKI is of optimal order if \(J_k \ge b^{-2 k} J_0\), whereas Nyström-EKI is of optimal order if \(J_k \ge b^{- k} J_0\) (see Eq. 3.3). This means that Nyström-EKI performs comparably with only a square-root of the sample size. Furthermore, if the eigenvalues of \(C_{0}\) decay faster than \(O(n^{-1})\), Nyström-EKI can take advantage of this, whereas Standard-EKI is limited by the lower bound Eq. 2.10.

3.3 General remarks

3.3.1 Relation to other versions of EKI

Note that our focus differs from the strictly Bayesian setting in which ensemble Kalman inversion is often introduced. In the Bayesian setting, it is assumed that the regularization parameter represents the available prior information, and the regularized solution is identified with the MAP estimate. In regularization theory, we are interested in showing convergence rates in the zero-noise limit, which requires the use of parameter choice rules that select the regularization parameter \(\alpha \) in terms of the noise level and properties of the forward operator L. Above, we have focused on the discrepancy principle. In contrast to a-priori choice rules, the use of the discrepancy principle has the advantage that it requires only little prior information on the operator L. However, its use is contingent on performing multiple steps of EKI with decreasing values of the regularization parameter. This strategy has a lot of similarities to the empirical Bayesian approach, where we assume a Gaussian prior but treat the regularization parameter as unknown and try to estimate it from the data (see e.g. [59]). Our analysis shows that, by coupling the sample size to the regularization parameter, it becomes possible to obtain the optimal convergence rates in the zero-noise limit. This is also the major difference of the presented scheme to other versions of EKI.

3.3.2 Relation to multiscale methods

The ideas behind adaptive EKI are similar to sequential multiscale methods, where one iteratively moves from a low-dimensional coarse-scale subspace to finer scales. A related work along these lines is [39], which also applies to ensemble methods, but considers the setting where in each step an approximate solution on a different subspace is computed. Under certain conditions on the multiscale decomposition, this approach can be shown to be equivalent to Tikhonov regularization in the full space. In contrast, the idea behind adaptive EKI is only to approximate Tikhonov regularization, in a way that achieves the same convergence order in the zero-noise limit.

3.3.3 Localization

In some practical applications (e.g. numerical weather prediction [26]) it is only feasible to work with ensemble sizes that are orders of magnitude smaller than the parameter dimension. In these situations, localization [18] is often used to increase the effective ensemble size through incorporation of domain knowledge on the correlation structure of the parameter or observation of interest. Since adaptive EKI can be formulated both in square-root and covariance form (see Remark 1.7), it can be combined with most of the existing localization methods, such as covariance localization [25] or local analysis [45]. Note that localization for stochastic EKI has been studied in [58].

4 Numerical experiments

We performed numerical experiments to evaluate the performance of adaptive EKI.

4.1 Test problem

We have chosen inversion of the Radon transform L (see for instance [34]) as our test example. The analytical results show that the large ensemble limit approximates the Tikhonov regularized solution, which we aim to verify numerically. And we also compare the different variants of EKI in terms of efficiency. As a test object, we use the classic Shepp-Logan phantom [54] with size \(d \times d\), \(d=100\), (see Fig. 1). This corresponds to a parameter dimension of \(n := \dim \mathbb {X}= d^2 = 10{,}000\) and a measurement dimension of \(m := \dim \mathbb {Y}= 14{,}200\).

Fig. 1
figure 1

The Shepp–Logan phantom

4.2 Data simulation

We generated noise \(\xi _s \sim \mathcal {N}(0, \text {I}_m)\) from a standard normal distribution and then rescaled the noise by setting

$$\begin{aligned} \xi := \frac{\left\| y\right\| }{10 \left\| \xi _s\right\| } \xi _s, \end{aligned}$$

thereby ensuring a signal-to-noise ratio of exactly 10. We then used \({\hat{y}} = y + \xi \) as noisy measurement for the tested methods. We also rescaled the measurement and the observation operator by \(\left\| \xi \right\| \) so that \(\delta = \left\| \hat{y}- y\right\| = 1\).

4.3 Considered methods

In our experiment, we set \(R= \mathbb {I}_m\), and chose \(C_{0}\in \mathbb {R}^{n \times n}\) equal to a discretized covariance operator of an Ornstein-Uhlenbeck process,

$$\begin{aligned} (C_{0})_{ij}&:= e^{-\left\| q_i - q_j\right\| /h^2}, \\ \text {where}\quad q_i&:= \begin{pmatrix} \frac{i ~\text {mod}~ d}{d-1}\\ \frac{\lfloor i / d \rfloor }{d-1} \end{pmatrix} \in [0,1] \times [0,1], \end{aligned}$$

with correlation length \(h > 0\) (we used the value \(h = 0.01\)). Such operators are often used as prior covariance for Bayesian MAP estimation in tomography, for example in [56]. They correspond to the assumption that the correlation between individual pixels decreases exponentially with distance, where \(q_i\) denotes the normalized position of the i-th pixel if the image is scaled to \([0,1]\times [0,1]\). We compared the 3 different instances of adaptive EKI discussed in Sect. 3:

  • Standard EKI with \(\alpha _k = b^k\), \(b = \sqrt{0.8}\), \(J_k = \lceil b^{-2(k-1)} J_1 \rceil \), and \(J_1 = 50\).

  • Nyström-EKI with \(\alpha _k = b^k\), \(b = 0.8\), \(J_k = \lceil b^{-(k-1)} J_1 \rceil \), and \(J_1 = 50\).

  • SVD-EKI with \(\alpha _k = b^k\), \(b = 0.8\), \(J_k = \lceil b^{-(k-1)} J_1 \rceil \), and \(J_1 = 50\).

The different values of b are used in order to ensure that the sequence of sample sizes \((J_k)_{k=1}^\infty \) is equal for all three methods. Moreover, all methods used the discrepancy principle (see Definition 3.2) with \(\tau = 1.2\). In any case, the iterations where aborted once \(J_k\) was larger than n, since at this point the computational complexity of EKI is higher than of Tikhonov regularization.

4.4 Implementation

The algorithms were implemented in Python and use efficient Numpy [23] and SciPy [60] routines. We used the existing implementation of the Radon transform in the scikit-image library [61], and took advantage of the Ray framework [38] to parallelize the operator evaluations. The computations were performed on a Dell XPS-15-7590 Laptop with 12 2.60 GHz CPUs and 15.3 GiB RAM.

4.5 Convergence of adaptive EKI

For each iteration, we evaluated the relative reconstruction error

$$\begin{aligned} e_\text {rel}(x) := \frac{\left\| x - x_*\right\| }{\left\| x_*\right\| }. \end{aligned}$$

The results are visualized in Fig. 2. Note that every iteration is computationally more expensive than the previous one since the sample size \(J_k\) increases steadily. While the Nyström-EKI and the SVD-EKI methods were able to satisfy the discrepancy principle after 18 iterations (with sample size \(J_{18} = 2221\)), the Standard EKI-iteration was not able to satisfy the discrepancy principle for a sample size less than n. Apart from that, one clearly sees that Nyström-EKI and SVD-EKI both significantly outperform Standard-EKI. Consistent with Theorem 2.12, one may observe that SVD-EKI yields the most accurate reconstruction for given sample size.

Fig. 2
figure 2

The adaptive EKI, Nyström-EKI and SVD-EKI iterations. The x-axis denotes the iteration number. The y-axis denotes the relative reconstruction error \(e_\text {rel}\). The Standard-EKI iteration was not able to satisfy the discrepancy principle for \(J_k < n\)

4.6 Comparison of Standard-EKI with Nyström-EKI

In Fig. 3, we visually compare the reconstruction with Standard-EKI to the reconstruction with Nyström-EKI. Both reconstructions use the same value of \(\alpha \) and sample size \(J=2000\). One can see that the standard method is much more noisy than the Nyström method. This noise does not come from the noisy measurement, it is introduced by the sampling process.

Fig. 3
figure 3

Reconstruction of the Shepp-Logan phantom from noisy data using EKI with sample size \(J=2000\), corresponding to 1/5 of the parameter dimension

4.7 Convergence to Tikhonov regularization for large sample sizes

In Fig. 4, we have plotted the reconstruction with Nyström-EKI for increasing values of J. For \(J = 500\), the reconstruction is hardly useful. However, for \(J=2000\) the reconstruction is already almost comparable to the Tikhonov reconstruction, although a little bit blurred. For higher values of J, the improvement is only marginal. This shows that the presence of noise allows considerable a-priori (that is, not using knowledge on L or \(\hat{y}\)) dimensionality reduction.

Fig. 4
figure 4

Reconstruction with Nyström-EKI for different sample sizes J, using noisy data with a signal-to-noise ratio of 10

We also repeated the experiment for fixed regularization parameter \(\alpha =0.03\) and different values of J in order to examine the convergence estimate from Sect. 2.3 numerically. In Fig. 5, we plotted the approximation error with respect to Tikhonov regularization, normalized with \(\left\| x_*\right\| \), i.e.

$$\begin{aligned} e_\text {app}(x; \alpha ) := \frac{\left\| x - \hat{x}_{\alpha }\right\| }{\left\| x_*\right\| }. \end{aligned}$$

In accordance with Proposition 2.7, the approximation error of Standard-EKI decreases like \(J^{-1/2}\). However, it is still significant even if the number of ensembles is close to n. With Nyström-EKI or SVD-EKI, the approximation error becomes negligible even for relatively small sample sizes.

Fig. 5
figure 5

The Standard-EKI, Nyström-EKI and SVD-EKI iterations for fixed regularization paramter \(\alpha \) and varying sample size. The x-axis denotes the sample size J. The y-axis denotes the relative approximation error \(e_\text {app}\)

4.8 Divergence for small values of \(\alpha \)

Keeping the sample size fixed at \(J=2000\), we then repeated the experiment for different values of \(\alpha \) (see Fig. 6). One sees that the approximation error of all three methods explodes as \(\alpha \rightarrow 0\), which demonstrates the necessity of adapting the sample size. Again, Nyström-EKI and SVD-EKI are superior to Standard-EKI.

Fig. 6
figure 6

The Standard-EKI, Nyström-EKI and SVD-EKI iterations for fixed sample size J and varying regularization parameter. The x-axis denotes the regularization parameter \(\alpha \). The y-axis denotes the scaled approximation error \( e_\text {app}\) for EKI with sample size \(J=2000\). As \(\alpha \) approaches 0, the approximation error explodes

5 Conclusions

We have shown that ensemble Kalman inversion is a convergent regularization method if the sample size is adapted to the regularization parameter. The interpretation of EKI as a low-rank aproximation of Tikhonov regularization shows that it provides a trade-off between exactness and computational cost by shrinking the search space in which we try to reconstruct the unknown parameter x. This approach is suited for problems where the adjoint is not available and the noise is significant, since then the optimal regularization parameter \(\alpha \) will typically be larger and a good approximation to the Tikhonov-regularized solution can be achieved for relatively small sample sizes (see Fig. 6).

It is important to note that the dimensionality reduction in EKI is completely a-priori. It uses no knowledge about the forward operator L or the measurement \(\hat{y}\). This has the advantage that it also works in the case where the adjoint of L is not available. On the other hand, if one has access to the adjoint of L, one can compute a low-rank approximation of the whole operator \(C_{0}^{1/2} L^* R^{-1} L C_{0}^{1/2}\) instead [16]. This can yield superior results as it allows to also exploit the spectral decay of the forward operator L [55].

While EKI was originally developed for nonlinear inverse problems, our insights from the linear case—in particular the need for adapting the sample size to the noise level—can serve as an Ansatz for an analysis of EKI as a regularization method for nonlinear inverse problems.

After all, the basic ideas of ensemble methods are simple and constitute a very general way to obtain linear dimensionality reduction and algorithms for black-box inverse problems. Therefore, another natural direction of research is to study the resulting stochastic approximations of classical iterative regularization methods, such as the iteratively regularized Gauss-Newton iteration [3], and compare their performance to EKI for the case of nonlinear inverse problems.