Multilevel ensemble Kalman filtering for spatio-temporal processes

We design and analyse the performance of a multilevel ensemble Kalman filter method (MLEnKF) for filtering settings where the underlying state-space model is an infinite-dimensional spatio-temporal process. We consider underlying models that needs to be simulated by numerical methods, with discretization in both space and time. The multilevel Monte Carlo (MLMC) sampling strategy, achieving variance reduction through pairwise coupling of ensemble particles on neighboring resolutions, is used in the sample-moment step of MLEnKF to produce an efficient hierarchical filtering method for spatio-temporal models. Under sufficient regularity, MLEnKF is proven to be more efficient for weak approximations than EnKF, asymptotically in the large-ensemble and fine-numerical-resolution limit. Numerical examples support our theoretical findings.


Introduction
Filtering refers to the sequential estimation of the state u and/or parameters of a system through sequential incorporation of online data y. The most complete estimation of the state u n at time n is given by its probability distribution conditional on the observations up to the given time P(du n |y 1 , . . . , y n ) [27,2]. For linear Gaussian systems, the analytical solution may be given in closed form via update formulae for the mean and covariance known as the Kalman filter [31]. More generally, however, closed form solutions typically are not known. One must therefore resort to either algorithms which approximate the probabilistic solution by leveraging ideas from control theory in the data assimilation community [32,27], or Monte Carlo methods to approximate the filtering distribution itself [2,15,11]. The ensemble Kalman filter (EnKF) [9,17,35] combines elements of both approaches. In the linear Gaussian case it converges to the Kalman filter solution in the large-ensemble limit [41], and even in the nonlinear case, under suitable assumptions it converges [37,36] to a limit which is optimal among those which incorporate the data linearly and use a single update iteration [36,40,44]. In the case of spatially extended models approximated on a numerical grid, the state space itself may become very high-dimensional and even the linear solves may become intractable, due to the cost of computing the covariance matrix. Therefore, one may be inclined to use the EnKF filter even for linear Gaussian problems in which the solution is computationally intractable despite being given in closed form by the Kalman filter.
The Multilevel Monte Carlo method (MLMC) is a hierarchical and variancereduction based approximation method initially developed for weak approximations of random fields and stochastic differential equations [21,18,19]. Recently, a number of works have emerged which extend the MLMC framework to the setting of Monte Carlo algorithms designed for Bayesian inference. Examples include Markov chain Monte Carlo [14,22], sequential Monte Carlo samplers [6,26,42], particle filters [25,20], and EnKF [23]. The filtering papers thus far [25,20,23] consider only finite-dimensional SDE forward models. In this work, we develop a new multilevel ensemble Kalman filtering method (MLEnKF) for the setting of infinite-dimensional state-space models with evolution in continuous-time. The method consists of a hierarchy of pairwise coupled EnKF-like ensembles on different finiteresolution levels of the underlying infinite-dimensional evolution model that all depend on the same Kalman gain in the update step. The method presented in this work may be viewed as an extension of the finite-dimensionalstate-space MLEnKF method [23], which only considered a hierarchy of time-discretization resolution levels.
Under sufficient regularity, the large-ensemble limit of EnKF is equal in distribution to the so-called mean-field EnKF (MFEnKF), cf. [37,36,34]. In nonlinear settings, however, MFEnKF is not equal in distribution to the Bayes filter, which is the exact filter distribution. More precisely, the error of EnKF approximating the Bayes filter may be decomposed into a statistical error, due to the finite ensemble size, and a Gaussian bias that is introduced by the Kalman-filter-like update step in EnKF. While the update-step bias error in EnKF is difficult both to quantify and deal with, the statistical error can, in theory, be reduced to arbitrary magnitude. However, the high computational cost of simulations in high-dimensional state space often imposes small ensemble size as a practical constraint. By making use of hierarchical variance-reduction techniques, the MLEnKF method developed in this work is capable of obtaining a much smaller statistical error than EnKF at the same fixed cost.
In addition to design an MLEnKF method for spatio-temporal processes, we provide an asymptotic performance analysis of the method that is applicable under sufficient regularity of the filtering problem and L p -strong convergence of the numerical method approximating the underlying model dynamics. Sections 5 and 6 are devoted to a detailed analysis and practical implementation of MLEnKF applied to linear and semilinear stochastic reaction-diffusion equations. In particular, we describe how the pairwise coupling of EnKF-like hierarchies should be implemented for one specific numerical solver (the exponential-Euler method), and provide numerical evidence for the efficiency gains of MLEnKF over EnKF.
Since particle filters are known to often perform better than EnKF, we also include a few remarks on how we believe such methods would compare to MLEnKF in filtering settings with spatial processes. Due to the poor scaling of particle ensemble size in high dimensions, which can even be exponential [7,38], particle filters are typically not used for spatial processes, or even modestly high-dimensional processes. There has been some work in the past few years which overcomes this issue either for particular examples [5] or by allowing for some bias [4,48,45,49]. But particle filters cannot yet be considered practically applicable for general spatial processes. If there is a well-defined limit of the model as the state-space dimension d grows such that the effective dimension of the target density with respect to the proposal remains finite or even small, then useful particle filters can be developed [33,39]. As noted in [10], the key criterion which needs to be satisfied is that the proposal and the target are not mutually singular in the limit. MLMC has been applied recently to particle filters, in the context where the approximation arises due to time discretization of a finite-dimensional SDE [25,20]. It is an interesting open problem to design multilevel particle filters for spatial processes: Both the range of applicability and the asymptotic performance of such a method versus MLEnKF when applied to spatial processes are topics that remain to be studied.
The rest of the paper is organized as follows. Section 2 introduces the filtering problem and notation. The design of the MLEnKF method is presented in section 3. Section 4 studies the weak approximation of MFEnKF by MLEnKF, and shows that in this setting, MLEnKF inherits almost the same favorable asymptotic "cost-to-accuracy" performance as standard MLMC applied to weak approximations of stochastic spatio-temporal processes. Section5 presents a detailed analysis and description of the implementation of MLEnKF for a family of stochastic reaction-diffusion models. Section 6 provides numerical studies of filtering problems with linear and semilinear stochastic reaction-diffusion models that corroborate our theoretical findings. Conclusions and future directions are presented in section 7, and auxiliary theoretical results and technical proofs are provided in Appendices A, B and C.
2. Set-up and single level algorithm 2.1. General set-up. Let (Ω, F, P) be a complete probability space, where P is a probability measure on the measurable space (Ω, F). Let V be a separable Hilbert space with inner product ·, · V and norm · V = ·, · V . Let V denote a subspace of V which is closed in the topology induced by the norm · V = ·, · V , which is assumed to be a stronger norm than · V . For an arbitrary separable Hilbert space (K, · K ), we denote the associated L p -Bochner space by where u L p (Ω,K) = (E u p K ) 1/p , or the shorthand u p whenever confusion is not possible. For an arbitrary pair of Hilbert spaces K 1 and K 2 , the space of bounded linear mappings from the former space into the latter is denoted by In finite dimensions, (R m , ·, · ) represents the m-dimensional Euclidean vector space with norm | · | := ·, · , and for matrices A ∈ L(R m 1 , R m 2 ), |A| := A L(R m 1 ,R m 2 ) .
Here, H ∈ L(V, R m ), the sequence {η n } consists of independent and identically N (0, Γ)−distributed random variables with Γ ∈ R m×m positive definite. In the sequel, the explicit dependence on ω will be suppressed where confusion is not possible. A general filtering objective is to track the signal u n given a fixed sequence of observations Y n := (y 1 , y 2 , . . . , y n ), i.e., to track the distribution of u n |Y n for n = 1, . . .. In this work, however, we restrict ourselves to considering the more specific objective of approximating E[ϕ(u n )|Y n ] for a given quantity of interest (QoI) ϕ : V → R. The index n will be referred to as time, whether the actual time between observations is 1 or not (in the examples in Section5 and beyond it will be called T ), but this will not cause confusion since time is relative.
2.1.2. The dynamics. We consider problems in which Ψ is the finite-time evolution of an SPDE, e.g. (35), and we will assume that Ψ cannot be evaluated exactly, but that there exists a sequence {Ψ : of approximations to the solution Ψ := Ψ ∞ satisfying the following uniform-in-stability properties Assumption 1. For every p ≥ 2, it holds that Ψ : L p (Ω, V )×Ω → L p (Ω, V ), and for all u, v ∈ L p (Ω, V), the solution operators {Ψ } ∞ =0 satisfy the following conditions: there exists a constant 0 < c Ψ < ∞ depending on p such that For notational simplicity, we restrict ourselves to settings in which the map Ψ(·) does not depend on n, but the results in this work do of course extend easily to non-autonomous settings when the assumptions on {Ψ n } N n=1 are uniform with respect to n. Remark 1. The two approximation spaces V ⊂ V are introduced in order to obtain convergence rates for numerical simulation methods Ψ that are discretized in physical or state space. See Assumption 2(i)-(ii) and inequality (41) for an example of how this may be obtained in practice.

2.1.3.
The Bayes filter. The pair of discrete-time stochastic processes (u n , y n ) constitutes a hidden Markov model, and the exact (Bayes-filter) distribution of u n |Y n may in theory be determined iteratively through the system of prediction-update equations P(du n |Y n ) = 1 Z(Y n ) L(u n ; y n )P(du n |Y n−1 ), P(du n |Y n−1 ) = u n−1 ∈V P(du n |u n−1 )P(du n−1 |Y n−1 ), When the state space is infinite-dimensional and the dynamics cannot be evaluated exactly, however, this is an extremely challenging problem. Consequently, we will here restrict ourselves to constructing weak approximation methods of the mean-field EnKF, cf. Section 2.4.

2.2.
Some details on Hilbert spaces, Hilbert-Schmidt operators, and Cameron-Martin spaces. For two arbitrary separable Hilbert spaces K 1 and K 2 , the tensor product K 1 ⊗ K 2 is also a Hilbert space. For rank-1 tensors, its inner product is defined by which extends by linearity to any tensor of finite rank. The Hilbert space K 1 ⊗ K 2 is the completion of this set with respect to the induced norm Let {e k } and {ê k } be orthonormal bases for K 1 and K 2 , respectively, and observe that finite sums of rank-1 tensors of the form X := i,j α ij e i ⊗ê j ∈ K 1 ⊗ K 2 can be identified with a bounded linear mapping For two bounded linear operators A, B : K * 2 → K 1 we recall the definition of the Hilbert-Schmidt inner product and norm where {ê * k } is the orthonormal basis of K * 2 satisfyingê * k (ê j ) = δ jk for all j, k in the considered index set. A bounded linear operator A : K * 2 → K 1 is called a Hilbert-Schmidt operator if |A| HS < ∞ and HS(K * 2 , K 1 ) is the space of all such operators. In view of (4), By completion, the space K 1 ⊗K 2 is isometrically isomorphic to HS(K * 2 , K 1 ) (and also to HS(K 2 , K 1 ) by the Riesz representation theorem). For an element A ∈ K 1 ⊗ K 2 we identify the norms and such elements will interchangeably be considered either as members of K 1 ⊗ K 2 or of HS(K * 2 , K 1 ). When viewed as A ∈ HS(K * 2 , K 1 ), the mapping A : K * 2 → K 1 is defined by where A ij := e i , Aê * j K 1 , and when viewed as A ∈ K 1 ⊗ K 2 , we use tensorbasis representation The covariance operator for a pair of random variables Z, X ∈ L 2 (Ω, V ) is denoted by and whenever Z = X, we employ the shorthand Cov[Z] := Cov[Z, Z]. For completeness and later reference, let us prove that said covariance belongs to V ⊗ V .

Ensemble Kalman filtering.
EnKF is an ensemble-based extension of Kalman filtering to nonlinear settings.
denote the ensemble-based approximation of the updated distribution u n |Y n (at n = 0 we employ the convention Y 0 = ∅, so that u 0 = u 0 |Y 0 ). Given an updated ensemble {v n,i } M i=1 , the ensemble-based approximation of the prediction distribution u n+1 |Y n is obtained through simulating each particle one observation time ahead: v n+1,i = Ψ(v n,i ), i = 1, 2, . . . , M.
We will refer to {v n+1,i } M i=1 as the prediction ensemble at time n + 1, and we also note that in many settings, the exact dynamics Ψ in (5) have to be approximated by a numerical solver.
Next, given {v n+1,i } M i=1 and a new observation y n+1 , the ensemble-based approximation of the updated distribution u n+1 |Y n+1 is obtained through updating each particle path where {η n+1,i } M i=1 is an independent and identically N (0, Γ)−distributed sequence, the Kalman gain H * + Γ, the adjoint observation operator H * ∈ L(R m , V * ), defined by (H * a)(w) = a, Hw for all a ∈ R m and w ∈ V, and the prediction covariance Due to the update formula (6), all ensemble particles are correlated to one another after the first update. Even in the linear Gaussian case, the ensemble will not remain Gaussian after the first update. Nonetheless, it has been shown that the in the large-ensemble limit, EnKF converges in L p (Ω) to the correct (Bayes-filter) Gaussian in the linear and finite-dimensional case [41,37], with the rate O(M −1/2 ) for Lipschitz-functional QoI with polynomial growth at infinity. Furthermore, in the nonlinear cases admitted by Assumption 1, EnKF converges in the same sense and with the same rate to a mean-field limiting distribution described below.
Remark 2. The perturbed observationsỹ n,i were originally introduced in [9] to correct the variance-deflation-type error that appeared in its absence in implementations following the original formulation of EnKF [16]. It has become known as the perturbed observation implementation.

2.4.
Mean-field Ensemble Kalman Filtering. In order to describe and study convergence properties of EnKF in the large-ensemble limit, we now introduce the mean-field EnKF (MFEnKF) [36]: Letv 0 ∼ P u 0 and Here η n are i.i.d. draws from N (0, Γ). In the finite-dimensional state-space setting, it was shown in [37] and [36] that for nonlinear state-space models and nonlinear models with additive Gaussian noise, respectively, EnKF converges to MFEnKF with the L p (Ω) convergence rate O(M −1/2 ), as long as the models satisfy a Lipschitz criterion, similar to (but stronger than) Assumption 1. And in [23], we showed for that MLEnKF converges toward MFEnKF with a higher rate than EnKF does in said finite-dimensional setting. The work [34] extended convergence results to infinite-dimensional state space for square-root filters. In this work, the aim is to prove convergence of the MLEnKF for infinite-dimensional state space, with the same favorable asymptotic cost-to-accuracy performance as in [23].
The following L p -boundedness properties ensures the existence of the MFEnKF-process and its mean-field Kalman gain, and they will be needed when studying the properteis of MLEnKF: Assume the initial data of the hidden Markov model (1) and (2) satisfies u 0 ∈ ∩ p≥2 L p (Ω, V ). Then the MFEnKF process (9)-(10) satisfiesv n ,v n ∈ ∩ p≥2 L p (Ω, V ) and K n L(R m ,V ) < ∞ for all n ∈ N.
Proof. Sincev 0 = u 0 , the property clearly holds for n = 0. Givenv n ∈ L p (Ω, V ), Assumption 1 guaranteesv n+1 ∈ L p (Ω, V ). By Proposition 1, The result follows by recalling that V ⊂ V and by the triangle inequality: v n+1 L p (Ω,V ) ≤ v n+1 L p (Ω,V ) We conclude this section with some remarks on tensorized representations of the Kalman gain and related auxiliary operators that will be useful when developing MLEnKF algorithms in Section 3.3.
For the covariance matrix, it holds almost surely that C MC n+1 ∈ V ⊗V ⊂ V ⊗V, so it may be represented by For the auxiliary operator, it holds almost surely that

Multilevel EnKF
3.1. Notation and assumptions. Recall that {φ k } ∞ k=1 represents a complete orthonormal basis for V and consider the hierarchy of subspaces V = span{φ k } N k=1 , where {N } is an exponentially increasing sequence of natural numbers further described below in Assumption 2. By construction, V 0 ⊂ V 1 ⊂ · · · ⊂ V. We define a sequence of orthogonal projection operators {P : V → V } by It trivially follows that V is isometrically isomorphic to R N , so that any element v ∈ V will, when convenient, be viewed as the unique corresponding element of R N whose k-th component is given by φ k , v V for k ∈ {1, 2, . . . , N }. For the practical construction of numerical methods, we also introduce a second sequence of projection operators {Π : V → V }, e.g., interpolant operators, which are assumed to be close to the corresponding orthogonal projectors and to satisfy the constraint Π V = P V = V . This framework can accommodate spectral methods, for which typically Π = P , as well as finite element type approximations, for which Π more commonly will be taken as an interpolant operator. In the latter case, the basis {φ j } will be a hierarchical finite element basis, cf. [47,8].
We now introduce two additional assumptions on the hierarchy of dynamics and two assumptions on the projection operators that will be needed in order to prove the convergence of MLEnKF and its superior efficiency compared to EnKF. For two non-negative sequences {f } and {g }, the notation f g means there exist a constant c > 0 such that f ≤ cg holds for all ∈ N ∪ {0}, and the notation f g means that both f g and g f are true.
Assumption 2. Assume the initial data of the hidden Markov model (1) and (2) satisfies u 0 ∈ ∩ p≥2 L p (Ω, V ). Consider a hierarchy of solution operators {Ψ : L p (Ω, V) × Ω → L p (Ω, V )} for which Assumption 1 holds that are associated to a hierarchy of subspaces {V } with resolution dimension N κ for some κ > 1. Let h N −1/d and ∆t h γt , for some γ t > 0, respectively denote the spatial and the temporal resolution parameter on level . For a given set of exponent rates β, γ x , γ t > 0, the following conditions are fulfilled: and that of applying Ψ to any element of V is where d denotes the dimension of the spatial domain of elements in V, and dγ x + γ t ≥ d.

3.2.
The MLEnKF method. MLEnKF computes particle paths on a hierarchy of finite-dimensional function spaces with accuracy levels determined by the solvers {Ψ : Let v n andv n respectively represent prediction and updated ensemble state at time n of a particle on resolution level , i.e., with dynamics governed by Ψ . For an ensemble- (18), the initial setup for MLEnKF consists of a hierarchy of ensembles MLEnKF approximates the initial reference distribution P u 0 |Y 0 (recalling the convention Y 0 = ∅, so that u 0 |Y 0 = u 0 ) by the multilevel-Monte-Carlo-based and signed empirical measurê Similar to EnKF, the mapping } represents the transition of the MLEnKF hierarchy of ensembles over one prediction-update step and represents the empirical distribution of the updated MLEnKF at time n.
The MLEnKF prediction step consists of simulating all particle paths on all resolution one observation-time forward: for = 1, . . . , L and i = 1, 2, . . . , M . Note here that the driving noise in the second argument of the dynamics Ψ −1 and Ψ is pairwise coupled, and otherwise independent. For the update step, the MLEnKF prediction covariance matrix is given by the following multilevel sample-covariance estimator and the multilevel Kalman gain is defined by where with (λ j , q j ) m j=1 denoting the eigenpairs of HC ML n+1 H * ∈ R m×m . The new observation y n+1 is assimilated into the hierarchy of ensembles by the following multilevel extension of EnKF at the zeroth level: , and for each of the higher levels, = 1, . . . , L, the pairwise coupling of perturbed observations for i = 1, . . . , M , with the sequence {η n+1,i } i, being independent and identically N (0, Γ)−distributed. It is precisely the multiplication with the Kalman gain in the update step that correlates all the MLEnKF particles. In comparison to standard MLMC where all samples except the pairwise coupled ones are independent, the this global correlation in MLEnKF substantially complicates the convergence analysis of the method.
Remark 3. Although unlikely, the multilevel sample prediction covariance C ML n+1 may have negative eigenvalues and, worst case, this could lead to S ML n+1 = HC ML n+1 H * + Γ becoming a singular matrix. The impetus for replacing the matrix (HC ML n+1 H * ) with its positive semidefinite "counterpart" (HC ML n+1 H * ) + in the Kalman gain formula (13) is to ensure that S ML n+1 is invertible and to obtain the bound |(S ML n+1 ) −1 | ≤ |Γ −1 |. The following notation denotes the (signed) empirical measure of the mul- and for any QoI ϕ : We conclude this section with an estimate that relates to the computational cost of one MLEnKF update step.

Proposition 3. Given an MLEnKF hierarchy of prediction ensembles
And if Assumption 2(iii) holds, then the cost of updating the -th level ensemble Proof. Notice that it is not required to compute the full multilevel prediction covariance C ML n+1 in order to build the MLEnKF Kalman gain, but rather only (17) (The advantage of storing R ML n+1 ∈ R N L ×m rather than C ML n+1 ∈ R N L ×N L is the dimensional reduction obtained for large L, since then N L m.) For the Kalman gain, the cost of computing Cov M [v n+1 , Hv n+1 ] ∈ R N ×m , is proportional to mN M . There are also the insignificant one-time costs of constructing and inverting S ML n+1 , and the matrix multiplication R ML n+1 (S ML n+1 ) −1 . In total, these costs are proportional to N L m 2 .
The cost of updating the -th level ensemble by (15) contains the one-time cost of the matrix multiplications Π K ML n+1 which by Assumption 2(iii) is proportional to mN . For each particle, the cost of computing Hv n+1,i is proportional to N , since v n+1,i ∈ V , and the cost of computing (Π K ML n+1 )(Hv n+1,i ) and (Π K ML n+1 )ỹ n+1,i are both proportional to mN .

MLEnKF algorithms.
A subtlety with computing (17) efficiently is that the summands will be elements of different sized tensor spaces since The algorithm presented below efficiently computes (17) through performing all arithmetic operations in the tensor space of lowest possible dimension available at the current stage of computations. When Proposition 3 applies, the computational cost of the algorithm is O(m L =0 M N ). For ease of exposition, we will in the sequel employ the convention v −1 n,i = v −1 n,i = 0 for all n, i.
Algorithm 1 Computing the auxiliary variable R ML n Input: Observation operator H ∈ L(V, R m ) and prediction ensemble Update the submatrix R ML n (1 : N , :) ∈ R N ×m consisting of the N first rows and all columns of R ML n as follows: . end for Lastly, add finest level sample covariance: return R ML n .
In Algorithm 2, we summarize the main steps for one predict-update iteration of the MLEnKF method.

Algorithm 2 MLEnKF predict-update iteration
Input: Hierarchy of projection operators {Π : (11). end for end for Update: Compute R ML n+1 by Algorithm 1, and K ML n+1 by (13) Generate the perturbed observationỹ n+1,i and update the particle

Theoretical Results
In this section we derive theoretical results on the approximation error and computational cost of weakly approximating the MFEnKF filtering distribution by MLEnKF. We begin by stating the main theorem of this paper. It gives an upper bound for the computational cost of achieving a sought accuracy in L p (Ω)-norm when using the MLEnKF method to approximate the expectation of a QoI. The theorem may be considered an extension to spatially extended models of the earlier work [23].
Theorem 1 (MLEnKF accuracy vs. cost). Consider a Lipschitz continuous QoI ϕ : V → R and suppose Assumption 2 holds. For a given ε > 0, let L and {M } L =0 be defined under the constraints L = 2d log κ (ε −1 )/β and Then, for any p ≥ 2 and n ∈ N, where we recall thatμ ML n denotes the MLEnKF empirical measure (16), and µ n denotes the mean-field EnKF distribution at time n (meaningv n ∼μ n ).

The computational cost of the MLEnKF estimator
The proof of this theorem is presented at the end of this section, and it depends upon the intermediary results presented prior to the proof.
Remark 4. The constraint dγ x + γ t ≥ d in Assumption 2(iii) was imposed to ensure that the computational cost of the forward simulation, Cost(Ψ ) h −(dγx+γt) , is either linear or superlinear in N . In view of Proposition 3, the share of the total cost of a single predict and update step assigned to level is proportional to M Cost(Ψ ). This cost estimate is used as input in the standard-MLMC-constrained-optimization approach to determining M , cf. (18). However, it is important to observe that in settings with high dimensional observations, m ≥ N 0 , the input in said optimization problem needs to be modified accordingly, as then the cost on the lower levels will be dominated by m rather than Cost(Ψ ).
The first result we present is a collection of direct consequences of Assumption 2: Proposition 4. If Assumption 2 holds, then for all u, v ∈ ∩ p≥2 L p (Ω, V ), and globally Lipschitz QoI ϕ : V → R, for all p ≥ 2, (iii) and for all n ≥ 1, the MFEnKF prediction covariance (9) satisfies Proof. Property (i) follows from Assumption 2(i) and the triangle inequality. Property (ii) follows from the Lipschitz continuity of ϕ followed by the triangle inequality, Assumption 1(i), and Assumption 2(i). For property (iii), Proposition 2, Jensen's inequality, definition (3), and Hölder's inequality implies that Similar to the analysis in [23] and [37,36,41], we next introduce an auxiliary mean-field multilevel ensemble n,i ,v n,i ) evolves by the respective forward mappings Ψ −1 and Ψ using the same driving noise realization as the corresponding MLEnKF particle pair (v −1 n,i , v n,i ). Note however that in the update of the mean-field multilevel ensemble, the limiting form MFEnKF covarianceC n and Kalman gainK n from equations (9) and (10) are used rather than corresponding ones based on sample moments of the multilevel ensemble itself. That is, the initial condition for each is coupled particle pair is identical to that of MLEnKF: ( and one prediction-update iteration is given by Update for = 0, 1, . . . , L and i = 1, 2, . . . , M (similar to before, we employ the conventionv −1 =v −1 := 0). By similar reasoning as in Proposition 2, it can be shown that alsov n ,v n ∈ ∩ p≥2 L p (Ω, V ) for any , n ∈ N ∪ {0}. One may think of the auxiliary mean-field multilevel ensemble as "shadowing" the MLEnKF ensemble.
Before bounding the difference between the multilevel and mean-field Kalman gains by the two next lemmas, let us recall that they respectively are given by Proof. Since (HC ML n H * ) + −HC n H * is self-adjoint and positive semi-definite, It remains to verify the lemma when {j | λ j < 0} = ∅. Let the normalized eigenvector associated to the eigenvalue min {j;λ j <0} λ j be denoted q max . Then, since (HR ML n ) + q max = 0 and the mean-field covarianceC n is self-adjoint and positive semi-definite, The next step is to bound the Kalman gain error in terms of the covariance error.

Lemma 2.
There exists a positive constantc n < ∞, depending on H L(V,R m ) , |Γ −1 |, and K n L(R m ,V) , such that Proof. The proof of this lemma as is similar to that of [23,Lemma 3.4]. For completeness, we have included a proof in Appendix C.
The next lemma bounds the distance between the prediction covariance matrices of MLEnKF and MFEnKF. For that purpose, let us first recall that the dynamics for the mean-field multilevel ensemble (21) and (22), and introduce the auxiliary matrix Lemma 3. For any ε > 0, let L and {M } L =0 be defined as in Theorem 1. If Assumption 2 holds, then for any p ≥ 2 and n ∈ N, Proof. Introducing the auxiliary covariance matrix C L n := Cov[v L n ] and using the triangle inequality, Proof. Recall that initial data of the limit mean-field methods is given bŷ v 0 ∼μ 0 and thatv 0 = Π v 0 , so that by Assumption 2(ii), By Assumptions 1(i) and 2(i), Inequality (26) consequently holds by induction, and thus also (27) by the triangle inequality. To prove inequality (28), We complete the proof of Lemma 3 by deriving the following bound for C ML n −C L n p : Lemma 5 (Multilevel i.i.d. sample covariance error). For any ε > 0, let L and {M } L =0 be defined as in Theorem 1. If Assumption 2 holds, then for any p ≥ 2 and n ∈ N, where we recall thatC L n := Cov[v L n ]. Proof. Since the sample covariances in (24) are unbiased, and therefore Next, introduce the linear centering operator Υ : . Then, by equation (24), By Lemmas 4 and 10 (the latter lemma is located in Appendix A), We now turn to bounding the last term of the right-hand side of inequality (25). Lemma 6. For any ε > 0, let L and {M } L =0 be defined as in Theorem 1. If Assumption 2 holds, then for any p ≥ 2 and n ∈ N, Proof. From the definitions of the sample covariance (7) and multilevel sample covariance (12), one obtains the bounds The bilinearity of the sample covariance yields that and For bounding I 1 we use Jensen's and Hölder's inequalities: .
The second summand of inequality (30) is bounded similarly, and we obtain The I 2 term can also be bounded with similar steps as in the preceding argument so that also The proof is finished by summing the contributions of I 1 and I 2 over all levels.
The propagation of error in update steps of MLEnKF is governed by the magnitude C n − C ML n p , i.e., the distance between the MFEnKF prediction covariance and the MLEnKF prediction covariance. The next lemma makes use of Lemma 6 to bound the distance between the mean-field multilevel en- Lemma 7 (Distance between ensembles.). For any ε > 0, let L and {M } L =0 be defined as in Theorem 1. If Assumption 2 holds, then for any p ≥ 2 and n ∈ N, Proof. The proof is similar to that of [23,Lemma 3.10]. For completeness, a proof is given in Appendix C.
With the bound between MLEnKF and its multilevel MFEnKF shadow, that conveniently for analysis consists of independent particles, we are finally ready to prove the main result.
Proof of Theorem 1. By the triangle inequality, whereμ ML n denotes the empirical measure associated to the mean-field multi- , andμ L n denotes the probability measure associated tov L . The two first summands on the right-hand side above relate to the statistical error, whereas the last relates to the bias.
By the Lipschitz continuity of the QoI ϕ, the triangle inequality, Lemma 7, and using the conventions ϕ(v −1 n ) = 0 and ϕ(v −1 n ) = 0, the first term satisfies the following bound For the second summand of (32), we employ the telescoping propertŷ and Lemmas 4 and 9 to obtain Finally, the bias term in (32) satisfies where the last step follows from the Lipschitz continuity of the QoI and Lemma 4.
Remark 5. Theorem 1 shows the cost-to-accuracy performance of MLEnKF with a disconcerting logarithmic penalty factor in (19) that grows geometrically in n. The same penalty appears in the work [23], yet the numerical experiments there indicate a rate of convergence that is uniform in n. The discrepancy between theory and practice may be an artifact of conservative bounds used in the proof of said theorem. By imposing further regularity constraints on the dynamics and the QoI, we were able to obtain an error bound without said logarithmic penalty factor for an alternative finitedimensional-state-space MLEnKF method with local-level Kalman gains [24].
As an alternative to imposing further regularity constraints, we also suspect that ergodicity of the MFEnKF process may be used to avoid the geometrically growing the logarithmic penalty factor. Recently, there has been much work on the stability of EnKF [12,13,46].
We conclude this section with a result on the cost-to-accuracy performance of EnKF. It shows that MLEnKF generally outperforms EnKF.
Theorem 2 (EnKF accuracy vs. cost). Consider a Lipschitz continuous QoI ϕ : V → R, and suppose Assumption 2 holds. For a given ε > 0, let L and M be defined under the respective constraints L = 2d log κ (ε −1 )/β and M ε −2 . Then, for any n ∈ N and p ≥ 2, n denotes the EnKF empirical measure, cf. equation (8), with particle evolution given by the EnKF predict and update formulae at resolution level L (i.e., using the numerical approximation Ψ L in the prediction and the projection operator Π L in the update).

The computational cost of the EnKF estimator
Sketch of proof. By the triangle inequality, whereμ MC n denotes the empirical measure associated to the EnKF ensemble {v L n,i } M i=1 andμ L n denotes the empirical measure associated tov L n . It follows by inequality (33) that I ε.
For the second term, the Lipschitz continuity of the QoI ϕ implies there exists a positive scalar c ϕ such that |ϕ(x)| ≤ c ϕ (1 + x V ). Sincev L n ∈ L p (Ω, V ) for any n ∈ N and p ≥ 2, it follows by Lemma 9 (on the Hilbert space R) that For the last term, let us first assume that for any p ≥ 2 and n ∈ N, for the single particle dynamicsv L n,1 andv L n,1 respectively associated to the EnKF ensemble {v L n,i } M i=1 and the mean-field EnKF ensemble {v L n,i } M i=1 . Then the Lipschitz continuity of ϕ, the fact thatv L n,1 ,v L n,1 ∈ L p (Ω, V ) for any n ∈ N and p ≥ 2 holds (when assuming (34)), and the triangle inequality yield that All that remains is to verify (34), but we omit this as it can be done by similar steps as for the proof of inequality (31).

MLEnKF-adapted numerical methods for a class of stochastic partial differential equations
In this section we develop an MLEnKF-adapted version of the exponential Euler method, for the purpose of solving a family of stochastic reactiondiffusion equations. For a relatively large class of problems, we derive an L p (Ω, V)-convergence rate β for pairwise coupled numerical solutions, cf. Assumption 2, which will needed when implementing MLEnKF.

The stochastic reaction-diffusion eqquation.
We consider the following stochastic partial differential equation (SPDE) where T > 0, and the reaction f , the cylindrical Wiener process W and the linear smoothing operator B will be described below. Our base-space is K = L 2 (0, 1), we denote by A : D(A) = H 2 (0, 1) ∩ H 1 0 (0, 1) → K the Laplace operator ∆ with zero-valued Dirichlet boundary conditions and H k (0, 1) denotes the Sobolev space of order k ∈ N. A spectral decomposition of −A yields the sequence of eigenpairs {(λ j , φ j )} j∈N where −Aφ j = λ j φ j with φ j := √ 2 sin(jπx) and λ j = π 2 j 2 . K = span{φ j }, it follows that and eigenpairs of the spectral decomposition give rise to the following family of Hilbert spaces parametrized over r ∈ R: with norm · Kr := (−A) r (·) K . Associated with the probability space (Ω, F, P) and normal filtration {F t } t∈[0,T ] , the I K -cylindrical Wiener process is defined by where {W j : [0, T ] × Ω → R} j∈N is a sequence of independent F t /B(R)adapted standard Wiener processes. The smoothing operator is defined by with the smoothing paramter b ≥ 0. It may be shown that BK r = K r+b , and this implies that B becomes progressively more smoothing the higher the value of b.
In the remaining part of this section, we assume the following conditions on the nested Hilbert spaces V ⊂ V and regularity conditions on the initial data u 0 and the reaction term f hold: Assumption 3. The Hilbert spaces V ⊂ V are of the form V = K r 1 and V = K r 2 for a pair of parameters r 1 , r 2 ∈ R satisfying max(0, b − 1/4) ≤ r 1 < r 2 < b + 1/4, the initial data u 0 is F 0 /B(V )-measurable and u 0 ∈ ∩ p≥2 L p (Ω, V ), and the reaction satisfies Under Assumption 3 there exists an up to modifications unique (Ω, F, P, {F t } t∈[0,T ] )mild solution of (35), which in this setting corresponds to a mapping u : P-almost surely for all t ∈ [0, T ]. Moreover, for any p ≥ 2 and r ∈ [r 1 , r 2 ], it holds that u(T, ·) L p (Ω,Kr) ≤ C(1 + u 0 L p (Ω,Kr) ), (38) where C > 0 depends on p, r, and T , cf. [28]. (35) only make pointwise sense provided u(t, ·) ∈ K 1/2+δ for some δ > 0 and all t ∈ (0, T ], P-almost surely. In lower-reglarity settings, e.g., when u(t, ·) / ∈ C(0, 1), said boundary condition should be interpreted in mild rather than pointwise sense.

5.2.
The filtering problem. We consider a discrete-time filtering problem of the form (1) and (2) with the above SPDE as underlying model with Ψ(u n ) denoting the mild solution of (35) at T > 0 given the initial data u n ∈ ∩ p≥2 L p (Ω, V ). The underlying dynamics at observation times 0, T, 2T, . . . is thus described by the dynamics u n+1 = Ψ(u n ), and the finite-dimensional partial observation of u n at time nT is given by where Hu = [H 1 (u), . . . , H m (u)] T ∈ R m with H i ∈ V * for i = 1, 2, . . . , m. Note further that Assumption 3 implies that V ⊂ K 0 = K, so we may represent the mild solution in the basis {φ j } at any observation time nT : To verify that this approximation method can be used in the MLEnKF framework, it remains to verify Assumptions 1 and 2(i)-(ii), and to determine the rate parameter β > 0. The equation (37), the regularity f ∈ Lip(K r 1 ), the inequality sup v∈Kr\{0} e At v Kr v Kr ≤ 1, for all r ∈ R and t ≥ 0, and Jensen's inequality imply that for any p ≥ 2, there exists a C > 0 such that Hence by Gronwall's inequality, which verifies Assumption 1(i). Assumption 1(ii) follows from (38). To verify that Assumption 2(i) holds with rate β = 4(r 2 − r 1 ), observe that for any p ≥ 2, where the last inequality follows from (38) and h N −1 . Assumption 2(ii) follows by a similar shift-space argument. (Relating to Assumption 2(iii), we leave the question of the computational cost of this method open for the time being, but see Section 6.2 for treatment in one example.)

Assumptions and convergence rates.
To show that the fully-discrete numerical method may be used in MLEnKF, it remains to verify that Assumptions 1 and 2(i)-(ii) hold. Assumption 1(i): Let U ,k andŪ ,k denote solutions at time t = k∆t of the scheme (42) with respective initial data U ,0 = P u 0 andŪ ,0 = P v 0 fpr some u 0 , v 0 ∈ L p (Ω, V). Then, by (40), and the properties: (a) for all ≥ 0 and v ∈ V and (b) f ∈ Lip(V, V); there exists a C > 0 such that Consequently, for every p ≥ 2, there exists a c Ψ > 0 such that holds for all ≥ 0 and u 0 , v 0 ∈ L p (Ω, V).
The triangle inequality and (41) yield that , and by [28, Corollary 8.1.11-12 and Theorem 8.2.25] 1 , it respectively holds that for any p ≥ 2 This verifies Assumption 2(i) as it leads to the following bound: for any p ≥ 2, Assumption 2(ii) only depends on the projection operator, and thus follows from (41).

Linear forcing.
For the remaining part of this section consider the linear case f (u) = u of the filtering problem in Section 5.2. We derive explicit values for the rate exponents β, γ x and γ t when applying MLEnKF with either the exact-in-time-truncated-in-space approximation method in Section 5.3 or the fully-discrete approximation method in Section 5.4. The exact solution of the j-th mode for this linear case is Although we see that the underlying dynamics can be solved exactly, the filtering problem is still non-trivial since correlations between the modes {u (j) n+1 } j will arise from the assimilation of observations (39), unless the observation operator is of the special form H(·) = [H 1 (·), . . . , H m (·)] T with all operator components of the form H i = φ * j for some j ∈ N. 1 In the notation of the lecture notes [28], the parameters γ, β and η, which describe different properties than in this paper, take the values γ = r1, β = b−1/4 and η = 2(γ −β).
Since the Galerkin and spatial approximation methods coincide in the linear setting, meaning Ψ = Ψ , it holds by (41) that for any p ≥ 2, Let us next show that the time discretization convergence rate (45) is improved from r 1 − r 2 in the above nonlinear setting to 1 in the linear setting. We begin by studying the properties of the sequence {P Ψ m (u 0 )} ∞ m= for a fixed ∈ N. The j-th mode projected difference of coupled solutions for m > is given by where I m,j,1 , I m,j,2 , and I m,j,3 for every j = 1, 2, . . . is a triplet of mutually independent random variables and I m,j,1 = I m,j,2 = I m,j,3 = 0 for all j > N m−1 . Furthermore, there exists a constant c > 0 that depends on T > 0 and λ 1 > 1 such that for any m ∈ N and all j ≤ N m−1 , and I m,j,2 and I m,j,3 are mean zero-valued Gaussians with variance bounded by Proof. See Appendix B.
By Lemma 8 and Assumption 3, there exists a C > 0 depending on p, T , λ 1 and b + 1/4 − r 1 such that for any m > and u 0 ∈ ∩ p≥2 L(Ω, V ), Here, the sixth inequality follows from I m,j,2 and I m,j,3 being mean-zerovalued Gaussians with variance bounded by (47), which implies that for any p ≥ 2, there exists a constant C > 0 depending on p such that max r∈{2,3} j holds for all j ∈ N. And the last inequality follows from the assumption From inequality (48) we deduce that {P Ψ m (u 0 )} ∞ m= is L p (Ω, V)-Cauchy and that there exists a constant C > 0 depending on p, T , λ 1 and b+1/4−r 1 such that In view of the preceding inequality and (46) we obtain the following L pstrong convergence rate for the fully discrete scheme: where C depends on r 1 , r 2 and p, but not on .

Remark 7.
To the best of our knowledge, the L p -strong time-discretization convergence rate (49) is an improvement of the literature in two ways. First, for p = 2, it is slightly higher than O(log(∆t −1 )∆t), which is the best rate in the literature, cf. [29]. And second, this is the first proof of order 1 L p -strong time-discretization convergence rate for any p ≥ 2. 5.5.1. Error equilibration. The temporal and spatial discretization errors of (50) are equlibrated through determining the base κ > 1 that induces a sequence {N = N 0 κ } such that N 2(r 1 −r 2 ) J −1 2 − . The solution is κ = 2 (r 2 −r 1 )/2 , which yields the following L p -strong convergence rate in (50): In view of Assumption 2, MLEnKF with the fully-discrete approximation method yields the convergence rate β = 4(r 2 − r 1 ) and the computational cost rates γ x = 1 and γ t = 2(r 2 − r 1 ) in the considered linear setting.

Remark 8 (MLEnKF time).
Note that one could consider applying the SDE version of [23] to a fixed finite approximation of the SPDE. However, in this case we would be incurring a fixed baseline cost associated to that discretization. In comparison to using a single level method, there would be a gain in efficiency, as a result of using the multilevel identity with respect to the time discretization. But this would be still substantially less efficient than accounting also for the spatial approximation in the multilevel method, as we do in the method considered here.

Numerical examples
In this section we present numerical performance studies of EnKF and MLEnKF applied to two different filtering problems with underlying dynamics given by the SPDE (35). In the first example, the reaction term of the SPDE is linear, and in the second example we consider a nonlinear, and thus more challenging, reaction term. 6.1. Discretization parameters and the relationship between computational cost and accuracy. If we neglect the logarithmic factor in (19), as is motivated by Remark 5, then Theorems 1 and 2 respectively imply the following relations between mean squared error (MSE) and computational cost and Cost μ MC In other words, and For all test problems, we use the observation-time interval T = 1/2, N = 40 observation times, N = 2 +2 , and, when relevant J = 2 +2 (i.e., for the fully-discrete numerical method). The approximation error, which we refer to as the mean squared error (MSE), is defined as the sum of the squared QoI error over the observation times and averaged over 100 realizations of the respective filtering methods. That is, In the examples below we numerically verify that the considered numerical methods respectively fulfill (51) and (52), when the above computational cost expressions are replaced/approximated by the wall-clock runtime of the computer implementations of the respective methods. More precisely, we numerically verify that the following approximate asymptotic inequalities hold: observation noise parameter Γ = 0.5, QoI and initial data We note that H, ϕ ∈ V * and u 0 ∈ V . Figure 1 illustrates one exact-in-time simulation of the SPDE.
By the approximation β = 2(r 2 − r 1 ) ≈ 2, application of the error equilibration in Section 5.5.1 yields (dγ x = 1,γ t = 1) for the fully-discrete method and (dγ x = 1, γ t = 0) for the spatially-discrete method. Figure 2 and the left subfigure of Figure 3 display the runtime-to-MSE performance for the spatially-discrete and fully-discrete methods, respectively. The right subfigure of Figure 3  The reference-solution sequence {μ n [ϕ]} N n=1 that is needed to estimate the MSE in the above figures, is approximated by Kalman filtering the subspace V 12 ⊂ V, which is an N 12 = 2 14 -dimensional subspace. This yields an accurate approximation, since when the underlying dynamics (35) is linear with Gaussian additive noise, the full-space Kalman filter distribution equals the   Figure 2. Runtime-to-MSE comparison for the filtering problem in Section 6.2 using the spatially-discrete method. Runtime-to-MSE comparison for the filtering problem in Section 6.2 using the fully-discrete numerical method.
reference MFEnKF distributionμ. Furthermore, EnKF and MLEnKF solutions are computed with ensemble particles at no higher spatial resolution than V 9 in the cost-to-accuracy studies.
As in Section 5.1, we introduce the family of Hilbert spaces parametrized in r ∈ R with norm · Kr := (−A) r (·) K . As smoothing operator B, we consider (36) with parameter b = 1/4, where W denotes an I K -cylindrical Wiener process (both B and W are of course expanded in the currently considered basis (57)). We consider the approximation spaces where ν = 10 −4 , the QoI (55) and the observation operator The spectral representation of the initial data implies that u(0, ·) ∈ K (1−ν)/2 . Figure 4 illustrates one simulation of the SPDE by the numerical scheme described below.
By the Lipschitz-continuity of the reaction term, it follows that Assumption 3 is fulfilled. Moreover, the well-posedness theory for the zero-valued boundary condition for the SPDE (35) extends to the current setting, and so does the theory for the fully-discrete exponential Euler method in Section 5.4, cf. [28]. 6.3.1. Numerical scheme. We apply the coupled fully discrete approximation method described in Section 5.4, but, due to the nonlinearity of the reaction term f (u) = sin(πu), the spectral representation in the coupled scheme (42) and (44) needs to be approximated. Namely, the approximation of [(f (U ,k )) (1) , . . . , (f (U ,k )) (N ) ] is obtained by application of the fast Fourier transform (FFT) as follows: 1. Given the spectral representation [U  The spectral approximation of the coarse-level reaction term is obtained analogously. Due to the FFT approximation error in step 2. above, we cannot directly obtain the rate parameter β from the analysis in Section 5.4. To infer β, we instead perform numerical studies of the L p (Ω, V)-convergence rate of the coupled-level difference of the FFT-based fully-discrete method Ψ (u) − Ψ −1 (u) towards 0, where the expectation is estimated with the Monte Carlo method with M = 10 5 samples: (58) Recalling that h −1 N J = 2 2+ for the numerical solver Ψ , we infer from the results of the numerical study (58), which is provided in Figure 5, with β = 2. Further numerical studies, which we do not include here, indicate that the right hand side of (59) may be decomposed into O(N −1 + J −1 ). On the basis of these observations, the configuration of discretization parameters for this problem, N J , is in alignment with the efficiencyoptimized error equilibration strategy in Section 5.5.1. The left subfigure in   Once again, the numerical observations are consistent with the theoretical asymptotical behavior predicted by (53) and (54).
Remark 9 (MLEnKF versus Multilevel particle filters). To the best of our knowledge, there does not exist a general multilevel particle filter for SPDE to this date. When the effective dimension on level is N , the general requirement for particle filters is that the ensemble size on that level is bounded from below by ce N particles, for some constant c > 0. Effective dimension refers to the dimension of the space over which importance sampling needs to be performed [3,10,1]. For example, in the case of full observations, the effective dimension can be equal to the state-space dimension. For MLEnKF, on the other hand, the level ensemble size is always bounded from above by O(L 2 N (β−dγx−γt)/(2d) ), even with full observations. The set of MLEnKFtractable problems is therefore substantially larger than the set of problems tractable by particle filters.

Conclusion
We have presented the design and analysis of a multilevel EnKF method for infinite-dimensional spatio-temporal processes depending on a hierarchical decomposition of both the spatial and the temporal parameters. We have proved theoretically and provided numerical evidence that under suitable assumptions, a similar asymptotic cost-to-accuracy is obtained for MLEnKF as that one obtains for standard multilevel Monte Carlo methods. This result has potential for broad impact across application areas in which there has been a recent explosion of interest in EnKF, for example weather prediction and subsurface exploration.
Appendix A. Marcinkiewicz-Zygmund inequalities for separable Hilbert spaces In order to prove Lemma 5, we will need the following two lemmas for extending the Marcinkiewicz-Zygmund inequality from finite-dimensional state-spaces to separable Hilbert spaces. Lemma 9. [34, Theorem 5.2] Let 2 ≤ p < ∞ and X i ∈ L p (Ω, V) be i.i.d. samples of X ∈ L p (Ω, V). Then where c p only depends on p.
For bounding these three terms, we need to estimate the difference between powers of (g(λ j , ∆t m )) 2 and g(λ j , ∆t m−1 ). Note first that (g(λ j , ∆t m )) 2 = e −2λ j ∆tm + 2e −λ j ∆tm 1 − e −λ j ∆tm λ j + 1 − e −λ j ∆tm λ j 2 = e −λ j ∆t m−1 + 1 − e −λ j ∆t m−1 λ j =g(λ j ,∆t m−1 ) Remark 10. Equations (61) and (62) show that to leading order, the additive noise from two consecutive iterations of the fine scheme equals the additive noise from one corresponding iteration of the coupled coarse scheme. The strong coupling of the coarse and fine schemes is crucial for achieving the order 1 a priori time discretization convergence rate.
Proof of Lemma 2. Recalling the notation R ML n = C ML n H * and introducing the auxiliary operatorR n :=C n H * , we havē Proof of Lemma 7. We will use an induction argument to show that for arbitrary fixed N ∈ N and p ≥ 2, it holds for all n ≤ N that L =0 v n −v n L p (Ω,V) | log(ε)| n ε, ∀p ≤ 4 N −n p.
Plugging (68) into the right-hand side of (29) and using Lemma 3, we obtain that for all p ≤ 4 N −n p.