Robust parameter estimation using the ensemble Kalman filter

Standard maximum likelihood or Bayesian approaches to parameter estimation for stochastic differential equations are not robust to perturbations in the continuous-in-time data. In this paper, we give a rather elementary explanation of this observation in the context of continuous-time parameter estimation using an ensemble Kalman filter. We employ the frequentist perspective to shed new light on three robust estimation techniques; namely subsampling the data, rough path corrections, and data filtering. We illustrate our findings through a simple numerical experiment.


Introduction
In this note, which is an extended version of [30], we consider the well-studied problem of parameter estimation for stochastic differential equations (SDEs) from continuous-time observations X † t , t ∈ [0, T ] [26].It is well-known that the corresponding maximum likelihood estimator does not depend continuously on the observations X † t , t ∈ [0, T ], which can result in a systematic estimation bias [28,15].In other words, the maximum likelihood estimator is not robust with respect to perturbations in the observations.Here, we revisit this problem from the perspective of online (time-continuous) parameter estimation [7,12] using the popular ensemble Kalman filter (EnKF) and its continuous-time ensemble Kalman-Bucy filter (EnKBF) formulations [16,11,27].As for the corresponding maximum likelihood approaches, the EnKBF does not depend continuously on the incoming observations X † t , t ≥ 0, with respect to the uniform norm topology on the space of continuous functions.This fact has been first investigated in [10] using rough path theory [17].In particular, as already demonstrated for the related maximum likelihood estimator in [15], rough path theory allows one to specify an appropriately generalised topology which leads to a continuous dependence of the EnKBF estimators on the observations.Here we expand the analysis of [10] to a frequentist analysis of the EnKBF in the spirit of [31], where the primary focus is on the expected behaviour of the EnKBF estimators over all admissible observation paths.One recovers that the discontinuous dependence of the EnKBF estimators on the driving observations results in a systematic bias from a frequentist perspective.This is also a well known fact for SDEs driven by multiplicative noise [24].
The proposed frequentist perspective naturally enables the study of known bias correction methods, such as subsampling the data [28], a recently proposed data filtering approach [2], as well as novel de-biasing approaches [10] in the context of the EnKBF.
In order to facilitate a rather elementary mathematical analysis, we consider only the very much simplified problem of parameter estimation for linear SDEs.This restriction allows us to avoid certain technicalities from rough path theory and enables a rather straightforward application of the numerical rough path approach put forward in [14].As a result we are able to demonstrate that the popular approach of subsampling the data [3,28,6] can be well justified from a frequentist perspective.The frequentist perspective also suggests a rather natural approach to the estimation of the required correction term in the case an EnKBF is implemented without subsampling.
We end this introductory paragraph with a reference to [1], which includes a broad survey on alternative estimation techniques.We also point to [10] for an in-depth discussion of rough path theory in connection to filtering and parameter estimation.
The remainder of this paper is structured as follows.The problem setting and the EnKBF are introduced in the subsequent Section 2. The frequentist perspective and its implications on the specific implementations of an EnKBF in the context of low and high frequency data assimilation are laid out in Section 3. The importance of these considerations becomes transparent when applying the EnKBF to perturbed data in Section 4.Here again, we restrict attention to a rather simple model setting taken from [18] and also used in [10].As a result we build a clear connection between subsampling and the necessity for a correction term in the case high frequency data is assimilated directly.We also provide a discussion of the data filtering approach [2] in the context of our simply model system.A brief numerical demonstration is provided in Section 5, which is followed by a concluding remark in Section 6.

Ensemble Kalman parameter estimation
We consider the SDE parameter estimation problem subject to observations X † t , t ∈ [0, T ], which arise from the reference system where the unknown drift function f † (x) typically satisfies f † (x) = f (x, θ † ) and θ † denotes the true parameter value.Here we assume for simplicity that the unknown parameter is scalar-valued and that the state variable is d-dimensional with d ≥ 1.Furthermore, W t and W † t denote independent standard d-dimensional Brownian motions and γ > 0 is the (known) diffusion constant.
Following the Bayesian paradigm, we treat the unknown parameter as a random variable Θ.Furthermore, we apply a sequential approach and update Θ with the incoming data X † t as a function of time.Hence we introduce the random variable Θ t which obeys the Bayesian posterior distribution given all observations X † τ , τ ∈ [0, t], up to time t > 0. Furthermore, instead of exactly solving the time-continuous Bayesian inference problem as specified by the associated Kushner-Stratonovitch equation [7,27], we define the time evolution of Θ t by an application of the (deterministic) ensemble Kalman-Bucy filter (EnKBF) mean-field equations [11,27], which take the form where π t denotes the probability density function (PDF) of Θ t and π t [g] the associated expectation value of a function g(θ).The column vector I t , defined by (3b), is called the innovation, while the row vector premultiplying the innovation in (3a) is called the gain.Here the notation a ⊗ b = ab T , where a, b can be any two column vectors, has been used.The initial condition Θ 0 ∼ π 0 is provided by the prior PDF of the unknown parameter.A Monte-Carlo implementation of the mean-field equations (3) leads to the interacting particle system dΘ (i) i = 1, . . ., M , where expectations are now taken with respect to the empirical measure.That is, for given function g(θ), and all Monte-Carlo samples are driven by the same (fixed) observations X † t .The initial samples Θ (i) 0 , i = 1, . . ., M , are drawn identically and independently from the prior distribution π 0 .
We note in passing that there is also a stochastic variant of the innovation process [27] defined by which leads to the Monte-Carlo approximation dI of the innovation in (5).
Remark 1.There is an intriguing connection to the stochastic gradient descent approach to the estimation of θ † , as proposed in [32], which is written as in our notation, where α t > 0 denotes the learning rate.We note that (9) shares with (3) the gain times innovation structure.However, while (3) approximates the Bayesian inference problem, formulation (9) treats the parameter estimation problem from an optimisation perspective.Both formulations share, however, the discontinuous dependence on the observation path X † t , and the proposed frequentist analysis of the EnKBF (3) also applies in simplified form to (9).We also point out that (3) is affine invariant [19] and does not require the computation of partial derivatives.We now state a numerical implementation with step-size ∆t > 0 and denote the resulting numerical approximations at t n = n∆t by Θ n ∼ π n , n ≥ 1.While a standard Euler-Maruyama approximation could be applied, the following stable discrete-time mean-field formulation of the EnKBF is inspired by [4] with Kalman gain It is straightforward to combine this time discretisation with the Monte-Carlo approximation (5) in order to obtain a complete numerical implementation of the EnKBF.
Remark 2. The rough path analysis of the EnKBF presented in [10] is based on a Stratonovich reformulation of (3) and its appropriate time discretisation.Here we follow the Itô/Euler-Maruyama formulation of the data-driven term in (3), for any continuous function g(x, t) and ∆t = T /L, as it corresponds to standard implementation of the EnKBF and is easier to analyse in the context of this paper.
The EnKBF provides only an approximate solution to the Bayesian inference problem for general nonlinear f (x, θ).However, it becomes exact in the meanfield limit for affine drift functions f (x, θ) = θAx + Bx + c.
Example 1.Consider the stochastic partial differential equation over a periodic spatial domain y ∈ [0, L), where W(t, y) denotes space-time white noise, U ∈ R, and ρ > 0 are given parameters.A standard finite-difference discretisation in space with d grid points and mesh-size ∆y leads to a linear system of SDEs of the form where u t ∈ R d denotes the vector of grid approximations at time t, D ∈ R d×d a finite difference approximation of the spatial derivative ∂ y , and W t the standard d-dimensional Brownian motion.We can now set X t = u t , γ = ∆y −1 and identify either θ = U or θ = ρ as the unknown parameter in order to obtain an SDE of the form (1).
In this note, we further simplify our given inference problem to the case where A ∈ R d×d is a normal matrix with eigenvalues in the left half plane.That is σ(A) ⊂ C − .The reference parameter value is set to θ † = 1.Hence the SDE (2) possesses a Gaussian invariant measure with mean zero and covariance matrix We assume from now on that the observations X † t are realisations of (2) with initial condition X † 0 ∼ N(0, C).Under these assumptions, the EnKBF (3) simplifies drastically, and we obtain with variance Remark 3.For completeness, we state the corresponding formulation for the stochastic gradient descent approach (9): We find that the learning rate α t takes the role of the variance σ t in (17).However, we emphasise again that the same pathwise stochastic integrals arise from both formulations, and therefore, the same robustness issue of the resulting estimators θ t , t > 0, arises.
Similarly, the discrete-time mean-field EnKBF (10) reduces to with Kalman gain Furthermore, since X † t ∼ N(0, C), for d ≫ 1, and we may simplify the Kalman gain to Here we have used the notation A : B = tr(A T B) to denote the Frobenius inner product of two matrices A, B ∈ R d×d .The approximation (22) becomes exact in the limit d → ∞, which we will frequently assume in the following section.Please note that under the stated assumptions.
Remark 4. The Stratonovitch reformulation of ( 17) replaces (17a) by The innovation I t remains as before.See Appendix B of [10] for more details.An appropriate time discretisation of the innovation-driven term replaces the Kalman gain (21) by where Please note that a midpoint discretisation of the data-driven term in (25) results in and that 1 2 which justifies the additional drift term in (25).A precise meaning of the approximation in (29) will be given in Remark 5 below.
Alternatively, if one wishes to explicitly utilise the availability of continuous-time data X † t , one could apply the following variant of (20): and following the Itô/Euler-Maruyama approximation (12), discretise the integral with a small inner step-size ∆τ = ∆t/L, L ≫ 1; that is, with τ l = t n + l∆τ .We note that which is at the heart of rough path analysis [14] and which we utilise in the following section.

Frequentist analysis
It is well-known that the second-order contribution in (32) leads to a discontinuous dependence of the integral on the observed X † t in the uniform norm topology on the space of continuous functions.Rough path theory fixes this problem by defining appropriately extended topologies and has been extended to the EnKBF in [10].In this section, we complement the path-wise analysis from [10] by an analysis of the impact of second-order contribution on the EnKBF (17) from a frequentist perspective, which analyses the behaviour of EnKBF over all possible observations X † t subject to (2).In other words, one switches from a strong solution concept to a weak one.While we assume that the observations satisfy (2), throughout this section, we will analyse the impact of a perturbed observation process on the EnKBF in Section 4.
We first derive evolution equations for the conditional mean and variance under the assumption that Θ 0 is Gaussian distributed with given prior mean m prior and variance σ prior .It follows directly from (17) that the conditional mean µ t = π t [θ], that is the mean of Θ t , satisfies the SDE which simplifies to under the approximation (22).The initial condition is µ 0 = m prior .The evolution equation for the conditional variance, that is the variance of Θ t , is given by with initial condition σ 0 = σ prior and which again reduces to under the approximation (22).We now perform a frequentist analysis of the estimator µ t defined by (34) and (36), that is, we perform a weak analysis of the SDE (34) in terms of the first two moments of µ t [31].In the first step, we take the expectation of (34) over all realisations X † t of the SDE (2), which we denote by The associated evolution equation is given by which reduces to In the second step, we also look at the frequentist variance we obtain which we simplify to under the approximation (22).The initial conditions are m 0 = m prior and p 0 = 0, respectively.We note that the differential equations ( 36) and ( 43) are explicitly solvable.For example, it holds that and one finds that σ t ∼ 1/((A T A) : (A T +A) −1 t) for t ≫ 1.It can also be shown that p t ≤ σ t for all t ≥ 0. Furthermore, this analysis suggests that the learning rate in the stochastic gradient descent formulation (19) should be chosen as where ᾱ > 0 denotes an initial learning rate; for example ᾱ = σ 0 .We finally conduct a formal analysis of the ensemble Kalman filter timestepping (20) and demonstrate that the method is first-order accurate with regard to the implied frequentist mean m t .We recall (24) and conclude from (20) that the implied update on the variance σ n satisfies which provides a first-order approximation to (36).We next analyse the evolution equation (34) for the conditional mean µ t and its numerical approximation arising from (20).Here we follow [14] in order to analyse the impact of the data X † t on the estimator.An in-depth theoretical treatment can be found in [10].Comparing (47) to (34) and utilising (24), we find that the key quantity of interest is which we can rewrite as Here, motivated by (32) and following standard rough path notation, we have used and the second-order iterated Itô integral The difference between the integral (48) and its corresponding approximation in (47) is provided by A T : X † tn,tn+1 plus higher-order terms arising from (24).The iterated integral X † tn,tn+1 becomes a random variable from the frequentist perspective.Taking note of (2), we find that the drift, f (x) = Ax, contributes with terms of order O(∆t 2 ) to X † tn,tn+1 and the expected value of X † tn,tn+1 therefore satisfies since E † [W † tn,τ ] = 0 for τ > t n , and where we have introduced the commutator Hence we find that, while (47) is not a first-order (strong) approximation of the SDE (34), the approximation becomes first-order in m t when averaged over realisations X † t of the SDE (2).More precisely, one obtains We note that the modified scheme (30) leads to the same time evolution in the variance σ n while the update in µ n is changed to This modification results in a more accurate evolution in the conditional mean µ n , but because of (52) it does not impact to leading order the evolution of the underlying frequentist mean, We summarise our findings in the following proposition.
Proposition 1.The discrete-time EnKBF implementations ( 20) and ( 30) both provide first-order approximations to the time evolution of the frequentist mean, m t , and the frequentist variance, p t .In other words, both methods converge weakly with order one.
We also note that the frequentist uncertainty is essentially data-independent and depends only on the time window [0, T ] over which the data gets observed.Hence, for fixed observation interval [0, T ], it makes sense to choose the step-size ∆t such that the discretisation error (bias) remains on the same order of magnitude as p T .Selecting a much smaller step-size would not significantly reduce the frequentist estimation error in the conditional estimator µ T .
Remark 5. We can now give a precise reformulation of the approximation (29): which is at the heart of the Stratonovich formulation ( 25) of the EnKFB [10].

Multi-scale data
We now have all the material in place to study the dependency of the EnKBF estimator on a set of observations X (ǫ) t , ǫ > 0, which approach the theoretical X † t with respect to the uniform norm topology on the space of continuous functions as ǫ → 0. Since the second-order contribution in (32), that is (51), does not depend continuously on such perturbations, we demonstrate in this section that a systematic bias arises in the EnKBF.Furthermore, we show how the bias can be eliminated either via subsampling the data, which effectively amounts to ignoring these second-order contributions, or via an appropriate correction term, which ensures a continuous dependence on observations X (ǫ) t with respect to the uniform norm topology.More specifically, we investigate the impact of a possible discrepancy between the SDE model (1), for which we aim to estimate the parameter θ, and the data generating SDE (2).We therefore replace (2) by the the following two-scale SDE [18]: where β = 2 and ǫ = 0.01.The dimension of state space is d = 2 throughout this section.While we restrict here to the simple two-scale model (58), similar scenarios can arise from deterministic fast-slow systems [25,8].
The associated EnKBF mean-field equations in the parameter Θ t , which we now denote by Θ (ǫ) t in order to explicitly record its dependence on the scale parameter ǫ ≪ 1, become dΘ with variance σ and t .The discrete-time mean-field EnKBF (20) turns into with Kalman gain We also consider the appropriately modified scheme (30): (64) In order to understand the impact of the modified data generating process on the two mean-field EnKBF formulations (62) and ( 64), respectively, we follow [18] and investigate the difference between is stationary, it is Gaussian with mean zero and covariance Hence P (ǫ) t → 0 as ǫ → 0 and also in L 2 uniformly in t, provided σ(A) ⊂ C − and X (ǫ) 0 = X † 0 .This is illustrated in Figure 1.
In order to investigate the problem further, we study the integral and its relation to (48).As for (48), we can rewrite (68) as We now investigate the limit of the second-order iterated integral as ǫ → 0 [18].Here [., .]denotes the commutator defined by (54).Proof.The proof follows [18] and can be summarised as follows: As discussed in detail in [10] already, Proposition 2 implies that the scheme (64) does not, in general, converge to the scheme (64) as ǫ → 0 since This observation suggests the following modification

Numerical implementation
The numerical implementation of (74) requires an estimator for the generally unknown M in (73).This task is challenging as we only have access to X (ǫ) t without any explicit knowledge of the underlying generating process (58).While the estimator proposed in [10] is based on the idea of subsampling the data, the frequentist perspective taken in this note suggests the alternative estimator M est defined by ∆t γ 2 which follows from (72f) and (52).That is, E † [X † tn,tn+1 ] = O(∆t 2 ) for ∆t sufficiently small.Note that second-order iterated integral X (ǫ) tn,tn+1 satisfies (70) and is therefore easy to compute.In practice, the frequentist expectation value can be replaced by an approximation along a given single observation path X (ǫ) t , t ∈ [0, T ], under the assumption of ergodicity.
An appropriate choice of the outer or sub-sampling step-size ∆t [28] constitutes an important aspect for the practical implementation of the EnKBF formulation (62) for finite values of ǫ > 0 [27].Consistency of the second-order iterated integrals [14] implies A sensible choice of ∆t is dictated by that is, the sub-sampled data X (ǫ) tn behaves to leading order like solution increments from the reference model (2) at scale ∆t independent of the specific value of ǫ.Note that, on the other hand, for an inner step-size ∆τ ∼ ǫ.In other words, a suitable step-size ∆t > 0 can be defined by making as small as possible while still guaranteeing an accurate numerical approximation in (62).
Remark 8.The choice of the outer time step ∆t is less critical for the EnKBF formulation (74) since it does not rely on sub-sampling the data and is robust with regard to perturbations in the data provided the appropriate M is explicitly available or has been estimated from the available data using (81).Furthermore, if A is symmetric, then it follows from (75) and the skew-symmetry of the commutator [., .]that tn+1 tn (AX which can be used in (74).The same simplification arises when M is symmetric.This insight is at the heart of the geometric rough path approach followed in [10] and which starts from the Stratonovich formulation (25) of the EnKBF.
See also [29] on the convergence of Wong-Zakai approximations for stochastic differential equations.In all other cases, a more refined numerical approximation of the data-driven integral in (74) is necessary; such as, for example, (31).For that reason, we rely on the Itô/Euler-Maruyama interpretation of (68) in this note instead, that is the approximation (12).

Filtered data
We finally discuss a recently proposed [2] robust modification to the parameter estimation problem in the light of the mean-field EnKBF equations considered in this paper.The essential idea is to filter the observation paths X † t , t ≥ 0, via where δ > 0 is a sufficiently small parameter and V † t denotes independent Brownian motion with δ noise = 1 (noise added) or δ noise = 0 (no noise added).Extending the methodology proposed in [2] to the mean-field EnKFB equations (17), we now consider dΘ t = σ t γ (AZ † t ) T dI t , (88a) with the variance σ t defined as before.
Let us first investigate the long-time behaviour of the extended data generating system in some detail.Its stationary distribution is Gaussian with mean m x ∞ = m z ∞ = 0.The stationary covariance matrices satisfy the relations  (30) with inner time-step ∆τ = ∆t/600.We also display the curves arising for σt and mt from the standard Kalman theory using the approximation (22).Note that the posterior variance, σt, should provide an upper bound on the frequentist uncertainty pt.2).The corresponding approximation for σ t provides, however, a good upper bound for p t .
We now replace the data generating SDE model ( 2) by the multi-scale formulation (58) with ǫ = 0.01 and β = 2.This parameter choice agrees with the one used in [10].We again find that assimilating the data at the slow time-scale ∆t = 0.06 leads to very similar results obtained from an assimilation at the fast time-scale ∆τ = 10 −4 with the EnKBF formulation (74), provided the correction term resulting from the second-order iterated integral (73) is included.See Figure 3.We also verified numerically that ∆t = 0.06 constitutes a nearly optimal step-size in the sense of making (85) sufficiently small while maintaining numerical accuracy.For example, reducing the outer step-size to ∆t = 0.02 leads to h(0.02) − h(0.06) ≈ 10 in (85).

Fig. 2 .
Fig.2.a)-b): frequentist mean, mt and variance, pt, from EnKBF implementation(20) with step-size ∆t = 0.06; c)-d): same results from EnKBF implementation(30) with inner time-step ∆τ = ∆t/600.We also display the curves arising for σt and mt from the standard Kalman theory using the approximation(22).Note that the posterior variance, σt, should provide an upper bound on the frequentist uncertainty pt.

Fig. 3 .
Fig.3.Same experimental setting as in Figure2but with the data now generated from the multi-scale SDE (58).Again, subsampling the data in intervals of ∆t = 0.06 and high-frequency assimilation with step-size ∆τ = 10 −4 lead to very similar results in terms of their frequentist means and variances.