Keywords

1 Introduction

In this note, we consider the well-studied problem of parameter estimation for stochastic differential equations (SDEs) from continuous-time observations \(X_t^\dagger \), t ∈ [0, T] [25]. It is well-known that the corresponding maximum likelihood estimator does not depend continuously on the observations \(X_t^\dagger \), t ∈ [0, T], which can result in a systematic estimation bias [27, 14]. In other words, the maximum likelihood estimator is not robust with respect to perturbations in the observations. Here, we revisit this problem from the perspective of online (time-continuous) parameter estimation [6, 11] using the popular ensemble Kalman filter (EnKF) and its continuous-time ensemble Kalman-Bucy filter (EnKBF) formulations [15, 10, 26]. As for the corresponding maximum likelihood approaches, the EnKBF does not depend continuously on the incoming observations \(X_t^\dagger \), t ≥ 0, with respect to the uniform norm topology on the space of continuous functions. This fact has been first investigated in [9] using rough path theory [16]. In particular, as already demonstrated for the related maximum likelihood estimator in [14], rough path theory allows one to specify an appropriately generalised topology which leads to a continuous dependence of the EnKBF estimators on the observations. Here we expand the analysis of [9] to a frequentist analysis of the EnKBF in the spirit of [29], where the primary focus is on the expected behaviour of the EnKBF estimators over all admissible observation paths. One recovers that the discontinuous dependence of the EnKBF estimators on the driving observations results in a systematic bias from a frequentist perspective. This is also a well known fact for SDEs driven by multiplicative noise [23].

The proposed frequentist perspective naturally enables the study of known bias correction methods, such as subsampling the data [27], as well as novel de-biasing approaches in the context of the EnKBF.

In order to facilitate a rather elementary mathematical analysis, we consider only the very much simplified problem of parameter estimation for linear SDEs. This restriction allows us to avoid certain technicalities from rough path theory and enables a rather straightforward application of the numerical rough path approach put forward in [13]. As a result we are able to demonstrate that the popular approach of subsampling the data [2, 27, 5] can be well justified from a frequentist perspective. The frequentist perspective also suggests a rather natural approach to the estimation of the required correction term in the case an EnKBF is implemented without subsampling.

We end this introductory paragraph with a reference to [1], which includes a broad survey on alternative estimation techniques. We also point to [9] for an in-depth discussion of rough path theory in connection to filtering and parameter estimation.

The remainder of this paper is structured as follows. The problem setting and the EnKBF are introduced in the subsequent Sect. 2. The frequentist perspective and its implications on the specific implementations of an EnKBF in the context of low and high frequency data assimilation are laid out in Sect. 3. The importance of these considerations becomes transparent when applying the EnKBF to perturbed data in Sect. 4. Here again, we restrict attention to a rather simple model setting taken from [17] and also used in [9]. As a result we build a clear connection between subsampling and the necessity for a correction term in the case high frequency data is assimilated directly. A brief numerical demonstration is provided in Sect. 5, which is followed by a concluding remark in Sect. 6.

2 Ensemble Kalman Parameter Estimation

We consider the SDE parameter estimation problem

$$\displaystyle \begin{aligned} \mathrm{d} X_t = f(X_t,\theta)\mathrm{d}t + \gamma^{1/2} \mathrm{d}W_t \end{aligned} $$
(1)

subject to observations \(X_t^\dagger \), t ∈ [0, T], which arise from the reference system

$$\displaystyle \begin{aligned} \mathrm{d} X_t^\dagger = f^\dagger (X_t^\dagger)\mathrm{d}t + \gamma^{1/2} \mathrm{d}W^\dagger_t, \end{aligned} $$
(2)

where the unknown drift function f (x) typically satisfies f (x) = f(x, θ ) and θ denotes the true parameter value. Here we assume for simplicity that the unknown parameter is scalar-valued and that the state variable is d-dimensional with d ≥ 1. Furthermore, W t and \(W_t^\dagger \) denote independent standard d-dimensional Brownian motions and γ > 0 is the (known) diffusion constant.

Following the Bayesian paradigm, we treat the unknown parameter as a random variable Θ. Furthermore, we apply a sequential approach and update Θ with the incoming data \(X^\dagger _t\) as a function of time. Hence we introduce the random variable Θ t which obeys the Bayesian posterior distribution given all observations \(X_\tau ^\dagger \), τ ∈ [0, t], up to time t > 0. Furthermore, instead of exactly solving the time-continuous Bayesian inference problem as specified by the associated Kushner–Stratonovitch equation [6, 26], we define the time evolution of Θ t by an application of the (deterministic) ensemble Kalman–Bucy filter (EnKBF) mean-field equations [10, 26], which take the form

$$\displaystyle \begin{aligned} \mathrm{d} \varTheta_t &= \gamma^{-1} \pi_t \left[(\theta - \pi_t[\theta]) \otimes f(X_t^\dagger ,\theta) \right] \mathrm{d} I_t, \end{aligned} $$
(3a)
$$\displaystyle \begin{aligned} \mathrm{d} I_t &= \mathrm{d}X_t^\dagger - \frac{1}{2} \left( f(X_t^\dagger,\varTheta_t ) + \pi_t[f(X^\dagger_t,\theta)] \right) \mathrm{d}t , \end{aligned} $$
(3b)

where π t denotes the probability density function (PDF) of Θ t and π t[g] the associated expectation value of a function g(θ). The column vector I t, defined by (3b), is called the innovation, while the row vector

$$\displaystyle \begin{aligned} K_t(\pi_t) = \gamma^{-1} \pi_t \left[(\theta - \pi_t[\theta]) \otimes f(X_t^\dagger ,\theta) \right] , \end{aligned} $$
(4)

premultiplying the innovation in (3a) is called the gain. Here the notation a ⊗ b = ab T, where a, b can be any two column vectors, has been used. The initial condition Θ 0 ∼ π 0 is provided by the prior PDF of the unknown parameter.

A Monte-Carlo implementation of the mean-field equations (3) leads to the interacting particle system

$$\displaystyle \begin{aligned} \mathrm{d} \varTheta_t^{(i)} &= \gamma^{-1} \pi_t^M \left[(\theta - \pi_t^M[\theta]) \otimes f(X_t^\dagger ,\theta) \right] \mathrm{d} I_t^{(i)}, \end{aligned} $$
(5a)
$$\displaystyle \begin{aligned} \mathrm{d} I_t^{(i)} &= \mathrm{d}X_t^\dagger - \frac{1}{2} \left( f(X_t^\dagger,\varTheta_t^{(i)} ) + \pi_t^M [f(X^\dagger_t,\theta)] \right) \mathrm{d}t , \end{aligned} $$
(5b)

i = 1, …, M, where expectations are now taken with respect to the empirical measure. That is,

$$\displaystyle \begin{aligned} \pi_t^M [g] = \frac{1}{M} \sum_{i=1}^M g(\varTheta_t^{(i)}) \end{aligned} $$
(6)

for given function g(θ), and all Monte-Carlo samples are driven by the same (fixed) observations \(X_t^\dagger \). The initial samples \(\varTheta _0^{(i)}\), i = 1, …, M, are drawn identically and independently from the prior distribution π 0.

We note in passing that there is also a stochastic variant of the innovation process [26] defined by

$$\displaystyle \begin{aligned} \mathrm{d} I_t = \mathrm{d}X_t^\dagger - f(X_t^\dagger,\varTheta_t ) \mathrm{d}t - \gamma^{1/2}\mathrm{d}W_t , \end{aligned} $$
(7)

which leads to the Monte-Carlo approximation

$$\displaystyle \begin{aligned} \mathrm{d} I_t^{(i)} = \mathrm{d}X_t^\dagger - f(X_t^\dagger,\varTheta_t^{(i)} ) \mathrm{d}t - \gamma^{1/2} \mathrm{d}W_t^{(i)} \end{aligned} $$
(8)

of the innovation in (5).

Remark 1

There is an intriguing connection to the stochastic gradient descent approach to the estimation of θ , as proposed in [30], which is written as

$$\displaystyle \begin{aligned} \mathrm{d}\theta_t &= \frac{\alpha_t}{\gamma} \nabla_\theta f(X_t^\dagger,\theta_t)\mathrm{d}\tilde I_t, \end{aligned} $$
(9a)
$$\displaystyle \begin{aligned} \mathrm{d}\tilde I_t &= \mathrm{d}X_t^\dagger - f(X_t^\dagger,\theta_t)\mathrm{d}t \end{aligned} $$
(9b)

in our notation, where α t > 0 denotes the learning rate. We note that (9) shares with (3) the gain times innovation structure. However, while (3) approximates the Bayesian inference problem, formulation (9) treats the parameter estimation problem from an optimisation perspective. Both formulations share, however, the discontinuous dependence on the observation path \(X_t^\dagger \), and the proposed frequentist analysis of the EnKBF (3) also applies in simplified form to (9). We also point out that (3) is affine invariant [18] and does not require the computation of partial derivatives.

We now state a numerical implementation with step-size Δt > 0 and denote the resulting numerical approximations at t n = nΔt by Θ n ∼ π n, n ≥ 1. While a standard Euler–Maruyama approximation could be applied, the following stable discrete-time mean-field formulation of the EnKBF

$$\displaystyle \begin{aligned} \varTheta_{n+1} = \varTheta_n + K_n \left\{ (X_{t_{n+1}}^\dagger - X_{t_n}^\dagger) - \frac{1}{2} \left( f(X_{t_n}^\dagger,\varTheta_n ) + \pi_n[f(X^\dagger_{t_n},\theta)] \right)\varDelta t \right\} \end{aligned} $$
(10)

is inspired by [3] with Kalman gain

$$\displaystyle \begin{aligned} K_n &= \pi_n \left[(\theta - \pi_n[\theta]) \otimes f(X_{t_n}^\dagger ,\theta)\right] \times \end{aligned} $$
(11a)
$$\displaystyle \begin{aligned} & \qquad \left( \gamma + \varDelta t \pi_n \left[ \left(f(X_{t_n}^\dagger ,\theta) - \pi_n [f(X_{t_n}^\dagger ,\theta)] \right) \otimes f(X_{t_n}^\dagger ,\theta)\right] \right)^{-1}. \end{aligned} $$
(11b)

It is straightforward to combine this time discretisation with the Monte-Carlo approximation (5) in order to obtain a complete numerical implementation of the EnKBF.

Remark 2

The rough path analysis of the EnKBF presented in [9] is based on a Stratonovich reformulation of (3) and its appropriate time discretisation. Here we follow the Itô/Euler–Maruyama formulation of the data-driven term in (3),

$$\displaystyle \begin{aligned} \int_0^T g(X_t^\dagger,t)\,\mathrm{d}X_t^\dagger = \lim_{\varDelta t\to 0} \sum_{i=1}^L g(X_{t_n}^\dagger,t_n)(X_{t_{n+1}}^\dagger - X_{t_n}^\dagger ) \end{aligned} $$
(12)

for any continuous function g(x, t) and Δt = TL, as it corresponds to standard implementation of the EnKBF and is easier to analyse in the context of this paper.

The EnKBF provides only an approximate solution to the Bayesian inference problem for general nonlinear f(x, θ). However, it becomes exact in the mean-field limit for affine drift functions f(x, θ) = θAx + Bx + c.

Example 1

Consider the stochastic partial differential equation

$$\displaystyle \begin{aligned} \partial_t u = -U\partial_y u + \rho \partial_y^2 u + \dot{\mathcal{W}} \end{aligned} $$
(13)

over a periodic spatial domain y ∈ [0, L), where \(\mathcal {W}(t,y)\) denotes space-time white noise, \(U\in \mathbb {R}\), and ρ > 0 are given parameters. A standard finite-difference discretisation in space with d grid points and mesh-size Δy leads to a linear system of SDEs of the form

$$\displaystyle \begin{aligned} \mathrm{d}{\mathbf{u}}_t = -(U D + \rho D D^{\mathrm{T}}){\mathbf{u}}_t\mathrm{d}t + \varDelta y^{-1/2} \mathrm{d}W_t, \end{aligned} $$
(14)

where \({\mathbf {u}}_t \in \mathbb {R}^d \) denotes the vector of grid approximations at time t, \(D \in \mathbb {R}^{d\times d}\) a finite difference approximation of the spatial derivative y, and W t the standard d-dimensional Brownian motion. We can now set X t = u t, γ = Δy −1 and identify either θ = U or θ = ρ as the unknown parameter in order to obtain an SDE of the form (1).

In this note, we further simplify our given inference problem to the case

$$\displaystyle \begin{aligned} f(x,\theta) = \theta Ax\,, \end{aligned} $$
(15)

where \(A \in \mathbb {R}^{d\times d}\) is a normal matrix with eigenvalues in the left half plane. That is \(\sigma (A) \subset \mathbb {C}_-\). The reference parameter value is set to θ  = 1. Hence the SDE (2) possesses a Gaussian invariant measure with mean zero and covariance matrix

$$\displaystyle \begin{aligned} C = -\gamma (A + A^{\mathrm{T}})^{-1}. \end{aligned} $$
(16)

We assume from now on that the observations \(X_t^\dagger \) are realisations of (2) with initial condition \(X_0^\dagger \sim \mathrm {N}(0,C)\).

Under these assumptions, the EnKBF (3) simplifies drastically, and we obtain

$$\displaystyle \begin{aligned} \mathrm{d} \varTheta_t &=\frac{\sigma_t}{\gamma} (A X^\dagger_t)^{\mathrm{T}} \mathrm{d}I_t, \end{aligned} $$
(17a)
$$\displaystyle \begin{aligned} \mathrm{d} I_t &= \mathrm{d}X_t^\dagger - \frac{1}{2} \left( \varTheta_t + \pi_t[\theta] \right) A X_t^\dagger \mathrm{d}t , \end{aligned} $$
(17b)

with variance

$$\displaystyle \begin{aligned} \sigma_t = \pi_t \left[(\theta - \pi_t[\theta])^2 \right] . \end{aligned} $$
(18)

Remark 3

For completeness, we state the corresponding formulation for the stochastic gradient descent approach (9):

$$\displaystyle \begin{aligned} \mathrm{d}\theta_t &= \frac{\alpha_t}{\gamma} (A X_t^\dagger)^{\mathrm{T}} \mathrm{d}\tilde I_t, \end{aligned} $$
(19a)
$$\displaystyle \begin{aligned} \mathrm{d}\tilde I_t &= \mathrm{d}X_t^\dagger - \theta_tA X_t^{\dagger}\mathrm{d}t. \end{aligned} $$
(19b)

We find that the learning rate α t takes the role of the variance σ t in (17). However, we emphasise again that the same pathwise stochastic integrals arise from both formulations, and therefore, the same robustness issue of the resulting estimators θ t, t > 0, arises.

Similarly, the discrete-time mean-field EnKBF (10) reduces to

$$\displaystyle \begin{aligned} \varTheta_{n+1} = \varTheta_n + K_n \left\{ (X_{t_{n+1}}^\dagger - X_{t_n}^\dagger) - \frac{1}{2} \left( \varTheta_n + \pi_n[\theta] \right) A X_{t_n}^\dagger \varDelta t \right\} \end{aligned} $$
(20)

with Kalman gain

$$\displaystyle \begin{aligned} K_n = \sigma_n (AX_{t_n}^\dagger)^{\mathrm{T}} \left( \gamma + \varDelta t \sigma_n (AX_{t_n}^\dagger)^{\mathrm{T}} AX_{t_n}^\dagger \right)^{-1}\,. \end{aligned} $$
(21)

Furthermore, since \(X_t^\dagger \sim \mathrm {N}(0,C)\),

$$\displaystyle \begin{aligned} (AX_t^\dagger)^{\mathrm{T}} A X_t^\dagger = (A^{\mathrm{T}}A) : (X_t^\dagger \otimes X_t^\dagger) \approx (A^{\mathrm{T}}A):C \end{aligned} $$
(22)

for d ≫ 1, and we may simplify the Kalman gain to

$$\displaystyle \begin{aligned} K_n = \sigma_n \,(AX_{t_n}^\dagger)^{\mathrm{T}} \left( \gamma + \varDelta t \sigma_n \,(A^{\mathrm{T}}A) : C\right)^{-1}. \end{aligned} $$
(23)

Here we have used the notation A : B = tr(A T B) to denote the Frobenius inner product of two matrices \(A,B\in \mathbb {R}^{d\times d}\). The approximation (22) becomes exact in the limit d →, which we will frequently assume in the following section. Please note that

$$\displaystyle \begin{aligned} K_n = \frac{\sigma_n}{\gamma} \,(AX_{t_n}^\dagger)^{\mathrm{T}} + \mathcal{O}(\varDelta t) \end{aligned} $$
(24)

under the stated assumptions.

Remark 4

The Stratonovitch reformulation of (17) replaces (17a) by

$$\displaystyle \begin{aligned} \mathrm{d} \varTheta_t =\frac{\sigma_t}{\gamma} \left\{ (A X^\dagger_t)^{\mathrm{T}} \circ \mathrm{d}I_t - \frac{\gamma}{2} \mbox{tr} \,(A)\,\mathrm{d}t\right\}. \end{aligned} $$
(25)

The innovation I t remains as before. See Appendix B of [9] for more details. An appropriate time discretisation of the innovation-driven term replaces the Kalman gain (21) by

$$\displaystyle \begin{aligned} K_{n+1/2} = \sigma_n (AX_{t_{n+1/2}}^\dagger)^{\mathrm{T}} \left( \gamma + \varDelta t \sigma_n (AX_{t_{n+1/2}}^\dagger)^{\mathrm{T}} AX_{t_{n+1/2}}^\dagger \right)^{-1}, \end{aligned} $$
(26)

where

$$\displaystyle \begin{aligned} X_{t_{n+1/2}}^\dagger = \frac{1}{2} (X_{t_n}^\dagger + X_{t_{n+1}}^\dagger)\,. \end{aligned} $$
(27)

Please note that a midpoint discretisation of the data-driven term in (25) results in

$$\displaystyle \begin{aligned} (AX_{t_{n+1/2}}^\dagger )^{\mathrm{T}} (X_{t_{n+1}}^\dagger - X_{t_n}^\dagger) &= (AX_{t_n}^\dagger )^{\mathrm{T}} (X_{t_{n+1}}^\dagger - X_{t_n}^\dagger) \,\,+ \end{aligned} $$
(28a)
$$\displaystyle \begin{aligned} &\qquad \frac{1}{2} A^{\mathrm{T}} : (X_{t_{n+1}}^\dagger - X_{t_n}^\dagger) \otimes (X_{t_{n+1}}^\dagger - X_{t_n}^\dagger) \end{aligned} $$
(28b)

and that

$$\displaystyle \begin{aligned} \frac{1}{2} A^{\mathrm{T}} : (X_{t_{n+1}}^\dagger - X_{t_n}^\dagger) \otimes (X_{t_{n+1}}^\dagger - X_{t_n}^\dagger) \approx \frac{\varDelta t \,\gamma}{2} \mbox{tr}\,(A), \end{aligned} $$
(29)

which justifies the additional drift term in (25). A precise meaning of the approximation in (29) will be given in Remark 5 below.

Alternatively, if one wishes to explicitly utilise the availability of continuous-time data \(X^\dagger _t\), one could apply the following variant of (20):

$$\displaystyle \begin{aligned} \varTheta_{n+1} = \varTheta_n + \frac{\sigma_n}{\gamma} \int_{t_n}^{t_{n+1}} (AX_t^\dagger)^{\mathrm{T}} \mathrm{d} X_t^{\dagger} - \frac{1}{2} K_n A X_{t_n}^\dagger \left( \varTheta_n + \pi_n[\theta] \right) \varDelta t , \end{aligned} $$
(30)

and following the Itô/Euler–Maruyama approximation (12), discretise the integral with a small inner step-size Δτ = ΔtL, L ≫ 1; that is,

$$\displaystyle \begin{aligned} \int_{t_n}^{t_{n+1}} (AX_t^\dagger)^{\mathrm{T}} \mathrm{d} X_t^{\dagger} \approx \sum_{l=0}^{L-1} (AX_{\tau_l}^\dagger)^{\mathrm{T}} (X_{\tau_{l+1}}^{\dagger}-X_{\tau_l}^\dagger) \end{aligned} $$
(31)

with τ l = t n + lΔτ. We note that

$$\displaystyle \begin{aligned} \sum_{l=0}^{L-1} (AX_{\tau_l}^\dagger)^{\mathrm{T}} (X_{\tau_{l+1}}^{\dagger}-X_{\tau_l}^\dagger) &= (AX_{t_n}^\dagger)^{\mathrm{T}} (X_{t_{n+1}}^{\dagger}-X_{t_n}^\dagger) \,\,+ \end{aligned} $$
(32a)
$$\displaystyle \begin{aligned} & \qquad \,\, A^{\mathrm{T}}: \left( \sum_{l=0}^{L-1} (X_{\tau_l}^\dagger - X_{t_n}^\dagger) \otimes (X_{\tau_{l+1}}^{\dagger}-X_{\tau_l}^\dagger)\right), \end{aligned} $$
(32b)

which is at the heart of rough path analysis [13] and which we utilise in the following section.

3 Frequentist Analysis

It is well-known that the second-order contribution in (32) leads to a discontinuous dependence of the integral on the observed \(X_t^\dagger \) in the uniform norm topology on the space of continuous functions. Rough path theory fixes this problem by defining appropriately extended topologies and has been extended to the EnKBF in [9]. In this section, we complement the path-wise analysis from [9] by an analysis of the impact of second-order contribution on the EnKBF (17) from a frequentist perspective, which analyses the behaviour of EnKBF over all possible observations \(X_t^\dagger \) subject to (2). In other words, one switches from a strong solution concept to a weak one. While we assume that the observations satisfy (2), throughout this section, we will analyse the impact of a perturbed observation process on the EnKBF in Sect. 4.

We first derive evolution equations for the conditional mean and variance under the assumption that Θ 0 is Gaussian distributed with given prior mean m prior and variance σ prior. It follows directly from (17) that the conditional mean μ t = π t[θ], that is the mean of Θ t, satisfies the SDE

$$\displaystyle \begin{aligned} \mathrm{d}\mu_t = \frac{\sigma_t}{\gamma} \left( (A X_t^\dagger)^{\mathrm{T}} \mathrm{d}X^\dagger_t - \mu_t \,(A^{\mathrm{T}}A) : (X_t^\dagger \otimes X_t^\dagger) \,\mathrm{d}t\right), \end{aligned} $$
(33)

which simplifies to

$$\displaystyle \begin{aligned} \mathrm{d}\mu_t = \frac{\sigma_t}{\gamma} \left( (A X_t^\dagger)^{\mathrm{T}} \mathrm{d}X^\dagger_t - \mu_t \,(A^{\mathrm{T}}A) : C \,\mathrm{d}t\right), \end{aligned} $$
(34)

under the approximation (22). The initial condition is μ 0 = m prior. The evolution equation for the conditional variance, that is the variance of Θ t, is given by

$$\displaystyle \begin{aligned} \frac{\mathrm{d}}{\mathrm{d}t} \sigma_t = - \frac{\sigma_t^2}{\gamma} \,(A^{\mathrm{T}}A):(X_t^\dagger \otimes X_t^\dagger) \end{aligned} $$
(35)

with initial condition σ 0 = σ prior and which again reduces to

$$\displaystyle \begin{aligned} \frac{\mathrm{d}}{\mathrm{d}t} \sigma_t = - \frac{\sigma_t^2}{\gamma} \,(A^{\mathrm{T}}A):C \end{aligned} $$
(36)

under the approximation (22).

We now perform a frequentist analysis of the estimator μ t defined by (34) and (36), that is, we perform a weak analysis of the SDE (34) in terms of the first two moments of μ t [29]. In the first step, we take the expectation of (34) over all realisations \(X_t^\dagger \) of the SDE (2), which we denote by

$$\displaystyle \begin{aligned} m_t :=\mathbb{E}^\dagger[\mu_t] . \end{aligned} $$
(37)

The associated evolution equation is given by

$$\displaystyle \begin{aligned} \frac{\mathrm{d}}{\mathrm{d}t} m_t = \frac{\sigma_t}{\gamma} \,(A^{\mathrm{T}} A): \mathbb{E}^\dagger \left[X_t^\dagger \otimes X_t^\dagger\right] -\frac{\sigma_t}{\gamma} \,(A^{\mathrm{T}} A) : C \, m_t , \end{aligned} $$
(38)

which reduces to

$$\displaystyle \begin{aligned} \frac{\mathrm{d}}{\mathrm{d}t} m_t = \frac{\sigma_t}{\gamma} \,(A^{\mathrm{T}}A):C \,(1- m_t) = \sigma_t \,(A^{\mathrm{T}}A): (A + A^{\mathrm{T}})^{-1}\,(1-m_t) . \end{aligned} $$
(39)

In the second step, we also look at the frequentist variance

$$\displaystyle \begin{aligned} p_t := \mathbb{E}^\dagger [(\mu_t-m_t)^2] . \end{aligned} $$
(40)

Using

$$\displaystyle \begin{aligned} \mathrm{d}(\mu_t-m_t) &= \frac{\sigma_t}{\gamma} \left\{ (A^{\mathrm{T}}A): \left( X_t^\dagger \otimes X_t^\dagger - C \right)\mathrm{d}t + \gamma^{1/2} (AX_t^\dagger)^{\mathrm{T}} \mathrm{d}W^\dagger_t \right\} \,\,- \end{aligned} $$
(41a)
$$\displaystyle \begin{aligned} & \qquad \qquad \frac{\sigma_t}{\gamma} (A^{\mathrm{T}} A): C \,(\mu_t-m_t)\mathrm{d}t , \end{aligned} $$
(41b)

we obtain

$$\displaystyle \begin{aligned} \frac{\mathrm{d}}{\mathrm{d}t} p_t &= -\frac{\sigma_t}{\gamma} \,(A^{\mathrm{T}}A):C \left(2p_t-\sigma_t\right)\,\,+ \end{aligned} $$
(42a)
$$\displaystyle \begin{aligned} &\qquad \qquad \quad \frac{2\sigma_t}{\gamma} \,(A^{\mathrm{T}} A): \mathbb{E}^\dagger \left[(X_t^\dagger \otimes X_t^\dagger-C) \,(\mu_t-m_t)\right] , \end{aligned} $$
(42b)

which we simplify to

$$\displaystyle \begin{aligned} \frac{\mathrm{d}}{\mathrm{d}t} p_t = \frac{\sigma_t}{\gamma} \,(A^{\mathrm{T}}A):C \left(\sigma_t - 2p_t \right) = \sigma_t \,(A^{\mathrm{T}}A): (A + A^{\mathrm{T}})^{-1} \left(\sigma_t - 2p_t\right) \end{aligned} $$
(43)

under the approximation (22). The initial conditions are m 0 = m prior and p 0 = 0, respectively. We note that the differential equations (36) and (43) are explicitly solvable. For example, it holds that

$$\displaystyle \begin{aligned} \sigma_t = \frac{\sigma_0}{1 + (A^{\mathrm{T}}A) : (A^{\mathrm{T}} + A)^{-1}\, \sigma_0 t} \end{aligned} $$
(44)

and one finds that σ t ∼ 1∕((A T A) : (A T + A)−1 t) for t ≫ 1. It can also be shown that p t ≤ σ t for all t ≥ 0. Furthermore, this analysis suggests that the learning rate in the stochastic gradient descent formulation (19) should be chosen as

$$\displaystyle \begin{aligned} \alpha_t = \min \left\{ \bar \alpha, \frac{1}{(A^{\mathrm{T}}A) :(A^{\mathrm{T}}+A)^{-1}\,t} \right\} , \end{aligned} $$
(45)

where \(\bar \alpha >0\) denotes an initial learning rate; for example \(\bar \alpha = \sigma _0\).

We finally conduct a formal analysis of the ensemble Kalman filter time-stepping (20) and demonstrate that the method is first-order accurate with regard to the implied frequentist mean m t. We recall (24) and conclude from (20) that the implied update on the variance σ n satisfies

$$\displaystyle \begin{aligned} \sigma_{n+1} = \sigma_n - \frac{\sigma_n^2}{\gamma} \,(A^{\mathrm{T}} A):C \varDelta t + \mathcal{O}(\varDelta t^2) , \end{aligned} $$
(46)

which provides a first-order approximation to (36).

We next analyse the evolution equation (34) for the conditional mean μ t and its numerical approximation

$$\displaystyle \begin{aligned} \mu_{n+1} = \mu_n + K_n \left\{ (X_{t_{n+1}}^\dagger -X_{t_n}^\dagger) - \mu_n AX_{t_n}^\dagger \varDelta t\right\} \end{aligned} $$
(47)

arising from (20). Here we follow [13] in order to analyse the impact of the data \(X_t^\dagger \) on the estimator. An in-depth theoretical treatment can be found in [9].

Comparing (47) to (34) and utilising (24), we find that the key quantity of interest is

$$\displaystyle \begin{aligned} J^\dagger_{t_n,t_{n+1}} := \int_{t_n}^{t_{n+1}} (AX_t^\dagger)^{\mathrm{T}} \mathrm{d}X_t^\dagger , \end{aligned} $$
(48)

which we can rewrite as

$$\displaystyle \begin{aligned} J^\dagger_{t_n,t_{n+1}} = A^{\mathrm{T}}: (X^\dagger_{t_n} \otimes X^\dagger_{t_n,t_{n+1}}) + A^{\mathrm{T}} : \mathbb{X}_{t_n,t_{n+1}}^\dagger\,. \end{aligned} $$
(49)

Here, motivated by (32) and following standard rough path notation, we have used

$$\displaystyle \begin{aligned} X^\dagger_{t_n,t_{n+1}} := X_{t_{n+1}}^\dagger -X_{t_n}^\dagger \end{aligned} $$
(50)

and the second-order iterated Itô integral

$$\displaystyle \begin{aligned} \mathbb{X}_{t_n,t_{n+1}}^\dagger := \int_{t_n}^{t_{n+1}} (X^\dagger_t - X^\dagger_{t_n})\otimes \mathrm{d}X_t^\dagger . \end{aligned} $$
(51)

The difference between the integral (48) and its corresponding approximation in (47) is provided by \(A^{\mathrm {T}} : \mathbb {X}_{t_n,t_{n+1}}^\dagger \) plus higher-order terms arising from (24). The iterated integral \(\mathbb {X}^\dagger _{t_n,t_{n+1}}\) becomes a random variable from the frequentist perspective. Taking note of (2), we find that the drift, f(x) = Ax, contributes with terms of order \(\mathcal {O}(\varDelta t^2)\) to \(\mathbb {X}^\dagger _{t_n,t_{n+1}}\) and the expected value of \(\mathbb {X}^\dagger _{t_n,t_{n+1}}\) therefore satisfies

$$\displaystyle \begin{aligned} \mathbb{E}^\dagger [\mathbb{X}^\dagger_{t_n,t_{n+1}}] = \mathcal{O}(\varDelta t^2) , \end{aligned} $$
(52)

since \(\mathbb {E}^\dagger [W^\dagger _{t_n,\tau }]= 0\) for τ > t n, and

$$\displaystyle \begin{aligned} \mathbb{E}^\dagger [\mathbb{W}_{t_n,t_{n+1}}^\dagger] = \frac{1}{2} \mathbb{E}^\dagger [ W^\dagger_{t_n,t_{n+1}} \otimes W^\dagger_{t_n,t_{n+1}} - [W^\dagger_{t_n},W^\dagger_{t_n,t_{n+1}}] ] - \frac{\varDelta t}{2}I = 0 , \end{aligned} $$
(53)

where we have introduced the commutator

$$\displaystyle \begin{aligned}{}[W_{t_n}^\dagger, W_{t_n,t_{n+1}}^\dagger] := W^\dagger_{t_n}\otimes W^\dagger_{t_n,t_{n+1}} - W^\dagger_{t_n,t_{n+1}}\otimes W_{t_n}^\dagger . \end{aligned} $$
(54)

Hence we find that, while (47) is not a first-order (strong) approximation of the SDE (34), the approximation becomes first-order in m t when averaged over realisations \(X_t^\dagger \) of the SDE (2). More precisely, one obtains

$$\displaystyle \begin{aligned} \mathbb{E}^\dagger [J^\dagger_{t_n,t_{n+1}}] = (A^{\mathrm{T}}A) : C \varDelta t + \mathcal{O}(\varDelta t^2) . \end{aligned} $$
(55)

We note that the modified scheme (30) leads to the same time evolution in the variance σ n while the update in μ n is changed to

$$\displaystyle \begin{aligned} \mu_{n+1} = \mu_n + \frac{\sigma_n}{\gamma} \int_{t_n}^{t_{n+1}} (AX_t^\dagger)^{\mathrm{T}} \mathrm{d} X_t^{\dagger} - K_n A X_{t_n}^\dagger \mu_n \varDelta t . \end{aligned} $$
(56)

This modification results in a more accurate evolution in the conditional mean μ n, but because of (52) it does not impact to leading order the evolution of the underlying frequentist mean, \(m_n = \mathbb {E}^\dagger [\mu _n]\). We summarise our findings in the following proposition.

Proposition 1

The discrete-time EnKBF implementations (20) and (30) both provide first-order approximations to the time evolution of the frequentist mean, m t , and the frequentist variance, p t . In other words, both methods converge weakly with order one.

We also note that the frequentist uncertainty is essentially data-independent and depends only on the time window [0, T] over which the data gets observed. Hence, for fixed observation interval [0, T], it makes sense to choose the step-size Δt such that the discretisation error (bias) remains on the same order of magnitude as \(p_T^{1/2} \approx \sigma _T^{1/2}\). Selecting a much smaller step-size would not significantly reduce the frequentist estimation error in the conditional estimator μ T.

Remark 5

We can now give a precise reformulation of the approximation (29):

$$\displaystyle \begin{aligned} \frac{1}{2} \mathbb{E}^\dagger \left[ A^{\mathrm{T}}: (X_{t_n,t_{n+1}}^\dagger \otimes X_{t_n,t_{n+1}}^\dagger )\right] = \frac{\varDelta t \,\gamma}{2} \mbox{tr}\,(A) + \mathcal{O}(\varDelta t^2), \end{aligned} $$
(57)

which is at the heart of the Stratonovich formulation (25) of the EnKFB [9].

4 Multi-Scale Data

We now have all the material in place to study the dependency of the EnKBF estimator on a set of observations \(X_t^{(\epsilon )}\), 𝜖 > 0, which approach the theoretical \(X_t^\dagger \) with respect to the uniform norm topology on the space of continuous functions as 𝜖 → 0. Since the second-order contribution in (32), that is (51), does not depend continuously on such perturbations, we demonstrate in this section that a systematic bias arises in the EnKBF. Furthermore, we show how the bias can be eliminated either via subsampling the data, which effectively amounts to ignoring these second-order contributions, or via an appropriate correction term, which ensures a continuous dependence on observations \(X_t^{(\epsilon )}\) with respect to the uniform norm topology. More specifically, we investigate the impact of a possible discrepancy between the SDE model (1), for which we aim to estimate the parameter θ, and the data generating SDE (2). We therefore replace (2) by the following two-scale SDE [17]:

$$\displaystyle \begin{aligned} \mathrm{d}X^{(\epsilon)}_t&= AX^{(\epsilon)}_t\,\mathrm{d}t + \frac{\gamma^{1/2}}{\epsilon} M P^{(\epsilon)}_t\,\mathrm{d}t, \end{aligned} $$
(58a)
$$\displaystyle \begin{aligned} \mathrm{d}P^{(\epsilon)}_t&= -\frac{1}{\epsilon} M P^{(\epsilon)}_t\,\mathrm{d}t + \mathrm{d}W^\dagger_t, \end{aligned} $$
(58b)

where

$$\displaystyle \begin{aligned} M = \left( \begin{array}{cc} 1 & \beta \\ -\beta & 1\end{array} \right), \end{aligned} $$
(59)

β = 2 and 𝜖 = 0.01. The dimension of state space is d = 2 throughout this section. While we restrict here to the simple two-scale model (58), similar scenarios can arise from deterministic fast-slow systems [24, 7].

The associated EnKBF mean-field equations in the parameter Θ t, which we now denote by \(\varTheta _t^{(\epsilon )}\) in order to explicitly record its dependence on the scale parameter 𝜖 ≪ 1, become

$$\displaystyle \begin{aligned} \mathrm{d} \varTheta_t^{(\epsilon)} &=\frac{\sigma_t^{(\epsilon)}}{\gamma} (A X^{(\epsilon)}_t)^{\mathrm{T}} \mathrm{d}I_t^{(\epsilon)}, \end{aligned} $$
(60a)
$$\displaystyle \begin{aligned} \mathrm{d} I_t^{(\epsilon)} &= \mathrm{d}X_t^{(\epsilon)} - \frac{1}{2} \left( \varTheta_t^{(\epsilon)} + \pi_t^{(\epsilon)} [\theta] \right) A X_t^{(\epsilon)} \mathrm{d}t , \end{aligned} $$
(60b)

with variance

$$\displaystyle \begin{aligned} \sigma_t^{(\epsilon)} = \pi_t^{(\epsilon)} \left[(\theta - \pi_t^{(\epsilon)} [\theta])^2 \right] \end{aligned} $$
(61)

and \(\varTheta ^\epsilon _t \sim \pi _t^{(\epsilon )}\). The discrete-time mean-field EnKBF (20) turns into

$$\displaystyle \begin{aligned} \varTheta_{n+1}^{(\epsilon)} = \varTheta_n^{(\epsilon)} + K_n^{(\epsilon)} \left\{ \left(X_{t_{n+1}}^{(\epsilon)} - X_{t_n}^{(\epsilon)} \right) - \frac{1}{2} \left( \varTheta_n^{(\epsilon)} + \pi_n^{(\epsilon)}[\theta] \right) A X_{t_n}^{(\epsilon)} \varDelta t \right\} \end{aligned} $$
(62)

with Kalman gain

$$\displaystyle \begin{aligned} K_n^{(\epsilon)} = \sigma_n^{(\epsilon)} (AX_{t_n}^{(\epsilon)})^{\mathrm{T}} \left( \gamma + \varDelta t \sigma_n^{(\epsilon)} (AX_{t_n}^{(\epsilon)})^{\mathrm{T}} AX_{t_n}^{(\epsilon)} \right)^{-1}\,. \end{aligned} $$
(63)

We also consider the appropriately modified scheme (30):

$$\displaystyle \begin{aligned} \varTheta_{n+1}^{(\epsilon)} = \varTheta_n^{(\epsilon)} + \frac{\sigma_n^{(\epsilon)}}{\gamma} \int_{t_n}^{t_{n+1}} (AX_t^{(\epsilon)})^{\mathrm{T}} \mathrm{d} X_t^{(\epsilon)} - \frac{1}{2} K_n^{(\epsilon)} A X_{t_n}^{(\epsilon)} \left( \varTheta_n^{(\epsilon)} + \pi_n^{(\epsilon)}[\theta] \right) \varDelta t. \end{aligned} $$
(64)

In order to understand the impact of the modified data generating process on the two mean-field EnKBF formulations (62) and (64), respectively, we follow [17] and investigate the difference between \(X^{(\epsilon )}_t\) and \(X^\dagger _t\):

$$\displaystyle \begin{aligned} \mathrm{d} (X^{(\epsilon)}_t - X^\dagger_t) &= A (X^{(\epsilon)}_t - X^\dagger_t)\mathrm{d}t + \frac{\gamma^{1/2}}{\epsilon}M P_t^{(\epsilon)} \mathrm{d}t -\gamma^{1/2} \mathrm{d}W_t^\dagger \end{aligned} $$
(65a)
$$\displaystyle \begin{aligned} &= A (X_t^{(\epsilon)} - X_t^\dagger)\mathrm{d}t - \gamma^{1/2}\mathrm{d}P_t^{(\epsilon)} . \end{aligned} $$
(65b)

When \(P^{(\epsilon )}_t\) is stationary, it is Gaussian with mean zero and covariance

$$\displaystyle \begin{aligned} \mathbb{E}_{\mathrm{stat}} \left[P_t^{(\epsilon)} \otimes P_t^{(\epsilon)}\right] =\epsilon \,(M + M^{\mathrm{T}})^{-1} = \frac{\epsilon}{2} I . \end{aligned} $$
(66)

Hence \(P^{(\epsilon )}_t \rightarrow 0\) as 𝜖 → 0 and also

$$\displaystyle \begin{aligned} X^{(\epsilon)}_t \rightarrow X^\dagger_t \end{aligned} $$
(67)

in L 2 uniformly in t, provided \(\sigma (A)\subset \mathbb {C}_-\) and \(X^{(\epsilon )}_0 = X^{\dagger }_0\). This is illustrated in Fig. 1.

Fig. 1
figure 1

SDE driven by mathematical vs. physical Brownian motion (𝜖 = 0.01). The top panel displays both \(X_t^\dagger \) (blue) and \(X_t^{(\epsilon )}\) (red) over the long time interval t ∈ [0, 10], while the lower panel provides a zoomed in perspective over the interval t ∈ [0, 1]

In order to investigate the problem further, we study the integral

$$\displaystyle \begin{aligned} J^{(\epsilon)}_{t_n,t_{n+1}} := \int_{t_n}^{t_{n+1}} (AX_t^{(\epsilon)})^{\mathrm{T}} \mathrm{d}X_t^{(\epsilon)} \end{aligned} $$
(68)

and its relation to (48). As for (48), we can rewrite (68) as

$$\displaystyle \begin{aligned} J^{(\epsilon)}_{t_n,t_{n+1}} = A^{\mathrm{T}}: (X^{(\epsilon)}_{t_n} \otimes X^{(\epsilon)}_{t_n,t_{n+1}}) + A^{\mathrm{T}} : \mathbb{X}_{t_n,t_{n+1}}^{(\epsilon)}. \end{aligned} $$
(69)

We now investigate the limit of the second-order iterated integral

$$\displaystyle \begin{aligned} \mathbb{X}^{(\epsilon)}_{t_n,t_{n+1}} &= \int_{t_n}^{t_{n+1}} X^{(\epsilon)}_{t_n,t}\otimes \mathrm{d}X_t^{(\epsilon)} \end{aligned} $$
(70a)
$$\displaystyle \begin{aligned} &= \frac{1}{2} X_{t_n,t_{n+1}}^{(\epsilon)} \otimes X_{t_n,t_{n+1}}^{(\epsilon)} - \frac{1}{2} \int_{t_n}^{t_{n+1}}[ X_{t_n,t}^{(\epsilon)},\mathrm{d}X_{t}^{(\epsilon)} ] \end{aligned} $$
(70b)

as 𝜖 → 0 [17]. Here [., .] denotes the commutator defined by (54).

Proposition 2

The second-order iterated integral \(\mathbb {X}^{(\epsilon )}_{t_n,t_{n+1}}\) satisfies

$$\displaystyle \begin{aligned} \lim_{\epsilon \to 0} \mathbb{X}^{(\epsilon)}_{t_n,t_{n+1}} = \mathbb{X}^\dagger_{t_n,t_{n+1}}+ \frac{\varDelta t\,\gamma}{2} M \end{aligned} $$
(71)

Proof

The proof follows [17] and can be summarised as follows:

$$\displaystyle \begin{aligned} \mathbb{X}^{(\epsilon)}_{t_n,t_{n+1}} &= \int_{t_n}^{t_{n+1}} X^{(\epsilon)}_{t_n,t}\otimes \mathrm{d}X^{(\epsilon)}_t \end{aligned} $$
(72a)
$$\displaystyle \begin{aligned} &\rightarrow \int_{t_n}^{t_{n+1}} X^\dagger_{t_n,t} \otimes \mathrm{d} X^\dagger_t- \gamma^{1/2} \int_{t_n}^{t_{n+1}} X^{(\epsilon)}_{t_n,t} \otimes \mathrm{d} P^{(\epsilon)}_t \end{aligned} $$
(72b)
$$\displaystyle \begin{aligned} &= \mathbb{X}^\dagger_{t_n,t_{n+1}} - \gamma^{1/2} X^{(\epsilon)}_{t_n,t_{n+1}} \otimes P^{(\epsilon)}_{t_{n+1}} + \gamma^{1/2} \int_{t_n}^{t_{n+1}} \mathrm{d}X^{(\epsilon)}_t \otimes P^{(\epsilon)}_t \end{aligned} $$
(72c)
$$\displaystyle \begin{aligned} & \rightarrow \mathbb{X}^\dagger_{t_n,t_{n+1}} + \gamma^{1/2} \int_{t_n}^{t_{n+1}} \left\{ A X_t^{(\epsilon)} + \frac{\gamma^{1/2}}{\epsilon} M P^{(\epsilon)}_t \right\} \otimes P^{(\epsilon)}_t \mathrm{d}t \end{aligned} $$
(72d)
$$\displaystyle \begin{aligned} & \rightarrow \mathbb{X}^\dagger_{t_n,t_{n+1}} + \frac{\varDelta t\,\gamma }{\epsilon}M \,\mathbb{E}_{\mathrm{stat}}\left[ P_{t_n}^{(\epsilon)} \otimes P_{t_n}^{(\epsilon)} \right] \end{aligned} $$
(72e)
$$\displaystyle \begin{aligned} &= \mathbb{X}^\dagger_{t_n,t_{n+1}}+ \frac{\varDelta t\,\gamma}{2} M. \end{aligned} $$
(72f)

As discussed in detail in [9] already, Proposition 2 implies that the scheme (64) does not, in general, converge to the scheme (64) as 𝜖 → 0 since

$$\displaystyle \begin{aligned} J^\dagger_{t_n,t_{n+1}} = \lim_{\epsilon \to 0} J^{(\epsilon)}_{t_n,t_{n+1}} - \frac{\varDelta t\,\gamma}{2} A^{\mathrm{T}} :M\,. \end{aligned} $$
(73)

This observation suggests the following modification

$$\displaystyle \begin{aligned} \varTheta_{n+1}^{(\epsilon)} &= \varTheta_n^{(\epsilon)} + \frac{\sigma_n^{(\epsilon)}}{\gamma} \int_{t_n}^{t_{n+1}} (AX_t^{(\epsilon)})^{\mathrm{T}} \mathrm{d} X_t^{(\epsilon)} - \frac{\varDelta t}{2} \sigma_n^{(\epsilon)} \,A^{\mathrm{T}} : M\,\,- \end{aligned} $$
(74a)
$$\displaystyle \begin{aligned} & \quad \qquad \qquad \frac{1}{2} K_n^{(\epsilon)} A X_{t_n}^{(\epsilon)} \left( \varTheta_n^{(\epsilon)} + \pi_n^{(\epsilon)}[\theta] \right) \varDelta t \end{aligned} $$
(74b)

to (64). Please note that it follows from (70) that

$$\displaystyle \begin{aligned} \int_{t_n}^{t_{n+1}} (AX_t^{(\epsilon)})^{\mathrm{T}} \mathrm{d} X_t^{(\epsilon)} = A^{\mathrm{T}} : \left( X_{t_{n+1/2}}^{(\epsilon)} \otimes X_{t_n,t_{n+1}}^{(\epsilon)} - \frac{1}{2} \int_{t_n}^{t_{n+1}}[ X_{t_n,t}^{(\epsilon)},\mathrm{d}X_{t}^{(\epsilon)} ] \right). \end{aligned} $$
(75)

Proposition 3

The discrete-time EnKBF (62) converges to (20) for fixed Δt as 𝜖 → 0. Similarly, (74) converges to (30) under the same limit.

Proof

The first statement follows from \(\sigma _n^{(\epsilon )} = \sigma _n\), the limiting behaviour (67), and

$$\displaystyle \begin{aligned} \lim_{\epsilon \to 0} K_n^{(\epsilon)} = K_n. \end{aligned} $$
(76)

The second statement additionally requires (73) to be substituted into (74) when taking the limit 𝜖 → 0.

Remark 6

The analogous adaptation of (74) to the gradient descent formulation (19) with \(X_t^\dagger \) replaced by \(X_t^{(\epsilon )}\) becomes

$$\displaystyle \begin{aligned} \theta_{n+1}^{(\epsilon)} &= \theta_n^{(\epsilon)} + \frac{\alpha_{t_n}}{\gamma} \left( \int_{t_n}^{t_{n+1}} (AX_t^{(\epsilon)})^{\mathrm{T}} \mathrm{d}X_t^{(\epsilon)} - \frac{\gamma \varDelta t}{2} A^{\mathrm{T}} : M \,\,- \right. \end{aligned} $$
(77a)
$$\displaystyle \begin{aligned} & \qquad \qquad \left. \theta_n^{(\epsilon)} (AX_{t_n}^{(\epsilon)})^{\mathrm{T}} A X_{t_n}^{(\epsilon)} \varDelta t \right). \end{aligned} $$
(77b)

Alternatively, subsampling the data can be applied which leads to the simpler formulation

$$\displaystyle \begin{aligned} \theta_{n+1}^{(\epsilon)} = \theta_n^{(\epsilon)} + \frac{\alpha_{t_n}}{\gamma} (AX_{t_n}^{(\epsilon)})^{\mathrm{T}} \left( (X_{t_{n+1}}^{(\epsilon)}-X_{t_n}^{(\epsilon)}) - \theta_n^{(\epsilon)} A X_{t_n}^{(\epsilon)} \varDelta t \right). \end{aligned} $$
(78)

Remark 7

A two-scale SDE, closely related to (58), has been investigated in [8] in terms of the time integrated autocorrelation function of \(P_t^{(\epsilon )}\) and modified stochastic integrals. In our case, the modified quadrature rule, here denoted by ◇, has to satisfy

$$\displaystyle \begin{aligned} \int_{t_n}^{t_{n+1}} (AX_t^\dagger)^{\mathrm{T}} \diamond \mathrm{d}X_t^\dagger = \lim_{\epsilon \to 0} \int_{t_n}^{t_{n+1}} (AX_t^{(\epsilon)})^{\mathrm{T}} \mathrm{d}X_t^{(\epsilon)} , \end{aligned} $$
(79)

and it is therefore related to the standard Itô integral via

$$\displaystyle \begin{aligned} \int_{t_n}^{t_{n+1}} (AX_t^\dagger)^{\mathrm{T}} \diamond \mathrm{d}X_t^\dagger = \int_{t_n}^{t_{n+1}} (AX_t^\dagger)^{\mathrm{T}} \mathrm{d}X_t^\dagger + \frac{\varDelta t \gamma }{2} A^{\mathrm{T}} : M. \end{aligned} $$
(80)

Hence M playes the role of the integrated autocorrelation function of \(P_t^{(\epsilon )}\) in our approach. We note that the modified quadrature rule reduces to the standard Stratonovitch integral if either β = 0 in (59) or A is symmetric. While the results from [8] could, therefore, also be used as a starting point for discussing the induced estimation bias, practical implementations would still require knowledge of the integrated autocorrelation function of \(P_t^{(\epsilon )}\) or, equivalently, the estimation of M in addition to observing \(X_t^{(\epsilon )}\). We address this aspect next.

The numerical implementation of (74) requires an estimator for the generally unknown M in (73). This task is challenging as we only have access to \(X_t^{(\epsilon )}\) without any explicit knowledge of the underlying generating process (58). While the estimator proposed in [9] is based on the idea of subsampling the data, the frequentist perspective taken in this note suggests the alternative estimator M est defined by

$$\displaystyle \begin{aligned} \frac{\varDelta t \,\gamma}{2} M_{\mathrm{est}} =\mathbb{E}^\dagger [ \mathbb{X}^{(\epsilon)}_{t_n,t_{n+1}}], \end{aligned} $$
(81)

which follows from (72f) and (52). That is, \(\mathbb {E}^\dagger [\mathbb {X}^\dagger _{t_n,t_{n+1}}] = \mathcal {O}(\varDelta t^2)\) for Δt sufficiently small. Note that second-order iterated integral \(X_{t_n,t_{n+1}}^{(\epsilon )}\) satisfies (70) and is therefore easy to compute. In practice, the frequentist expectation value can be replaced by an approximation along a given single observation path \(X^{(\epsilon )}_t\), t ∈ [0, T], under the assumption of ergodicity.

An appropriate choice of the outer or sub-sampling step-size Δt [27] constitutes an important aspect for the practical implementation of the EnKBF formulation (62) for finite values of 𝜖 > 0 [26]. Consistency of the second-order iterated integrals [13] implies

$$\displaystyle \begin{aligned} \mathbb{X}^{(\epsilon)}_{t_n,t_{n+2}} = \mathbb{X}^{(\epsilon)}_{t_n,t_{n+1}} + \mathbb{X}^{(\epsilon)}_{t_{n+1},t_{n+2}} + X^{(\epsilon)}_{t_n,t_{n+1}} \otimes X^{(\epsilon)}_{t_{n+1},t_{n+2}} . \end{aligned} $$
(82)

A sensible choice of Δt is dictated by

$$\displaystyle \begin{aligned} \mathbb{E}^\dagger \left[ X^{(\epsilon)}_{t_n,t_{n+1}} \otimes X^{(\epsilon)}_{t_{n+1},t_{n+2}}\right] = \mathcal{O}(\varDelta t^2) \,, \end{aligned} $$
(83)

that is, the sub-sampled data \(X_{t_n}^{(\epsilon )}\) behaves to leading order like solution increments from the reference model (2) at scale Δt independent of the specific value of 𝜖. Note that, on the other hand,

$$\displaystyle \begin{aligned} \mathbb{E}^\dagger \left[ X^{(\epsilon)}_{\tau_l,\tau_{l+1}} \otimes X^{(\epsilon)}_{\tau_{l+1},\tau_{l+2}}\right] = \mathcal{O}( \epsilon^{-1} \varDelta \tau^2) \end{aligned} $$
(84)

for an inner step-size Δτ ∼ 𝜖. In other words, a suitable step-size Δt > 0 can be defined by making

$$\displaystyle \begin{aligned} h(\varDelta t) := \varDelta t^{-2} \left\|\mathbb{E}^\dagger \left[ X^{(\epsilon)}_{t_n,t_{n+1}} \otimes X^{(\epsilon)}_{t_{n+1},t_{n+2}}\right] \right\| \end{aligned} $$
(85)

as small as possible while still guaranteeing an accurate numerical approximation in (62).

Remark 8

The choice of the outer time step Δt is less critical for the EnKBF formulation (74) since it does not rely on sub-sampling the data and is robust with regard to perturbations in the data provided the appropriate M is explicitly available or has been estimated from the available data using (81). Furthermore, if A is symmetric, then it follows from (75) and the skew-symmetry of the commutator [., .] that

$$\displaystyle \begin{aligned} \int_{t_n}^{t_{n+1}} (AX_t^{(\epsilon)})^{\mathrm{T}} \mathrm{d} X_t^{(\epsilon)} = A: \left(X^{(\epsilon)}_{t_{n+1/2}}\otimes X^{(\epsilon)}_{t_n,t_{n+1}}\right), \end{aligned} $$
(86)

which can be used in (74). The same simplification arises when M is symmetric. This insight is at the heart of the geometric rough path approach followed in [9] and which starts from the Stratonovich formulation (25) of the EnKBF. See also [28] on the convergence of Wong–Zakai approximations for stochastic differential equations. In all other cases, a more refined numerical approximation of the data-driven integral in (74) is necessary; such as, for example, (31). For that reason, we rely on the Itô/Euler–Maruyama interpretation of (68) in this note instead, that is the approximation (12).

5 Numerical Example

We consider the linear SDE (2) with γ = 1 and

$$\displaystyle \begin{aligned} A = \frac{-1}{2} \left( \begin{array}{cc} 1 & -1 \\ 1 & 1 \end{array} \right). \end{aligned} $$
(87)

We find that C = I and A T A = 1∕2I. Hence (A T A) : C = 1, and the posterior variance simply satisfies σ t = σ 0∕(1 + σ 0 t) according to (44). We set m prior = 0 and σ prior = 4 for the Gaussian prior distribution of Θ 0, and the observation interval is [0, T] with T = 6. We find that σ T = 0.16. Solving (39) for given σ t with initial condition m 0 = 0 yields

$$\displaystyle \begin{aligned} m_t = 1 - \frac{\sigma_t}{\sigma_0} \end{aligned} $$
(88)

and m T = 0.96. The corresponding curves are displayed in red in Fig. 2.

Fig. 2
figure 2

(ab) Frequentist mean, m t and variance, p t, from EnKBF implementation (20) with step-size Δt = 0.06; (cd) Same results from EnKBF implementation (30) with inner time-step Δτ = Δt∕600. We also display the curves arising for σ t and m t from the standard Kalman theory using the approximation (22). Note that the posterior variance, σ t, should provide an upper bound on the frequentist uncertainty p t

We implement the EnKBF schemes (20) and (30) with t n = n Δt. The inner time-step is Δτ = 10−4 while Δt = 0.06, that is, L = 600. We repeat the experiment N = 104 times and compare the outcome with the predicted mean value of m T = 0.96 and the posterior variance of σ T = 0.16 in Fig. 2. The differences in the computed time evolutions of m t and p t are rather minor and support the idea that it is not necessary to assimilate continuous-time data beyond Δt. We also find that the simple prediction (88), based on standard Kalman filter theory, is not very accurate for this low-dimensional problem (d = 2). The corresponding approximation for σ t provides, however, a good upper bound for p t.

We now replace the data generating SDE model (2) by the multi-scale formulation (58) with 𝜖 = 0.01 and β = 2. This parameter choice agrees with the one used in [9]. We again find that assimilating the data at the slow time-scale Δt = 0.06 leads to very similar results obtained from an assimilation at the fast time-scale Δτ = 10−4 with the EnKBF formulation (74), provided the correction term resulting from the second-order iterated integral (73) is included (See Fig. 3). We also verified numerically that Δt = 0.06 constitutes a nearly optimal step-size in the sense of making (85) sufficiently small while maintaining numerical accuracy. For example, reducing the outer step-size to Δt = 0.02 leads to h(0.02) − h(0.06) ≈ 10 in (85).

Fig. 3
figure 3

Same experimental setting as in Fig. 2 but with the data now generated from the multi-scale SDE (58). Again, subsampling the data in intervals of Δt = 0.06 and high-frequency assimilation with step-size Δτ = 10−4 lead to very similar results in terms of their frequentist means and variances

6 Conclusions

In this follow-up note to [9], we have investigated the impact of subsampling and/or high-frequency data assimilation on the corresponding conditional mean estimators, μ t, both for data generated from the standard SDE model and a modified multi-scale SDE. A frequentist analysis supports the basic finding that both approaches lead to comparable results provided that the systematic biases due to different second-order iterated integrals are properly accounted for. While the EnKBF is relatively easy to analyse and a full rough path approach can be avoided, extending these results to the nonlinear feedback particle filter [26, 9] will prove more challenging. Extensions to systems without a strong scale separation [4, 31] and applications to geophysical fluid dynamics [22, 12] are also of interest. In this context, the approximation quality of the proposed estimator (81) and the choice of the step-size Δt following (85) (and potentially Δτ) will be of particular interest. Finally, while we have investigated the univariate parameter estimation problem, a semi-parametric parametrisation of the drift term f in (1), such as random feature maps [21], lead to high-dimensional parameter estimation problems and their statistics [19, 20]. This provides another fertile direction for future research.