Drift Estimation of Multiscale Diffusions Based on Filtered Data

We study the problem of drift estimation for two-scale continuous time series. We set ourselves in the framework of overdamped Langevin equations, for which a single-scale surrogate homogenized equation exists. In this setting, estimating the drift coefficient of the homogenized equation requires pre-processing of the data, often in the form of subsampling; this is because the two-scale equation and the homogenized single-scale equation are incompatible at small scales, generating mutually singular measures on the path space. We avoid subsampling and work instead with filtered data, found by application of an appropriate kernel function, and compute maximum likelihood estimators based on the filtered process. We show that the estimators we propose are asymptotically unbiased and demonstrate numerically the advantages of our method with respect to subsampling. Finally, we show how our filtered data methodology can be combined with Bayesian techniques and provide a full uncertainty quantification of the inference procedure.


Introduction
Efficient parameter estimation for stochastic models is essential in a wide range of applications in natural and social sciences. In several areas, the data originate from phenomena which vary continuously in time and which are endowed with a multiscale structure. This is the case, for example, in molecular dynamics, oceanography and atmosphere science or in econometrics. Frequently, it is desirable in these areas to infer from data a simpler model which captures effectively large-scale structures, or slow variations, disregarding small-scale fluctuations or treating them as a source of noise. The mismatch between the data and their desired slow-scale representation is a typical instance of a problem of model misspecification, which, if ignored or handled incorrectly, can lead to erroneous inference. Indeed, the data, coming from the full dynamics, are compatible with the coarse-grained model only at the time scales at which the effective dynamics is valid.
In this paper we consider a simple multiscale setting arising from models of molecular dynamics, with a complete separation between the fast and the slow scale. In particular, we consider diffusion processes for motion in a confining potential which has slow variations with rapid order-one oscillations superimposed. Given data in the form of a sample path from this simple class of model problems, we are interested in determining the drift coefficient of an equation of the overdamped Langevin type in which the fast-scale potential is eliminated. The theory of homogenization guarantees that such a single-scale equation can be uniquely determined, and our goal is therefore to obtain effective coarse-grained dynamics from data consistently with respect to the homogenization result.
Several methods to take into account model misspecification in multiscale frameworks as above exist. For diffusion processes, the proposed approaches rely in different measures to subsampling, which has proved itself to some extent effective in many applications, but which requires nevertheless precise knowledge of how separated the two characteristic time scales are. Robustness of this methodology is dubious, too, as inference results tend to be extremely sensitive to the subsampling rate.
In the rest of the introduction, we first give a brief overview of the existing literature on the topic of deterministic and stochastic multiscale inference problems, then introduce our novel methodology and its favourable properties and conclude with an outline of this paper.

Literature Review
For simple models in molecular dynamics, the effect of model misspecification was studied in a series of papers [7,8,16,17,26,28,29] under the assumption of scale separation. In particular, for Brownian particles moving in two-scale potentials it was shown that, when fitting data from the full dynamics to the homogenized equation, the maximum likelihood estimator (MLE) is asymptotically biased [29,Theorem 3.4]. To be more precise, in the large sample size limit, the data remains consistent with the multi-scale problem at small scale. Ostensibly this would seem related only to the estimation of the diffusion coefficient. However, because of detail balance, it also has the effect that the MLE, for the drift in a parameter fit of a single-scale model, incorrectly identifies the coefficient of the homogenized equation. The bias of the MLE can be eliminated by subsampling at an appropriate rate, which lies between the two characteristic time scales of the problem [29, Theorems 3.5 and 3.6].
Similar techniques can be employed in econometrics, in particular for the estimation of the integrated stochastic volatility in the presence of market microstructure noise. In this case, too, the data have to be subsampled at an appropriate rate [6,25]. The correct subsampling rate can, in some instances, be rather extreme with respect to the frequency of the data itself, resulting in ignoring as much as 99% of the time-series. As the intuition suggests, this increases significantly the variance of the estimator, which is usually taken care of with additional bias corrections and variance reduction procedures. The need of such methodology is accentuated by data being obtained at high-frequency [5,35].
The problem of extracting large-scale variations from multiscale data is studied in atmosphere and ocean science. In this field, too, subsampling the data is necessary to obtain an accurate coarse-grained model [12,34].
The necessity to subsample the data can be alleviated by using appropriate martingale estimators, as was done in [18,21]. This class of estimators can be applied to the case where the noise is multiplicative and also given by a deterministic chaotic system, as opposed to white noise. Estimators of this family have been applied to time series from paleoclimatic data and marine biology and augmented with appropriate model selection methodologies [22].
In case the data consist of discrete observations and not of continuous time series, it is possible to employ estimators based on a spectral decomposition of the generator of the stochastic process. Methodologies of this kind have been applied successfully to inference problems for single-scale problems [13,20], as well as more recently for multiscale diffusions [14].
Inference of diffusion processes can be naturally performed under a Bayesian perspective. If one focuses on the drift coefficient, the form of the likelihood function guarantees, under a Gaussian prior hypothesis, that the posterior distribution is itself a Gaussian. The versatility of the Bayesian approach in the infinite-dimensional case [15,33] gives the possibility to extend the study of inferring the drift of a diffusion process to the non-parametric case [31,32]. The issue of model misspecification in inverse problems with a multiscale structure has been treated in the context of partial differential equations, too. In particular, it has been shown that it is possible to infer a coarse-grained equation from data coming from the full model and to retrieve, in the large data limit, the correct result [24]. A series of papers [1][2][3] focuses on retrieving the full model when the multiscale coefficient is endowed with a specific parametrized structure. Since these problems are ill-posed, the latter is achieved via Tikhonov regularization [1,24], adopting a Bayesian approach [2,24] or exploiting techniques of Kalman filtering [3]. In [2,3], the authors highlight the need to account explicitly for the modelling error due to homogenization and apply statistical techniques taken from [10,11].

Our Contributions
In this paper, we bypass subsampling by designing a methodology based on filtered data. In particular, we smooth the time-series data from the multiscale model by application of an appropriate linear time-invariant filter, from the exponential family, and show that doing so allows us to accurately retrieve the drift coefficient of the homogenized model. The methodology we present is straightforward to implement, robust in practice and backed by theory. In particular, we show theoretically and demonstrate via numerical experiments that: (i) The smoothing width of the filter can be alternatively tuned to be proportional to the speed of the slow process or to smaller scales and provide in both cases unbiased results for maximum likelihood parameter estimation. Sharp estimates on the minimal width with respect to the multiscale parameter are provided. The unbiasedness results are given in Theorems 3.12 and 3.18 for filtered data in the homogenized and in the multiscale regimes, respectively.
(ii) We additionally propose in the multiscale regime an estimator of the effective diffusion coefficient based on filtered data, as shown by Theorem 3.20.
(iii) Estimations based on our technique are robust in practice with respect to the parameter of the filter. This is not the case for subsampling, which is strongly influenced by the subsampling frequency. The robustness of our technique is demonstrated via numerical experiments in Sections 5.1 and 5.3.
(iv) The entire stream of data is employed, which, in practice, enhances the quality of the filter-based MLE in terms of bias. Moreover, avoiding subsampling and thus discretising the data allows us to employ continuous-time theoretical tools.
(v) It is possible to employ the filtered data approach within a continuous-time Bayesian framework by a careful modification of the likelihood function. Under mild hypotheses on the filter parameters, we are able to show that the posterior distributions obtained with our methodology are asymptotically consistent with respect to the drift parameter of the homogenized equation. Our main theoretical result is given in Theorem 4.5, and a numerical experiment for the combination of the filtered data approach and of Bayesian techniques is presented in Section 5.4.

Outline
The rest of the paper is organised as follows. In Section 2 we introduce the problem and lay the basis of our analysis setting the main assumptions and notation. In Section 3 we present our filtered data methodology, with a particular focus on ergodic properties, on multiscale convergence and, naturally, on the properties of our estimators. In Section 4 we introduce the Bayesian framework and show how it can be enhanced employing filtered data. Finally, in Section 5 we demonstrate the effectiveness of our methodology via a series of numerical experiments.

Problem Setting
In this section, we introduce the class of diffusion processes which we treat in this paper and the classical methodology employed for the estimation of the drift. Let ε > 0 and let us consider the one-dimensional multiscale stochastic differential equation (SDE) where, given a positive integer N , we have that α ∈ R N and σ > 0 are the drift and diffusion coefficients respectively and W t is a standard one-dimensional Brownian motion. The functions V : R → R N and p : R → R define the slow-scale and the fast-scale confining potentials respectively. In particular, we assume for smooth functions V i : R → R, i = 1, . . . , N . Moreover, we assume p to be smooth and periodic of period L. The theory of homogenization [9, Chapter 3] guarantees the existence of an SDE of the form such that X ε t → X t for ε → 0 in law as random variables in C 0 ([0, T ]; R). In particular, we have A = Kα and Σ = Kσ, where the coefficient 0 < K < 1 is given by the formula and where the function Φ is the unique solution with zero-mean with respect to the measure µ of the two-point boundary value problem − p (y)Φ (y) + σΦ (y) = p (y), 0 ≤ y ≤ L, (2.5) endowed with periodic boundary conditions. Let us remark that in this one-dimensional setting it is possible to determine Φ explicitly, and the homogenization coefficient K is given by We now briefly present the classical methodology for estimating the drift coefficient. Let T > 0 and let X := (X t , 0 ≤ t ≤ T ) be a realization of the solution of (2.3) up to final time T . Girsanov's change of measure formula applied to (2.3) allows to write the likelihood of X given a drift coefficient A as Minimizing the functional I(X | A) with respect to A therefore gives the maximum likelihood estimator (MLE) of A, which can be formally computed in closed form as where M (X) ∈ R N ×N and h(X) ∈ R N are defined as and where ⊗ denotes the outer product in R N . Let us now state the assumptions which will be employed throughout the rest of our work. In particular, we consider the same dissipative setting as [29, Assumption 3.1].
Assumption 2.1. The potentials p and V satisfy (i) p ∈ C ∞ (R) and is L-periodic for some L > 0; (ii) V i ∈ C ∞ (R) for all i = 1, . . . , N is polynomially bounded from above and bounded from below, and there exist a, b > 0 such that (iii) V is Lipschitz continuous, i.e. there exists a constant C > 0 such that and the components V i are polynomially bounded for all i = 1, . . . , N ; (iv) for all T > 0, the symmetric matrix M (X) is positive definite and there existsλ > 0 such that λ min (M (X)) ≥λ Remark 2.2. In the following, in particular in the proof of Lemma 3.3, we will employ Assumption 2.1(ii) for the whole drift of the SDE (2.1), i.e., the function Since p ∈ C ∞ (R) and is periodic, all derivatives of p are in L ∞ (R). Therefore, the assumption above is sufficient for V ε to satisfy Assumption 2.1(ii) with different values for a and b. In particular, assume Assumption 2.1(ii) holds for V . Then, we have for all γ > 0 by Young's inequality Hence, Assumption 2.1(ii) holds for V ε with a coefficient b which is arbitrarily close to the coefficient for V , alone.
Under these assumptions, the MLE given in (2.7) is indeed the unique minimizer of the likelihood function, as shown in [32,Theorem 2.4].
Let us consider the modified estimator of the drift coefficient obtained replacing X with X ε := (X ε t , 0 ≤ t ≤ T ) solution of (2.1), i.e., A(X ε , T ) := arg min where I(X ε | A), the matrix M (X ε ) and the vector h(X ε ) are obtained replacing each occurrence of X with X ε . In the following, we assume that Assumption 2.1(iv) holds as well for the matrix M (X ε ), and simply denote by M := M (X ε ) and h := h(X ε ) in case of no ambiguity. Given the convergence of X ε → X in the space of continuous stochastic processes, one would expect that the MLE (2.8) would be asymptotically unbiased for the drift coefficient A of the homogenized equation (2.3). Instead, it is possible to prove that in the asymptotic limit for T → ∞ and ε → 0, the MLE tends to the drift coefficient α of the unhomogenized equation (2.1). We report here this result, whose proof can be found for the case N = 1 in [29,Theorem 3.4]. We remark that the proof for N > 1 follows directly from the one-dimensional case. Theorem 2.3. Let Assumption 2.1 hold and let X ε 0 be distributed according to the invariant measure of the process X ε solution of (2.1). Then where α is the drift coefficient of equation (2.1).
As anticipated in the introduction, the main existing tool for obtaining unbiased estimators in the literature is subsampling the data. In particular, let the dimension of the parameter N = 1, let δ > 0 and let T = nδ with n a positive integer. Then, a subsampled estimator for A is given by which is a discretized version of A(X ε , T ). It is possible to show [29, Theorem 3.5] that choosing δ = ε ζ with ζ ∈ (0, 1), then A δ (X ε , T ) is an asymptotically unbiased estimator of A in the limit for ε → 0, in probability. Despite being widely employed in practice, estimators based on subsampling present some drawbacks, such as having a high variance, as mentioned in the introduction. In the following, we will introduce and analyse a novel approach for the drift estimation.
Estimating the effective diffusion coefficient Σ of the homogenized SDE (2.3) is as well a relevant problem. Indeed, knowing Σ besides the drift coefficient A gives a complete estimation of the effective model (2.3), which is effective for the multiscale data generated by (2.1) in the sense of homogenization theory. The standard approach for estimating the diffusion coefficient is to compute the quadratic variation of the path. In [29,Theorem 3.4], the authors show that this approach fails in case the data is not pre-processed, meaning that the quadratic variation of X ε equals the diffusion coefficient σ of (2.1), even in the limit for ε → 0. They propose therefore the estimator Σ δ based on subsampling that tends to the effective diffusion coefficient Σ [29, Theorem 3.6]. Despite the focus of this work being mainly the effective drift coefficient, we propose in the following an unbiased estimator for the effective diffusion coefficient which fits our novel approach.
Remark 2.4. We note that our framework may be viewed in the semi-parametric setting as the one of [21]. In particular the functions V i , i = 1, . . . , N can be seen as the known basis functions of an expansion (e.g. a Taylor expansion) for the unknown confining potential V α : R → R given by A numerical example highlighting the potential of our method in such a setting is given in Section 5.3.
Remark 2.5. Let us remark that for enhancing the clarity of the exposition, in this article we chose to focus on the case of a multi-dimensional parameter in the setting of one-dimensional diffusion processes. In fact, all the theory we present in the following could be generalized to the case of the d-dimensional version of the SDE (2.1), which can be written as where W t is a standard d-dimensional Brownian motion. Slight modifications of the proof demonstrate that analogous results to ours may be obtained in the d-dimensional case.

The Filtered Data Approach
In this section, we introduce and analyse a novel approach based on filtered data to address the issue that the MLE estimator, when confronted with multiscale data, is biased. Let β, δ > 0 and let us consider a family of exponential kernel functions k : R + → R defined as where C β is the normalizing constant given by and where Γ(·) is the gamma function. We consider the process The process Z ε can be interpreted as a smoothed version of the original trajectory X ε . In fact, in the field of signal processing the kernel (3.1) belongs to the class of low-pass linear time-invariant filters, which cut the high frequencies in a signal to highlight its slowest components. In the following, rigorous analysis is conducted only when β = 1. Nonetheless, numerical experiments show that for higher values of β the performances of estimators computed employing the filter are more robust and qualitatively better.
Remark 3.1. Given a trajectory X ε , it is relatively inexpensive to compute Z ε from a computational standpoint. In particular, the process Z ε is the truncated convolution of the kernel with the process X ε . Hence, computational tools based on the Fast Fourier Transform (FFT) exist and allow to compute Z ε fast component-wise. Moreover, the process Z ε can be computed, in case β = 1, in a recursive manner and therefore "online".
Given a trajectory X ε and the filtered data Z ε , the estimator of the drift coefficient we propose is given by where we employ the subscript k for reference to the filter's kernel in (3.1), and where For economy of notation we drop explicit reference to the dependence of M and h on X ε . Let us remark that the formula above is obtained from (2.8) by replacing only one instance of X ε t with Z ε t in both M and h. In particular, it is fundamental for proving unbiasedness to keep in the definition of h the differential of the original process dX ε t (see Remark 3.7). Let us furthermore remark that A k (X ε , T ) need not be the minimizer of some likelihood function based on filtered data. In fact, if one were to replace Z ε t directly in (2.6), the symmetric part of the matrix M would appear and A k (X ε , T ) would not be the minimizer. Therefore, the estimator A k (X ε , T ) has to be thought of as a perturbation of A(X ε , T ), directly at the level of estimators and after the maximization procedure. The only theoretical guarantee which is still needed for the well-posedness of A k (X ε , T ) is for M to be invertible, which we assume to be true and which we observed to hold in practice.
We now consider the diffusion coefficient, and propose the estimator for Σ in (2.3) given by where again we employ the subscript k for reference to the kernel (3.1) of the filter. As we will show in the following, and in particular in Theorem 3.20, the estimator Σ is unbiased for the effective diffusion coefficient Σ in case β = 1 and when we filter data at the multiscale regime, i.e., when δ is a vanishing function of ε. Let us from now on consider β = 1. For this value of β, the parameter δ appearing in (3.1) regulates the width of the filtering window. In practice, larger values of δ will lead to trajectories which are smoother and for which fast-scale oscillations are practically canceled. Let us remark that the filtering width resembles the subsampling step employed for the estimator A δ (X ε , T ) introduced and analyzed in [29]. For subsampling, the choice guaranteeing asymptotically unbiased results is δ = ε ζ with ζ ∈ (0, 1), and a similar analysis is due for our technique. For visualization purposes, we depict in Figure 1 the filtered trajectory Z ε for three different values of δ, namely δ = {1, √ ε, ε}. With δ = 1, all oscillations at the fast scale are canceled and the filtered trajectory Z ε presents only slow-scale variations. Reducing the value of δ, fast-scale oscillations are progressively taken into account.
In the following, we first focus on the ergodic properties of the process Z ε when it is coupled with the process X ε . This analysis is practically independent of the choice of δ, and is therefore presented on its own. Then, we focus on two different cases which depend on the choice of the width δ of the filter. First, in Section 3.2, we consider δ to be independent of ε, and therefore we filter at the speed of the homogenized process. In this case, we are able to prove that our estimator of the drift coefficient of the homogenized equation is asymptotically unbiased almost surely. This result will be presented in Theorem 3.12. We then move on in Section 3.3 to the case δ ∝ ε ζ , which corresponds to filtering the data at the speed of the multiscale process. In this case, we show that under some conditions on the exponent ζ, we can still obtain estimators which are asymptotically unbiased in probability. This result is proved in Theorem 3.18. For this second case, we widely employ techniques and estimates which come from [29].

Ergodic Properties
Let us consider the filtering kernel (3.1) with β = 1, i.e., In this case, Leibniz integral rule yields the equality which can be interpreted as an ordinary differential equation for Z ε t driven by the stochastic signal X ε . Considering the processes X ε and Z ε together, we obtain the system of two one-dimensional SDEs (3.5) The first ingredient for verifying the ergodic properties of the two-dimensional process ( is verifying that the measure induced by the stochastic process admits a smooth density with respect to the Lebesgue measure. Since noise is present only on the first component, this is a consequence of the theory of hypo-ellipticity, as summarized in the following Lemma, whose proof is given in Appendix A.

Lemma 3.2.
Let (X ε , Z ε ) be the solution of (3.5) and let µ ε t be the measure induced by the joint process at time t. Then, the measure µ ε t admits a smooth density ρ ε t with respect to the Lebesgue measure.
Once it is established that the law of the process admits a smooth density for all times t > 0, which satisfies a time-dependent Fokker-Planck equation, we are interested in the limiting properties of this law. In particular, we know that the process X ε alone is geometrically ergodic [23,Theorem 4.4], and we wish the couple (X ε , Z ε ) to inherit the same property. The following Lemma guarantees that the couple is indeed geometrically ergodic, and its proof is given in Appendix A. Lemma 3.3. Let Assumption 2.1 hold and let b > 0 be given in Assumption 2.1(ii). Then, if δ > 1/(4b), the process (X ε , Z ε ) solution of (3.5) is geometrically ergodic, i.e., there exists C, λ > 0 such that for all measurable f : , where E denotes expectation with respect to the Wiener measure, and ρ ε is the solution to the stationary Fokker-Planck equation Remark 3.4. The condition δ > 1/(4b) is not very restrictive. Let the parameter dimension N = 1 and let V (x) ∝ x 2r for an integer r > 1. Then, Assumption 2.1(ii) holds for an arbitrarily large b > 0. Therefore, the parameter of the filter δ can be chosen along the entire positive real axis. A similar argument can be employed for higher dimensions N > 1.
In a general case, it is not possible to find an explicit solution to (3.6). Nevertheless, it is possible to show some relevant properties of the solution itself, which are summarized in the following Lemma, whose proof is given in Appendix A.

Lemma 3.5.
Under the assumptions of Lemma 3.3, let ρ ε be the solution of (3.6) and let us write where ϕ ε and ψ ε are the marginal densities of X ε and Z ε respectively, i.e., Then, it holds Moreover, it holds Remark 3.6. Lemma 3.5, and in particular the equality (3.9), plays a fundamental role in the proof of unbiasedness of the estimator based on filtered data. In particular, this equality allows to bypass the explicit knowledge of the function R(x, z), which governs the correlation between the processes X ε and Z ε at stationarity, for which a closed-form expression is not available in the general case.
Remark 3.7. Let us return to the definition of A k and replace the differential dX ε t with dZ ε t in h. In this case, it holds where the last equality is obtained as in the proof of Lemma 3.5, with the choice f (x, z) = V (z) at the last line. Therefore, we stress again that it is indeed necessary to employ the original differential dX ε t in the vector h in the definition (3.2) of A ε k . Remark 3.8. Let us consider the kernel (3.1) with β > 1. In this case, the steps leading to the system (3.5) do not yield a system of Itô SDEs, but of stochastic delay differential equations. The analysis of the estimator in case β > 1 is therefore based on different arguments than the one we present in this work.

Filtered Data in the Homogenized Regime
In this section, we analyze the behavior of the estimator A k (X ε , T ) based on filtered data given in (3.2) when the filtering width δ is independent of ε. The analysis in this case is based on the convergence of the couple (X ε , Z ε ) with respect to the multiscale parameter ε → 0. In particular, it is known that the invariant measure of X ε converges weakly to the invariant measure of X, the solution of the homogenized equation (2.3). The following result guarantees the same kind of convergence for the couple (X ε , Z ε ) . Lemma 3.9. Under Assumption 2.1, let µ ε be the invariant measure of the couple (X ε , Z ε ) . If δ is independent of ε, then the measure µ ε converges weakly to the measure µ 0 (dx, dz) = ρ 0 (x, z) dx dz, whose density ρ 0 is the unique solution of the Fokker-Planck equation where A and Σ are the coefficients of the homogenized equation (2.3).
The arguments of Section 3.1 can be repeated to conclude that the invariant measure of (X, Z) admits a smooth density ρ 0 which satisfies (3.10). Moreover, standard homogenization theory (see e.g. [9, Chapter 3, Theorem 6.4] or [30, Theorem 18 We remark that traditionally it is assumed that the initial conditions satisfy (X ε 0 , Z ε 0 ) = (X 0 , Z 0 ) for the homogenization result to hold, but notice that the proof of e.g. [30,Theorem 18.1] can be shown to hold with a minor modification in case both the multiscale and the homogenized processes are at stationarity. Denoting E = C 0 ([0, T ], R 2 ), this means that the measure induced by (X ε , Z ε ) on (E, B(E)) converges weakly to the measure induced by (X, Z) on the same measurable space (see e.g. [30,Definition 3.24]). Hence, the measure µ ε converges weakly to µ 0 for ε → 0. where This is the density of a multivariate normal distribution N (0, Γ), where the covariance matrix is given by 1 .
Let us remark that this distribution can be obtained from direct computations involving Gaussian processes. In particular, we have that X is in this case an Ornstein-Uhlenbeck process and it is therefore known that X ∼ GP(m t , C(t, s)), where at stationarity m t = 0 and The basic properties of Gaussian processes imply that Z is a Gaussian process, and that the couple (X, Z) is a Gaussian process, too, whose mean and covariance are computable explicitly.
We now present an analogous result to Lemma 3.5 for the limit distribution.
Corollary 3.11. Let ρ 0 be the solution of (3.10) and let us write where ϕ 0 and ψ 0 are the marginal densities, i.e.,

Moreover, it holds
Proof. The proof is directly obtained from Lemma 3.5 setting p(y) = 0 and replacing α, σ by A, Σ respectively. Let us introduce a notation which will be used throughout the rest of the paper. We denote i.e., M ε is obtained in the limit for T → ∞ applying the ergodic theorem elementwise to the matrix M , and M 0 is the limit for ε → 0 of the matrix M ε due to Lemma 3.9. For completeness, we introduce here the symmetric matrices M ε and M 0 which are defined as and which will be employed in the following. We can now introduce the main result, namely the convergence of the estimator based on filtered data of the drift coefficient of the homogenized equation.

Theorem 3.12. Let the assumptions of Lemma 3.3 and Lemma 3.9 hold, and let
Proof. Replacing the expression of dX ε t into (3.3), we get for h Therefore, we have (3.14) We study the terms I ε 1 (T ) and I ε 2 (T ) separately. First, the ergodic theorem applied to I ε 1 (T ) yields Replacing the decomposition (3.7), the expression (3.8) of ϕ ε and integrating by parts, we have Replacing the equality above into (3.15), we obtain Due to Lemma 3.5, we therefore have Since δ is independent of ε, we can pass to the limit as ε goes to zero and Lemma 3.9 yields and moreover, an integration by parts yields We can therefore conclude that We now consider the second term I ε 2 (T ), and rewrite it as The ergodic theorem yields where R ε is bounded uniformly in ε due to the theory of homogenization, Assumption 2.1(iii)-(iv) and Lemma C. Remark 3.13. Let us remark that the assumption that δ is independent of ε is necessary to pass from (3.16) to (3.17) but is not needed before (3.16). Moreover, the term I ε 2 (t) in the proof vanishes a.s. independently of ε. Therefore, in the analysis of the case δ = O(ε ζ ) it will be sufficient for unbiasedness to show that which is a non-trivial limit since δ → 0 for ε → 0.

Filtered Data in the Multiscale Regime
We now consider the case of the filtering width δ = O(ε ζ ), where ζ > 0 will be specified in the following. In this case, the filtered process resembles more the original process X ε , as can be noted in Figure 1. Moreover, the techniques employed for proving Theorem 3.12 can only be partly exploited, as highlighted by Remark 3.13. In fact, in order to prove unbiasedness it is necessary to characterize precisely the difference between the processes Z ε and X ε . A first characterization is given by the following Proposition, whose proof is found in Appendix B.

19)
where Φ is the solution of the cell problem (2.5), W s is the Brownian motion appearing in (2.1) and Y ε t = X ε t /ε. Moreover, B ε t and the remainder R(ε, δ) satisfy for every p ≥ 1 the estimates

21)
where C is independent of ε, δ and t, and ϕ ε is the density of the invariant measure of X ε .
It is clear from the Proposition above that understanding the properties of the process B ε t is key to understanding the behavior of the difference between X ε and Z ε . In particular, we can write the dynamics of B ε t with an application of the Itô formula and due to the properties of the kernel k(t) as This equation can be coupled with the dynamics of the processes X ε t , Y ε t and Z ε t , thus describing the evolution of the quadruple (X ε , Y ε , Z ε , B ε ) together. In particular, it is possible to show that the results of Section 3.1 hold for the quadruple, and the properties of the invariant measure of the quadruple can be exploited to prove the unbiasedness of the estimator in the case δ = O(ε ζ ) in the same way as in the case δ independent of ε. In this context, a further assumption on the potential V is necessary. for all x, y ∈ R.
In light of Remark 3.13, it is fundamental to understand the behavior of the quantity as well as its limit for t → ∞ and for ε → 0. Let us remark that due to Proposition 3.14 we have and therefore studying the right hand side of the approximate equality above is the goal of the upcoming discussion. The following result, whose proof is in Appendix C, gives a first characterization.

Lemma 3.16. Under Assumptions 2.1 and 3.15, let η ε be the invariant measure of the quadruple
where the remainder R(ε, δ) satisfies Let us remark that the quantity appearing above hints towards the theory of homogenization. In fact, we recall that the homogenization coefficient K is given by where µ is the marginal measure of the process Y ε when coupled with X ε . Therefore, the next step is the homogenization limit, i.e., the limit of vanishing ε, which is considered in the following Lemma, and whose proof is given in Appendix C.
where Σ is the diffusion coefficient of the homogenized equation (2.3).
Provided with the results presented above, we can prove the following Theorem, stating that the estimator A k (X ε , T ) is asymptotically unbiased even in the case of the filtering width δ vanishing with respect to the multiscale parameter ε.

Theorem 3.18. Let the assumptions of Lemma 3.3 and Lemma 3.17 hold. Let
Proof. Let us introduce the notation where M ε is defined in (3.12). Then following the proof of Theorem 3.12 and in light of Remark 3.13, we only need to show that if δ = ε ζ with ζ ∈ (0, 2) we have lim ε→0 A ε (δ) = A, in probability.
Using Proposition 3.14 and geometric ergodicity for taking the limit for t → ∞ (Lemma 3.3), we have the following equality where R(ε, δ) is given in Proposition 3.14, E denotes the expectation with respect to the Wiener measure and J ε Let us consider the three terms separately. First, by geometric ergodicity and applying Lemma 3.16 and Lemma 3.17 we get Let us now consider J ε 2 (t). Considering Hölder conjugates p, q, r the Hölder inequality yields We consider now J ε 3 (t). The Hölder inequality yields for conjugates p and q which, similarly as above, yields for t sufficiently large Therefore, since δ = O(ε ζ ) for ζ ∈ (0, 2), the terms J ε 2 (t) and J ε 3 (t) vanish in the limit for t → ∞ and ε → 0. Furthermore, by Lemma C.4 and by weak convergence of the invariant measure µ ε to µ 0 , we have lim where M 0 is defined in (3.13). Therefore and, finally, employing (3.11) and (3.13) and integrating by parts yields which implies the desired result.
We conclude the analysis concerning the estimator A k for the effective drift coefficient with a negative convergence result, i.e., that if δ = ε ζ with ζ > 2, the estimator based on filtered data converges to the coefficient α of the unhomogenized equation. This result is relevant for two reasons. First, it shows the sharpness of the bound on ζ in the assumptions of Theorem 3.18. Second, it shows an interesting switch between two completely different regimes at ζ = 2, which happens arbitrarily fast in the limit ε → 0.

Theorem 3.19. Let the assumptions of Lemma 3.3 and Assmuption 3.15 hold. Let
where α is the drift coefficient of the multiscale equation (2.1).
The proof is given in Appendix C.
We conclude this section by proving a result of asymptotic unbiasedness for the estimator Σ k of the effective diffusion coefficient Σ defined in (3.4). The proof is given in Appendix D.

The Bayesian Setting
In this section we present a Bayesian reinterpretation of the inference procedure, which, given the structure of the problem, allows full uncertainty quantification with little more computational effort than required for the MLE.
Let us fix a Gaussian prior µ 0 = N (A 0 , C 0 ) on A, where A 0 ∈ R N and C 0 ∈ R N ×N is symmetric positive definite. Then, given a final time T > 0, the posterior distribution µ T,ε admits a density p(A | X ε ) with respect to the Lebesgue measure which satisfies where Z ε is the normalization constant, p 0 is the density of µ 0 , and where the likelihood p(X ε | A) is given in (2.6). The log-posterior density is therefore given by where M and h are defined in (2.8). Since the log-posterior density is quadratic in A, the posterior is Gaussian, and it is therefore sufficient to determine its mean and covariance to fully characterize it. We denote by m T,ε and C T,ε the mean and covariance matrix, respectively. Completing the squares in the log-posterior density, we formally obtain Under Assumption 2.1, one can show that the posterior at time T > 0 is well defined and given by µ T,ε (· | X ε ) = N (m T,ε , C T,ε ). Let us remark that in order to compute the posterior covariance C T,ε the value of the diffusion coefficient Σ of the homogenized equation is needed. Although the exact value is in general unknown, it can be estimated employing the subsampling technique presented in [29] or with the estimator Σ k given in (3.4) based on filtered data. In fact, we verified in practice that the estimator of the diffusion coefficient based on subsampling is more robust with respect to the subsampling step than the estimator for the drift coefficient. In the following theorem, we show that the posterior distribution obtained with no pre-processing of the data contracts asymptotically to the drift coefficient of the unhomogenized equation. We characterize the contraction by verifying that the posterior measure concentrates in arbitrarily small balls. Let us finally remark that the measure µ T,ε is a random measure, and therefore contraction has to be considered averaged with respect to the Wiener measure. The choice of the contraction measure and some parts of the proof are taken from [31,Theorem 5.2].
where E denotes expectation with respect to the Wiener measure and α is the drift coefficient of the unhomogenized equation (2.1).
Remark 4.2. The result above has the same consequences in the Bayesian setting as Theorem 2.3 has for the MLE. In particular, it shows that the posterior distribution obtained when data is not pre-processed concentrates asymptotically on the drift coefficient of the unhomogenized equation (2.1). Moreover, a partial result which can be deduced from the proof is that in the limit for T → ∞ and for a positive value ε > 0 the Bayesian and the MLE approaches are equivalent. In particular, we have for all ε > 0 lim i.e., the weak limit of the posterior µ T,ε for T → ∞ is the Dirac delta concentrated on the limit of A(X ε , T ) for T → ∞.
Proof of Theorem 4.1. The proof of [31,Theorem 5.2] guarantees that if the trace of C T,ε tends to zero and if the mean m T,ε tends to α, then the desired result holds. Indeed, the triangle inequality yields If the mean converges in probability, then the second term vanishes. For the first term, Markov's inequality yields and a change of variable simply gives This proves that we just have to verify that the covariance matrix vanishes and that the mean tends to the coefficient α. Let us first consider the covariance matrix. An algebraic identity yields Let us first remark that due to the hypothesis on M (Assumption 2.1(iv)) and the ergodic theorem it holds for all T > 0 whereλ is given in Assumption2.1(iv). We now have that for generic symmetric positive definite matrices R and S it holds (R + S) −1 2 ≤ S −1 2 . Applying this inequality to Q −1 , we obtain and due to the triangle inequality lim We proved that in the limit for T → ∞ the covariance shrinks to zero independently of ε. We now consider the mean. First, we remark that the triangle inequality yields For the second term, Theorem 2.3 implies Let us now consider the first term. Replacing the expression of the maximum likelihood estimator (2.8) and due to the Cauchy-Schwarz and triangle inequalities, we obtain Moreover, the ergodic theorem and the strong law of large numbers for martingales guarantee that h 2 is bounded a.s. for T → ∞. Therefore

The Filtered Data Approach
In this section, we present how to correct the asymptotic biasedness of the posterior highlighted by Theorem 4.1 employing filtered data. In the Bayesian setting, we consider the modified likelihood function where Since M is symmetric positive definite, the function p(X ε | A) is indeed a valid Gaussian likelihood function. We then obtain the modified posterior µ T,ε = N ( m T,ε , C T,ε ), whose parameters are given by Let us remark that the posterior µ T,ε has the same covariance as µ T,ε given in (4.1) and that therefore it is indeed a valid Gaussian posterior distribution. Nevertheless, in order to employ the tool of convergence introduced in Theorem 4.1, we need to study the properties of the MLE based on the likelihood p(X ε | A), i.e., the quantity The following theorem guarantees the unbiasedness of this estimator under a condition on the parameter δ of the filter.

Theorem 4.3. Let the assumptions of Theorem 3.18 hold. Then
Proof. We first consider the difference between the two estimators A k (X ε , T ) and A k (X ε , T ). In particular, the ergodic theorem and an algebraic equality imply almost surely, where M ε and M ε are defined in (3.13) and (3.12), respectively. Therefore, due to Assumption 2.1 which allows controlling the norm of M −1 ε and due to Lemma C.4 we have for a constant C > 0 lim where we remark that A k (X ε , T ) has a bounded norm for ε sufficiently small due to Theorem 3.18. Now, the triangle inequality yields Therefore, due to Theorem 3.18, the inequality (4.4) and since δ = ε ζ , the desired result holds.
Remark 4.4. One could argue that we could have carried on the whole analysis for the estimator A k (X ε , T ) instead of the estimator A k (X ε , T ). Nevertheless, the latter guarantees the strong result of almost sure convergence in case δ is independent of ε, which is false for the former. Conversely, analysing the properties of the estimator A k (X ε , T ) is fundamental for the Bayesian setting, in which the matrix M cannot be employed as its symmetric part is not positive definite in general.
In light of the proof of Theorem 4.1, Theorem 4.3 guarantees that the mean of the posterior distribution µ T,ε converges to the drift coefficient of the homogenized equation. Since the covariance matrix is the same for µ T,ε and µ T,ε , it is possible to prove a positive convergence result for µ T,ε , which is given by the following Theorem.

Theorem 4.5. Let the Assumptions of Theorem 4.3 hold. Then, the modified posterior measure
where E denotes expectation with respect to the Wiener measure and A is the drift coefficient of the homogenized equation (2.3).
Proof. The proof follows from the proof of Theorem 4.1 and from Theorem 4.3.

Numerical Experiments
In this section we show numerical experiments confirming our theoretical findings and showcasing the potential of the filtered data approach to overcome model misspecification arising when multiscale data is used to fit homogenized models.
Remark 5.1. In practice, we consider for numerical experiment the data to be in the form of a high-frequency discrete time series from the solution X ε of (2.1). Let τ > 0 be the time step at which data is observed, and let X ε := (X ε 0 , X ε τ , X ε 2τ , . . .). We then compute the estimator A k as We take in all experiments τ ε 2 , so that the discretization of the data has negligible effects and does not compromise the validity of our theoretical results.

Parameters of the Filter
For the first preliminary experiments, we consider N = 1 and the quadratic potential V (x) = x 2 /2. In this case, the solution of the homogenized equation is an Ornstein-Uhlenbeck process. Moreover, we set the the fast potential in the multiscale equation (2.1) as p(y) = cos(y). In all experiments, data is generated employing the Euler-Maruyama method with a fine time step.

Verification of Theoretical Results
We first demonstrate numerically the validity of Theorem 3.12, Theorem 3.18 and Theorem 3.19, i.e., the unbiasedness of A k (X ε , T ) for δ = ε ζ with ζ ∈ [0, 2) and biasedness for ζ > 2. Let us recall that for ζ = 0 the analysis and the theoretical result are fundamentally different than for ζ ∈ (0, 2). We consider ε ∈ {0.1, 0.05, 0.025}, the diffusion coefficient σ = 1 and generate data X ε t for 0 ≤ t ≤ T with T = 10 3 . Then we filter the data by choosing δ = ε ζ , and ζ = 0, 0.1, 0.2, . . . , 3, and compute A k (X ε , T ). Results are displayed in Figure 2, and show that for ζ > 2, i.e., δ = o(ε 2 ), the estimator tends to the drift coefficient α of the unhomogenized equation. Conversely, as predicted by the theory, for ζ ∈ [0, 2) the estimator tends to A, the drift coefficient of the homogenized equation. Therefore, the point δ = ε 2 acts asymptotically as a switch between two completely different regimes, which is theoretically sharp in the limit for T → ∞ and ε → 0. Let us remark that the results displayed in Figure 2.(a) demonstrate that the transition occurs more rapidly for the smallest values of ε. Moreover, in Figure 2.(b), one can see how with bigger final times T the estimator is closer both to A when ζ ∈ [0, 2] and to α when ζ > 2. Still, we observe that in finite computations the switch between A and α is smoother than what we expect from the theory, which suggests to fix, if possible, δ = 1.

Comparison with Subsampling
We now compare the results given by the filtered data technique with the results given by subsampling the data, i.e., the difference between the estimators A k (X ε , T ) and A δ (X ε , T ). We fix the multiscale parameter ε = 0.1 and generate data for 0 ≤ t ≤ T with T = 10 3 . We choose δ = ε ζ and vary ζ ∈ [0, 1], where δ is the filtering and the subsampling width, respectively. Moreover, for the filtered data approach we consider both β = 1 and β = 5. We report in Figure 3 the experimental results. Let us remark that: (i) for σ = 0.5 the results given by subsampling and by the filter with β = 1 are similar, while for higher values of σ the filtered data approach seems better than subsampling; (ii) in general, choosing a higher value of β seems beneficial for the quality of the estimator; (iii) the dependence on δ of numerical results given by the filter seems relevant only in case β = 1 and for small values of σ. For β = 1 and higher values of σ, the estimator is stable with respect to this parameter. This can be observed for a higher value of β but we have no theoretical guarantee in this case.

The Influence of β
We finally test the variability of the estimator with respect to β in (3.1). We consider δ = ε, which corresponds to ζ = 1 and seems to be the worst-case scenario for the filter, at least for β = 1. We consider again σ = 0.5, 0.7, 1 and vary β = 1, 2, . . . , 10. Results, given in Figure 4, show empirically that the estimator stabilizes fast with respect to β. Nevertheless, there is no theoretical guarantee supporting this empirical observation.

Variance of the Estimators
We now compare the estimators A k based on filtered data and A δ based on subsampling in terms of variance. We consider for this experiment the SDE (2.1) with N = 1, the bistable potential V (x) = x 4 /4 − x 2 /2, the multiscale drift coefficient α = 1, the diffusion coefficient σ = 1 and with ε = 0.1. We then let X ε = (X t , 0 ≤ t ≤ T ) be the solution of (2.1) and generate N s = 500 i.i.d. samples of X ε . We then compute the estimators A k and A δ on each of the realizations of X ε , thus obtaining N s replicas { A For the estimator A k , we consider the kernel (3.1)  with β = {1, 5} and with δ = 1. For the estimator A δ , we employ the subsampling width δ = ε 2/3 , which is heuristically optimal following [29]. It could be argued that another estimator based on subsampling and shifting could be employed to reduce the variance. In particular, we let τ > 0 be the time step at which the data is observed. Indeed, in practice we work with high-frequency discrete data, and observe X ε := (X ε 0 , X ε τ , . . . , X ε nτ ), with nτ = T . We assume for simplicity that the subsampling width δ is a multiple of τ and compute for all k = 0, 1, . . . , δ/τ − 1 i.e. the subsampling estimator obtained by shifting the origin by kτ . We then average over the index k and obtain the new estimator We include this estimator in the numerical study for completeness, and compute N s replicas of A avg δ on all the realizations of X ε . Results, given in Figure 5 for the final times T = {500, 1000}, show that our novel approach does not outperform subsampling in terms of variance, but clearly does in terms of bias. Moreover, we notice numerically that the shifted-averaged estimator A avg δ does not reduce sensibly the variance in this case with respect to A δ . In fact, this is only partly surprising, since the estimators A δ,k of (5.1) are highly correlated. Finally, we notice that the filtering estimator A k with β = 5 has a lower variance with respect to the same estimator with β = 1. This confirms that choosing a higher value of β improves the estimation of the effective drift coefficient.

Multidimensional Drift Coefficient
Let us consider the Chebyshev polynomials of the first kind, i.e., the polynomials T i : R → R, i = 0, 1, . . ., defined by the recurrence relation We consider the potential function V (x) as in (2.2) with thus considering the semi-parametric framework of Remark 2.4. This potential function satisfies Assumption 2.1 whenever N is even and if the leading coefficient α N is positive. We set N = 4 and the drift coefficient α = (−1, −1/2, 1/2, 1). With this drift coefficient, the potential function is of the bistable kind. Moreover, we set ε = 0.05, the diffusion coefficient σ = 1, the fast potential p(y) = cos(y) and simulate a trajectory of X ε for 0 ≤ t ≤ T with T = 10 3 employing the Euler-Maruyama method with time step ∆t = ε 3 . We estimate the drift coefficient A ∈ R 4 with the estimators: (i) A(X ε , T ) based on the data X ε itself; (ii) A δ (X ε , T ) based on subsampled data with subsampling parameter δ = ε 2/3 ; (iii) A k (X ε , T ) based on filtered data Z ε computed with β = 1 and δ = 1.
In particular, we pick this specific value of δ for the subsampling following the optimality criterion given in [29]. Results, given in Figure 6, show that the filter-based estimation captures well the homogenized potential as well as the coefficient A. Moreover, it is possible to remark the negative result given by Theorem 2.3 holds in practice, i.e., with no pre-processing the estimator A(X ε , T ) tends to the drift coefficient α of the unhomogenized equation. Finally, we can observe that the subsampling-based estimator fails to capture the homogenized coefficients. Indeed, the estimator strongly depends on the sampling rate and on the diffusion coefficient, as shown in the numerical experiments of [29]. Even though the authors suggest the choice of δ = ε 2/3 , this is just an heuristic and is not guaranteed to be the optimal value in all cases. In the asymptotic limit of ε → 0 and T → ∞, any valid choice of the subsampling rate is guaranteed theoretically to work, but not in the pre-asymptotic regime. Our estimator, conversely, seems to perform better with no particular tuning of the parameters even in this multi-dimensional case, which demonstrates the robustness of our novel approach.

The Bayesian Approach: Bistable Potential
In this numerical experiment we consider N = 2 and the bistable potential, i.e., the function V defined as with coefficients α 1 = 1 and α 2 = 2. We then consider the multiscale equation with σ = 0.7, the fast potential p(y) = cos(y) and ε = 0.05, thus simulating a trajectory X ε . We adopt here a Bayesian approach and compute the posterior distribution µ T,ε obtained with the filtered data approach introduced in Section 4.1. The parameters of the filter are set to β = 1 and δ = ε in (3.1). Moreover, we choose the non-informative prior µ 0 = N (0, I). Let us remark that in order to compute the posterior covariance the diffusion coefficient Σ of the homogenized equation has to be known. In this case, we pre-compute the value of Σ via the coefficient K and the theory of homogenization, but notice that Σ could be estimated either employing the subsampling technique of [29] or using the estimator Σ k based on filtered data defined in (3.4). In particular, in this case Σ ≈ 0.2807, and we compute numerically Σ k (X ε , 100) = 0.2901, Σ k (X ε , 200) = 0.2835, Σ k (X ε , 400) = 0.2813, so that employing the estimator Σ k instead of the true value would have negligible effects on the computation of the posterior over the effective drift coefficient. We stop computations at times T = {100, 200, 400} in order to observe the shrinkage of the Gaussian posterior towards the MLE A k (X ε , T ) with respect to time. In Figure 7, we observe that the posterior does indeed shrink towards the MLE, which in turn gets progressively closer to the true value of the drift coefficient A of the homogenized equation.

Conclusion
In this work we considered a novel methodology to confront the problem of model misspecification when homogenized models are fit to multiscale data. Our approach is based on using filtered data for the estimation of the drift of the homogenized diffusion process. We proved asymptotic unbiasedness of estimators drawn from our methodology. Moreover, we found a modified Bayesian approach which guarantees robust uncertainty quantification and posterior contraction, based on the same filtered data approach. Numerical experiments demonstrate how the estimator based on filtered data requires less knowledge of the characteristic time-scales of the multiscale equation with respect to subsampling, and how it can be employed as a black-box tool for parameter estimation on a range of academic examples. We note that in many applications one can only obtain discrete measurements of the diffusion process. Recently, using the filtering approach developed in this paper and martingale estimating functions a new estimator for learning homogenised SDEs from noisy discrete data has been introduced [4]. We believe this work gives way to several further developments. In particular, we believe it would be relevant to (i) analyse the filtered data approach for β > 1 in (3.1), which seems to give more robust results in practice, (ii) extend the analysis to the non-parametric framework most likely by means of Bayesian regularization techniques, thus allowing to recover effective drift functions for which a parametric representation does not exist, (iii) consider multiscale models for which the homogenized equation presents multiplicative noise, (iv) test the filtered data methodology against real-world data, (v) apply similar methodologies to correct faulty behaviour of other methods. Consequently, which spans the tangent space of R 2 at (x, z), denoted T x,z R 2 . The desired result then follows from Hörmander's theorem (see e.g. [27,Chapter 6]).
Proof of Lemma 3.5. Integrating equation (3.6) with respect to z we obtain the stationary Fokker-Planck equation for the process X ε , i.e. σ(ϕ ε ) (x) + d dx α · V (x) + 1 ε p x ε ϕ ε (x) = 0, (A.1) whose solution is given by and which proves (3.8). In view of (3.7) and (A.1), equation (3.6) can be rewritten as We now multiply the equation above by a continuous differentiable function f : R 2 → R N , f = f (x, z), and integrate with respect to x and z. Then an integration by parts yields which implies the following identity in R N Finally, choosing f (x, z) = (x − z)V (z) + V (z), we obtain the desired result.
We now consider ∆ 2 (ε). Integrating equation (C.2) with respect to z and b we obtain the Fokker-Planck equation for the stationary marginal distribution λ : R × [0, L], λ = λ(x, y), of the couple whose solution is given by Therefore, since Σ = Kσ and by equations (2.4) and (3.11) we have which shows that ∆ 2 (ε) = 0 and completes the proof.