Optimal control for estimation in partially observed elliptic and hypoelliptic linear stochastic differential equations

Multi-dimensional stochastic differential equations (SDEs) are a powerful tool to describe dynamics of phenomena that change over time. We focus on the parametric estimation of such SDEs based on partial observations when only a one-dimensional component of the system is observable. We consider two families of SDE, the elliptic family with a full-rank diffusion coefficient and the hypoelliptic family with a degenerate diffusion coefficient. Estimation for the second class is much more difficult and only few estimation methods have been proposed. Here, we adopt the framework of the optimal control theory to derive a contrast (or cost function) based on the best control sequence mimicking the (unobserved) Brownian motion. We propose a full data-driven approach to estimate the drift and diffusion coefficient parameters. Numerical simulations made on different examples (Harmonic Oscillator, FitzHugh–Nagumo, Lotka–Volterra) reveal our method produces good pointwise estimate for an acceptable computational price with, interestingly, no performance drop for hypoelliptic systems.


Introduction
We focus on statistical inference for d-dimensional stochastic dynamical systems modeled by a stochastic differential equation (SDE). We are interested in partial observations: only the first one-dimensional coordinate, denoted V t , of the system is observed, the other (d − 1)dimensional coordinates, denoted U t , are unobserved. The system is written: dV t = a 1 (V t , U t , t; θ)dt + σ 1 dW 1t dU t = a 2 (V t , U t , t; θ)dt + σ 2 dW 2t (1) where a 1 and a 2 are the two drift functions that depend both on V t and U t , θ are the drift parameters, (W 1t ) t and (W 2t ) t are two independent Brownian processes and σ 1 , σ 2 are the two diffusion coefficients. Here σ 1 is scalar and σ 2 is a d − 1 dimensional vector and the multiplication σ 2 dW 2t has to be understood componentwise in the last equation.
Let us denote B σ = σ 1 0 1,d−1 0 d−1,1 Diag(σ 2 ) the diffusion coefficient. We consider two classes of models (1). The first class, called elliptic, corresponds to an SDE with a full non degenerate diffusion coefficient i.e. the matrix B σ B T σ is full rank, X T being the transposed matrix of X . The second class, called hypoelliptic, corresponds to an SDE with degenerate stochastic noise: the diffusion coefficient B σ B T σ is not invertible, for example when σ 1 = 0, but the noise is still propagated to V t through the influence of U t . Assumptions on the drift and diffusion coefficient structures to ensure hypoellipticity are stated in the next section.
These two specificities, partial observations and hypoelliptic properties, are of increasing importance in many applications, some examples are given below. Let us first remark that these two specificities are not of the same nature. The first is linked to the type of observations. In many examples, the system is complex and is modeled by a multi-dimensional system, while the experimentalists are only able to measure, often at discrete times, a onedimensional signal. This increases the difficulty of estimating parameters of model (1). The second specificity is not linked to any experimental constraint, but is a mathematical way of describing the intrinsic noise of process (V t , U t ). It might be difficult for the modeler to know in advance if the system is elliptic rather than hypoelliptic. Unfortunately, estimation methods are often strongly different depending on the nature of the noise (see more details below), and may fail down when applied to the 'wrong' class of models. This is an advantage of our method which is the same in both cases.
Let us now give some examples of applications. Partially observed SDEs have been used in pharmacokinetics for modeling the concentration of a drug in the body, either in a elliptic or a hypoelliptic version (Ditlevsen et al. 2005;Cuenod et al. 2011). In systems biology, the stochastic version of the famous deterministic Lotka-Volterra model (see Lotka 1925) which can be found in Mao et al. (2002), Meeds and Welling (2015), Graham and Storkey (2017) describes the interaction between two species, predator and prey, through a two-dimensional elliptic system. It is often possible to observe only one of the two species, leading to partial observations. In neurosciences, several stochastic systems have been proposed to model the dynamic of one single neuron. The first equation V t corresponds to the dynamics of the membrane potential of the neuron and U t to a recovery variable, or a synaptic conductance, that can not be measured. We can cite the synaptic-conductance based models (Paninski et al. 2010(Paninski et al. , 2012Ditlevsen and Greenwood 2013) or the FitzHughNagumo model (Gerstner and Kistler 2002). These models have been proposed with stochastic noise on the synaptic conductance dynamic (U t ) only, leading to hypoelliptic SDEs (see e.g. Paninski et al. 2012, and references therein), or on both coordinates leading to elliptic SDE (Ditlevsen and Greenwood 2013). A last class of models is the stochastic hypoelliptic Damping Hamiltonian system where the first coordinate represents the position of a particle and the second its velocity (d = 2). It is natural that the (Brownian) noise appears only in the velocity coordinate, the position being defined as the deterministic infinitesimal integral of the speed. The position can be measured with precision, but the speed is not directly available. In these models, the first equation of the system reduces to dV t = U t dt.
Let us now review the estimation methods. Estimation of elliptic SDE has been widely studied. In the complete observations cases (both V t and U t observed), we can cite among others (Bibby and Sorensen 1995;Pedersen 1995;Kessler 1997;Aït-Sahalia 2008;Durham and Gallant 2002;Sørensen 2004;Beskos et al. 2006;Jensen et al. 2012;Van der Meulen and Schauer 2016). Partial observations have been considered with several approaches. The unobserved coordinates are treated as missing data and are imputed, see for examples Elerian et al. (2001), Eraker (2001, Golightly and Wilkinson (2006), Golightly and Wilkinson (2008) and Ditlevsen and Samson (2014). Some methods propose to approximate the transition density by the Euler-Maruyama scheme and consider a Monte-Carlo approximation to impute and filter the unobserved coordinates. Therefore, they are computationally intensive. We will show that the methodology we develop is less demanding in terms of computation time than theses previous methods, even if its computational cost is not negligible.
Let us now explain why the estimation of hypoelliptic systems is more difficult. Imagine that the complete observations of (V t , U t ) are available in continuous time. Estimating θ is naturally performed through the Girsanov formula, that gives directly the likelihood (Lipster and Shiryaev 2001), the diffusion coefficient being estimating by the quadratic variation for example. Girsanov formula requires the existence of a solution w.r.t. u(z) to the system of algebraic equations B σ u(z) = a(z) − α(z) where α(z) is the drift of the dominating measure (see section 7.6.4. in Lipster and Shiryaev 2001). This solution does not exist for hypoelliptic systems. The same problem occurs for most of estimation methods developed for elliptic systems that would typically require B σ B T σ to be invertible. They can thus not be applied to hypoelliptic systems.
There are only few references for hypoelliptic SDEs. The stochastic Damping Hamiltonian system (with dV t = U t dt) has been the most studied. In the parametric framework, Gloter (2006), Samson and Thieullen (2012) and Leon et al. (2019) propose Euler contrasts with a correction of the bias due to partial observations. Pokern et al. (2009) propose a Gibbs sampling in a bayesian approach and emphasize the bias induced by the hypoellipticity of the system when using the Euler pseudo-likelihood in the calculation of the posterior distribution. Bierkens et al. (2018) propos a new simulation method fo hypoelliptic conditional diffusions, which could be used in a Bayesian framework. Cattiaux et al. (2014a, b) and Comte et al. (2017) deal with the non-parametric framework. For hypoelliptic SDEs that are more general than the stochastic Damping Hamiltonian system, we are only aware of the works of Ditlevsen and Samson (2017), Melnykova (2019). Their approach is based on a discretization scheme of order 1.5, a particle filter to approximate the unobserved coordinate and a stochastic maximization of a statistical contrast. The main drawbacks are the computational time induced by the particle filter and the fastidious mathematical calculations to exhibit the sufficient statistics of the likelihood for the considered SDE model, that are then stochastically approximated. There is thus a need to develop new approaches, and especially estimation methods that can be applied to both elliptic and hypoelliptic systems.
In this paper, we propose an alternative which is computationnaly demanding but can be applied to a larger class of models, including elliptic and hypoelliptic systems and that does not require additional mathematical calculations. This strategy is based on optimal control theory. Let us recall the main question the optimal control theory aims to address. For a given dynamical system in a given initial state, which control do we have to apply to steer it to a desired behavior in an optimal way? This problem is formulated as a constrained optimization problem where the optimality is defined through the introduction of a cost function and the proposed dynamical model belongs to the contraints. Under this theory a large variety of theoretical and numerical tools have been developed to find the so-called optimal control which minimizes the proposed cost function. Recently, this theory has been advantageously used for statistical purpose. We can cite the pioneer work of Martin et al. (2001) for nonparametric estimation of B-splines. Parametric approaches have been proposed more recently by Brunel and Clairon (2015), Clairon and Brunel (2018a, b) and Iolov et al. (2017). One way of using the optimal control theory is to write an estimation criterion, typically a likelihood, as a tracking problem. Tracking problems are specific optimal control problems: the aim is to find the optimal control such that one coordinate of the system is the closest possible to a target trajectory, here the observations, on a given observation interval. The optimal control problem can be solved by the Pontryagin maximum principle (Pontryagin et al. 1962;Trelat 2005;Sontag 1998).
This idea has already been successfully developed by Brunel and Clairon (2015) to estimate the parameters of ordinary differential equations. It proves to be numerically efficient and stable, especially when the problem is ill-conditioned. Moreover, the consistency and the rate of convergence of the corresponding estimator have been proven (Clairon and Brunel 2018a, b). However, their method, based on the deterministic theory of optimal control, is restricted to the case of deterministic systems. In this paper, we extend this idea to SDE systems. To do so, we resort to the framework of the discrete optimal control theory. Theoretical and numerical results have been fully developed for linear models, this is the discrete linear-quadratic (LQ) theory, a particular case of the Pontryagin maximum principle. Indeed, this theory ensures the existence, uniqueness and gives the closed form of the solution of the control problem defining our estimation criterion. The main advantage of this theory is that it applies without any hypothesis on the diffusion coefficient B σ . Especially, it applies also to the hypoelliptic case. This is the approach that we use for linear SDEs. When estimating parameters of a nonlinear SDE, we propose to rewrite the estimation criterion to enter this theory as well. Unfortunately, the consistency and convergence results presented for ODE models in Brunel and Clairon (2015) do not apply here and we were not able to prove a theoretical result for our estimator. Nevertheless, our method gives good estimates on the different tested models during the numerical simulations and encounter no additional difficulties for hypoelliptic models comparing to the elliptic ones. The computational time of the method is reasonable though not small.
Our estimation procedure follows three steps. First we define a criterion to estimate the drift parameter θ alone, σ being considered known. This criterion is minimized through the linear-quadratic theory. This method introduces a weighting parameter w which needs to be selected. The second step consists in constructing an external criterion based on moments of the Brownian process allowing to data-select the weight w. The third step is the estimation of σ by profiling the functional used in step 2. The final estimation method is fully data-driven and there is no tuning parameter. It can be viewed as a plug-and-play methodology that can be applied to both elliptic and hypoelliptic systems.
The paper is organised as follows. Section 2 presents the models and the objectives of the estimation problem. Section 3 introduces the optimal control theory approach. Section 4 an adaptive selection of the weight w and Sect. 5 gives the complete procedure. Section 6 illustrates the approach on simulated examples. The paper ends with some discussions (Sect. 7).

Models and objectives
The estimation procedure that we propose, based on the discrete optimal control theory, is fully developed for linear models. We thus introduce a linear SDE. In Sect. 3, we nonethless present an adaptation of the estimation method to a certain kind of non-linear SDEs.

Elliptic and hypoelliptic stochastic differential equations
We consider a d-dimensional state variable Z t ∈ R d , d ≥ 2, defined for t in a time interval [0, T ]. We distinguish in the following the first observed state variable denoted V t ∈ R from the last d − 1 other unobserved variables denoted U t ∈ R d−1 . The dynamic of (Z t = (V t , U t )) t≥0 is described by the following stochastic dynamical system: with initial conditions Z 0 = (V 0 , U 0 ) possibly unknown and where W t is a m dimensional Brownian motion. The drift is assumed linear with respect to depends on the unknown parameter vector θ and may be a function of time t.
The d ×m-matrix B σ is called the diffusion coefficient and depends on an unknown parameter vector σ . We consider two cases: -Hypoelliptic SDE m < d and the matrix B σ is singular: A noticeable feature in the hypoelliptic SDE is that the equation ruling V t does not contain a stochastic part. The matrix b σ models the way the stochastic disturbance acts on the unobserved variables U t of the system and indirectly on V t through U t . In the hypoelliptic case, we consider the following assumption: (H1) Let A 1, j (t) denote the jth element of the first column of A θ (t) and B , j the jth column of B σ . For any t, there exists at least one j = 2, . . . , d such that Under assumption (H1), the noise is propagated also to the first coordinate V t . This property that the noise generates the entire space R d is a characteristic of the hypoelliptic property see e.g. Mattingly et al. (2002). Note that the standard assumptions to ensure the hypoellipticity are different from (H1) see e.g. Samson and Thieullen (2012), but it reduces to (H1) for a linear system.

Objectives and issues
The state variables Z t = (V t , U t ) are split in two because we observe only the first onedimensional state variable V t . We denote t −→ Y (t), the realization of V t . We assume that Y is discretely observed on the interval [0, T ] at times 0 = t 0 < . . . < t n = T without measurement error and denote Y = (Y 0 , . . . , Y n ) these observations. We thus have The aim of the paper is to estimate the unknown parameters θ and the initial condition Z 0 (if unknown) of model (2) in the elliptic and hypoelliptic cases using the discrete observations (Y 0 , . . . , Y n ). We will distinguish two cases: 1/ σ is known and only θ is estimated; 2/ both θ and σ are estimated.
Before introducing our approach, let us explain why the estimation problem is difficult. To estimate θ and σ of an elliptic, partially observed SDE, one can start with the discretization of the diffusion (2). The Euler-Maruyama discretization scheme at time t 1, , . . . t n is defined as follows for i = 0, . . . , n − 1 and When the system is elliptic, that is when B σ B T σ is invertible, minus twice the log-likelihood of this discretized process, assuming both V t and U t discretely observed, is then For partial observations, the log-likelihood (6) has to be integrated with respect to (U 0 , . . . , U n ). The estimator is thus defined as arg min This integral can be viewed as a filtering problem: the unobserved trajectory Kalman filter can be applied to linear SDEs. For non-linear SDEs, the extended Kalman filter can be used, or a particle albeit at the price of increased computing time. For hypoelliptic SDE, criterion (6) can not even be computed, B σ B T σ being not invertible. Therefore, there is a need of alternatives. In this paper, we take advantage of the optimal control theory to filter the unobserved coordinate or more precisely to find a surrogate value for the (unobserved) realization of the Brownian motion, under the form of a control sequence that drives the trajectory. This sequence allows to define an estimation criterion that applies also when the stochastic noise is degenerate. The estimation procedure is presented in Sects. 3, 4 and 5.

Estimation of Â via optimal control theory
In this section, we assume σ known. First we write the statistical problem into an optimal control one. The procedure is the same whether B σ is singular or not. The optimal control problem is then solved for linear models by using the linear quadratic theory (Sect. 3.2). We end this section by extending the estimation procedure to a particular but meaningful subset of nonlinear SDEs (Sect. 3.3).

Optimal control problem and definition ofÂ
The control problem introduces a control sequence such that data (Y 0 , . . . , Y n ) are close to a solution of the discretized model (5), close meaning of the order of the Euler scheme error (Bally and Talay 1996). In the context of SDEs, the natural control sequence is the increment of the Brownian motion. Let us introduce u i = W i+1 − W i , i = 0, . . . , n − 1, which will play the role of the control value at time t i . Note that u i ∈ R m , i = 0, . . . , n − 1 with a dimension m that can be different from d, thus allowing the noise to be degenerate.
The discretized model (5) can be reformulated under the form of a discrete controlled system: We denote u the vector of discrete values taken by the control: θ,σ,u , U i,θ,σ,u , the solution of (7) corresponding to the given θ , σ and u. When model (2) is true, there exists one realization of the Brownian motion such that data (Y 0 , . . . , Y n ) are a sample of model (2). Control sequence u can be viewed as this specific realization of the Brownian motion. The objective is to infer sequence u by extracting knowledge from data Y . We define a cost function C(u, Y ; θ, σ ) such that the optimum in u corresponds to this realization of the Brownian motion.
A natural cost function is the conditional distribution (also called posterior distribution) P(u|Y ; θ, σ ) of u knowing data Y . We might select filtering methods that consist in computing this conditional distribution but that are computationally demanding. An alternative is to compute the maximum of the conditional distribution P(u|Y ; θ, σ ), that can also be called Maximum A Posteriori (MAP) by analogy with the Bayesian settings. Maximizing P(u|Y ; θ, σ ) could be numerically difficult, the function being likely not concave. The optimization problem can be stabilized by regularizing it using a weight.
Let us introduce the MAP and then the weighted cost function. For a fixed value of the parameters (θ, σ ), the MAP would be defined as follows: where P(Y |u; θ, σ ) is the density of the data given the u, P(u; θ, σ ) is the density of the Brownian motion and the likelihood P(Y ; θ, σ ) does not depend on u. Let us now detail the two first terms. Conditioning on u means that V i,θ,σ,u is known. Thus conditional density P(Y |u; θ, σ ) is the distribution of the difference between Y i and V i,θ,σ,u , which is the distribution of the error of discretization of the SDE. This error has a variance of order the time step (Bally and Talay 1996). Second term P(u; θ, σ ) is the density of a discretized Brownian motion, which has a variance . Therefore, the MAP reduces tô Optimisation of the MAP (8) reveals to be intractable in practice. By replacing the scaling 1/ i of the first term by a weight w > 0, we obtain a tractable optimisation problem which falls into the linear quadratic theory framework, where the new cost function is: In the last equation, we exhibit the last observation Y n which plays a specific role in control problems. The weight w has to be chosen by the user. In Sect. 4 a procedure is proposed to select w adaptively from data Y . For a given weight w, the best control is the sequence u that minimizes C w (u, Y ; θ, σ, Z 0 ) under the constraint of model (7). It is defined as the solution of the following optimal control problem: This is called the optimal control and denoted u θ,σ . For a fixed value σ and given the discrete observations (Y 0 , . . . , Y n ), we then define the estimator of θ as: is the profiled cost C w over the set of possible sequences u, or if the initial conditions Z 0 is unknown: the profiled cost C w over u then over the possible initial condition values. The computations of S w and u θ,σ require solving optimal control problem (9). We have to prove the existence and uniqueness of the solution of problem (9), and that this unique solution is numerically computable. These two results are given by the linear quadratic theory (Trelat 2005;Sontag 1998) which is exposed below. Interestingly, this theory also ensures profiling on Z 0 is not computationally demanding.
Remark 1 If, instead of solving problem (9), we set u = 0, we end up with an estimator somewhat similar to the Trajectory Fitting Estimator (TFE) (Kutoyants 1991;Dietz 2001). However, the addition of the penalization term 1 i u T i u i in C w and its optimization over u allows us to be able to deal with the partially observed framework for both elliptic and hypoelliptic models which is not the case for TFE.

Linear quadratic theory
Linear quadratic theory is derived from the Pontryagin maximum principle (Pontryagin et al. 1962) for linear models. For a given (θ, σ ), it ensures the existence and uniqueness of u θ,σ , the solution of (9), and that u θ,σ and S w (Y ; θ, σ ) can be computed via a backward finite difference equation, called the Riccati equation.
Let us be more precise. For a given (θ, σ ), and a given weight w, let us denote the following matrices, for i = 0, . . . , n − 1: Set R n,θ,σ = Q n and let R i,θ,σ be the sequence of positive symmetric matrices, solution of the discrete backward Riccati equation, for i = n − 1, . . . , 0: We can establish the following theorem.
The fundamental theorem used to derive these expressions for S w is recalled in "Appendix".
Theorem 1 provides a closed form expression for S w that depends on matrix R 0,θ,σ obtained by solving Riccati equation (11).
Let us now highlight a noticeable advantage of our approach. Theorem 1 holds for elliptic (m = d) SDEs as well as for hypoelliptic (m < d) SDEs. Indeed C ω is strictly convex with respect to the sequence u due to the term u T i u i , no matter the dimension of the noise. Applications of linear quadratic theory does not require the matrix B σ B T σ to be full-conditioned. It can thus be applied to both elliptic and hypoelliptic cases without any additional calculations.
Note that the cost function can be easily profiled on initial condition Z 0 . This is due to the quadratic nature of min u∈R m×(n−1) C w (u, Y ; θ, σ, Z 0 ) with respect to Z 0 . Thus, the construction of S w for unknown initial conditions is similar to the deterministic Kalman Filter state estimator derivation (see Sontag 1998).
The computation of the cost function C ω is in the framework of the discrete LQ theory. This ensures the well posedness nature of the optimization problem (uniqueness, existence and continuity of the solution w.r.t problem parameters) defining the cost S w as well as an efficient method to compute it no matter the elliptic nature of the SDE model. Fom equation (11), we can easily establish by induction that for every i, R i,θ,σ has the same degree of regularity with respect to model parameters (θ, σ ) than A θ and B σ . This in turn applies as well to S w . So, if we assume θ belongs to a bounded subset Θ and θ −→ A θ is continuous on Θ, θ is defined as the minimizer of a continuous functional S w on a bounded subset which ensures its existence. This also allows the use of classic optimization methods during the numerical analysis in Sect. 6.

Extension to some nonlinear models
To extend the method, we resort to the pseudo-linear formalism presented for example in Cimen and Banks (2004a, b). A nonlinear SDE can always be written through a pseudo-linear representation: even if this representation is non unique. Here, we focus on the case where a pseudo-linear matrix A θ which only depends on the observed coordinate V t can be found In this case the Euler-Maruyama discretization scheme becomes and we are able to use Theorem 1 to compute the profiled cost S w . The next step is the selection of the weight w involved in the definitions of the optimal control problem (9) and the estimator θ w (σ ). This is explained in the next section.

Adaptive w selection
The choice of the weight w is critical to ensure the numerical stability of the procedure. We propose to select it adaptively by minimizing an external functional criterion G(Y ; σ, w, θ). We propose two possible choices for G: 1. G (1) , a criteria quantifying the consistency between the estimated optimal control u θ,σ and a moment of the quadratic growth of Brownian motion increments, 2. G (2) , a contrast function based on a state variable estimator obtained as a by-product of u θ,σ estimation.
For the two functionals G (1) and G (2) , the selected weight w and the final estimator of θ are computed through the following steps.
3. Define the final estimator of θ as The two functionals are presented in the next sections.

w selection via a quadratic growth moment condition
We present the first functional G (1) derived as a constraint of the distribution of the control. The control u θ,σ mimics the increments of the Brownian motion. Thus, it is natural to impose that the values u i,θ,σ √ i are independent and distributed with N (0, I m ). We can then derive constraints related to this distribution. Let us denote u i, j,θ,σ the j−th component of the m dimensional Brownian motion at time t i , for j ∈ 1, m . The law of large numbers implies that, almost surely, We thus propose the following functional to select w: Here, the dependency of G (1) (·) with respect to w is made through the optimal control sequence u θ,σ . This functional gives importance to the control sequence property.

w selection via a data fidelity criterion
The second functional is driven by the idea of giving importance to the data fidelity term, that is to minimize the distance between model solution Z i+1,θ,σ,u and data Y i . We can show by recurrence that for i = 0, . . . , n − 1 and any integer 0 ≤ l ≤ i − 1: = 1 when k = 0. Thus, by multiplying by C, we can explicitly link the observations Y and the control: To ease the reading, let us introduce the m-vector and the sequence of scalars, for i = 0, . . . , n − 1: We can derive the law followed by X i,θ,σ,u = l k=0 Γ θ,σ (k, Let us choose l such that γ 2 l,θ,σ (t i ) is non-zero. Let us explain the intuition why choosing a sufficiently large index l yields γ 2 l,θ,σ (t i ) > 0. The idea is to give time to the stochastic elements u i to diffuse. In other words, after "enough time" (characterized by l), the elements u i are able to perturb the observations Y i+l+1 and σ becomes univocally identified.

Estimation of 2
We now consider the estimation of σ 2 . The quadratic variation of the stochastic process divided by T provides an estimator of σ 2 . It can not be applied with partial observations because only the first coordinate is observed. We therefore propose to minimize an estimation criterion. It turns out that the two functionals G (1) and G (2) presented in Sect. 4 can be used to estimate σ 2 . It might seem counter-intuitive that the same criterion could be used to estimate a weight parameter and a diffusion coefficient. However, let us recall that the two functionals have been constructed as constraints of the optimal control problem. These constraints are not specifically linked to the weight parameter but to the model itself. They can thus be judiciously used to estimate also the diffusion coefficient B σ .
Our proposal is a nested procedure that provides the final estimation of θ , σ 2 and the data-driven selection of the weight w.
For the sake of comparison, we also estimate parameters by maximum likelihood approximation via particle filter methods. For elliptic models, we rely on the iterative filtering algorithm proposed by Ionides et al. (2006) implemented in the R package 'pomp' (King et al. 2016). This method embeds a parameter perturbation step. It is though to be computationally faster than the direct approach of plugging a likelihood estimate from a particle filter directly into a stochastic optimization algorithm (Ionides et al. 2011). The algorithm is tuned as suggested in their paper. The corresponding estimators are denoted with the upper symbol I . For the hypoelliptic case, we are only aware of the method developed by Ditlevsen and Samson (2017) based on the 1.5 order discretization scheme of the SDE and which uses a particle filter coupled to a stochastic approximation EM algorithm. The corresponding estimators are denoted with the upper symbol DS.
For each model, 100 trajectories are simulated with a discretization scheme with n s = 10 5 equidistant points from which we subsample n = 10 3 equispaced points for parameter estimation purpose. For the elliptic case, we use an Euler-Maruyama scheme and for hypoelliptic models, a 1.5 order scheme to propagate the noise into the observed state variable (Kloeden and Platen 1992).
The mean and standard error (SE) of the estimators are reported as well as the mean computionnal time. Our method was coded with Matlab, the required time was computed by using the function 'cputime'. Optimization required for inner and outer criteria minimization were both conducted using the function 'fminsearch' (a derivative free simplex search method). The filtering based methods was implemented in R. The simulations have been run on a personal Dell machine using one core Intel(R) Core (TM) i7-4790S CPU @ 3.20Ghz.
Three models are used: the Lotka-Volterra model, the FitzHugh-Nagumo model and the Harmonic Oscillator (HO). The two firsts are non linear and HO is linear. For all cases, we profile on initial conditions.

Simulation study for elliptic systems
Two models are studied in their elliptic version, the Lotka-Volterra system and the FitzHugh Nagumo model. The Iterated Filtering requires the setting of four hyperparameters, the number of iterations M, here M = 200, the number of particles P required by the embedded sequential Monte-Carlo algorithm estimating the likelihood, here P = 1500, the cooling factor a, here a = 0.8 and the perturbation scales σ p which is the standard deviation of the perturbed random walk, here we set σ p = 0.2. In contrast, we recall our method only involves one weighting parameter w for which an adaptive selection method has been proposed. For each model, we precise an interval [w min , w max ] where the value w will be searched by using a simple dichotomy method.

Elliptic Lotka-Volterra process
Stochastic versions of Lotka-Volterra model (see Lotka 1925) have been proposed (Meeds and Welling 2015;Graham and Storkey 2017;Mao et al. 2002). It is a predator-prey model that describes the dynamics of two interacting populations. The preys are assumed to have an unlimited food supply. Let V t and U t denote the number of predators and preys at time t, respectively. The dynamics of V t and U t are described as: where θ 1 is the death rate of the predator, θ 2 is the growth rate of the predator population, θ 3 is the exponential growth of the prey, θ 4 is the rate of predation upon the prey which reflects the interaction between the two species, (W 1,t ) t≥0 and (W 2,t ) t≥0 are two independent Brownian motions, and σ is the diffusion coefficient assumed to be the same for both coordinates. The system is thus elliptic.
The objective is to estimate these parameters from discrete observations of the predator population only. For the sake of parametric identifiability, θ 2 and θ 3 are considered known. The system (23) is non linear due to the interaction term between predators and preys. Thus, we need to use the nonlinear extension approach exposed in Sect. 3.3. Matrix is used in Eq. (13) to proceed to parametric estimation as in the linear case. We have B σ = σ 0 0 σ . Note that to ensure that G (2) is well defined, it is sufficient to take l = 0, as γ 2 0,θ,σ (t i ) = i σ 2 > 0. We set [w min , w max ] = 10, 10 3 . Results are presented in Table 1. The two weight selection procedures with σ known give similar results. Interestingly, they give less biaised estimators than the one proposed in Ionides   (2006) and at a very reasonable computational cost. When σ is also estimated, G (1) and G (2) minimization give different estimators for the drift and different computational time. However, they both give two accurate estimators of σ comparing toσ I . In particular, θ (2) and σ (2) are less biased estimators than θ I andσ I and are obtained much more quickly. One can argue that other hyperameter values would have leadσ I to be more accurate. To investigate this, we plot in Fig. 2 the graphs of mean square error with respect to varying cooling factor a and perturbation scales σ p , the other hyperparameters being set to the value given in the previous subsection. The chosen value for a does not have a huge impact on estimation and the selected σ p gives accurate estimation comparing to the other values. Overall the selected hyperparameters (a, σ p ) do not undermine the accuracy of the method.

Elliptic FitzHugh-Nagumo
The FitzHugh-Nagumo model (FHN) describes the dynamic of an excitable neuron, modeling the characteristic spikes of the neuron (FitzHugh 1961;Nagumo et al. 1962). It is defined by a system of two differential equations which model the membrane potential of the neuron and a recovery channel mimicking the opening/closing of ion channels. More formally, let V t and U t denote the membrane potential at time t and the value of the recovery channel at time t, respectively. The stochastic FHN model is defined as: where is a time scale parameter, γ and β are kinetic parameters, σ 1 and σ 2 are the two diffusion coefficients, (W 1,t ) t and (W 2,t ) t are two independent Brownian motions. Only the membrane potential V t can be measured experimentally.
Trajectories are simulated on the interval [0,20]. Parameters are set to = 0.1, γ = 1.5, β = 0.8, σ 1 = 0.1, and σ 2 = 0.3. An example of trajectory for the model (25) is presented on Fig. 3. The objective is to estimate θ = ( , γ , β) and σ = (σ 1 , σ 2 ) from the discrete observations of the first coordinate (Y 1 , . . . , Y n ). The values are Means (and standard errors) from 100 simulations. Mean individual CPU times are given for each estimation procedure The FHN model (25) includes a non-linear term V 3 in the first equation and a constant term β in the second equation. To apply our estimation procedure, we consider a pseudo-linear representation of FHN by introducing a new variable R t assumed to be constant and equal to 1. This can be done by increasing the number of coordinates by one and stating d R t = 0 and R 0 = 1: As explained in Sect. 3.3, the pseudo linear representation is not necessarily unique. Another choice for A θ (V t , t) could have been: In this case, we would have token Y 3 to replace V 3 instead of replacing V − V 3 by (1−Y 2 )V in the original SDE (25). But this reveals to be numerically unstable due to the dramatic propagation of the discretization error committed by using Y 3 instead of V 3 . The estimation procedures are then applied with matrix (26) with d = 3, C = ( 1 0 0 ) and B σ = To estimateσ (2) defined by (19), note that again the index l = 0 is sufficient as γ 2 0,θ,σ (t i ) = i σ 2 1 > 0. Here, we set [w min , w max ] = [50, 100]. Results are presented in Table 2. When σ is known the situation is similar to the Lotka-Volterra example. This is not the case when σ is also estimated, G (2) minimization gives a more accurate estimator of σ 2 than G (1) minimization. It has to be noted σ 2 is the diffusion of the unobversed state variable. Interestingly, the estimator given by Ionides et al. (2006) encounters difficulties for estimating the diffusion σ , in particular σ 2 . Only our approach gives an accurate estimator of the diffusion at a reasonable computational price.

Simulation study for hypoelliptic systems
We now compare the estimation procedures on two hypoelliptic systems, the Harmonic Oscillator and a hypoelliptic FitzHugh-Nagumo neuronal model. The approach developed in Ditlevsen and Samson (2017) is based on a Stochastic approximation of the EM algorithm which requires to set a number of iterations, a number of particles and a decreasing sequence of positive numbers. For both tested models, we choose the value given by Ditlevsen and Samson (2017) for these hyperparameters. Moreover, this approach requires the formal computation of model specific sufficient statistics which can be mathematically cumbersome. In contrast, the method we propose here does not require any modification comparing to the elliptic case and we select w as before.

Harmonic oscillator
The Harmonic Oscillator is a mechanistic model describing oscillations governed by a white noise. It is described by a system of two equations, denoted V t and U t , the noise entering only in the second equation (Pokern et al. 2009). The model is defined as follows: with D, δ, σ > 0 and (W t ) t a Brownian motion. Trajectories are simulated on the interval [0, 20]. Parameters are set to D = 4, δ = 0.5 and σ = 0.5. The objective is to estimate θ = (D, δ) and σ from the discrete observations of the first coordinate (Y 1 , . . . , Y n ). We set [w min , w max ] = 10 10 , 10 15 . Again, let us give some details on the procedure for obtainingσ (2) . We have B σ = 0 σ and A θ = 0 1 −D −δ .
So Γ θ,σ (0, t i ) = 0, Γ θ,σ (1, t i ) = 3 2 i σ . We can deduce that γ 2 1,θ (t i ) = 3 i σ 2 = 0. This order of variance for V t is the true one, as explained in Samson and Thieullen (2012). Thus our procedure automatically propagates the noise from the second coordinate to the first one, yielding to an invertible covariance matrix of the process. In that sense, it is close to what is proposed by Ditlevsen and Samson (2017).
Results are presented in Table 3. For σ known, both weight selection methods give accurate estimation at a reasonable computational cost. When σ is estimated, our approach gives good results and is faster than the Ditlevsen and Samson (2017) procedure.

Hypoelliptic FitzHugh-Nagumo
We consider the FHN model without noise on the first coordinate, as studied by Ditlevsen and Samson (2017). The model is thus defined by: and we refer to Sect. 6.1.2 for the description of the variables and parameters.
Trajectories are simulated on the interval [0,20]. Parameters are set to = 0.1, γ = 1.5, β = 0.8, and σ = 0.3. As in the elliptic case, the objective is to estimate θ = ( , γ , β) and σ from the discrete observations of the first coordinate. We set [w min , w max ] = [50, 500]. The values are Means (and standard errors) from 100 simulations. Mean individual CPU times are given for each estimation procedure The values are Means (and standard errors) from 100 simulations. Mean individual CPU times are given for each estimation procedure Model (28) is non-linear and time-inhomogenous. As done in the elliptic case, we choose the pseudo linear matrix (26).
Forσ i σ . We can deduce from that γ 2 1,θ (t i ) = 3 i σ 2 ε 2 = 0. This corresponds to the first term of the exact variance of V t , as proved by Ditlevsen and Samson (2017).
Results are presented in Table 4. Our method shows a larger bias than Ditlevsen and Samson (2017) for parameters γ and β but very good results for which is known to be difficult to estimate.
The most striking and somewhat counter-intuitive result with our method is that the computational time in the hypoelliptic case, especially when σ is estimated is smaller than for the elliptic case. Indeed, for a fixed σ , the complexity of the optimization problem linked to the drift parameter estimation θ does not depend on the singularity of B σ . It only relies on the dimension of θ and the dimension of the control space. So the smaller the dimension of σ , the quicker the minimizer of S w is found. Having the dimension of σ smaller in the hypoelliptic example than in the elliptic one explains the observed difference in the computational time.

Conclusion and discussion
In this work, we propose a new method based on control theory to estimate parameters in SDEs. Its main feature is to propose a unified framework for both elliptic and hypoelliptic models by using a criterion focusing on estimating the Brownian motion realization given the observation rather than solely on the observation. By doing so, we manage to construct a criterion S w well defined no matter the structure of B σ with only one hyperparameter to be set. Another advantage of the method is the reasonable computational time. The use of the discrete LQ theory allows us to avoid the use of MCMC and/or other computationally costful stochastic approximation.
However, because of the nested structure of our estimation procedure, we can see that the computational time drastically increases when σ has to be estimated. In particular it is very sensitive to the dimension of σ . This can be a limitation for high dimensional systems. Interestingly, in constrast to classic statistical approaches, the hypoelliptic nature of a system is an advantage for us because the dimension of σ is smaller than the one present in its elliptic counterpart. A perspective could be to investigate a criterion which allows us to simultaneously estimate θ and σ instead of using a nested procedure. However, so far we have not manage to find a criterion which fits in the framework of the discrete LQ theory. This leads us to our second limitation, our method is currently restricted to linear models or non-linear ones with a specific structure (i.e models only nonlinear w.r.t the observed state variable). A main challenge would be to extend our method to general nonlinear SDEs.
where u = {u 0 , . . . u N −1 } and the state variable sequence x = {x 0 , . . . , x N } are linked by the finite difference equation: The derivation of the next theorem by the optimality principle can be found in Bertsekas (2005).
Theorem 2 Let us assume that S is positive semi-definite, for all i ∈ 0, N −1 , Q i is positive semi-definite and R i is positive definite. Then the cost (29) reaches its global minimum for the control sequence u * given by: and the minimal cost value is equal to: where P k is given by the discrete time Riccati difference equation: