1 Introduction and history

1.1 Information geometry as the differential geometric approach to statistics

Information Geometry is an informal term to describe the differential geometric approach to statistics, or more precisely to study the differential geometric properties of sets of probability distributions, on which a manifold structure is usually built, leading to so called statistical manifolds. A list of references, far from being comprehensive, describing the evolution of the discipline include Rao [48], interpreting the Fisher matrix for a parametric family of distributions as a Riemannian metric on the given finite-dimensional statistical manifold, the dimension being usually related to the number of parameters. It is important for our paper to point out that the Fisher information matrix is related naturally to the Hellinger distance on more general infinite-dimensional spaces of probability measures, a distance based on the \(L^2\) structure of sets of square roots of probability densities. Aggarwal [1], Amari [3], Barndorff-Nielsen [13] and Pistone and Sempi [47] are other important references that contributed to the development of information geometry and are related to this article, although this list is far from being comprehensive.

1.2 Information geometry and filtering dynamics

Our work concerns the application of information geometry to approximation of dynamics of probability distributions, in most cases stemming from the stochastic filtering problem.

To state it in basic terms, in stochastic filtering one observes a random signal perturbed by random noise. The unperturbed random signal cannot be observed but needs to be estimated. For example, the perturbed signal could be the radar reading of the position of a spacecraft, which would not provide the exact position of the spacecraft due to several disturbances (“noise”) in the radar observations. It would then be necessary to estimate the real position of the spacecraft from the noisy radar readings. This is a filtering problem. A filtering algorithm was used in the Apollo 11 mission (Cipra [24]), the first human landing on the moon. Filtering has also applications in areas such as water level estimation and prediction, submarine navigation, econometrics, target tracking and many others. A good historical book on filtering with an eye to applications is Jazwinski [36], see also Maybeck [43], while the mathematical aspects are considered fully in Liptser and Shiryayev [41]. More recent monographs on filtering are Ahmed [2] and Bain and Crisan [12].

The general solution of the filtering problem at a given time is given by the probability density of the unperturbed state of the system at that time, conditional on the perturbed observations up to the given time. When the unobserved signal and the observed signal evolve in continuous time, the filter density follows a stochastic partial differential equation (SPDE). It has been shown that this probability density, the solution of the SPDE, does not evolve in a finite dimensional statistical manifold, except in very special cases. For example, if the dynamics of the unobserved system is linear, the observations are linear, the noises are Gaussian and the initial condition of the unperturbed signal is also Gaussian (or deterministic), then the filter is Gaussian and its density can be characterized by a finite dimensional set of parameters, namely the mean vector and variance-covariance matrix of the resulting Gaussian distribution. This leads to the celebrated Kalman filter. However, this does not happen usually, in the non-linear case, and the filter is infinite dimensional in general, as shown for the cubic sensor example by Hazewinkel et al. [34].

1.3 Classic projection filters: Stratonovich–Hellinger projection

Enters information geometry. Can information geometry provide us with a method to approximate the infinite-dimensional filter with a finite-dimensional approximation that is close to the original filter? The idea to apply the Fisher Metric and Hellinger distance to this problem was first sketched in an article of Hanzon [32] while he was working at the Technical University of Delft. Hanzon suggested to project the SPDE equation in Stratonovich form for the evolution of the filter density onto a finite dimensional statistical manifold, using the Fisher metric/Hellinger distance. We call this “Stratonovich projection” and it consists in projecting the separate vector fields of the SPDE corresponding to the drift and diffusion part of the Stratonovich version. The projected equation would describe a finite dimensional density evolution, called projection filter, approximating the full filter evolution associated with the optimal filter. The paper was presented to a conference in Lancaster whose proceedings were edited by Christopher T. J. Dodson, in a volume with the almost prophetic title “Geometrization of Statistical Theory”. The following year, on August 22, 1988, Hanzon presented the idea at a seminar in Tokyo University called “The Projection Filter” while visiting Shun’ichi Amari. A few years later, in 1991, Hanzon and a PhD student Ruud Hut also from Technical University of Delft, wrote the paper Hanzon and Hut [31] with new results on the projection filter on Gaussian densities, showing that for the Gaussian family the projection filter coincides with a heuristic-based family of finite dimensional filters, the assumed density filters, previously studied by Kushner [38], see also [43].

The projection filter idea was formulated precisely, extended and made fully rigorous in subsequent works, during the PhD studies of Damiano Brigo with Bernard Hanzon at the Free University of Amsterdam and with Francois LeGland at IRISA/INRIA, in Rennes, France, in 1993–1996 [16]. In these studies it was shown, among other things, that exponential families played a very particular role in the projection filter, allowing for the correction step of the filtering algorithm to be exact, and also fully generalizing the equivalence to the assumed density filters. The filters were tested numerically on some examples. During his PhD, Brigo also authored other papers on small observation noise for the Gaussian projection filter [15, 17] and on approximations of the Fokker–Planck–Kolmogorov equation, as well as formulations of the filter in discrete time using the Kullback Leibler information, with application to volatility modeling in finance [18]. The main results on the projection filters were published later in Brigo et al. [19, 20].

One of the key issues, from the start, was making sure that the given approximated equation for the filter density would stay on the chosen statistical manifold. The Stratonovich projection ensured this, but scholars had been studying the behaviour of stochastic differential equations on manifolds independently of the filtering application above. Among those, we refer to Elworthy [27], Emery [29], and more recently Hsu [35]. We also notice that Elworthy et al. would discuss geometric aspects of filtering theory in [28], although their book does not deal with projection filters.

1.4 Classic projection filters: Stratonovich-direct \(L^2\) projection

Brigo returned to the filtering problem as a side project in 2011 after he moved from a managing director position in the financial industry to a full academic position as Gilbart Chair at the Department of Mathematics of King’s College London, earlier in 2010. There, in 2011 he met a new colleague, John Armstrong, a differential geometry PhD from Oxford who had worked on almost Kähler geometry and who also had spent several years in the financial industry and was now turning to a full-time academic career. Brigo explained the filtering problem to Armstrong, who grasped immediately the essential ideas and the mathematics. Brigo had already written a preprint on his new idea of applying the direct \(L^2\) structure without square roots to obtain a new type of projection filter, showing equivalence with Galerkin-based filters when using mixtures of distributions. Armstrong refined the idea and implemented the filter numerically, studying the cubic sensor problem. This led to a second wave of projection filters based on the direct \(L^2\) metric as opposed to the Hellinger distance. It turned out that, as anticipated in the preprint, while the original Hellinger-based filters worked well with exponential families, being equivalent to assumed density filters, the direct \(L^2\) filters worked best with mixture families, being equivalent to Galerkin-based filters. This research went on in 2011–2013 and was published in Armstrong and Brigo [6]. By 2012 Brigo had moved to Imperial College. During the review of [6], one of the reviewers asked in which sense, or according to which criterion, the projection filter was providing an optimal approximation of the true filter.

1.5 Is the classic projection filter an optimal approximation?

The essence of the problem of optimality of the approximation was based on the way the filtering equation was projected in the projection filter works published until then, mainly [6, 19, 20]. There are two stochastic calculi, Ito and Stratonovich. The two different calculi are suited to different applications, but from a probabilistic point of view the Ito calculus has a more clear interpretation of the stochastic equation coefficients in terms of local mean and local standard deviation, linked to the martingale property. Also, it is believed that even when one works with Stratonovich calculus, under the formalism one can argue that it is still the Ito calculus that “does all the work” [49, Chapter V.30, p. 184]. The problem with Ito calculus is that it violates the chain rule for change of variables. When changing variables, one has to use Ito’s formula, involving a second order term in the transformation.

The true, infinite dimensional filter equation (taking the form of a stochastic partial differential equation, or SPDE) had always been written in Stratonovich form in the previous projection filter works, because in a Stratonovich stochastic equation the two parts describing the drift term and the diffusion coefficient term obey the chain rule under change of variables. This means that they can be interpreted as vector fields and be projected without problems on the tangent space of a submanifold, obtaining vector fields in the submanifolds that would form the approximating finite dimensional stochastic differential equation.

Projecting directly the Ito equation does not work, because the change of variables includes second order terms that do not resemble the behaviour of vector fields. Projection becomes then impossible to perform directly in Ito form. One could re-write the Ito true filter stochastic equation in Stratonovich form, project it, obtain a finite dimensional approximated filter, and transform back this approximate filter equation from Stratonovich to Ito form. But in what sense is this approximation optimal? What criterion does it minimize?

More in detail, the projection of a vector field always provides the best approximation of the original vector field. But a stochastic equation is given by two terms, the drift and the diffusion part, and if one puts the equation in Stratonovich form, the drift and the diffusion coefficients become described by two vector fields and as such can be projected. As the two vector fields are projected, each projected vector field will be the best approximation of the original vector field, but what does this mean for the solution of the stochastic equation as a whole? The stochastic equation is not just the pair of vector fields. In fact, when the equation is in Ito form, the drift and the diffusion coefficients interact when changing variables or coordinates, involving second order terms in the transformation. The fact that Stratonovich is “less good” probabilistically means that putting together two optimal projections of the coefficients to form a single Stratonovich equation does not provide a solution that is optimal in a probabilistic sense, for example in mean square.

1.6 Finding optimal projection filters

Armstrong had previously noticed that an Ito equation behaved exactly as a geometric object he was familiar with, called a 2-jet. Brigo, while helping Armstrong in developing the 2-jet interpretation of stochastic differential equations, started looking at the Schwartz Morphism as studied in Emery [29] and found it to be very close to the 2-jet approach. The 2-jet interpretation was published in Armstrong and Brigo [7, 8], which led next to Armstrong and Brigo investigating how one could project a stochastic differential equation on a sub-manifold in an optimal way. Based on Ito Taylor expansions, two different projections satisfying two different types of optimality were found, the Ito-vector and the Ito-jet projections. The Ito-jet projection is superior in terms of optimality, in that it has a higher order of optimality in a precise sense. These results were presented at ICMS in Edinburgh by Armstrong and Brigo [5], at a conference co-organized in 2015 again by Dodson, this time with Frank Critchley and Frank Nielsen. The two projections were studied further and some technical problems concerning tubular neighborhoods were solved with the help of the then PhD student Emilio Rossi Ferrucci, leading to the publication Armstrong et al. [10], see also Armstrong et al. [9], where Rossi Ferrucci helped re-derive the optimal projections through constrained optimizations as opposed to Ito Taylor expansions, and where ambient coordinates are used.

In this last paper [10], information geometry comes back as an application of the now optimal projections both in Hellinger distance and direct \(L^2\) metric, comparing them in a numerical case with the traditional Stratonovich projection of previous works. It turns out that Stratonovich is also optimal for a particular criterion that is, however, not a particularly interesting or natural one, so that the Ito-jet projection filter should be preferred in general.

In this paper we will first present a literature review of projection filtering as done by other authors, following the original papers [19, 20], and then we will explain the basic ideas of the optimal projection filters as compared to the Stratonovich ones. We will finally sketch future problems where information geometry might give a contribution. For the reader’s convenience, we summarize the different projection filtering approaches in Table 1.

Table 1 A simplified classification of projection filters (PFs)

2 Other works based on the classic projection filters

Our original work on projection filters was further studied and applied to several fields by subsequent authors. Here we mention only a few examples to illustrate the breadth of the possible use of information geometry and dynamics in applications.

Jones and Soatto [37] briefly mention the projection filter as one of the possible algorithms for on-line estimation in the context of visual-inertial navigation, mapping and localization. Lermusiaux [40] mentions the projection filter as a possible tool for estimation of uncertainties for ocean dynamics. Kutschireiter et al. [39] apply the projection filter to continuous time circular filtering. Projection filters have been applied to quantum systems for example in van Handel and Mabuchi [52] and in Gao et al. [30]. Ma et al. [42] apply projection filters to hazard position estimation. Vellekoop and Clark [53] extend the projection filter theory to deal with changepoint detection. Tronarpand and Särkkä [51] present a projection filter for systems with discrete time measurement having arbitrary likelihoods. Surace and Pfister [50] apply the Gaussian projection filter to estimate the parameters of a partially observed diffusion. Harel et al. [33] apply the assumed density filters, equivalent to the projection filters, to the filtering of optimal point processes with applications to neural encoding. Azimi-Sadjadi and Krishnaprasad [11] apply projection filter algorithms to navigation. Bröcker and Parlitz [23] apply projection filter techniques to address noise reduction in chaotic time series. Zhang et al. [54] apply the Gaussian projection filter as part of their estimation technique to deal with measurements of fiber diameters in melt-blown nonwovens. The projection filter further attracted the attention of the Swedish Defense Research Agency, that summarized and studied it in 2003 in the report [14].

3 Optimal projection filters for non-linear filtering via information geometry

We studied the application of the new projections to nonlinear filtering via information geometry in Armstrong et al. [10]. Here we summarize the results of that paper, showing how our new projection methods work for stochastic filtering. As explained in the introduction, this enhances optimality of the approximations compared to our previous works in [6, 19, 20].

Let us first summarize the filtering problem for diffusions. One has a signal X that evolves according to a SDE, and observes a process Y which is a function of this signal plus noise.

The filtering problem consists in estimating the signal X given the present and past observations Y. If t is the current time, the solution of the filtering problem is the probability density of the state \(X_t\) conditional on the observations from time 0 to time t, call this density \(p_t\). The density \(p_t\) follows the Kushner–Stratonovich (or alternatively the Zakai) stochastic partial differential equation (SPDE) that, under some technical assumptions, can be seen as a stochastic differential equation in the infinite dimensional \(L^2\) space of square roots of densities (Hellinger metric) or of densities themselves (direct \(L^2\) metric).

The process we wish to approximate on a low dimensional manifold is \(p_t\), evolving in the \(L^2\) infinite dimensional space, while the submanifold M where we seek approximation is a finite dimensional family of probability densities parametrized by \(\theta \), acting as coordinates: \(M=\{p(\cdot ,\theta ), \ \theta \in \Theta \subset {\mathbb {R}}^n\}\). We aim at finding a SDE for \(\theta \) such that \(p(\cdot ,{\theta _t})\) approximates the optimal filter \(p_t(\cdot )\) in an optimal way.

3.1 The Kushner–Stratonovich equation

We suppose that the state \(X_t \in {\mathbb {R}}^m\) of a system evolves according to the Itô stochastic differential equation:

$$\begin{aligned} \textrm{d}X_t = f(X_t,t) \, \textrm{d}t + \sigma (X_t,t) \, \textrm{d}W_t \end{aligned}$$

where f and \(\sigma \) are smooth \({\mathbb {R}}^m\) valued functions and \(W_t\) is a Brownian motion. One typically adds growth conditions to ensure a global existence and uniqueness result for the signal equation, see for example [6] and references therein for the details.

We suppose that an associated process, the observation process, \(Y_t \in {\mathbb {R}}^d\) evolves according to the equation:

$$\begin{aligned} \textrm{d}Y_t = b(X_t,t) \, \textrm{d}t + \textrm{d}V_t \end{aligned}$$

where b is a smooth \({\mathbb {R}}^d\) valued function and \(V_t\) is a Brownian motion independent of \(W_t\). Note that the filtering problem is often formulated with an additional constant in terms of the observation noise. For simplicity we have assumed that the system is scaled so that this can be omitted.

The filtering problem is to compute the conditional distribution of \(X_t\) given a prior distribution for \(X_0\) and the values of Y for all times up to and including t.

Subject to various bounds on the growth of the coefficients of this equation, the assumption that the distribution has a density \(p_t\) and suitable bounds on the growth of \(p_t\) one can show that \(p_t\) satisfies the Kushner–Stratonovich SPDE:

$$\begin{aligned} \textrm{d}p_t = {{\mathcal {L}}}^*_t p_t \ \textrm{d}t + p_t[b(\cdot ,t) - E_{p_t}(b(\cdot ,t))]^T [ \textrm{d}Y_t - E_{p_t}(b(\cdot ,t)) \textrm{d}t] \end{aligned}$$
(1)

where \(E_p\) denotes the expectation with respect to the density p,

$$\begin{aligned} E_p[\psi ] = \int \psi (x) p(x) dx, \ \ E_p[\phi (\cdot ,t)] = \int \phi (x,t) p(x) dx, \end{aligned}$$

and the forward diffusion operator \({{\mathcal {L}}}^*_t\) is defined by:

$$\begin{aligned} {{\mathcal {L}}}_t^* \phi = - \frac{\partial }{\partial x^i} [ f_i(x,t) \phi ] + \frac{1}{2} \frac{\partial ^2}{\partial x^i \partial x^j} [a_{ij}(x,t)\phi ] \end{aligned}$$
(2)

where \(a=\sigma \sigma ^T\). Note that we are using the Einstein summation convention in this expression.

In the event that the coefficient functions f and b are all linear and \(\sigma \) is a deterministic function of time one can show that so long as the prior distribution for X is Gaussian, or deterministic, the density p will be Gaussian at all subsequent times. This allows one to reduce the infinite dimensional equation (1) to a finite dimensional stochastic differential equation for the mean and covariance matrix of this normal distribution. This finite dimensional problem solution is known as the Kalman filter.

For more general coefficient functions, however, Eq. (1) cannot be reduced to a finite dimensional problem [34]. Instead one might seek approximate solutions of (1) that belong to some given statistical family of densities. This is a very general setup and includes, for example, approximating the density using piecewise linear functions to derive a finite difference approximation or approximating the density with Hermite polynomials to derive a spectral method. Other examples include exponential families (considered in [19, 20]) and mixture families (considered in [4, 6]).

Our projection theory tells us how one can find good approximations on a given statistical family with respect to a given metric on the space of distributions. We illustrate this by writing down the Itô-vector and Itô-jet projection of (1) for the \(L^2\) and Hellinger metrics onto a general manifold.Footnote 1

A good part of the classic filtering literature focuses on the very specific case of seeking approximate solutions using Gaussian distributions. The idea of approximating the solution to the filtering problem using a Gaussian distribution has been considered by numerous authors who have derived variously, the extended Kalman filter [46], assumed density filters [38] and Stratonovich projection filters [19]. Some of these are related, for example the assumed density filters and Stratonovich projection filters in Hellinger metrics for Gaussian (and more generally exponential) families coincide [20]. Using our projection methods, we have been able to derive projection filters which outperform all these other filters also in the specific Gaussian case (assuming performance is measured over small time intervals using the appropriate Hilbert space metric).

We will be using \(L^2\) geometry here. More generally, for the geometry of approximations to the infinite dimensional filtering problems based on \(L^2\) or Orlicz charts we refer for example to [6, 10, 19,20,21,22, 44, 45].

3.2 Stratonovich projections

The Stratonovich projection filters have been abundantly studied in [19, 20] in Hellinger metric, and in [6] in direct metric, see also references in Sect. 2 for the Hellinger case. Here we briefly summarize them. To shorten notation, we will omit time dependence when obvious from the context, so \(p=p_t\), \(b=b(\cdot ,t)\), and so on. For this method, the optimal filter SPDE is given by putting the optimal filter equation (1) in Stratonovich form, obtaining

$$\begin{aligned} dp = {{\mathcal {L}}}^*\, p\,dt - \frac{1}{2}\, p\, [\vert b \vert ^2 - E_{p}\{\vert b \vert ^2\}] \,dt + p\, [b-E_{p}\{b\}]^T \circ dY. \end{aligned}$$
(3)

For convenience, let us rewrite this as

$$\begin{aligned} dp = A \,dt + B \circ dY. \end{aligned}$$
(4)

This is a Stratonovich SPDE. The P in SPDE and the particular type of SPDE we have imply that, in general, the solution p will not belong to any finite dimensional family of densities \(M=\{p(\cdot , \theta ), \ \theta \in \Theta \subset {\mathbb {R}}^n\}\). As every computer implementation is inherently finite dimensional, we need a way to get \(p_t\) approximated through a finite dimensional density \(p(\cdot ,\theta _t)\) for all times t.

Now, with the Stratonovich filter SPDE equation above, one can do something very simple. Since the Stratonovich SPDE satisfies the chain rule, A and B behave like two vector fields in a suitable function space. So the equation is characterized by a “dt” vector field A and a “\(dY_t\)” vector field B. These are two separate vector fields and for the time being we are content with dealing with them separately, but as we will discuss later this is not a choice without consequences. Dealing with A and B separately, one can project them on the tangent space of \(M=\{p(\cdot , \theta ), \ \theta \in \Theta \subset {\mathbb {R}}^n\}\) (direct metric) or of their square roots (Hellinger metric) obtaining, in the direct metric case for example,

$$\begin{aligned} dp(\cdot ,\theta _t) = \Pi _{p(\cdot ,\theta _t)}[A] \,dt + \Pi _{p(\cdot ,\theta _t)}[B] \circ dY \end{aligned}$$
(5)

where \(\Pi \) is the tangent space projection at the denoted point for the manifold M. Applying the chain rule gives immediately a finite dimensional SDE for \(d \theta _t\) from the above equation, where the coefficients are known and where the SDE can be implemented easily in a finite dimensional setting, giving a finite dimensional filter.

This is basically the \(L^2\) direct metric or Hellinger projection filter in a nutshell, it has been studied and implemented in [6, 19, 20] and by a number of subsequent authors, as summarized in Sect. 2.

We conclude the summary of the Stratonovich projections by saying that they do satisfy an optimality criterion, although it is a criterion that is somewhat unnatural and not helpful. It requires to run an artificial filter in negative time and to include it into the criterion to be minimized. This is summarized in Table 2.

3.3 Itô-vector projections

Let us go back to our exact filter equation in Stratonovich form (4):

$$\begin{aligned} dp = A \,dt + B \circ dY. \end{aligned}$$

Now in the Stratonovich projection filters we projected separately the vector fields A and B obtaining a projected equation. By nature, the projection is the best (optimal) approximation for A and B separately on the chosen manifold tangent space. However, does this translate into an optimality of the solution \(p(\cdot ,\theta _{t+\delta t})\) as an approximation of the exact \(p_{t+\delta t}\) for say small \(\delta t\), given that we had the optimal filter up to t and now we wish to approximate the next step \(\delta t\) (in most cases \(t=0\) as we will seek an optimal approximation from time 0)? In other terms, is there a norm \(\Vert \cdot \Vert \) for which we can say that in some sense

$$\begin{aligned} \theta _{t+\delta t} \approx \text{ argmin}_\theta \ \Vert p_{t+\delta t}- p(\cdot ,\theta _{t+\delta t}) \Vert \end{aligned}$$
(6)

so long as \(\delta t\) is small? This is a very legitimate question, and it comes from the fact that the two vector fields of a SDE or SPDE (A and B in our example) interact in a very specific way in determining the solution. If we agree it is the Ito solution we are considering primarily [49, Chapter V.30, p. 184], note that to transform a Stratonovich SDE into an Ito one with the same solution, the drift A is modified by terms involving partial derivatives of the term B. In the Ito form, therefore, there is no neat separation into two vector fields. Not just that, but the behaviour of the solution of the SDE or SPDE as a whole is more than the behaviour of the two separate vector fields A and B. This is why the optimality of the separate projections of A and B does not guarantee any optimality of the type sought in (6). Consider then (1) and write it as

$$\begin{aligned} dp = C dt + B dY_t. \end{aligned}$$

Again, this is a SPDE, this time in Ito form, and has an infinite dimensional solution in general.

The Ito-vector projection sets out to approach the problem starting from a criterion like (6). It does not resort to a Stratonovich version of the Kushner–Stratonovich equation but keeps the original Ito version.

Let us choose a norm for the space of densities, \(\Vert \cdot \Vert \) which might be the direct metric or the Hellinger metric.

Given the diffusion term in the approximating equation minimizing (but not zeroing) the \(\delta t\) term of the expansion for the mean square difference \(E_t[\Vert p_{t+\delta t}-p(\cdot ,\theta _{t+\delta t})\Vert ^2]\), we find the drift term that minimizes the \((\delta t)^2\) term of the same difference while holding the earlier diffusion term fixed. Note that the \(\delta t\) order term is minimized, not zeroed, so that we do not attain \((\delta t)^2\) convergence.

As a bonus, we also minimize the order 1 Taylor expansion (in t) of the norm of the expectation of the difference between the optimal filter \(p_{t+\delta t}\) and \(p(\cdot ,\theta _{t+\delta t})\), namely \(\Vert E[p_{t+\delta t}-p(\cdot ,\theta _{t+\delta t})]\Vert \).

To achieve \((\delta t)^2\) convergence, rather than \(\delta t\) convergence, we will need the Ito-jet projection.

Finally, the expectation \(E_t\) is necessary because one should not forget that p and \(p(\cdot , \theta )\) are random objects. The randomness of p, in particular, comes from Y and the random \(\theta \) is supposed to capture it optimally.

3.4 Itô-jet projection

The Ito jet projection uses the notion of metric projection.

The metric projection of a general density p in \(L^2\) onto the manifold M is the closest point on M to p and is denoted by \(\pi (p)\). This is not a vector projection, it is a projection of a point onto the submanifold M. Given that the metric projection is, according to the chosen metric, the best we can ever do in approximating p on M, as it is the closest point on M to p, we can try to find a projection filter that gets as close as possible to the metric projection. In other term, our criterion has changed to

$$\begin{aligned} \theta _{t+\delta t} \approx \text{ argmin}_\theta \ \Vert \pi (p_{t+\delta t})- p(\cdot ,\theta _{t+\delta t}) \Vert . \end{aligned}$$
(7)

The Ito jet projection satisfies the following optimality criterion: it zeros the \(\delta t\) term and minimizes the \((\delta t)^2\) term of the Taylor expansion of the mean square of the distance in \(L^2\) or M between \(\pi (p_{t+\delta t})\) and \(p(\cdot ,\theta _{t+\delta t})\). This is the most optimal projection we derived and it converges with order \((\delta t)^2\), as opposed to the \((\delta t)^1\) of the Ito vector projection.

Again, in real applications we won’t have the optimal filter at time t so we will start our approximation directly at time \(t=0\). This is reflected in the summary Table 2 where \(t=0\) and we call \(\delta t\) with the name t, assuming it is small.

3.5 Comparison of filters

In [10] we compare the different projection filters with each other in a case of cubic sensor perturbing a linear system (where, without perturbation, the Kalman filter would work well). In other words, the state equation is trivial, \(dX=dW\), while the observation function is \(b(x) = x + \varepsilon x^3\). For small \(\varepsilon \), this will be close to a linear system and the extended Kalman filter and other Gaussian filters are supposed to perform well. We make the comparison in [10], comparing the different projection filters with the extended Kalman filter and with the Ito assumed density filter (ADF) with assumed Gaussian density. We refer to the paper for the full details.

In [10] we compare first the direct \(L^2\) residuals for the various methods. The Itô-vector projection in the direct \(L^2\) metric results in the lowest residuals over short time horizons. The Stratonovich projection comes a close second. Over medium time horizons, the Itô-jet projection out performs the Itô-vector projection. The projection methods out-performed all other methods like extended Kalman filter or assumed density filters.

Second, in [10] we compared the Hellinger residuals for different filters, where projection filters are w.r.t. the Hellinger metric. This second analysis indicates that the Itô ADF and the Itô-jet projection are almost indistinguishable in their performance, and we explain why in [10]. Over the short term, the Itô-vector projection gives the best results. Over medium term, the Itô-jet projection and the Itô ADF give the best results.

We also note that in previous works such as [6, 19, 20] where we only studied the Stratonovich projection filter, filtering problems for systems like the cubic and quadratic sensors were studied. For such systems, the optimal filter density would often turn out to be bimodal and a projection filter based on a manifold consisting of mixtures of two Gaussians or of exponential families with fourth order polynomial exponents would track the optimal filter well, while approximated filters such as the Extended Kalman filter, Gaussian Assumed Density filters and even particle filters with the same number of parameters as the projection filters would fare worse than the projection filters in terms of \(L^2\), Hellinger or Lévy–Prokhorov norms of errors.

We may conclude that information geometry based filters contributed in a relevant way to finite dimensional approximations of the optimal filter.

4 Conclusions and further work

The notion of projecting a vector field onto a manifold is unambiguous. By contrast, there are multiple distinct generalizations of this notion to SDEs, as summarized in Table 2.

The two Itô projections we recalled in this review can both be derived from minimization arguments. However, the Itô-jet projection has some clear advantages.

  • The Itô-jet projection is the best approximation to the metric projection of the true solution and leads to a mean-squared error of order \(O(t^2)\). By contrast, the Itô-vector projection only tracks the true solution with an accuracy of O(t) for the mean-square error.

  • The Itô-jet projection gives a more intuitive answer than the Itô-vector projection for the low dimensional example of the cross-diffusion considered in [10].

  • The Itô-jet projection gives better numerical results in the longer term than the Itô-vector projection in our application to filtering.

  • The Itô-jet projection has an elegant definition when written in terms of 2-jets, which is described in [10].

The Stratonovich projection satisfies an ad hoc minimization that is less appealing than the ones of the Itô projections, since it requires a deterministic anchor point at time 0 and negative time copies of the processes. The Itô-jet and Itô-vector projection arguments allow one to derive new Gaussian approximations to non-linear filters, and new exponential and mixture filters more generally, although the more general cases have not been explored in [10]. Some of the possibilities with different projections, metrics and manifolds are shown in Table 1. This could be investigated in further work to complete the table. In the Gaussian case we do explore in [10] applying the methods summarized in this review, unlike previous Gaussian approximations to non-linear filters, the projection approximations are derived by fully explicit minimization arguments rather than heuristic arguments. Thus, the notion of projecting an SDE onto a manifold, coupled with information geometry, is able to give new results even for this well-worn topic of approximate Gaussian nonlinear filtering.

A further important investigation line could be in deriving approximations based on approximating bases that are not made of densities or their square roots. Working with densities has the advantage of allowing information geometry to act clearly, but at the same time puts strong constraints on the approximating bases. As a simple example, one might wish to use “mixtures” of Hermite polynomials, which are not densities, as a basis for the approximation. One might wish to investigate to what extent it is possible to use non-density bases while retaining an information geometric approach.

The above development might potentially ease another fundamental problem that remains to this day: controlling the long term error of the projection filter compared to the optimal filter. This is a very difficult problem in general. Again in an information geometry setting, when the unobserved signal process X is a finite-state Markov chain, Cohen and Fausti [25] derive results on a well-controlled error, based on ergodic theory and symplectic structures. This result builds on their previous work [26]. The theory needs to be extended to the diffusion setting we have been using here, but this is a promising result in controlling the long term error between the optimal filter and the projection filter.

Table 2 Projections and the associated optimality criteria