Families of distributions
As discussed in the introduction, the idea of a projection filter is to approximate solutions to the Kushner–Stratononvich Eq. (2) using a finite-dimensional family of distributions.
Example 3.1
A normal mixture family contains distributions given by:
$$\begin{aligned} p = \sum _{i=1}^m \lambda _i \frac{1}{\sigma _i \sqrt{ 2 \pi }} \exp \left( \frac{-(x-\mu _i)^2}{2 \sigma _i^2}\right) \end{aligned}$$
with \(\lambda _i >0 \) and \(\sum \lambda _i = 1\). It is a \(3m-1\)-dimensional family of distributions.
Example 3.2
A polynomial exponential family contains distributions given by:
$$\begin{aligned} p = \exp \left( \sum _{i=0}^m a_i x^i \right) \end{aligned}$$
where \(a_0\) is chosen to ensure that the integral of p is equal to 1. To ensure the convergence of the integral we must have that m is even and \(a_m < 0\). This is an m-dimensional family of distributions. Polynomial exponential families are a special case of the more general notion of an exponential family, see for example [5].
A key motivation for considering these families is that one can reproduce many of the qualitative features of distributions that arise in practice using these distributions. For example, consider the qualitative specification: the distribution should be bimodal with peaks near \(-1\) and 1 with the peak at \(-1\) twice as high and twice as wide as the peak near 1. One can easily write down a distribution of this approximates form using a normal mixture. To find a similar exponential family, one seeks a polynomial with: local maxima at \(-1\) and 1; with the maximum values at these points differing by \(\log (2)\); with second derivative at 1 equal to twice that at \(-1\). These conditions give linear equations in the polynomial coefficients. Using degree 6 polynomials it is simple to find solutions meeting all these requirements. A specific numerical example of a polynomial meeting these requirements is given in the exponent of the exponential distribution plotted in Fig. 1, see [6] for more details. We see that normal mixtures and exponential families have a broadly similar power to describe the qualitative shape of a distribution using only a small number of parameters. Our hope is that by approximating the probability distributions that occur in the Kushner–Stratonovich equation by elements of one of these families we will be able to derive a low-dimensional approximation to the full infinite-dimensional stochastic partial differential equation.
Mixture families have a long tradition in filtering. Alspach and Sorenson [4] already highlight that Gaussian sums work well in nonlinear filtering. Ahmed [2] points out that Gaussian densities are dense in \(L^2\), pointing at the fact that with a sufficiently large number of components a mixture of Gaussian densities may be able to reproduce most features of squared integrable densities.
Two Hilbert spaces of probability distributions
We have given direct parameterisations of our families of probability distributions and thus have implicitly represented them as finite-dimensional manifolds. In this section, we will see how families of probability distributions can be thought of as being embedded in a Hilbert space and hence they inherit a manifold structure and metric from this Hilbert space. There are two obvious ways of thinking of embedding a probability density function on \({\mathbb {R}}^n\) in a Hilbert space. The first is to simply assume that the probability density function is square integrable and hence lies directly in \(L^2({\mathbb {R}}^n)\). The second is to use the fact that a probability density function lies in \(L^1({\mathbb {R}}^n)\) and is non-negative almost everywhere. Hence \(\sqrt{p}\) will lie in \(L^2({\mathbb {R}}^n)\). For clarity we will write \(L^2_D({\mathbb {R}}^n)\) when we think of \(L^2({\mathbb {R}}^n)\) as containing densities directly. The D stands for direct. We write \({\mathcal {D}} \subset L^2_D({\mathbb {R}}^n)\) where \({\mathcal {D}}\) is the set of square integrable probability densities (functions with integral 1 which are positive almost everywhere). Similarly we will write \(L^2_H({\mathbb {R}}^n)\) when we think of \(L^2({\mathbb {R}}^n)\) as being a space of square roots of densities. The H stands for Hellinger (for reasons we will explain shortly). We will write \({\mathcal {H}}\) for the subset of \(L^2_H\) consisting of square roots of probability densities. We now have two possible ways of formalizing the notion of a family of probability distributions. In the next section, we will define a smooth family of distributions to be either a smooth submanifold of \(L^2_D\) which also lies in \({\mathcal {D}}\) or a smooth submanifold of \(L^2_H\) which also lies in \({\mathcal {H}}\). Either way the families we discussed earlier will give us finite-dimensional families in this more formal sense. The Hilbert space structures of \(L^2_D\) and \(L^2_H\) allow us to define two notions of distance between probability distributions which we will denote \(d_D\) and \(d_H\). Given two probability distributions \(p_1\) and \(p_2,\) we have an injection \(\iota \) into \(L^2\) so one defines the distance to be the norm of \(\iota (p_1) - \iota (p_2)\). So given two probability densities \(p_1\) and \(p_2\) on \({\mathbb {R}}^n\) we can define:
$$\begin{aligned} d_H(p_1,p_2) = \left( \int (\sqrt{p_1} - \sqrt{p_2})^2 {\mathrm {d}}\mu \right) ^{\frac{1}{2}}, \ \ d_D(p_1,p_2) = \left( \int (p_1 - p_2)^2 {\mathrm {d}}\mu \right) ^{\frac{1}{2}}. \end{aligned}$$
Here, \({\mathrm {d}}\mu \) is the Lebesgue measure. \(d_H\) defines the Hellinger distance between the two distributions, which explains our use of H as a subscript. We will write \(\langle \cdot , \cdot \rangle _H\) for the inner product associated with \(d_H\) and \(\langle \cdot , \cdot \rangle _D\) or simply \(\langle \cdot , \cdot \rangle \) for the inner product associated with \(d_D\). In this paper, we will consider the projection of the conditional density of the true state of the system given the observations (which is assumed to lie in \({\mathcal {D}}\) or \({\mathcal {H}}\)) onto a submanifold. The notion of projection only makes sense with respect to a particular inner product structure. Thus, we can consider projection using \(d_H\) or projection using \(d_D\). Each has advantages and disadvantages. The most notable advantage of the Hellinger metric is that the \(d_H\) metric can be defined independently of the Lebesgue measure and its definition can be extended to define the distance between measures without density functions (see Jacod and Shiryaev [28]). In particular, the Hellinger distance is independent of the choice of parameterization for \({\mathbb {R}}^n\). This is a very attractive feature in terms of the differential geometry of our setup. Despite the significant theoretical advantages of the \(d_H\) metric, the \(d_D\) metric has an obvious advantage when studying mixture families: it comes from an inner product on \(L^2_D\) and so commutes with addition on \(L^2_D\). So it should be relatively easy to calculate with the \(d_D\) metric when adding distributions as happens in mixture families. As we shall see in practice, when one performs concrete calculations, the \(d_H\) metric works well for exponential families and the \(d_D\) metric works well for mixture families.
The tangent space of a family of distributions
To make our notion of smooth families precise, we need to explain what we mean by a smooth map into an infinite-dimensional space. Let U and V be Hilbert spaces and let \(f:U \rightarrow V \) be a continuous map (f need only be defined on some open subset of U). We say that f is Frećhet differentiable at x if there exists a bounded linear map \(A:U \rightarrow V\) satisfying:
$$\begin{aligned} \lim _{h \rightarrow x} \frac{\Vert f(h) - f(x) - A h \Vert _V}{\Vert h\Vert _U} = 0 \end{aligned}$$
If A exists it is unique and we denote it by \({\mathrm {D}}f(x)\). This limit is called the Frećhet derivative of f at x. It is the best linear approximation to f at 0 in the sense of minimizing the norm on V. This allows us to define a smooth map \(f:U \rightarrow V\) defined on an open subset of U to be an infinitely Frećhet differentiable map. We define an immersion of an open subset of \({\mathbb {R}}^n\) into V to be a map such that \({\mathrm {D}}f(x)\) is injective at every point where f is defined. The latter condition ensures that the best linear approximation to f is a genuinely n-dimensional map. Given an immersion f defined on a neighbourhood of x, we can think of the vector subspace of V given by the image of \({\mathrm {D}}f(x)\) as representing the tangent space at x. To make these ideas more concrete, let us suppose that \(p(\theta )\) is a probability distribution depending smoothly on some parameter \(\theta = (\theta _1,\theta _2,\ldots ,\theta _m) \in U\) where U is some open subset of \({\mathbb {R}}^m\). The map \(\theta \rightarrow p(\theta )\) defines a map \(i:U \rightarrow {\mathcal {D}}\). At a given point \(\theta \in U\) and for a vector \(h=(h_1,h_2,\ldots ,h_m) \in {\mathbb {R}}^m,\) we can compute the Fréchet derivative to obtain:
$$\begin{aligned} {\mathrm {D}} i (\theta ) h = \sum _{i=1}^m \frac{\partial p}{\partial \theta _i} h_i \end{aligned}$$
So we can identify the tangent space at \(\theta \) with the following subspace of \(L^2_D\):
$$\begin{aligned} \mathrm{span}\left\{ \frac{\partial p}{\partial \theta _1}, \frac{\partial p}{\partial \theta _2}, \ldots , \frac{\partial p}{\partial \theta _m} \right\} \end{aligned}$$
(6)
We can formally define a smooth n-dimensional family of probability distributions in \(L^2_D\) to be an immersion of an open subset of \({\mathbb {R}}^n\) into \({\mathcal {D}}\). Equivalently, it is a smoothly parameterized probability distribution p such that the above vectors in \(L^2\) are linearly independent. We can define a smooth m-dimensional family of probability distributions in \(L^2_H\) in the same way. This time let \(q(\theta )\) be a square root of a probability distribution depending smoothly on \(\theta \). The tangent vectors in this case will be the partial derivatives of q with respect to \(\theta \). Since one normally prefers to work in terms of probability distributions rather than their square roots, we use the chain rule to write the tangent space as:
$$\begin{aligned} \mathrm{span}\left\{ \frac{1}{2 \sqrt{p}} \frac{\partial p}{\partial \theta _1}, \frac{1}{2 \sqrt{p}} \frac{\partial p}{\partial \theta _2}, \ldots , \frac{1}{2 \sqrt{p}} \frac{\partial p}{\partial \theta _m} \right\} \end{aligned}$$
(7)
We have defined a family of distributions in terms of a single immersion f into a Hilbert space V. In other words, we have defined a family of distributions in terms of a specific parameterization of the image of f. It is tempting to try and phrase the theory in terms of the image of f. To this end, one defines an embedded submanifold of V to be a subspace of V which is covered by immersions \(f_i\) from open subsets of \({\mathbb {R}}^n\) where each \(f_i\) is a homeomorphisms onto its image. With this definition, we can state that the tangent space of an embedded submanifold is independent of the choice of parameterization.
One might be tempted to talk about submanifolds of the space of probability distributions, but one should be careful. The spaces \({\mathcal {H}}\) and \({\mathcal {D}}\) are not open subsets of \(L^2_H\) and \(L^2_D\) and so do not have any obvious Hilbert-manifold structure. To see why, one may perturb a positive density by a negative spike with an arbitrarily small area and obtain a function arbitrarily close to the density in \(L^2\) norm but not almost surely positive, see [6] for a graphic illustration.
The Fisher information metric
Given two tangent vectors at a point to a family of probability distributions, we can form their inner product using \(\langle \cdot , \cdot \rangle _H\). This defines a so-called Riemannian metric on the family. With respect to a particular parameterization \(\theta ,\) we can compute the inner product of the \(i\mathrm{th}\) and \(j\mathrm{th}\) basis vectors given in Eq. (7). We call this quantity \(\frac{1}{4} g_{ij}\).
$$\begin{aligned} \frac{1}{4} g_{ij}( \theta ):= & {} \left\langle \frac{1}{2 \sqrt{p}} \frac{ \partial p}{\partial \theta _i}, \frac{1}{2 \sqrt{p}} \frac{ \partial p}{\partial \theta _j} \right\rangle _H = \frac{1}{4} \int \frac{1}{p} \frac{ \partial p}{ \partial \theta _i} \frac{ \partial p}{ \partial \theta _j} {\mathrm {d}}\mu \\= & {} \frac{1}{4} \int \frac{ \partial \log p}{ \partial \theta _i} \frac{ \partial \log p}{ \partial \theta _j} p {\mathrm {d}}\mu = \frac{1}{4} E_p \left( \frac{ \partial \log p}{\partial \theta _i} \frac{ \partial \log p}{\partial \theta _j}\right) \end{aligned}$$
Up to the factor of \(\frac{1}{4}\), this last formula is the standard definition for the Fisher information matrix. So our \(g_{ij}\) is the Fisher information matrix. We can now interpret this matrix as the Fisher information metric and observe that, up to the constant factor, this is the same thing as the Hellinger distance. See [5] and [1] for more in depth study on this differential geometric approach to statistics.
Example 3.3
The Gaussian family of densities can be parameterized using parameters mean \(\mu \) and variance v. With this parameterization the Fisher metric is given by:
$$\begin{aligned} g( \mu , v) = \frac{1}{v} \left[ \begin{array}{c@{\quad }c} 1 &{} 0 \\ 0 &{} 1/(2v) \end{array} \right] \end{aligned}$$
The Gaussian family may also be considered with different coordinates, namely as a particular exponential family, see [6] for the metric g under these coordinates.
The particular importance of the metric structure for this paper is that it allows us to define orthogonal projection of \(L^2_H\) onto the tangent space. Suppose that one has m linearly independent vectors \(w_i\) spanning some subspace W of a Hilbert space V. By linearity, one can write the orthogonal projection onto W as:
$$\begin{aligned} \Pi (v) = \sum _{i=1}^m \left[ \sum _{j=1}^m A^{ij} \langle v, w_j \rangle \right] w_i \end{aligned}$$
for some appropriately chosen constants \(A^{ij}\). Since \(\Pi \) acts as the identity on \(w_i,\) we see that \(A^{ij}\) must be the inverse of the matrix \(A_{ij}=\langle w_i, w_j \rangle \). We can apply this to the basis given in Eq. (7). Defining \(g^{ij}\) to be the inverse of the matrix \(g_{ij}\) we obtain the following formula for projection, using the Hellinger metric, onto the tangent space of a family of distributions:
$$\begin{aligned} \Pi _H(v) = \sum _{i=1}^m \left[ \sum _{j=1}^m 4 g^{ij} \left\langle v, \frac{1}{2 \sqrt{p}} \frac{ \partial p}{ \partial \theta _j} \right\rangle _H \right] \frac{1}{2 \sqrt{p}} \frac{ \partial p}{ \partial \theta _i}. \end{aligned}$$
The direct \(L^2\) metric
The ideas from the previous section can also be applied to the direct \(L^2\) metric. This gives a different Riemannian metric on the manifold. We will write \(h=h_{ij}\) to denote the \(L^2\) metric when written with respect to a particular parameterization.
Example 3.4
In coordinates \(\mu \), \(\nu \), the \(L^2\) metric on the Gaussian family is (compare with Example 3.3):
$$\begin{aligned} h( \mu , \nu ) = \frac{1}{4 \nu \sqrt{ \nu \pi }} \left[ \begin{array}{c@{\quad }c} 1 &{}\quad 0 \\ 0 &{}\quad \frac{3}{8 \nu } \end{array} \right] \end{aligned}$$
We can obtain a formula for projection in \(L^2_D\) using the direct \(L^2\) metric using the basis given in Eq. (6). We write \(h^{ij}\) for the matrix inverse of \(h_{ij}\).
$$\begin{aligned} \Pi _D(v) = \sum _{i=1}^m \left[ \sum _{j=1}^m h^{ij} \left\langle v, \frac{ \partial p}{ \partial \theta _j} \right\rangle _D \right] \frac{ \partial p}{ \partial \theta _i}. \end{aligned}$$
(8)