Journal of Signal Processing Systems

, Volume 61, Issue 1, pp 51–59

A Comparison of Variational and Markov Chain Monte Carlo Methods for Inference in Partially Observed Stochastic Dynamic Systems

Authors

    • Neural Computing Research GroupAston University
  • Cedric Archambeau
    • Department of Computer ScienceUniversity College London
  • Dan Cornford
    • Neural Computing Research GroupAston University
  • Manfred Opper
    • Artificial Intelligence GroupTechnical University Berlin
  • John Shawe-Taylor
    • Department of Computer ScienceUniversity College London
  • Remi Barillec
    • Neural Computing Research GroupAston University
Article

DOI: 10.1007/s11265-008-0299-y

Cite this article as:
Shen, Y., Archambeau, C., Cornford, D. et al. J Sign Process Syst (2010) 61: 51. doi:10.1007/s11265-008-0299-y

Abstract

In recent work we have developed a novel variational inference method for partially observed systems governed by stochastic differential equations. In this paper we provide a comparison of the Variational Gaussian Process Smoother with an exact solution computed using a Hybrid Monte Carlo approach to path sampling, applied to a stochastic double well potential model. It is demonstrated that the variational smoother provides us a very accurate estimate of mean path while conditional variance is slightly underestimated. We conclude with some remarks as to the advantages and disadvantages of the variational smoother.

Keywords

Data assimilationSignal processingNonlinear smoothingVariational approximationBayesian computation

1 Introduction

Stochastic dynamical systems [1] have been used for modelling of real-life systems in various areas ranging from physics [1] to system biology [2] to environmental science [3]. Such systems are often only partially observed, which makes statistical inference in those systems difficult. The inference problem for stochastic dynamical systems usually includes both state- and parameter estimation. In this paper, we focus on state estimation and assume that the system equation and its parameters are both known a priori. This is known as filtering and/or smoothing problems in statistical signal processing [4]. It is known that the Kushner-Stratonovich-Pardoux (KSP) equations are the optimal solution to a general filtering/smoothing problem [57]. For linear systems, the filtering part of KSP equations is reduced to the well-known Kalman-Bucy filter [8] which is computationally very efficient. For non-linear dynamics in general, however, filtering/smoothing is still a challenging problem because a numerical solution to the KSP equations is not feasible for high-dimensional systems. Recently, a variational smoothing algorithm has been proposed in [10]. This paper is to illustrate the performance of that computationally efficient algorithm by comparing with Markov Chain Monte Carlo (MCMC) smoother.

Mathematically, a stochastic dynamical system is often represented by stochastic differential equation (SDE) [11]:
$$d{\bf X}(t) = {\bf f}({\bf X}, t)dt +(2{\bf D})^{1/2}(t) d{\bf W}(t),$$
(1)
where \({\bf X}(t) \in {\cal R}^d\) is state vector, \({\bf D} \in {\cal R}^{d \times d}\) is so-called diffusion matrix, f represents a deterministic dynamical process. The driving noise process is represented by a Wiener process W(t). Equation 1 is also referred to as a diffusion. Note that the diffusion matrix D is assumed to be state-independent. The state is observed via some measurement function H(·) at discrete times, say {tk}k = 1, ..., M. The observations are contaminated by i.i.d Gaussian noise:
$${\bf y}_k = {\bf H}\big({\bf X}(t_k)\big) + {\bf R}^{\frac{1}{2}} \cdot {\boldsymbol{\xi}}_k$$
(2)
where \({\bf y}_k \in {\cal R}^{d'}\) is the k-th observation, \({\bf R} \in {\cal R}^{d' \times d'}\) is the covariance matrix of measurement errors, and \({\boldsymbol{\xi}}_k\) represents multivariate white noise. A Bayesian approach to filtering/smoothing is typically adopted in which the posterior distribution p(X(t)|{y1, ..., yk, tk < t}) and p(X(t)|{y1, ..., yM}), respectively, are to be formulated and estimated. Theoretically, an optimal estimate of p(·) is the solution to the corresponding KSP equations. Computational approaches are either based on a variety of approximation schemes or achieved by MCMC sampling methods.

Using Markov Chain Monte Carlo [12], one is able to sample from a posterior distribution exactly. At each step of a MCMC simulation, a new state is proposed and will be accepted or rejected in a probabilistic way. For applications to stochastic dynamical systems, it is also referred to path sampling. A path sampling approach to discrete-time state-space models has been addressed in [9] and many references therein. In those works, a Gibbs-sampler with single-site update was used. To achieve better mixing, several algorithms using multiple-site update are explored in [13]. Recently, a Hybrid Monte Carlo (HMC) algorithm for path sampling is proposed in [14]. The HMC method updates the entire sample path at each step of path sampling while keeping the acceptance of new paths high. In this work, we first scrutinise the use of HMC for non-linear smoothing and then assess the performance of the variational smoother proposed in [10] by comparing its results with those of HMC.

In contrast to MCMC, all other approaches to non-linear filtering/smoothing, including the one proposed in [10], are based on a particular approximation scheme. The extended Kalman filter is the first attempt to tackle the non-linearity by linearising the dynamics around the currently available state estimate [15]. However, unstable error growth is observed in such linearisation methods [16]. To alleviate this difficulty, the Ensemble Kalman Filter (EnKF) was introduced in [17]. An ensemble of states are integrated forward in time. Therefore, the Kalman gain can be estimated by using the error covariances which are not propagated but calculated from the ensemble of states at each time step. Note that this method drops the linear approximation of non-linear dynamics while keeping the Gaussian assumption of error statistics. Particle filter (PF) proposed in [18] represents a different direction of approximation strategies. Essentially, the posterior density of filtering variables in PF is approximated by a discrete distribution with random support. Each one in the discrete support is called particle and its probability mass is considered as weight. It will be seen that the approximation strategy implemented in [10] is distinct from those in the above methods.

In essence, the variational smoother in [10] makes a global linear approximation of non-linear dynamics. This implies a Gaussian approximation of the posterior process
$$p\left({\bf X}(t)|\left\{ {\bf y}_1, ..., {\bf y}_M \right\}\right).$$
The quality of approximation is measured by Kullback-Leibler (KL) divergence [19] between the true and approximate posterior, and the optimal approximate posterior is obtained by minimising the KL divergence. Following this, any statistical inference in the true system is based on the approximate posterior. This method is within the framework of variational approximation for Bayesian inference, which is computationally very efficient and popular in machine learning community [20].

The structure of this paper is as follows: First, we present a Bayesian treatment of non-linear smoothing. In Section 3, the MCMC method is described in details while we give a summary of the variational smoother in Section 4. For detailed proofs, we refer to [10]. After that, we compare both methods in Section 5 by numerical experiments with a double-well potential system. The paper concludes with a discussion.

2 Bayesian Approach to Non-linear Smoothing

Both for the MCMC method in [14] and for the variational smoother in [10], stochastic differential equations are discretized by using an explicit Euler-Maruyama scheme [11]. The discretized version of Eq. 1 is given by
$${\bf x}_{k+1} = {\bf x}_k + {\bf f}({\bf x}_k, t_k)\delta t +(2{\bf D})^{1/2}(t_k) \sqrt{\delta t} \cdot \xi_k,$$
(3)
with tk = k · δt, k = 0, 1, ..., N, and a smoothing window from t = 0 to T = N · δt. Note that ξk are white noise. An initial state x0 needs to be set. There are M observations within the smoothing window, and they are denoted by
$$(t_{k_j}, {\bf y}_j)_{j = 1, ..., M} \quad \mbox{with} \quad \big\{ t_{k_1}, ..., t_{k_M}\big\} \subseteq \big\{ t_0, ..., t_N\big\}.$$
In the following, we formulate the posterior distribution step by step.
The prior of a diffusion process can be written down as
$$p\big({\bf x}_0, ...., {\bf x}_N\big) = p({\bf x}_0) \cdot p\big({\bf x}_1|{\bf x}_0\big) \cdot .... \cdot p\big({\bf x}_N|{\bf x}_{N-1}\big),$$
where p(x0) is the prior of initial states and p(xk + 1|xk) with k = 0, ...., N − 1 are transition densities of the diffusion process. For small enough δt, those transition densities can be well approximated by a Gaussian density [21]. Accordingly,
$$p\left({\bf x}_{k+1}|{\bf x}_k\right) = {\cal N}\left({\bf x}_{k+1} | {\bf x}_k + {\bf f}({\bf x}_k) \delta t, 2{\bf D} \delta_t\right).$$
Therefore, the prior is given by
$$p\left({\bf x}_0, ...., {\bf x}_N\right) \propto p ({\bf x}_0) \cdot \exp(-{\cal H}_{{\rm dynamics}}),$$
where
$$\begin{array}{lll}{\cal H}_{{\rm dynamics}} &=& \sum\limits_{k = 0}^{N - 1} \frac{\delta t}{4} \left[ \frac{{\bf x}_{k+1} - {\bf x}_k}{\delta t} - {\bf f}({\bf x}_k, t_k)\right]^{\top} \nonumber \\ &&\times {\bf D}^{-1} \left[ \frac{{\bf x}_{k+1} - {\bf x}_k}{\delta t} - {\bf f}({\bf x}_k, t_k)\right]. \end{array}$$
As we assume that measurement noises are i.i.d. Gaussian random variables, the likelihood is simply given by
$$p\big({\bf y}_1, ..., {\bf y}_M|{\bf x}_0, ..., {\bf x}_N\big) = \exp\left(-{\cal H}_{{\rm obs}}\right),$$
where
$${\cal H}_{obs} = \frac{1}{2} \sum\limits_{j = 1}^M \left[ {\bf H}\left({\bf x}_{k_j}\right) - {\bf y}_j \right]^{\top} {\bf R}^{-1} \left[ {\bf H}\left({\bf x}_{k_j}\right) - {\bf y}_j \right].$$
(4)
In summary, we have the posterior
$$p({\bf x}_0, ..., {\bf x}_N|\{ {\bf y}_1, ..., {\bf y}_M \}) \propto p({\bf x_0}) \cdot \exp(-1( {\cal H}_{{\rm dynamics}} + {\cal H}_{{\rm obs}})).$$

3 Markov Chain Monte Carlo (MCMC) Smoother

In Hybrid Monte Carlo, the molecular dynamics simulation algorithm is applied to make proposals in a Metropolis-Hastings algorithm, for example,
$${\boldsymbol{\cal{X}}}^k = \left\{{\bf x}^k_0, ..., {\bf x}^k_N\right\} \longrightarrow {\boldsymbol{\cal{X}}}^{k+1} = \left\{{\bf x}^{k+1}_0, ..., {\bf x}^{k+1}_N\right\},$$
at step k. To make a proposal of \({\boldsymbol{\cal{X}}}^{k+1}\), one simulates a fictitious deterministic system as follows
$$\begin{array}{lll}\frac{d{\boldsymbol{\cal{X}}}}{d\tau} &=& {\bf P} \nonumber \\ \frac{d{\bf P}}{d\tau} &=& -\nabla_{{\boldsymbol{\cal{X}}}} \hat {\cal H}({\boldsymbol{\cal{X}}}, {\bf P})\end{array}$$
where P = (p0, ..., pN) represents momentum and \(\hat {\cal H}\) is a fictitious Hamiltonian which is the sum of potential energy \({\cal H}^{pot}\) and kinetic energy \({\cal H}^{kin} = \frac{1}{2} \sum_{k = 1}^N {\bf p}_k^2\). For the posterior distribution of non-linear smoothing in Section 2, the potential energy is given by
$${\cal H}^{pot} = -\log(p({\bf x}_0) + {\cal H}^{{\rm dynamics}} + {\cal H}^{{\rm obs}}.$$
The above system is initialised by setting \({\boldsymbol{\cal{X}}}(\tau = 0) = {\boldsymbol{\cal{X}}}_k\) and sampling a random number from \({\cal N}(0, 1)\) for each component of P(τ = 0). After that, one integrates the system equations forward in time with time increment δτ by using leapfrog as follows:
$$\begin{array}{lll}{\boldsymbol{\cal{X}}}' &=& {\boldsymbol{\cal{X}}} + \delta \tau A{\bf P} + \frac{\delta \tau^2}{2}AA^{\top}\left(-\nabla_{{\boldsymbol{\cal{X}}}} \hat {\cal H}\right) \nonumber \\ {\bf P}' &=& {\bf P} + \frac{\delta \tau}{2}A^{\top} \left(-\nabla_{{\boldsymbol{\cal{X}}}} \hat {\cal H} -\nabla_{{\boldsymbol{\cal{X}}}'} \hat {\cal H}\right) \end{array}$$
where A denotes so-called preconditioning matrix which accelerates the convergence of matrix iterations. Further, the matrix A is a circulant matrix which is constructed from the vector
$$\left\{1, \exp(-\alpha), ..., \exp(-\alpha \cdot T)\right\}$$
where α is a tuning parameter. After J iterations, the state \({\boldsymbol{\cal{X}}}(\tau = J\delta \tau)\) is proposed as \({\boldsymbol{\cal{X}}}^{k+1}\) which will be accepted with probability
$$\mbox{min}\left\{1, \exp\left( -\hat {\cal H}^{k+1} + \hat {\cal H}^k \right)\right\}.$$
A reasonably high acceptance rate can be achieved by tuning the parameter δτ, J and α. If δτ is too large, then the leapfrog algorithm gives us a poor approximation to the true dynamics of the fictitious system. If J is too large, small discretisation errors could be accumulated so that the simulated trajectory shifts away from the true one. Both lead to low acceptance rate. With too small δτ and J, however, the change of sample paths at each step is too small to improve mixing significantly.

4 Variational Gaussian Process Approximation (VGPA) Smoother

The starting point of the variational Gaussian Process approximation method is to approximate Eq. 1 by a linear SDE:
$$d{\bf X}(t) = {\bf f}_L({\bf X}, t)dt +(2{\bf D})^{1/2}(t) d{\bf W}(t),$$
(5)
where
$${\bf f}_L({\bf X}, t) = -{\bf A}(t) {\bf X}(t) + {\bf b}(t).$$
(6)
Note that D must not be state-dependent so that X(t) of the approximate SDE is a Gaussian process. The matrix \({\bf A}(t) \in {\cal R}^{d \times d}\) and the vector \({\bf b}(t) \in {\cal R}^d\) are two variational parameters to be optimised.
The approximation made by Eq. 6 implies that the true posterior process, i.e. p(X(t)|y1, ..., yM) and say p(t), is approximated by a Gaussian Markov process, say q(t). If we discretise the linear SDE in the same way as the true SDE, the approximate posterior can be written down as
$$q({\bf x}_0, ...., {\bf x}_N) = q({\bf x}_0) \cdot \prod\limits_{k = 0}^{N - 1} {\cal N}\big({\bf x}_{k+1} | {\bf x}_k + {\bf f}_L({\bf x}_k) \delta t, 2{\bf D} \delta_t\big).$$
In [10], q(x0) is fixed to \({\cal N}({\bf x}_0|{\bf m}_0, {\bf S}_0)\), and the prior on initial states p(x0) is a uniform distribution.
The optimal A(t) and b(t) are obtained by minimising the KL divergence of q(t) and p(t) which is given by
$${\mbox KL}[q||p] = \int dq \ln \frac{dq}{dp} = \int_0^T E(t) dt + const.$$
(7)
with E(t) = Esde(t) + Eobs(t), \(E_{obs}(t) = \left< {\cal H}^{obs} \right>_{q_t}\) and
$$E_{sde}(t) = \frac{1}{4} \left< {\bf f}({\bf X}) - {\bf f}_L({\bf X}))^{\top}{\bf D}^{-1}\big({\bf f}({\bf X}) - {\bf f}_L({\bf X})\big) \right>_{q_t}$$
where \({\cal H}^{obs}\) is defined in Eq. 4 and qt denotes the marginal distribution of the approximate posterior process q(t) at time t.
To compute the KL divergence, we introduce two auxiliary variational parameters m(t) and S(t) which are the mean and covariance matrix of the marginal distribution qt. However, the pair (m(t), S(t)) is not independent of (A(t), b(t)). There exists two constraints between them:
$$\frac{d {\bf m}(t)}{dt} = -{\bf A}(t){\bf m}(t) + {\bf b}(t),$$
(8)
and
$$\frac{d {\bf S}(t)}{dt} = -{\bf A}(t){\bf S}(t) - {\bf S}(t){\bf A}^{\top}(t) + 2{\bf D}.$$
(9)
Accordingly, we find optimal (A(t), b(t)), (m(t), and S(t)) by looking for the stationary points of the following Lagrangian
$$\begin{array}{lll}{\cal L} &=& \int \left\{E -tr\left\{ {\bf \boldsymbol{\Psi} \left(\frac{d{\bf S}}{dt} + {\bf A}{\bf S} + {\bf S}{\bf A}^{\top} - 2{\bf D}\right)} \right\} \right. \nonumber \\ &&{\kern17pt} \left.- {\boldsymbol{\lambda}}^{\top}\left(\frac{d{\bf m}}{dt} + {\bf A}{\bf m}\right) - {\bf b}\right\}dt \end{array}$$
where \({\boldsymbol{\Psi}}(t) \in {\cal R}^{d \times d}\) and \({\boldsymbol{\lambda}}(t) \in {\cal R}^{d}\) are Lagrange multipliers. By definition, \({\boldsymbol{\Psi}}(T) = 0\) and \({\boldsymbol{\lambda}}(T) = 0\).
By taking the derivatives of \({\cal L}\) with respect to m, S, A and b, we obtain the following Euler-Lagrange equations:
$$\frac{\partial E}{\partial {\bf A}} - 2{\boldsymbol{\Psi}S} - {\boldsymbol{\lambda} {\bf m}}^{\top} = 0$$
(10)
$$\frac{\partial E}{\partial {\bf b}} + {\boldsymbol{\lambda}} = 0$$
(11)
$$\frac{\partial E}{\partial {\bf m}} - {\bf A}^{\top}{\boldsymbol{\lambda}} + \frac{d{\boldsymbol{\lambda}}}{dt} = 0$$
(12)
$$\frac{\partial E}{\partial {\bf S}} - 2{\bf \boldsymbol{\Psi} A} + \frac{d{\boldsymbol{\Psi}}}{dt} = 0$$
(13)
Note that the optimal m, S, A, b, Ψ and λ should fulfil the above equations and Eqs. 89 as well. Hence, the non-linear smoothing problem is reduced to solving a system of first-order differential equations.
The equation system above is solved iteratively. We start with an initial guess of m, S, A, b, Ψ and λ. First, we compute m(t) and S(t) by performing standard Gaussian Process regression [22]. Then, we set \({\boldsymbol{\Psi}}(t) = 0\) and \({\boldsymbol{\lambda}}(t) = 0\) for all t. Finally, A and b are initialised by
$${\bf A}(t) = \left< \frac{\partial {\bf f}}{\partial {\bf X}} \right>_{q_t} + {\bf D} {\boldsymbol{\Psi}}(t)$$
(14)
$${\bf b}(t) = <{\bf f}({\bf X})>_{q_t} + {\bf A}(t){\bf m}(t) - 2{\bf D}{\boldsymbol{\lambda}}(t).$$
(15)
Note that Eqs. 1415 are derived from Eqs. 1011.
At iteration i, we first update m and S by solving Eqs. 89 forward in time where Ai and bi are used. Next, Ψ and λ are updated by solving Eqs. 1213 with final condition \({\boldsymbol{\Psi}}(T) = 0\) and \({\boldsymbol{\lambda}}(T) = 0\) where mi + 1 and Si + 1 are used. Note that the data are assimilated at this step. To clarify it, this can be split into two steps:
  1. 1.
    Between two successive observations, we update Ψ and λ by solving
    $$\frac{d{\boldsymbol{\Psi}}(t)}{dt} = 2 {\boldsymbol{\Psi}}(t) {\bf A}(t) - \frac{\partial E_{sde}}{\partial {\bf S}}$$
    (16)
    $$\frac{d{\boldsymbol{\lambda}}(t)}{dt} = {\bf A}^{\top}(t){\boldsymbol{\lambda}}(t) - \frac{\partial E_{sde}}{\partial {\bf m}}$$
    (17)
     
  2. 2.
    When there is an observation at \(t_{k_j}\), j = 1, ..., M, the following jump-conditions apply
    $${\boldsymbol{\Psi}}\left(t^+_{k_j}\right) = {\boldsymbol{\Psi}}\left(t^-_{k_j}\right) - \frac{1}{2} {\bf H}^{\top}{\bf R}^{-1}{\bf H}$$
    (18)
    $${\boldsymbol{\lambda}}\left(t^+_{k_j}\right) = {\boldsymbol{\lambda}}\left(t^-_{k_j}\right) + {\bf H}^{\top}{\bf R}^{-1}{\bf H}\left({\bf y}_j - {\bf H}{\bf m}\left(t_{k_j}\right)\right).$$
    (19)
     
Finally, we compute
$${\bf A}\left(t; {\bf m}^{i+1}, {\bf S}^{i+1}, {\boldsymbol{\Psi}}^{i+1}, {\boldsymbol{\lambda}}^{i+1}\right)$$
and
$${\bf b}\left(t; {\bf m}^{i+1}, {\bf S}^{i+1}, {\boldsymbol{\Psi}}^{i+1}, {\boldsymbol{\lambda}}^{i+1}\right)$$
by using Eqs. 1415. To keep the algorithm stable, the update of A(t) and b(t) is done by
$$\begin{array}{lll}{\bf A}^{i+1}(t) &=& {\bf A}^i(t) - \omega\left\{ {\bf A}^i(t) - {\bf A}\left(t; {\bf m}^{i+1}, {\bf S}^{i+1}, {\boldsymbol{\Psi}}^{i+1}, {\boldsymbol{\lambda}}^{i+1}\right)\right\} \nonumber \\ {\bf b}^{i+1}(t) &=& {\bf b}^i(t) - \omega\left\{ {\bf b}^i(t) - {\bf b}\left(t; {\bf m}^{i+1}, {\bf S}^{i+1}, {\boldsymbol{\Psi}}^{i+1}, {\boldsymbol{\lambda}}^{i+1}\right)\right\}\end{array}$$
where 0 < ω < 1. The iteration stops when \({\cal L}\) has converged.

5 Numerical Experiments

The MCMC and variational algorithms are compared on a double-well potential system which is given by
$$\dot{x}(t) = f(x(t)) + \kappa \cdot \xi(t),$$
(20)
where
$$f(x) = 4x\left(1-x^2\right)$$
and ξ(t) is white-noise [1]. The parameter κ corresponds to \((2{\bf D})^{\frac{1}{2}}\) in Eq. 1 and determines the strength of random fluctuations within the system. This system has two stable states, namely x = + 1 and x = − 1. However, random fluctuations could cause a transition of the system from one stable state into another. The average time needed for the occurrence of such an event is called exit time [1]. In this study, we set κ = 0.5 and the corresponding exit time is about 4000 time units [23]. This provides us some prior knowledge on initial states.

In the numerical experiments, we consider a smoothing window ranging from t = 0 to t = 12.0. Further, we assume that states x can be observed directly, which makes H an identity function. Within the smoothing window, we generate three data sets, say A, B, and C, from a sample path which was considered in [23] and [25]. The variance R of measurement errors are 0.04, 0.09, and 0.36, respectively. Each data set consists of seven data points which are “measured” at times \(t_{k_1} = 1.0\), ...., \(t_{k_M} = 7.0\). Although multiple data sets are generated and analysed for each of those R-values, the results of data set A, B, and C are representative and chosen for illustration.

For the MCMC method, Eq. 20 is discretized with time increment δt = 0.1. The prior on initial states is set to a Gaussian density with mean at x = + 1 and variance equal to 0.05. This choice is strongly based on our prior knowledge of the system. The tuning parameters of Hybrid Monte Carlo are chosen as follows: J = 2, δτ = 0.005 and α = 0.02. The use of preconditioning matrix A keeps the necessary J small, which makes the simulation computationally more efficient. However, the multiplication of the matrix A with various vectors would cost extra computational time. Because of the circulant property of A, this part of computational burden can be reduced [24].

For each of 3 data sets, we run a Markov chain of length 5,000,000 and subsample from this chain with sampling interval equal to 1,000. The first 1,000 samples are discarded as burn-in period. It turns out that it is insufficient to determine burn-in only by monitoring a summary statistic like energy \(\hat {\cal H}\). On the contrary, one has to monitor the traces of state x at different time points. Particularly, those time points must be chosen from different phases of the smoothing window, for example, transition phases, stationary phases, and the phase before/after the first/last observation.

For the variational GP approximation method, Eq. 20 is discretized with time increment δt = 0.01. The only tuning parameter ω is set to 0.15 as in [10]. The number of iterations required for the convergence of VGPA may increase when we extend the smoothing window or add more measurement noise. This is because of the poor initial states estimated by standard GP regression.

In Figs. 1, 2 and 3, the estimates of both mean path and conditional variance are displayed for data set A, B, and C, respectively. In each figure, the results of VGPA are compared with the MCMC results.
https://static-content.springer.com/image/art%3A10.1007%2Fs11265-008-0299-y/MediaObjects/11265_2008_299_Fig1_HTML.gif
Figure 1

Comparison of mean-path and conditional-variance estimates between the MCMC- (dashed) and variational (solid) method with a double-well potential system. Filled circles represent 7 observations from data set A, with measurement noise variance equal to 0.04. The mean paths are displayed by thick lines, while each pair of thin lines indicates an envelope of mean path with 2 × standard deviation.

https://static-content.springer.com/image/art%3A10.1007%2Fs11265-008-0299-y/MediaObjects/11265_2008_299_Fig2_HTML.gif
Figure 2

The same as in Fig. 1 but with data set B (measurement noise variance = 0.09).

https://static-content.springer.com/image/art%3A10.1007%2Fs11265-008-0299-y/MediaObjects/11265_2008_299_Fig3_HTML.gif
Figure 3

The same as in Fig. 1 but with data set C (measurement noise variance = 0.36).

For the data sets with relatively small measurement noise, the estimated mean paths of both methods agree with each other very well whereas the estimated conditional covariance of VGPA is overall but only slightly smaller than that of MCMC. It is also seen that the estimated mean path is slightly biased towards zero during both stationary phases. This can be explained by the fact that although the posterior of x has a distinct mode at x = + 1.0 before the transition or x = − 1.0 after the transition, the mode at another stable state is not vanishing. Moreover, we see that the mean path stays at the left well after the last observation. This is also in accordance with the large exit time of the system we consider.

A small dip of mean paths is evident in the results of the VGPA smoother when we look into the initial period of the smoothing window. This is accompanied with large conditional variance S(0). To explain this observation, we run MCMC simulations with increasingly larger prior variance of initial states. As expected, the posterior variance of x0 increases with its prior variance. Further, it turns out that a similar dip of mean paths appears when the prior variance becomes sufficiently large. It can be understood as follows: Without any data, a double-well systems does show a bimodal probabilistic structure. With a posterior mean of x0 close to +1 and a large value of its posterior variance, the mean path could be further biased towards zero in the initial period where the first observation has little influence. Note that the approximate posterior variance S(0) is not optimised, but held fixed.

Finally, we turn our attention to data set C with very large measurement noise. Note that it is difficult to identify where the transition starts by visual inspection of the data themselves. In contrast, this is possible with data set A and B. From Fig. 3, we can see that there is significant difference both in mean path and in conditional variance between the MCMC and VGPA smoother, particularly in the period before t = 5.0. Due to the ambiguity shown by the data between t = 2.0 and t = 4.0, the MCMC sampler seems to be exploring the bimodal structure of the posterior distribution. In contrast, the approximate posterior of the VGPA method is fixed to one particular mode at any time. This may explain the difference in mean path between two methods and a significant underestimation of conditional variance for the VGPA smoother.

6 Discussions

By comparing with Markov Chain Monte Carlo, we scrutinise a variational method for non-linear smoothing which is recently proposed in [10]. Both methods are tested on a double-well potential system. Three data sets with different measurement noise are used to find out the strength and weakness of the novel smoother.

Our investigation is based on the fact that MCMC methods provide an exact inference tool for comparison. For data sets with small or moderate measurement noise, it turns out that the VGPA method does produce a very accurate estimate of mean path while the conditional variance is slightly under-estimated. As expected, the variational method is computationally more efficient than MCMC. Regarding other approximation-based smoothers, it has been reported that Ensemble Kalman smoother fails to reconstruct the transition of a double-well system accurately from a sparse data set [25]. As stated in [25], the failure is due to the fact that in KF and EnKF the propagated states are corrected by a linear interpolation scheme when new data are assimilated.

However, the weakness of the VGPA method is also evident when the ambiguity of data becomes significant. As many other variational approximation methods, the novel smoother is not good at exploring the multi-mode structure of some probability measures. In this paper, the role of prior on initial state is also investigated. It turns out that the estimates of mean path could be biased in the initial phase of the smoothing window, where the first observation has little influence, unless the prior on initial states is incorporated by the variational smoother.

In this paper, the focus of the comparison is on the accuracy of the variational smoother when compared to MCMC. Future work will focus on a comprehensive assessment of its relative performance when compared with other approximation-based algorithms. As application of so-called “statistical linearisation”-strategy, the ensemble Kalman smoother [17] and unscented Kalman smoother [26] are of most interest. For multimodal systems, the Gaussian sum smoother proposed in [27] is particularly promising, as it does propagate a Gaussian sum approximation of true marginal posterior [28] .

Many MCMC algorithms suffer from poor mixing when high-dimensional stochastic complex systems are concerned. Development of efficient MCMC algorithms is always a challenging task. A combination of variational approximation methods and sampling methods would offer a new promising direction to improve the efficiency of MCMC algorithms.

Copyright information

© Springer Science+Business Media, LLC 2008