# A Comparison of Variational and Markov Chain Monte Carlo Methods for Inference in Partially Observed Stochastic Dynamic Systems

## Authors

- First Online:

- Received:
- Revised:
- Accepted:

DOI: 10.1007/s11265-008-0299-y

- Cite this article as:
- Shen, Y., Archambeau, C., Cornford, D. et al. J Sign Process Syst (2010) 61: 51. doi:10.1007/s11265-008-0299-y

- 4 Citations
- 167 Views

## Abstract

In recent work we have developed a novel variational inference method for partially observed systems governed by stochastic differential equations. In this paper we provide a comparison of the Variational Gaussian Process Smoother with an exact solution computed using a Hybrid Monte Carlo approach to path sampling, applied to a stochastic double well potential model. It is demonstrated that the variational smoother provides us a very accurate estimate of mean path while conditional variance is slightly underestimated. We conclude with some remarks as to the advantages and disadvantages of the variational smoother.

### Keywords

Data assimilationSignal processingNonlinear smoothingVariational approximationBayesian computation## 1 Introduction

Stochastic dynamical systems [1] have been used for modelling of real-life systems in various areas ranging from physics [1] to system biology [2] to environmental science [3]. Such systems are often only partially observed, which makes statistical inference in those systems difficult. The inference problem for stochastic dynamical systems usually includes both state- and parameter estimation. In this paper, we focus on state estimation and assume that the system equation and its parameters are both known a priori. This is known as filtering and/or smoothing problems in statistical signal processing [4]. It is known that the Kushner-Stratonovich-Pardoux (KSP) equations are the optimal solution to a general filtering/smoothing problem [5–7]. For linear systems, the filtering part of KSP equations is reduced to the well-known Kalman-Bucy filter [8] which is computationally very efficient. For non-linear dynamics in general, however, filtering/smoothing is still a challenging problem because a numerical solution to the KSP equations is not feasible for high-dimensional systems. Recently, a variational smoothing algorithm has been proposed in [10]. This paper is to illustrate the performance of that computationally efficient algorithm by comparing with Markov Chain Monte Carlo (MCMC) smoother.

**f**represents a deterministic dynamical process. The driving noise process is represented by a Wiener process

**W**(

*t*). Equation 1 is also referred to as a diffusion. Note that the diffusion matrix

**D**is assumed to be state-independent. The state is observed via some measurement function

**H**(·) at discrete times, say {

*t*

_{k}}

_{k = 1, ..., M}. The observations are contaminated by i.i.d Gaussian noise:

*k*-th observation, \({\bf R} \in {\cal R}^{d' \times d'}\) is the covariance matrix of measurement errors, and \({\boldsymbol{\xi}}_k\) represents multivariate white noise. A Bayesian approach to filtering/smoothing is typically adopted in which the posterior distribution

*p*(

**X**(

*t*)|{

**y**

_{1}, ...,

**y**

_{k},

*t*

_{k}<

*t*}) and

*p*(

**X**(

*t*)|{

**y**

_{1}, ...,

**y**

_{M}}), respectively, are to be formulated and estimated. Theoretically, an optimal estimate of

*p*(·) is the solution to the corresponding KSP equations. Computational approaches are either based on a variety of approximation schemes or achieved by MCMC sampling methods.

Using Markov Chain Monte Carlo [12], one is able to sample from a posterior distribution exactly. At each step of a MCMC simulation, a new state is proposed and will be accepted or rejected in a probabilistic way. For applications to stochastic dynamical systems, it is also referred to path sampling. A path sampling approach to discrete-time state-space models has been addressed in [9] and many references therein. In those works, a Gibbs-sampler with single-site update was used. To achieve better mixing, several algorithms using multiple-site update are explored in [13]. Recently, a Hybrid Monte Carlo (HMC) algorithm for path sampling is proposed in [14]. The HMC method updates the entire sample path at each step of path sampling while keeping the acceptance of new paths high. In this work, we first scrutinise the use of HMC for non-linear smoothing and then assess the performance of the variational smoother proposed in [10] by comparing its results with those of HMC.

In contrast to MCMC, all other approaches to non-linear filtering/smoothing, including the one proposed in [10], are based on a particular approximation scheme. The extended Kalman filter is the first attempt to tackle the non-linearity by linearising the dynamics around the currently available state estimate [15]. However, unstable error growth is observed in such linearisation methods [16]. To alleviate this difficulty, the Ensemble Kalman Filter (EnKF) was introduced in [17]. An ensemble of states are integrated forward in time. Therefore, the Kalman gain can be estimated by using the error covariances which are not propagated but calculated from the ensemble of states at each time step. Note that this method drops the linear approximation of non-linear dynamics while keeping the Gaussian assumption of error statistics. Particle filter (PF) proposed in [18] represents a different direction of approximation strategies. Essentially, the posterior density of filtering variables in PF is approximated by a discrete distribution with random support. Each one in the discrete support is called particle and its probability mass is considered as weight. It will be seen that the approximation strategy implemented in [10] is distinct from those in the above methods.

The structure of this paper is as follows: First, we present a Bayesian treatment of non-linear smoothing. In Section 3, the MCMC method is described in details while we give a summary of the variational smoother in Section 4. For detailed proofs, we refer to [10]. After that, we compare both methods in Section 5 by numerical experiments with a double-well potential system. The paper concludes with a discussion.

## 2 Bayesian Approach to Non-linear Smoothing

*t*

_{k}=

*k*·

*δt*,

*k*= 0, 1, ...,

*N*, and a smoothing window from

*t*= 0 to

*T*=

*N*·

*δt*. Note that

*ξ*

_{k}are white noise. An initial state

**x**

_{0}needs to be set. There are

*M*observations within the smoothing window, and they are denoted by

*p*(

**x**

_{0}) is the prior of initial states and

*p*(

**x**

_{k + 1}|

**x**

_{k}) with

*k*= 0, ....,

*N*− 1 are transition densities of the diffusion process. For small enough

*δt*, those transition densities can be well approximated by a Gaussian density [21]. Accordingly,

## 3 Markov Chain Monte Carlo (MCMC) Smoother

*k*. To make a proposal of \({\boldsymbol{\cal{X}}}^{k+1}\), one simulates a fictitious deterministic system as follows

**P**= (

**p**

_{0}, ...,

**p**

_{N}) represents momentum and \(\hat {\cal H}\) is a fictitious Hamiltonian which is the sum of potential energy \({\cal H}^{pot}\) and kinetic energy \({\cal H}^{kin} = \frac{1}{2} \sum_{k = 1}^N {\bf p}_k^2\). For the posterior distribution of non-linear smoothing in Section 2, the potential energy is given by

**P**(

*τ*= 0). After that, one integrates the system equations forward in time with time increment

*δτ*by using leapfrog as follows:

*A*denotes so-called preconditioning matrix which accelerates the convergence of matrix iterations. Further, the matrix

*A*is a circulant matrix which is constructed from the vector

*α*is a tuning parameter. After

*J*iterations, the state \({\boldsymbol{\cal{X}}}(\tau = J\delta \tau)\) is proposed as \({\boldsymbol{\cal{X}}}^{k+1}\) which will be accepted with probability

*δτ*,

*J*and

*α*. If

*δτ*is too large, then the leapfrog algorithm gives us a poor approximation to the true dynamics of the fictitious system. If

*J*is too large, small discretisation errors could be accumulated so that the simulated trajectory shifts away from the true one. Both lead to low acceptance rate. With too small

*δτ*and

*J*, however, the change of sample paths at each step is too small to improve mixing significantly.

## 4 Variational Gaussian Process Approximation (VGPA) Smoother

**D**must not be state-dependent so that

**X**(

*t*) of the approximate SDE is a Gaussian process. The matrix \({\bf A}(t) \in {\cal R}^{d \times d}\) and the vector \({\bf b}(t) \in {\cal R}^d\) are two variational parameters to be optimised.

*p*(

**X**(

*t*)|

**y**

_{1}, ...,

**y**

_{M}) and say

*p*(

*t*), is approximated by a Gaussian Markov process, say

*q*(

*t*). If we discretise the linear SDE in the same way as the true SDE, the approximate posterior can be written down as

*q*(

**x**

_{0}) is fixed to \({\cal N}({\bf x}_0|{\bf m}_0, {\bf S}_0)\), and the prior on initial states

*p*(

**x**

_{0}) is a uniform distribution.

**A**(

*t*) and

**b**(

*t*) are obtained by minimising the KL divergence of

*q*(

*t*) and

*p*(

*t*) which is given by

*E*(

*t*) =

*E*

_{sde}(

*t*) +

*E*

_{obs}(

*t*), \(E_{obs}(t) = \left< {\cal H}^{obs} \right>_{q_t}\) and

*q*

_{t}denotes the marginal distribution of the approximate posterior process

*q*(

*t*) at time

*t*.

**m**(

*t*) and

**S**(

*t*) which are the mean and covariance matrix of the marginal distribution

*q*

_{t}. However, the pair (

**m**(

*t*),

**S**(

*t*)) is not independent of (

**A**(

*t*),

**b**(

*t*)). There exists two constraints between them:

**A**(

*t*),

**b**(

*t*)), (

**m**(

*t*), and

**S**(

*t*)) by looking for the stationary points of the following Lagrangian

**m**,

**S**,

**A**and

**b**, we obtain the following Euler-Lagrange equations:

**m**,

**S**,

**A**,

**b**,

**Ψ**and

**λ**should fulfil the above equations and Eqs. 8–9 as well. Hence, the non-linear smoothing problem is reduced to solving a system of first-order differential equations.

**m**,

**S**,

**A**,

**b**,

**Ψ**and

**λ**. First, we compute

**m**(

*t*) and

**S**(

*t*) by performing standard Gaussian Process regression [22]. Then, we set \({\boldsymbol{\Psi}}(t) = 0\) and \({\boldsymbol{\lambda}}(t) = 0\) for all

*t*. Finally,

**A**and

**b**are initialised by

*i*, we first update

**m**and

**S**by solving Eqs. 8–9 forward in time where

**A**

^{i}and

**b**

^{i}are used. Next,

**Ψ**and

**λ**are updated by solving Eqs. 12–13 with final condition \({\boldsymbol{\Psi}}(T) = 0\) and \({\boldsymbol{\lambda}}(T) = 0\) where

**m**

^{i + 1}and

**S**

^{i + 1}are used. Note that the data are assimilated at this step. To clarify it, this can be split into two steps:

- 1.Between two successive observations, we update
**Ψ**and**λ**by solving$$\frac{d{\boldsymbol{\Psi}}(t)}{dt} = 2 {\boldsymbol{\Psi}}(t) {\bf A}(t) - \frac{\partial E_{sde}}{\partial {\bf S}}$$(16)$$\frac{d{\boldsymbol{\lambda}}(t)}{dt} = {\bf A}^{\top}(t){\boldsymbol{\lambda}}(t) - \frac{\partial E_{sde}}{\partial {\bf m}}$$(17) - 2.When there is an observation at \(t_{k_j}\),
*j*= 1, ...,*M*, the following jump-conditions apply$${\boldsymbol{\Psi}}\left(t^+_{k_j}\right) = {\boldsymbol{\Psi}}\left(t^-_{k_j}\right) - \frac{1}{2} {\bf H}^{\top}{\bf R}^{-1}{\bf H}$$(18)$${\boldsymbol{\lambda}}\left(t^+_{k_j}\right) = {\boldsymbol{\lambda}}\left(t^-_{k_j}\right) + {\bf H}^{\top}{\bf R}^{-1}{\bf H}\left({\bf y}_j - {\bf H}{\bf m}\left(t_{k_j}\right)\right).$$(19)

*A*(

*t*) and

*b*(

*t*) is done by

*ω*< 1. The iteration stops when \({\cal L}\) has converged.

## 5 Numerical Experiments

*ξ*(

*t*) is white-noise [1]. The parameter

*κ*corresponds to \((2{\bf D})^{\frac{1}{2}}\) in Eq. 1 and determines the strength of random fluctuations within the system. This system has two stable states, namely

*x*= + 1 and

*x*= − 1. However, random fluctuations could cause a transition of the system from one stable state into another. The average time needed for the occurrence of such an event is called exit time [1]. In this study, we set

*κ*= 0.5 and the corresponding exit time is about 4000 time units [23]. This provides us some prior knowledge on initial states.

In the numerical experiments, we consider a smoothing window ranging from *t* = 0 to *t* = 12.0. Further, we assume that states *x* can be observed directly, which makes **H** an identity function. Within the smoothing window, we generate three data sets, say *A*, *B*, and *C*, from a sample path which was considered in [23] and [25]. The variance *R* of measurement errors are 0.04, 0.09, and 0.36, respectively. Each data set consists of seven data points which are “measured” at times \(t_{k_1} = 1.0\), ...., \(t_{k_M} = 7.0\). Although multiple data sets are generated and analysed for each of those *R*-values, the results of data set *A*, *B*, and *C* are representative and chosen for illustration.

For the MCMC method, Eq. 20 is discretized with time increment *δt* = 0.1. The prior on initial states is set to a Gaussian density with mean at *x* = + 1 and variance equal to 0.05. This choice is strongly based on our prior knowledge of the system. The tuning parameters of Hybrid Monte Carlo are chosen as follows: *J* = 2, *δτ* = 0.005 and *α* = 0.02. The use of preconditioning matrix *A* keeps the necessary *J* small, which makes the simulation computationally more efficient. However, the multiplication of the matrix *A* with various vectors would cost extra computational time. Because of the circulant property of *A*, this part of computational burden can be reduced [24].

For each of 3 data sets, we run a Markov chain of length 5,000,000 and subsample from this chain with sampling interval equal to 1,000. The first 1,000 samples are discarded as burn-in period. It turns out that it is insufficient to determine burn-in only by monitoring a summary statistic like energy \(\hat {\cal H}\). On the contrary, one has to monitor the traces of state *x* at different time points. Particularly, those time points must be chosen from different phases of the smoothing window, for example, transition phases, stationary phases, and the phase before/after the first/last observation.

For the variational GP approximation method, Eq. 20 is discretized with time increment *δt* = 0.01. The only tuning parameter *ω* is set to 0.15 as in [10]. The number of iterations required for the convergence of VGPA may increase when we extend the smoothing window or add more measurement noise. This is because of the poor initial states estimated by standard GP regression.

*A*,

*B*, and

*C*, respectively. In each figure, the results of VGPA are compared with the MCMC results.

For the data sets with relatively small measurement noise, the estimated mean paths of both methods agree with each other very well whereas the estimated conditional covariance of VGPA is overall but only slightly smaller than that of MCMC. It is also seen that the estimated mean path is slightly biased towards zero during both stationary phases. This can be explained by the fact that although the posterior of *x* has a distinct mode at *x* = + 1.0 before the transition or *x* = − 1.0 after the transition, the mode at another stable state is not vanishing. Moreover, we see that the mean path stays at the left well after the last observation. This is also in accordance with the large exit time of the system we consider.

A small dip of mean paths is evident in the results of the VGPA smoother when we look into the initial period of the smoothing window. This is accompanied with large conditional variance **S**(0). To explain this observation, we run MCMC simulations with increasingly larger prior variance of initial states. As expected, the posterior variance of **x**_{0} increases with its prior variance. Further, it turns out that a similar dip of mean paths appears when the prior variance becomes sufficiently large. It can be understood as follows: Without any data, a double-well systems does show a bimodal probabilistic structure. With a posterior mean of **x**_{0} close to +1 and a large value of its posterior variance, the mean path could be further biased towards zero in the initial period where the first observation has little influence. Note that the approximate posterior variance **S**(0) is not optimised, but held fixed.

Finally, we turn our attention to data set *C* with very large measurement noise. Note that it is difficult to identify where the transition starts by visual inspection of the data themselves. In contrast, this is possible with data set *A* and *B*. From Fig. 3, we can see that there is significant difference both in mean path and in conditional variance between the MCMC and VGPA smoother, particularly in the period before *t* = 5.0. Due to the ambiguity shown by the data between *t* = 2.0 and *t* = 4.0, the MCMC sampler seems to be exploring the bimodal structure of the posterior distribution. In contrast, the approximate posterior of the VGPA method is fixed to one particular mode at any time. This may explain the difference in mean path between two methods and a significant underestimation of conditional variance for the VGPA smoother.

## 6 Discussions

By comparing with Markov Chain Monte Carlo, we scrutinise a variational method for non-linear smoothing which is recently proposed in [10]. Both methods are tested on a double-well potential system. Three data sets with different measurement noise are used to find out the strength and weakness of the novel smoother.

Our investigation is based on the fact that MCMC methods provide an exact inference tool for comparison. For data sets with small or moderate measurement noise, it turns out that the VGPA method does produce a very accurate estimate of mean path while the conditional variance is slightly under-estimated. As expected, the variational method is computationally more efficient than MCMC. Regarding other approximation-based smoothers, it has been reported that Ensemble Kalman smoother fails to reconstruct the transition of a double-well system accurately from a sparse data set [25]. As stated in [25], the failure is due to the fact that in KF and EnKF the propagated states are corrected by a linear interpolation scheme when new data are assimilated.

However, the weakness of the VGPA method is also evident when the ambiguity of data becomes significant. As many other variational approximation methods, the novel smoother is not good at exploring the multi-mode structure of some probability measures. In this paper, the role of prior on initial state is also investigated. It turns out that the estimates of mean path could be biased in the initial phase of the smoothing window, where the first observation has little influence, unless the prior on initial states is incorporated by the variational smoother.

In this paper, the focus of the comparison is on the accuracy of the variational smoother when compared to MCMC. Future work will focus on a comprehensive assessment of its relative performance when compared with other approximation-based algorithms. As application of so-called “statistical linearisation”-strategy, the ensemble Kalman smoother [17] and unscented Kalman smoother [26] are of most interest. For multimodal systems, the Gaussian sum smoother proposed in [27] is particularly promising, as it does propagate a Gaussian sum approximation of true marginal posterior [28] .

Many MCMC algorithms suffer from poor mixing when high-dimensional stochastic complex systems are concerned. Development of efficient MCMC algorithms is always a challenging task. A combination of variational approximation methods and sampling methods would offer a new promising direction to improve the efficiency of MCMC algorithms.