# Sequential Monte Carlo with transformations

- 263 Downloads

## Abstract

This paper examines methodology for performing Bayesian inference sequentially on a sequence of posteriors on spaces of different dimensions. For this, we use sequential Monte Carlo samplers, introducing the innovation of using deterministic transformations to move particles effectively between target distributions with different dimensions. This approach, combined with adaptive methods, yields an extremely flexible and general algorithm for Bayesian model comparison that is suitable for use in applications where the acceptance rate in reversible jump Markov chain Monte Carlo is low. We use this approach on model comparison for mixture models, and for inferring coalescent trees sequentially, as data arrives.

## Keywords

Bayesian model comparison Coalescent Trans-dimensional Monte Carlo## 1 Introduction

### 1.1 Sequential inference

Much of the methodology for Bayesian computation is designed with the aim of approximating a posterior \(\pi \). The most prominent approach is to use Markov chain Monte Carlo (MCMC), in which a Markov chain that has \(\pi \) as its limiting distribution is simulated. It is well known that this process may be computationally expensive; that it is not straightforward to tune the method automatically; and that it can be challenging to determine how long to run the chain for. Therefore, designing and running an MCMC algorithm to sample from a particular target \(\pi \) may require much human input and computer time. This creates particular problems if a user is in fact interested in a number of target distributions \(\left( \pi _{t}\right) _{t=1}^{T}\) defined possibly on different spaces: using MCMC on each target requires additional computer time to run the separate algorithms and each may require human input to design the algorithm, determine the burn in, etc. This paper has as its subject the task of using a Monte Carlo method to simulate from each of the targets \(\pi _{t}\) that avoids these disadvantages.

*particles*that give an empirical approximation to \(\pi _{0}\) then to, for \(t=0,...,T-1\), update the set of particles approximating \(\pi _{t}\) such that they, after changing their positions using a kernel \(K_{t+1}\) and updating their weights, approximate \(\pi _{t+1}\). This approach is particularly useful where neighbouring target distributions in the sequence are similar to each other, and in this case has the following advantages over running

*T*separate MCMC algorithms.

The similarity of neighbouring targets can be exploited since particles approximating \(\pi _{t}\) may not need much adjustment to provide a good approximation to \(\pi _{t+1}\). We have the desirable property that we find approximations to each of the targets in the sequence. Further, we also may gain when compared to running a single MCMC algorithm to target \(\pi _{T}\), since it may be complicated to set up an MCMC that simulates well from \(\pi _{T}\) without using a sequence of simpler distributions to guide particles into the appropriate regions of the space.

When the targets \(\left( \pi _{t}\right) _{t=1}^{T}\) are only known up to a constant of proportionality, SMC samplers also provide unbiased estimates of the corresponding normalising constants. In a Bayesian context, the normalising constant of \(\pi _{t}\) is the

*marginal likelihood*or*evidence*, a key quantity in Bayesian model comparison. For much of the paper, and in abuse of notation, we use the same letters for denoting distributions and corresponding densities. In addition, we use tildes to denote unnormalised densities; e.g., let \(\theta \sim \pi _t(\cdot )\) then its density is given by \(\pi _t\left( \theta \right) ={\tilde{\pi }}_t\left( \theta \right) /Z_t\), where \(Z_t\) denotes the normalising constant.

### 1.2 Outline of paper

In this paper, we consider the case where each \(\pi _{t}\) is defined on a space of different dimension, often of increasing dimension with *t*. We provide a general framework for implementing an SMC algorithm in the aforementioned setting. A particle filter is designed to be used in a special case of this situation: the case where \(\pi _{t}\) is the path distribution in a state space model, \(\pi _{t}\left( \theta _{1:t}{\mid } y_{1:t}\right) \). A particle filter exploits the Markov property in order to update a particle approximation of \(\pi _{t}\left( \theta _{1:t}{\mid } y_{1:t}\right) \) to an approximation of \(\pi _{t+1}\left( \theta _{1:t+1}{\mid } y_{1:t+1}\right) \). In this paper, we consider targets in which there is not such a straightforward relationship between \(\pi _{t}\) and \(\pi _{t+1}\). In addition, the approach we present is useful in Bayesian model comparison that results from constructing an SMC sampler where each \(\pi _{t}\) corresponds to a different model and there are *T* models that can be ordered, usually in order of their complexity. Deterministic transformations are used to move points between one distribution and the next, potentially yielding efficient samplers by reducing the distance between successive distributions. We also show how the same framework can be used for sequential inference under the coalescent model (Kingman 1982).

The use of deterministic transformations to improve SMC has been considered previously in a number of papers (e.g., Chorin and Tu 2009; Vaikuntanathan and Jarzynski 2011; Reich 2013; Heng et al. 2015; South et al. 2019). Several of these papers are focussed on how to construct useful transformations in a generic way including, for example: methods that map high density regions of the proposal to high density regions of the target (Chorin and Tu 2009) and methods that approximate the solution of ordinary differential equations that mimic the SMC dynamics (Heng et al. 2015). This paper is different in that it focuses on the particular case of a sequence of distribution on spaces of different dimensions, and uses transformations and proposals that are designed for the applications we study.

Section 2 describes the methodology introduced in the paper, considering both practical and theoretical aspects, and provides comparison to existing methods. We provide an example of the use of the methodology for Bayesian model comparison in Sect. 3, on the Gaussian mixture model. In Sect. 4, we use our methodology for online inference under the coalescent, using the flexibility of our proposed approach to describe a method for moving between coalescent trees. In Sect. 5, we present a final discussion and outline possible extensions.

## 2 SMC samplers with transformations

### 2.1 SMC samplers with increasing dimension

The use of SMC samplers on a sequence of targets of increasing dimension has been described previously (e.g., Naesseth et al. 2014; Everitt et al. 2017; Dinh et al. 2018). These papers introduce an additional proposal distribution for the variables that are introduced at each step. In this section, we straightforwardly see that this is a particular case of the SMC sampler in Del Moral et al. (2007).

#### 2.1.1 SMC samplers with MCMC moves

*T*iterations. Let \(\pi _{t}\) be our target distribution of interest at iteration

*t*, this being the distribution of the random vector \(\theta _{t}\) on space

*E*. Throughout the paper, the values taken by particles in the SMC sampler have a \(^{(p)}\) superscript to distinguish them from random vectors; so for example \(\theta _{t}^{(p)}\) is the value taken by the

*p*th particle. We define \(\pi _{0}\) to be a distribution from which we can simulate directly, simulate each particle \(\theta _{0}^{(p)}\sim \pi _{0}\) and set its normalised weight \(w_{0}^{(p)}=1/P\). Then for \(0\le t<T\) at the \(\left( t+1\right) \)th iteration of the SMC sampler, the following steps are performed.

- 1.
**Reweight**Calculate the updated (unnormalised) weight \({\tilde{w}}_{t+1}^{(p)}\) of the*p*th particle$$\begin{aligned} {\tilde{w}}_{t+1}^{(p)}= & {} w_{t}^{(p)}\frac{{\tilde{\pi }}_{t+1}\left( \theta _{t}^{(p)}\right) }{{\tilde{\pi }}_{t}\left( \theta _{t}^{(p)}\right) }. \end{aligned}$$(1) - 2.
**Resample**Normalise the weights to obtain normalised weights \(w_{t+1}^{(p)}\) and calculate the*effective sample size*(ESS) (Kong et al. 1994). If the ESS falls below some threshold, e.g., \(\alpha P\) where \(0<\alpha <1\), then resample. - 3.
**Move**For each particle use an MCMC move with target \(\pi _{t+1}\) to move \(\theta _{t}^{(p)}\) to \(\theta _{t+1}^{(p)}\).

#### 2.1.2 Increasing dimension

*t*, we use: \(\theta _{t}\) to denote the random vector of interest; \(u_{t}\) to denote a random vector that contains the additional dimensions added to the parameter space at iteration \(t+1\), and \(v_{t}\) to denote the remainder of the dimensions that will be required at future iterations. Our SMC sampler is constructed on a sequence of distributions \(\varphi _{t}\) of the random vector \(\vartheta _{t}=\left( \theta _{t},u_{t},v_{t}\right) \) in space \(E=\left( \varTheta _t, U_t, V_t \right) \), with

*t*, and \(\psi _{t}\) and \(\phi _{t}\) are (normalised) distributions on the additional variables so that \(\pi _{t}\) and \(\varphi _{t}\) have the same normalising constant. The weight update in this SMC sampler is

### 2.2 Motivating example: Gaussian mixture models

#### 2.2.1 RJMCMC for Gaussian mixture models

*t*components, to be estimated from data

*y*, consisting of

*N*observed data points. For simplicity, we describe a “without completion” model, where we do not introduce a label

*z*that assigns data points to components. Let the

*s*th component have a mean \(\mu _{s}\), precision \(\tau _{s}\) and weight \(\nu _{s}\), with the weights summing to one over the components. Let \(p_{\mu }\) and \(p_{\tau }\) be the respective priors on these parameters, which are the same for every component, and let \(p_{\nu }\) be the joint prior over all of the weights. The likelihood under

*t*components is

*t*is chosen to be a random variable and assigned a prior \(p_{t}\), which here we choose to be uniform over the values 1 to

*T*. Let

*t*. RJMCMC simulates from the joint space of \(\left( t,\theta _{t}\right) \) in which a mixture of moves is used, some fixed-dimensional (

*t*fixed) and some trans-dimensional (to mix over

*t*). The simplest type of trans-dimensional move in this case is that of a birth move for moving from

*t*to \(t+1\) components or a death move for moving from \(t+1\) to

*t*(Richardson and Green 1997). We consider a birth move, a uniform prior probability over

*t*and equal probability of proposing birth or death. For the purposes of exposition, we assume that the weights of the components are chosen to be fixed in each model. (This assumption will be relaxed later in Sect. 3.) Let \(u_{t}=\left( \mu _{t+1},\tau _{t+1}\right) \), be the mean and precision of the new component and let \(\psi _{t}\left( u_{t}{\mid }\theta _{t}\right) =p_{\mu }\left( \mu _{t+1}\right) p_{\tau }\left( \tau _{t+1}\right) \). A birth move simulates \(u_{t}\sim \psi _{t}\) and has acceptance probability

#### 2.2.2 Comparing RJMCMC and SMC samplers

*t*th distribution is the mixture of Gaussians with

*t*components. By choosing \(u_t\) and \(\psi _t\) as above, together with

IS performs better if the proposal distribution is close to the target, whilst ensuring that the proposal has heavier tails than the target. The original RJMCMC algorithm allows the possibility to construct such proposals by allowing for the use of transformations to move from the parameters of one model to the parameters of another. Richardson and Green (1997) provide a famous example of this in the Gaussian mixture case in the form of split-merge moves. Focusing on the split move, the idea is to propose splitting an existing component, using a moment matching technique to ensure that the new components have appropriate means, variances and weights.

Annealed importance sampling (AIS) (Neal 2001) yields a lower variance than IS. The idea is to use intermediate distributions to form a path between the IS proposal and target, using MCMC moves to move points along this path. This approach was shown to be beneficial in some cases by Karagiannis and Andrieu (2013).

The estimator in Eq. (8) uses only a single importance point. It would be improved by using multiple points. However, using such an estimator directly within RJMCMC leads to a “noisy” algorithm that does not have the correct target distribution for the same reasons as those given for the noisy exchange algorithm in Alquier et al. (2016). We note that recent work (Andrieu et al. 2018) suggests a correction to provide an exact approach based on the same principle.

*P*particles.

### 2.3 Using transformations in SMC samplers

*transformation SMC*(TSMC). We again use the approach of performing SMC on a sequence of targets \(\varphi _{t}\), with each of the these targets being on a space of fixed dimension, constructed such that they have the desired target \(\pi _{t}\) as a marginal. In this section, the dimension of the space on which \(\pi _{t}\) is defined again varies with

*t*, but is not necessarily increasing with

*t*. Let \(\theta _{t}\) be the random vector of interest at SMC iteration

*t*: we wish to approximate the distributions \(\pi _{t}\) of \(\theta _{t}\) in the space \(\varTheta _{t}\). Let \(\left( {\tilde{\varphi }}_{t} \right) _{t=1}^T\) be a sequence of unnormalised targets, whose normalised versions are \(\left( \varphi _{t} \right) _{t=1}^T\) and being the distribution of the random vector \(\vartheta _{t}=\left( \theta _{t},u_{t}\right) \) in the space \(E_{t}=\left( \varTheta _{t},U_{t}\right) \) where

*t*, but the dimension of \(E_{t}\) must be constant in

*t*. We introduce a transformation \(G_{t\rightarrow t+1}:\varTheta _{t}\times U_{t}\rightarrow \varTheta _{t+1}\times U_{t+1}\) and define

*X*, and let the distribution of \(\vartheta _{t+1\rightarrow t}\) be \(\varphi _{t+1\rightarrow t}\). These distributions may be derived using standard results about the distributions of transforms of random variables: e.g., where the \(E_{t}\) are continuous spaces and where \(G_{t\rightarrow t+1}\) is a diffeomorphism, having Jacobian determinant \(J_{t\rightarrow t+1}\) , with inverse \(G_{t+1\rightarrow t}\) having Jacobian determinant \(J_{t+1\rightarrow t}\). In this case we have

- 1.
**Transform**For the*p*th particle, apply\(\vartheta _{t\rightarrow t+1}^{(p)}=G_{t\rightarrow t+1}\left( \vartheta _{t}^{(p)}\right) \). - 2.
**Reweight and resample**Calculate the updated (unnormalised) weight \({\tilde{w}}_{t+1}^{(p)}\)Where \(G_{t\rightarrow t+1}\) is a diffeomorphism we have$$\begin{aligned} {\tilde{w}}_{t+1}^{(p)}= & {} w_{t}^{(p)}\frac{{\tilde{\varphi }}_{t+1}\left( \vartheta _{t\rightarrow t+1}^{(p)}\right) }{{\tilde{\varphi }}_{t\rightarrow t+1}\left( \vartheta _{t\rightarrow t+1}^{(p)}\right) }. \end{aligned}$$(9)It is possible, depending on the transformation used, that this weight update involves none of the dimensions above \(\max \left\{ \dim \left( \theta _{t}\right) ,\dim \left( \theta _{t+1}\right) \right\} \) as happened in (4). Then resample if the ESS falls below some threshold, as described previously.$$\begin{aligned} {\tilde{w}}_{t+1}^{(p)}=w_{t}^{(p)}\frac{{\tilde{\pi }}_{t+1}\left( \theta _{t\rightarrow t+1}^{(p)}\right) \psi _{t+1}\left( u_{t\rightarrow t+1}^{(p)}{\mid }\theta _{t\rightarrow t+1}^{(p)}\right) }{{\tilde{\pi }}_{t}\left( \theta _{t}^{(p)}\right) \psi _{t}\left( u_{t}^{(p)}{\mid }\theta _{t}^{(p)}\right) \left| J_{t+1\rightarrow t}\right| }.\nonumber \\ \end{aligned}$$(10) - 3.
**Move**For each*p*, let \(\vartheta _{t+1}^{(p)}\) be the result of an MCMC move with target \(\varphi _{t+1}\), starting from \(\vartheta _{t\rightarrow t+1}^{(p)}\). We need not simulate*u*variables that are not used at the next iteration.

*t*, together with new dimensions simulated using a birth move, to explore model \(t+1\). The sampler in this section allows us to use a similar idea using more sophisticated proposals, such as split moves. The efficiency of the sampler depends on the choice of \(\psi _{t}\) and \(G_{t\rightarrow t+1}\). As previously, a good choice for these quantities should result in a small distance between \(\varphi _{t\rightarrow t+1}\) and \(\varphi _{t+1}\), whilst ensuring that \(\varphi _{t\rightarrow t+1}\) has heavier tails than \(\varphi _{t+1}\). As in the design of RJMCMC algorithms, usually these choices will be made using application-specific insight.

### 2.4 Design of SMC samplers

#### 2.4.1 Using intermediate distributions

*k*th being \(\varphi _{t,k}\), so that \(\varphi _{t,0}=\varphi _{t}\) and \(\varphi _{t,K}=\varphi _{t+1}\) and therefore \(\varphi _{t,K}=\varphi _{t+1,0}\). We use

*geometric annealing*, i.e.,

*t*index when \(k=K\), then setting \(k=0\) and finally using a transform move \(\vartheta _{t\rightarrow t+1,0}^{(p)}=G_{t\rightarrow t+1}\left( \vartheta _{t,K}^{(p)}\right) \) for each \(p\in \left\{ 1,\dots ,P \right\} \). The weight update becomes

#### 2.4.2 Adaptive SMC

*t*, \(\left( k+1\right) \) this approach uses the conditional ESS (CESS)

#### 2.4.3 Auxiliary variables in proposals

For the Gaussian mixture example, for two or more components, when using a split move we must choose the component that is to be split. We may think of the choice of splitting different components as offering multiple “routes” through a space of distributions, with the same start and end points. Another alternative route would be given by using a birth move rather than a split move. In this section, we generalise TSMC to allow multiple routes. We restrict our attention to the case where the choice of multiple routes is possible at the beginning of a transition from \(\varphi _{t}\) to \(\varphi _{t+1}\), when \(k=0\) (more general schemes are possible). A route corresponds to a particular choice for the transformation \(G_{t\rightarrow t+1}\); thus, we consider a set of \(M_{t}\) possible transformations indexed by the discrete random variable \(l_{t}\), using the notation \(G_{t\rightarrow t+1}^{\left( l_{t}\right) }\) (also using this superscript on distributions that depend on this choice of *G*). We now augment the target distribution with variables \(l_{0},...,l_{T-1}\) and, for each *t* alter the distribution \(\psi _{t}\) such that it becomes a joint distribution on \(u_{t}\) and \(l_{t}\). Our sampler will draw the *l* variables at the point at which they are introduced, so that different particles use different routes, but will not perform any MCMC moves on the variable after it is introduced. This leads to the sampler being degenerate in most of the *l* variables, but this doesn’t affect the desired target distribution.

### 2.5 Discussion

One of the most obvious applications of TSMC is Bayesian model comparison. SMC samplers are a generalisation of several other techniques, such as IS, AIS and the “stepping stone” algorithm from Xie et al. (2011) (which is essentially equivalent to AIS where more than one MCMC move is used per target distribution); thus, we expect a well-designed SMC to outperform these techniques in most cases. Zhou et al. (2015) reviews existing techniques that use SMC for model comparison and concludes that “the SMC2 algorithm (moving from prior to posterior) with adaptive strategies is the most promising among the SMC strategies.” In Sect. 3, we provide a detailed comparison of TSMC with SMC2 and find that TSMC can have significant advantages.

Section 2.2.2 compared TSMC with RJMCMC, noting that RJMCMC explores the model space by using a high variance estimator of a Bayes factor at each MCMC iteration, whereas TSMC is designed to construct a single lower variance estimator of each Bayes factor. The high variance estimators within RJMCMC are the cause of its most well-known drawback: that the acceptance rate of trans-dimensional moves can be very small. The design of TSMC, in which each model is visited in turn, completely avoids this issue. One might envisage that despite avoiding poor mixing, TSMC might instead yield high variance Bayes factor estimators for challenging problems. However, TSMC has the advantage that that adaptive methods may be used in order to reduce the possibility that the estimators have high variance by, for example, automatically using more intermediate distributions. The possibility to adaptively choose intermediate distributions also provides an advantage over the approach of Karagiannis and Andrieu (2013), where a sequence of intermediate distributions for estimating each Bayes factor must be specified in advance.

Since, by construction, TSMC is a particular instance of SMC as described in Del Moral et al. (2006), all of the theoretical properties of a standard SMC algorithm apply. Of particular interest are the properties of the method as the dimension of the parameter spaces grows. TSMC is constructed on a sequence of extended spaces \(E_t\), each of which has dimension \(d_{T}\), thus in the worst case, the results for an SMC sampler on a space of dimension \(d_{T}\) apply. In this respect, the authors in Beskos et al. (2014) have analysed the stability of SMC samplers as the dimension of the state space increases when the number of particles *P* is fixed. Their work provides justification, to some extent, for the use of intermediate distributions \(\left( \varphi _{t,k}\right) _{k=1}^{K}\). Under fairly strong assumptions, it has been shown that when the number of intermediate distributions \(K={\mathcal {O}}\left( d_{T}\right) \), and as \(d_{T}\rightarrow \infty \), the effective sample size \(\text{ ESS }_{t+1}^{P}\) is stable in the sense that it converges to a non-trivial random variable taking values in \(\left( 1,P\right) \). The total computational cost for bridging \(\varphi _{t}\) and \(\varphi _{t+1}\), assuming a product form of \(d_{T}\) components, is \({\mathcal {O}}\left( Pd_{T}^{2}\right) \). However, in practice, due to the cancellation of “fill in” variables, and using sensible transformations between consecutive distributions, one could expect a much lower effective dimension of the problem; an example of this situation is presented in the next section. Some theoretical properties of the method are explored further in the Supplementary Information.

## 3 Bayesian model comparison for mixtures of Gaussians

In this section, we examine the use of TSMC on the mixture of Gaussians application in Sect. 2.2: i.e., we wish to perform Bayesian inference of the number of components *t*, and their parameters \(\theta _{t}\), from data *y*. For simplicity, we study the “without completion” model, where component labels for each measurement are not included in the model. In the next sections, we outline the design of the algorithms used, then in Sect. 3.2 we describe the results of using these approaches on previously studied data, highlighting features of the approach. Further results are given in the Supplementary Information.

### 3.1 Description of algorithms

Let *t* be the unknown number of mixture components, and \(\left( \mu _{1:t},\tau _{\text {1:t}},\nu _{1:t}\right) \) (means, precisions and weights respectively) be the parameters of the *t* components. Our likelihood is the same as in Eq. (5); we use priors \(\tau \sim \text {Gamma}\left( 2,2S^{2}/100\right) ,\)\(\nu _{1:t}\sim \text {Dir}\left( 1,...,1\right) \) for the precisions and weights, respectively, and for the means we choose an unconstrained prior of \(\mu \sim {\mathcal {N}}\left( m,S^{2}\right) \), where *m* is the mean and *S* is the range of the observed data. We impose an ordering constraint on the means, as described in Jasra et al. (2005), which simplifies the problem by eliminating many posterior modes with the added benefit of improving the interpretability of our results. For simplicity, we have also not included the commonly used “random beta” hierarchical prior structure on \(\tau \) (Richardson and Green 1997), which from a statistical perspective is suboptimal but which simplifies our presentation of the behaviour of TSMC.

We use different variants of TSMC (as described in Sect. 2.3), using a sequence of distributions \(\left( \varphi _{t}\right) _{t=1}^{T}\) where \(\varphi _{t}\left( \vartheta _{t}\right) =\pi _{t}\left( \theta _{t}\right) \psi _{t}\left( u_{t}\right) \). \(\pi _{t}\) is here the posterior on *t* components given by Eq. (6), and \(\psi _{t}\) is different depending on the transformation that is chosen. We use intermediate distributions (as described in Sect. 2.4.1), using geometric annealing, in all of our algorithms, making use of the adaptive method from Sect. 2.4.2 to choose how to place these distributions. The results in this section focus particularly on illustrating the advantages afforded by making an intelligent choice of the transformation in TSMC. Full details of the transformations, weight updates and MCMC moves are given in the Supplementary Information. In summary, we use the birth and split moves referred to in Sect. 2.2, together with a move that orders the components. For both moves, we present results using the weight updates in Eqs. (14) (referred to henceforth as the conditional approach) and (15) (referred to as the marginal approach).

### 3.2 Results

We ran SMC2 and the TSMC approaches on the enzyme data from Richardson and Green (1997). We ran the algorithms 50 times, up to a maximum of \(T=8\) components, with \(P=500\) particles. We used an adaptive sequence of intermediate distributions, choosing the next intermediate distribution to be the one that yields a CESS (Eq. 13) of \(\beta P\), where \(\beta =0.99\). We resampled using stratified resampling when the ESS falls below \(\alpha P\), where \(\alpha =0.5\). Figure 1 compares the birth and split TSMC algorithms when moving from one to two components. We observe that the split transformation has the effect of moving the parameters to initial values that are more appropriate for exploring the posterior on two components. For this dataset, the birth move is a poor choice for the existing parameters in the model: Fig. 1e shows that no particles drawn from the proposal (i.e., the posterior for the single component model) overlap with the posterior for the first component in the two component model. Despite the poor proposal, the intermediate distributions (of which there are many more than used for the split move) enable a good representation of the posterior distribution, although below we see that the poor proposal results in very poor estimates of the marginal likelihood.

Figure 2a shows log marginal likelihood estimates from the different approaches (note that a poor quality SMC usually results in an underestimate of the log marginal likelihood), and the cumulative number of intermediate distributions used in estimating all of the marginal likelihoods up to model *t* for each \(t\in \{ 1, \dots ,T \}\). We observe that the performance of SMC2 degrades as the dimension increases due to the increasing distance of the prior from the posterior: we see that the adaptive scheme using the CESS results in the number of intermediate distributions across all dimensions being approximately constant which, as suggested by Beskos et al. (2014) is insufficient to control the variance as the dimension grows. As discussed above, both birth TSMC methods yield inaccurate Bayes’ factor estimates, with split TSMC exhibiting substantially better performance. However, we see that neither conditional approach yields very accurate results when using the weight update given in Eq. (14); instead the marginalised weight update is required to provide good estimates. The marginal version of split TSMC significantly outperforms the other approaches, although we note that this is achieved at a higher computational cost due to the sum in the denominator of the weight updates, this can be observed in Fig. 2c which shows the cumulative number of Gaussian evaluations for computing the weights in each case. For all TSMC approaches, we see that the number of intermediate distributions (Fig. 2b) decreases as we increase dimension. This result can be attributed to the relatively small change that results from only adding a single component to the model at a time in TSMC. If the method has a good representation of the target at model *t* and there is minimal change in the posterior on the existing *t* components when moving to model \(t+1\), then the SMC is effectively only exploring the posterior on the additional component and thus has higher ESS.

## 4 Sequential Bayesian inference under the coalescent

### 4.1 Introduction

In this section, we describe the use of TSMC for online inference under the coalescent model in population genetics (Kingman 1982); we consider the case in which we wish to infer the *clonal ancestry* (or *ancestral tree*) of a bacterial population from DNA sequence data. Current approaches in this area use MCMC (Drummond and Rambaut 2007), which is a limitation in situations where DNA sequence data does not arrive as a batch, such as may happen when studying the spread of an infectious disease as the outbreak is progressing (Didelot et al. 2014). We instead introduce an SMC approach to online inference, inferring posterior distribution as sequences become available (this approach is similar to that of Dinh et al. (2018) which was devised simultaneously to ours). We further envisage that TSMC will be useful in cases in which data is available as a single batch, through exploiting the well-known property that a tree estimated from \(t+1\) sequences is usually similar to a tree estimated from *t* sequences. Exploring the space of trees for a large number of sequences appears challenging due to the large number of possible trees: through adding leaves one by one the SMC approach follows a path through tree space in which transitions from distribution \(\pi _{t}\) to \(\pi _{t+1}\) are not challenging. Further, our approach yields more stable estimates of the marginal likelihood of models than current approaches used routinely in population genetics, such as the infinite variance harmonic mean estimator (Drummond and Rambaut 2007) and the stepping stone algorithm (Drummond and Rambaut 2007; Xie et al. 2011).

#### 4.1.1 Previous work

The idea of updating a tree by adding leaves dates back to at least Felsenstein (1981), in which he describes, for maximum likelihood estimation, that an effective search strategy in tree space is to add species one by one. More recent work also makes use of the idea of adding sequences one at a time: ARGWeaver (Rasmussen et al. 2014) uses this approach to initialise MCMC on (in this case, a space of graphs), \(t+1\) sequences using the output of MCMC on *t* sequences, and TreeMix (Pickrell and Pritchard 2012) uses a similar idea in a greedy algorithm. In work conducted simultaneously to our own, Dinh et al. (2018) also propose a sequential Monte Carlo approach to inferring phylogenies in which the sequence of distributions is given by introducing sequences one by one. However, their approach: uses different proposal distributions for new sequences; does not infer the mutation rate simultaneously with the tree; does not exploit intermediate distributions to reduce the variance; and does not use adaptive MCMC moves. Further investigation of their approach can be found in Fourment et al. (2018), where different guided proposal distributions are explored but that still presents the aforementioned limitations.

#### 4.1.2 Data and model

We consider the analysis of *T* aligned genome sequences \(y=y_{1:T}\), each of length *N*. Sites that differ across sequences are known as single nucleotide polymorphisms (SNPs). The data (which is freely available from http://pubmlst.org/saureus/) used in our examples consists of seven “multi-locus sequence type” (MLST) genes of 25 *Staphylococcus aureus* sequences, which have been chosen to provide a sample representing the worldwide diversity of this species (Everitt et al. 2014). We make the assumption that the population has had a constant size over time, that it evolves clonally and that SNPs are the result of mutation. Our task is to infer the clonal ancestry of the individuals in the study, i.e., the tree describing how the individuals in the sample evolved from their common ancestors, and [additional to Dinh et al. (2018)] the rate of mutation in the population. We describe a TSMC algorithm for addressing this problem in Sect. 4.2, before presenting results in Sect. 4.3. In the remainder of this section, we introduce a little notation.

*t*individuals and let \(\theta /2\) be the expected number of mutations in a generation. We are interested in the sequence of distributions

*a*branches exist in the tree, for \(2\le a\le t\). The heights of the coalescent events are given by \(h^{(a)}=\sum _{\iota =a}^{t}l_{t}^{(\iota )}\), with \(h_{t}^{(a)}\) being the \(\left( t-a+1\right) \)th coalesence time when indexing from the leaves of the tree. We let \({\mathcal {T}}_{t}\) be a random vector \(\left( {\mathcal {B}}_{t},h_{t}^{(2)},...,h_{t}^{(t)}\right) \) where \({\mathcal {B}}_{t}\) is itself a vector of discrete variables representing the branching order. When we refer to a lineage of a leaf node, this refers to the sequence of branches from this leaf node to the root of the tree.

### 4.2 TSMC for the coalescent

*s*using

### 4.3 Results

We used \(P=250\) particles, with an adaptive sequence of intermediate distributions, choosing the next intermediate distribution to be the one the yields a CESS (Eq. 13) of \(\beta P\), where \(\beta =0.95\). Resampling is performed whenever the ESS falls below \(\alpha P\), where \(\alpha =0.5\). At each iteration we used the current population of particles to tune the proposal variances, as detailed in the Supplementary Information, section 3.3.

Log marginal likelihood estimates and total number of distributions for TSMC applied to the coalescent (5 s.f.), for the “Furthest” (first line) and “Nearest” (second line) orderings

Default | No top. moves | \(\chi _{t}^{(h)}=\text {Exp}(1)\) | \(\left( \chi _{t}^{(h)}\right) ^{0}\) | \(\left( \chi _{t}^{(h)}\right) ^{2}\) | \(\left( \chi _{t}^{(h)}\right) ^{4}\) |
---|---|---|---|---|---|

\(-\) 6333.9/267 | \(-\) 6338.8/257 | \(-\) 6335.1/408 | \(-\) 6336.9/330 | \(-\) 6333.1/247 | \(-\) 6334.3/238 |

\(-\) 6335.8/323 | \(-\) 6354.6/293 | \(-\) 6337.8/501 | \(-\) 6341.0/384 | \(-\) 6339.0/300 | \(-\) 6342.0/255 |

As also suggested by Fig. 3, we see that the “furthest” ordering provides consistently better results than the “nearest” ordering. “Furthest” provides an ordering in which new sequences are often added above the root of the current tree, since the existing sequences are all more closely related than the new sequence, whereas “nearest” frequently results in adding a leaf close to the existing leaves of the tree. In the latter strategy, the proposal relating to the new sequence is often good, but adding a new sequence can have a large effect on the posterior of existing variables. We see this by comparing Fig. 3c, d, observing that the “furthest” ordering results in a topology that is close to the truth. The topology from the “nearest” ordering is not as close to the truth, thus is more reliant on topology changing MCMC moves to give an accurate sample from the posterior.

As expected, using no MCMC topology moves results in very poor estimates, highlighting the important role of MCMC in generating diversity not introduced in the SMC proposals. This poor quality is not accounted for by the adaptive scheme based on the CESS introducing more intermediate distributions, since the CESS is only based on the weights of the particles and cannot account for a lack of diversity.

Using less directed proposals, on both the lineage and the height, increases the distance between the proposal and target, and results in lower quality estimates.

Using more directed proposals on the lineage may in some cases slightly improve the method, but appear to make the method less robust to the order in which the individuals are added (so may not be suitable in applications where the order of the individuals cannot be chosen).

## 5 Conclusions

This paper introduces a sequential technique for Bayesian model comparison and parameter estimation, and an approach to online parameter and marginal likelihood estimation for the coalescent, underpinned by the same methodological development: TSMC. We show that whilst TSMC performs inference on a sequence of posterior distributions with increasing dimension, it is a special case of the standard SMC sampler framework of Del Moral et al. (2007). In this section, we outline several points that are not described elsewhere.

One innovation introduced in the paper is the use of transformations within SMC for creating proposal distributions when moving between dimensions. The effectiveness of TSMC is governed by the distance between neighbouring distributions; thus, to design TSMC algorithms suitable for any given application, we require the design of a suitable transformation that minimises the distance between neighbouring distributions. This is essentially the same challenge as is faced in designing effective RJMCMC algorithms, and we may make use of many of the methods devised in the RJMCMC literature (Hastie and Green 2012). The ideal case is to use a transformation such that every distribution \(\varphi _{t\rightarrow T}\) becomes identical, in which case one may simulate from \(\pi _{T}\) simply by simulating from \(\pi _{0}\) then applying the transformation. Approximating such a “transport map” for a sequence of continuous distributions is described in Heng et al. (2015). As discussed in Sect. 1.2, Heng et al. (2015) is one of a number of papers that seeks to automatically construct useful transformations, and we anticipate these techniques being of use in the case of changing dimension that is addressed in this paper. In the RJMCMC literature, Brooks et al. (2003) describe methods for automatically constructing the “fill in” distributions \(\psi _{t}\) for a given transformation: the literature on transport maps could be used to automatically construct the transformation in advance of this step.

In Fig. 2 of Sect. 3, we see a characteristic of this approach that will be common to many applications, in that the estimated marginal likelihood rises as the model is improved, then falls as the effect of the model complexity penalisation becomes more influential than improvements to the likelihood. We note that by using estimates of the variance of the marginal likelihood estimate (Lee and Whiteley 2015), we may construct a formal diagnostic that decides to terminate the algorithm at a particular model, on observing that the estimated marginal likelihood declines from an estimated maximum value.

Although the examples in this paper both involve posterior distributions of increasing dimension, we also see a use for our approach in some cases that involve a distributions of decreasing dimension. For example, in population genetics, it is common to perform a large number of different analyses using different overlapping sets of sequences. For this reason, many practitioners would value an inference technique that allows for the removal, as well as the addition, of sequences. Further, many genetics applications now involve the analysis of whole genome sequences. Our approach is applicable in this setting, and for this purpose a BEAST2 package is currently under development.

## Notes

### Acknowledgements

Thanks to Christophe Andrieu, Adam Johansen and Changqiong Wang for useful discussions; Xavier Didelot and Dan Lawson for establishing the novelty of the approach; and Christian Robert for the suggestion to use Rao-Blackwellisation in the mixture example. First and third authors were supported by BBSRC grant BB/N00874X/1. Second author was supported by the University of Reading, and the Modernising Medical Microbiology group, NDM Experimental Medicine, University of Oxford. Fourth author is a Sir Henry Dale Fellow, jointly funded by the Wellcome Trust and the Royal Society (Grant 101237/Z/13/Z).

## Supplementary material

## References

- Alquier, P., Friel, N., Everitt, R.G., Boland, A.: Noisy Monte Carlo: convergence of Markov chains with approximate transition kernels. Stat. Comput.
**26**(1), 29–47 (2016)MathSciNetCrossRefGoogle Scholar - Andrieu, C., Roberts, G.O.: The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Stat.
**37**(2), 697–725 (2009)MathSciNetCrossRefGoogle Scholar - Andrieu, C., Doucet, A., Yıldırım, S., Chopin, N.: On the utility of Metropolis-Hastings with asymmetric acceptance ratio. ArXiv e-prints arXiv:1803.09527 (2018)
- Beskos, A., Crisan, D., Jasra, A.: On the stability of sequential Monte Carlo methods in high dimensions. Ann. Appl. Probab.
**24**(4), 1396–1445 (2014)MathSciNetCrossRefGoogle Scholar - Brooks, S.P., Giudici, P., Roberts, G.O.: Efficient construction of reversible jump Markov chain Monte Carlo proposal distributions. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**65**(1), 3–39 (2003)MathSciNetCrossRefGoogle Scholar - Carlin, B.P., Chib, S.: Bayesian model choice via Markov chain Monte Carlo methods. J. R. Stat. Soc. Ser. B
**57**(3), 473–484 (1995)zbMATHGoogle Scholar - Chorin, A.J., Tu, X.: Implicit sampling for particle filters. Proc. Natl. Acad. Sci.
**106**(41), 17249–17254 (2009)CrossRefGoogle Scholar - Del Moral, P., Doucet, A., Jasra, A.: Sequential Monte Carlo samplers. J. R. Stat. Soc. Ser. B
**68**(3), 411–436 (2006)MathSciNetCrossRefGoogle Scholar - Del Moral, P., Doucet, A., Jasra, A.: Sequential Monte Carlo for Bayesian Computation. Bayesian Stat.
**8**, 1–34 (2007)zbMATHGoogle Scholar - Del Moral, P., Doucet, A., Jasra, A.: An adaptive sequential Monte Carlo method for approximate Bayesian computation. Stat. Comput.
**22**(5), 1009–1020 (2012)MathSciNetCrossRefGoogle Scholar - Didelot, X., Gardy, J., Colijn, C.: Bayesian inference of infectious disease transmission from whole genome sequence data. Mol. Biol. Evol.
**31**, 1869–1879 (2014)CrossRefGoogle Scholar - Dinh, V., Darling, A.E., Matsen IV, F.A.: Online Bayesian phylogenetic inference: theoretical foundations via sequential Monte Carlo. Syst. Biol.
**67**(3), 503–517 (2018)CrossRefGoogle Scholar - Douc, R., Guillin, A., Marin, J.M., Robert, C.P.: Convergence of adaptive mixtures of importance sampling schemes. Ann. Stat.
**35**(1), 420–448 (2007)MathSciNetCrossRefGoogle Scholar - Drummond, A.J., Rambaut, A.: BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol.
**7**, 214 (2007)CrossRefGoogle Scholar - Everitt, R.G., Didelot, X., Batty, E.M., Miller, R.R., Knox, K., Young, B.C., Bowden, R., Auton, A., Votintseva, A., Larner-Svensson, H., Charlesworth, J., Golubchik, T., Ip, C.L.C., Godwin, H., Fung, R., Peto, TEa, Walker, aS, Crook, D.W., Wilson, D.J.: Mobile elements drive recombination hotspots in the core genome of Staphylococcus aureus. Nat. Commun.
**5**, 3956 (2014)CrossRefGoogle Scholar - Everitt, R.G., Johansen, A.M., Rowing, E., Evdemon-Hogan, M.: Bayesian model comparison with un-normalised likelihoods. Stat. Comput.
**27**(2), 403–422 (2017)MathSciNetCrossRefGoogle Scholar - Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evolut.
**17**(6), 368–376 (1981)CrossRefGoogle Scholar - Fourment, M., Claywell, B.C., Dinh, V., McCoy, C., Matsen IV, F.A., Darling, A.E.: Effective online Bayesian phylogenetics via sequential Monte Carlo with guided proposals. Syst. Biol.
**67**(3), 490–502 (2018)CrossRefGoogle Scholar - Gordon, N.J., Salmond, D.J., Smith, A.F.M.: Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proc. F Radar Signal Process. IET
**140**, 107–113 (1993)CrossRefGoogle Scholar - Hastie, D.I., Green, P.J.: Model choice using reversible jump MCMC. Stat. Neerl.
**66**(3), 309–338 (2012)MathSciNetCrossRefGoogle Scholar - Heng, J., Doucet, A., Pokern, Y.: Gibbs flow for approximate transport with applications to Bayesian computation. ArXiv e-prints arXiv:1509.08787 (2015)
- Jasra, A., Holmes, C.C., Stephens, D.A.: Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modelling. Stat. Sci.
**20**(1), 50–67 (2005)CrossRefGoogle Scholar - Jasra, A., Stephens, D.A., Doucet, A., Tsagaris, T.: Inference for Lévy-driven stochastic volatility models via adaptive sequential Monte Carlo. Scand. J. Stat.
**38**(1), 1–22 (2011)MathSciNetCrossRefGoogle Scholar - Jukes, T.H., Cantor, C.R.: Evolution of Protein Molecules. Academic Press, New York (1969)CrossRefGoogle Scholar
- Karagiannis, G., Andrieu, C.: Annealed importance sampling reversible jump MCMC algorithms. J. Computat. Graph. Stat.
**22**(3), 623–648 (2013)MathSciNetCrossRefGoogle Scholar - Kingman, J.F.C.: The coalescent. Stoch. Process. Their Appl.
**13**, 235–248 (1982)MathSciNetCrossRefGoogle Scholar - Kong, A., Liu, J.S., Wong, W.H.: Sequential imputations and Bayesian missing data problems. J. Am. Stat. Assoc.
**89**(425), 278–288 (1994)CrossRefGoogle Scholar - Lee, A., Whiteley, N.: Variance estimation in the particle filter. ArXiv e-prints arXiv:1509.00394 (2015)
- Li, N., Stephens, M.: Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics
**165**, 2213–2233 (2003)Google Scholar - Naesseth, C.A., Lindsten, F., Schön, T.B.: Sequential Monte Carlo for graphical models. In: NIPS Proceedings, pp 1–14 (2014)Google Scholar
- Neal, R.: Annealed importance sampling. Stat. Comput.
**11**(2), 125–139 (2001)MathSciNetCrossRefGoogle Scholar - Pickrell, J.K., Pritchard, J.K.: Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet.
**8**(11), e1002967 (2012)CrossRefGoogle Scholar - Rasmussen, M.D., Hall, W., Hubisz, M.J., Gronau, I., Siepel, A.: Genome-wide inference of ancestral recombination graphs. PLoS Genet.
**10**(5), e1004342 (2014)CrossRefGoogle Scholar - Reich, S.: A guided sequential Monte Carlo method for theassimilation of data into stochastic dynamical systems. In: Johann, A., Kruse, H.P., Rupp, F., Schmitz, S. (eds) Recent Trends in Dynamical Systems. Springer Proceedings in Mathematics & Statistics, vol. 35. Springer, Basel (2013) Google Scholar
- Reis, M., Yang, Z.: Approximate likelihood calculation on a phylogeny for Bayesian estimation of divergence times. Mol. Biol. Evol.
**28**(1969), 2161–2172 (2011)CrossRefGoogle Scholar - Richardson, S., Green, P.J.: On Bayesian analysis of mixtures with an unknown number of components (with discussion). J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**59**(4), 731–792 (1997)CrossRefGoogle Scholar - South, L.F., Pettitt, A.N., Drovandi, C.C.: Sequential Monte Carlo samplers with independent Markov chain Monte Carlo proposals. Bayesian Anal.
**14**(3), 753–776 (2019)MathSciNetCrossRefGoogle Scholar - Stephens, M., Donnelly, P.: Inference in molecular population genetics. J. R. Stat. Soc. Ser. B
**62**(4), 605–655 (2000)MathSciNetCrossRefGoogle Scholar - Vaikuntanathan, S., Jarzynski, C.: Escorted free energy simulations: improving convergence by reducing dissipation. J. Chem. Phys.
**134**(5), 054107 (2011)CrossRefGoogle Scholar - Xie, W., Lewis, P.O., Fan, Y., Kuo, L., Chen, M.H.: Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Syst. Biol.
**60**(2), 150–160 (2011)CrossRefGoogle Scholar - Zhou, Y., Johansen, A.M., Aston, J.A.D.: Towards automatic model comparison: an adaptive sequential Monte Carlo approach. J. Comput. Graph. Stat.
**25**, 701–726 (2015)MathSciNetCrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.