Sequential Monte Carlo with transformations

This paper examines methodology for performing Bayesian inference sequentially on a sequence of posteriors on spaces of different dimensions. For this, we use sequential Monte Carlo samplers, introducing the innovation of using deterministic transformations to move particles effectively between target distributions with different dimensions. This approach, combined with adaptive methods, yields an extremely flexible and general algorithm for Bayesian model comparison that is suitable for use in applications where the acceptance rate in reversible jump Markov chain Monte Carlo is low. We use this approach on model comparison for mixture models, and for inferring coalescent trees sequentially, as data arrives. Electronic supplementary material The online version of this article (10.1007/s11222-019-09903-y) contains supplementary material, which is available to authorized users.


Sequential inference
Much of the methodology for Bayesian computation is designed with the aim of approximating a posterior π(θ).The most prominent approach is to use Markov chain Monte Carlo (MCMC), in which a Markov chain that has π as its limiting distribution is simulated.It is well known: that this process may be computationally expensive; that it is not straightforward to tune the method automatically; and that it can be challenging to determine how long to run the chain for.Therefore, designing and running an MCMC algorithm to sample from a particular target π may require much human input and computer time.This creates particular problems if a user is in fact interested in number of target distributions (π t ) T t=1 : using MCMC on each target requires additional computer time to run the separate algorithms and each may require human input to design the algorithm, determine the burn in, etc.This paper has as its subject the task of using a Monte Carlo method to simulate from each of the targets π t that avoids these disadvantages.
Particle filtering (Gordon et al., 1993) and its generalisation, the SMC sampler (Del Moral et al., 2006) is designed to tackle exactly this problem.Roughly speaking, the idea of these approaches is to begin by using importance sampling (IS) to find a set of weighted particles that give an empirical approximation to π 0 then to, for t = 0, ..., T − 1, update the set of particles approximating π t such that they, after changing their positions using a kernel K t+1 and updating their weights, approximate π t+1 .This approach is particularly useful where neighbouring target distributions in the sequence are similar to each other, and in this case has the following advantages over running T separate MCMC algorithms.
• The similarity of neighbouring targets can be exploited since particles approximating π t may not need much adjustment to provide a good approximation to π t+1 .We have the desirable property that we find approximations to each of the targets in the sequence.Further, we also may gain when compared to running a single MCMC algorithm to target π T , since it may be complicated to set up an MCMC that simulates well from π T without using a sequence of simpler distributions to guide particles into the appropriate regions of the space.
• If the π t are unnormalised, SMC samplers also provide an unbiased estimates of the normalising constants of the π t .When each π t is an unnormalised Bayesian posterior distribution, is the marginal likelihood or evidence, a key quantity in Bayesian model comparison.For much of the paper we use π t to represent an unnormalised target, and in an abuse of terminology, sometimes refer to π t as a "distribution".

Outline of paper
In this paper we consider the case where each π t is defined on a space of different dimension, often of increasing dimension with t.A particle filter is designed to be used in a special case of this situation: the case where π t is the 1 arXiv:1612.06468v2[stat.CO] 11 Jul 2018 path distribution in a state space model, π t (θ 1:t | y 1:t ).A particle filter exploits the a Markov property in order to update a particle approximation of π t (θ 1:t | y 1:t ) to an approximation of π t+1 (θ 1:t+1 | y 1:t+1 ).In this paper we consider targets in which there is not such a straightforward relationship between π t and π t+1 .We introduce a new approach to Bayesian model comparison that results from a constructing an SMC sampler where each π t is a different model and there are T models that can be ordered, usually in order of their complexity.Deterministic transformations are used to move points between one distribution and the next, yielding efficient samplers by reducing the distance between successive distributions.We also show how the same framework can be used for sequential inference under the coalescent model.The use of deterministic transformations to improve SMC has been considered previously in a number of papers (e.g.Chorin and Tu (2009); Vaikuntanathan and Jarzynski (2011); Reich (2013); Heng et al. (2015)).Several of these papers are focussed on how to construct useful transformations in a generic way including, for example: methods that map high density regions of the proposal to high density regions of the target (Chorin and Tu, 2009); and methods that approximate the solution of ordinary differential equations that mimic the SMC dynamics (Heng et al., 2015).This paper is different in that it focuses on the particular case of a sequence of distribution on spaces of different dimensions, and uses transformations and proposals that are designed for the applications we study.
Section 2 describes the methodological introduced in the paper, considering both practical and theoretical aspects, and provides comparison to existing methods.We provide an example of the use of the methodology for Bayesian model comparison in section 3, on the Gaussian mixture model.In section C we use our methodology for online inference under the coalescent, using the flexibility of our proposed approach to describe a method for moving between trees.In section 5 we review the approach and outline possible extensions.

SMC samplers with increasing dimension
The use of SMC samplers on a sequence of targets of increasing dimension has been described previously (e.g.Naesseth et al. (2014); Everitt et al. (2017); Dinh et al. (2016)).These papers introduce an additional proposal distribution for the variables that are introduced at each step.In this section we straightforwardly see that this is a particular case of the SMC sampler in Del Moral et al. (2007).

SMC samplers with MCMC moves
To introduce notation, we first consider the standard case in which the dimension is fixed across all iterations of the SMC.For simplicity we consider only SMC samplers with MCMC moves, and we consider an SMC sampler that has T iterations.Let π t be our target distribution of interest at iteration t, this being the distribution of the random vector θ t on space E. Throughout the paper the values taken by particles in the SMC sampler have a (p)  superscript to distinguish them from random vectors; so for example θ (p) t is the value taken by the pth particle.We define π 0 to be a distribution from which we can simulate directly, simulate each particle θ (p) 0 ∼ π 0 and set its normalised weight w (p) 0 = 1/P .Then for 0 ≤ t < T at the (t + 1)th iteration of the SMC sampler, the following steps are performed.
3. Move: For each particle use an MCMC move with target π t+1 to move θ t+1 .This algorithm yields an empirical approximation of π t and an estimate of its normalising constant where δ θ is a Dirac mass at θ.

Increasing dimension
We now describe a case where the parameter θ increases in dimension with the number of SMC iterations.Our approach is to set up an SMC sampler on an extended space that has the same dimension of the maximum dimension of θ that we will consider (similarly to Carlin and Chib (1995)).At SMC iteration t, we use: θ t to denote the random vector of interest; u t to denote a random vector that contains the additional dimensions added to the parameter space at iteration t + 1, and v t to denote the remainder of the dimensions that will be required at future iterations.
Our SMC sampler is constructed on a sequence of distributions ϕ t of the random vector where π t is the distribution of interest at iteration t, and ψ t and φ t are (normalised) distributions on the additional variables so that π t and ϕ t have the same normalising constant.The weight update in this SMC sampler is Here, as in particle filtering, by construction, the φ t terms in the numerator and denominator have cancelled so that none of the dimensions added after iteration t + 1 are involved; a characteristic shared by the MCMC move with target ϕ t+1 , that need only update θ t , u t and the subvector of v t that corresponds to u t+1 .

RJMCMC for Gaussian mixture models
The following sections make use of transformations and other ideas in order to improve the efficiency of the sampler.
To motivate this, we consider the case of Bayesian model comparison, in which the π t are different models ordered by their complexity.In section 3 we present an application to Gaussian mixture models, and we use this as our motivating example here.Consider mixture models with t components, to be estimated from data y, consisting of N observed data points.For simplicity we describe a "without completion" model, where we do not introduce a label z that assigns data points to components.Let the sth component have a mean µ s , precision τ s and weight ν s , with the weights summing to one over the components, let p µ and p τ be the respective priors on these parameters, which are the same for every component, and let p ν be the joint prior over all of the weights.The likelihood under k components is An established approach for estimating mixture models is that of RJMCMC.Here, t is chosen to be a random variable and assigned a prior p t , which here we choose to be uniform over the values 1 to T .Let be the joint posterior distribution over the parameters θ t conditional on t.RJMCMC simulates from the joint space of (t, θ t ) in which a mixture of moves is used, some fixed-dimensional (t fixed) and some trans-dimensional (to mix over t).The simplest type of trans-dimensional move in this case is that of a birth move for moving from t to t + 1 components, or a death move for moving from t + 1 to t (Richardson and Green, 1997).We consider a birth move, and for the purposes of exposition we assume that the weights of the components are chosen to be fixed in each model (this assumption will be relaxed later), uniform prior probability over t and equal probability of proposing birth or death.Let u t = (µ t+1 , τ t+1 ), be the mean and precision of the new component and let ψ t (u t | θ t ) = p µ (µ t+1 ) p τ (τ t+1 ).A birth move simulates u t ∼ ψ t and has acceptance probability

Comparing RJMCMC and SMC samplers
Consider the use of an SMC sampler for inference where the sequence of target distributions is (π t ) T t=1 , i.e. the tth distribution is the mixture of Gaussians with t components.By choosing u and ψ as above, together with v t = µ (t+2):T , τ (t+2):T and ψ t (v t | θ t , u t ) = T s=t+2 p µ (µ s ) p τ (τ s ), we may use the SMC sampler described in section 2.1.2.Note that the ratio in the acceptance probability in equation ( 7) is the same as the incremental SMC weight in equation ( 4).The reason for this is that both algorithms make use of an IS estimator of the Bayes factor Z t+1 /Z t : using a proposed point θ t ∼ π t , u t ∼ ψ t and θ t+1 = (θ t , u t ), this estimator is given by We may see RJMCMC as using an IS estimator of the ratio of the posterior model probabilities within its acceptance ratio; this view on RJMCMC (Karagiannis and Andrieu, 2013) links it to pseudo-marginal approaches (Andrieu and Roberts, 2009) in which IS estimators of target distributions are employed.As in pseudo-marginal MCMC, the efficiency of the chain depends on the variance of the estimator that is used.We observe that the IS estimator in equation ( 8) is likely to have high variance: this is one way of explaining the poor acceptance rate of dimension changing moves in RJMCMC.In particular, we note that this estimator suffers a curse of dimensionality in the dimension of θ t+1 , meaning that RJMCMC is in practice seldom effective when the parameter space is of high dimension.This view suggests a number of potential improvements to RJMCMC with a birth move, each of which has been previously investigated.
• IS performs better if the proposal distribution is close to the target, whilst ensuring that the proposal has heavier tails than the target.The original RJMCMC algorithm allows the possibility to construct such proposals by allowing for the use of transformations to move from the parameters of one model to the parameters of another.Richardson and Green (1997) provide a famous example of this in the Gaussian mixture case in the form of split-merge moves.Focussing on the split move, the idea is to propose splitting an existing component, using a moment matching technique to ensure that the new components have appropriate means, variances and weights.
• Annealed importance sampling (AIS) (Neal, 2001) yields a lower variance than IS.The idea is to use intermediate distributions to form a path between the IS proposal and target, using MCMC moves to move points along this path.This approach was shown to be beneficial in some cases by Karagiannis and Andrieu (2013).
• The estimator in equation ( 8) uses only a single importance point.It would be improved by using multiple points.However, using such an estimator directly within RJMCMC leads to a "noisy" algorithm that does not have the correct target distribution for the same reasons as those given for the noisy exchange algorithm in Alquier et al. (2016).We note that recent work (Andrieu et al., 2018) suggests a correction to provide an exact approach based on the same principle.
The approach we take in this paper is to investigate variations on these ideas within the SMC sampler context, rather than RJMCMC.We begin by examining the use of transformations in section 2.3, then describe the use of intermediate distributions and other refinements in section 2.4.The final idea is automatically used in the SMC context, due to the use of P particles.

Using transformations in SMC samplers
We now show (in a generalisation of section 2.1.2) how to use transformations within SMC, whilst simultaneously changing the dimension of the target at each iteration; an approach we will refer to as transformation SMC (TSMC).We again use the approach of performing SMC on a sequence of targets ϕ t , with each of the these targets being on a space of fixed dimension, constructed such that they have the desired target π t as a marginal.In this section the dimension of the space on which π t is defined again varies with t, but is not necessarily increasing with t.Let θ t be the random vector of interest at SMC iteration t: we wish to approximate the distributions π t of θ t in the space Θ t .Let ϕ t be a sequence of unnormalised targets (whose normalised versions are φt ), being the distribution of the random vector ϑ t = (θ t , u t ) in the space E t = (Θ t , U t ) and use The dimension of Θ t can change with t, but the dimension of E t must be constant in t.We introduce a transformation G t→t+1 : Θ t × U t → Θ t+1 × U t+1 and define In many cases we will choose G t→t+1 to be bijective.In this case we denote its inverse by G t+1→t = G −1 t→t+1 , with Let the distribution of the transformed random variable ϑ t→t+1 be ϕ t→t+1 (i.e. where L (X) denotes the law of a random variable X) and let the distribution of ϑ t+1→t be ϕ t+1→t .These distributions may be derived using standard results about the distributions of transforms of random variables: e.g.where the E t are continuous spaces and where G t→t+1 is a diffeomorphism, having Jacobian determinant J t→t+1 , with inverse G t+1→t having Jacobian determinant J t+1→t .In this case we have We may then use an SMC sampler on the sequence of targets ϕ t , with the following steps at its (t + 1)th iteration.
1. Transform: For the pth particle" apply ϑ 2. Reweight and resample: Calculate the updated (unnormalised) weight Where G t→t+1 is a diffeomorphism we have It is possible, depending on the transformation used, that this weight update involves none of the dimensions above max {dim (θ t ) , dim (θ t+1 )}.Then resample if the ESS falls below some threshold, as described previously.
3. Move.For each p, let ϑ t+1 be the result of an MCMC move with target ϕ t+1 , starting from ϑ (p) t→t+1 .We need not simulate u variables that are not used at the next iteration.
To illustrate the additional flexibility this framework allows, over and above the sampler described in section 2.1.2,we consider the Gaussian mixture example in section 2.2.The sampler from 2.1.2provides an alternative to RJMCMC in which a set of particles is used to sampler from each model in turn, using the particles from model t, together with new dimensions simulated using a birth move, to explore model t + 1.The sampler in this section allows us to use a similar idea using more sophisticated proposals, such as split moves.The efficiency of the sampler depends on the choice of ψ t and G t→t+1 .As previously, a good choice for these quantities should result in a small distance between ϕ t→t+1 and ϕ t+1 , whilst ensuring that ϕ t→t+1 has heavier tails than ϕ t+1 .As in the design of RJMCMC algorithms, usually these choices will be made using application-specific insight.

Using intermediate distributions
The Monte Carlo variance of an SMC sampler depends on the distance between successive target distributions, thus a well designed sampler will use a sequence of distributions in which the distance between successive distributions is small.We ensure this by introducing intermediate distributions in between successive targets (Neal, 2001): in between targets ϕ t and ϕ t+1 we use K − 1 intermediate distributions, the kth being ϕ t,k , so that ϕ t,0 = ϕ t and ϕ t,K = ϕ t+1 and therefore ϕ t,K = ϕ t+1,0 .We use geometric annealing, i.e.
where 0 = γ 0 < ... < γ K = 1.This idea results in only small alterations to the TSMC presented above.We now use a sequence of targets ϕ t,k , incrementing the t index then using a transform move ϑ and the MCMC moves now have target ϕ t→t+1,k+1 , starting from ϑ t→t+1,k and storing the result in ϑ t→t+1,k+1 .The use of intermediate distributions makes this version of TSMC more robust than the previous one; the MCMC moves used at the intermediate distributions provide a means for the the algorithm to recover if the initial transformation is not enough to ensure that ϕ t→t+1 is similar to ϕ t+1 .

Adaptive SMC
Section 2.4.1 describes the use of intermediate distributions with the aim of ensuring that the distance between neighbouring targets is not too great, but this aim cannot be achieved without also considering where to place these intermediate distributions.In this paper we follow the adaptive strategy first used in Del Moral et al. (2012) and refined in Zhou et al. (2015) in the case where resampling is not performed at every iteration.At iteration (t + 1), (k + 1) this approach uses the conditional ESS (CESS) (13) to monitor the discrepancy between neighbouring distributions, where ω (p) is the incremental weight (i.e. the term multiplied by w (p) t+1,k in equation 12).Before the reweighting step is performed, the next intermediate distribution is chosen to be the distribution under which the CESS is found to be βP , for some 0 < β < 1.In the case of the geometric annealing scheme, this corresponds to a particular choice for γ k .We may also adapt the MCMC kernels used for the move step, based on the current particle set.

Auxiliary variables in proposals
For the Gaussian mixture example, for two or more components, when using a split move we must choose the component that is to be split.We may think of the choice of splitting different components as offering multiple "routes" through a space of distributions, with the same start and end points.Another alternative route would be given by using a birth move rather than a split move.In this section we generalise TSMC to allow multiple routes.We restrict our attention to the case where the choice of multiple routes is possible at the beginning of a transition from ϕ t to ϕ t+1 , when k = 0 (more general schemes are possible).A route corresponds to a particular choice for the transformation G t→t+1 , thus we consider a set of M t possible transformations indexed by the discrete random variable l t , using the notation G (lt) t→t+1 (also using this superscript on distributions that depend on this choice of G).We now augment the target distribution with variables l 0 , ..., l T −1 and, for each t alter the distribution ψ t such that it becomes a joint distribution on u t and l t .Our sampler will draw the l variables at the point at which they are introduced, so that different particles use different routes, but will not perform any MCMC moves on the variable after it is introduced.This leads to the sampler being degenerate in most of the l variables, but this doesn't affect the desired target distribution.
A revised form of TSMC is then, when k = 0, to first simulate routes l (p) t ∼ ρ t for each particle, then to use a different transform ϑ dependent on the route variable.The weight update is then given by where for simplicity we have omitted the dependence of u  (2006), the variance of ( 14) is always greater than or equal to that of (10); we show in section 3 that this additional variance can result in large errors in marginal likelihood estimates).Therefore, instead we employ the Rao-Blackwellisation procedure found in population Monte Carlo (Douc et al., 2007), and marginalise the proposal over the auxiliary variable l t .This results in a weight update of . (15)

Discussion
One of the most obvious applications of TSMC is Bayesian model comparison.SMC samplers are a generalisation of several other techniques, such as IS, AIS and the "stepping stones" algorithm Xie et al. (2011) (which is essentially equivalent to AIS where more than one MCMC move is used per target distribution), thus we expect a well-designed SMC to outperform these techniques in most cases.Zhou et al. (2015) reviews existing techniques that use SMC for model comparison, and concludes that "the SMC2 algorithm (moving from prior to posterior) with adaptive strategies is the most promising among the SMC strategies".In section 3 we provide a detailed comparison of TSMC with SMC2, and find that TSMC can have significant advantages.Section 2.2.2 compared TSMC with RJMCMC, noting that RJMCMC explores the model space by using a high variance estimator of a Bayes factor at each MCMC iteration, whereas TSMC is designed to construct a single lower variance estimator of each Bayes factor.The high variance estimators within RJMCMC are the cause of its most well known drawback: that the acceptance rate of trans-dimensional moves can be very small.The design of TSMC, in which each model is visited in turn, completely avoids this issue.One might envisage that despite avoiding poor mixing, TSMC might instead yield high variance Bayes factor estimators for challenging problems.However, TSMC has the advantage that that adaptive methods may be used in order to reduce the possibility that the estimators have high variance by, for example, automatically using more intermediate distributions.The possibility to adaptively choose intermediate distributions also provides an advantage over the approach of Karagiannis and Andrieu (2013), where a sequence of intermediate distributions for estimating each Bayes factor must be specified in advance.
Since, by construction, TSMC is a particular instance of SMC as described in Del Moral et al. (2006), all of the theoretical properties of a standard SMC algorithm apply.Of particular interest are the properties of the method as the dimension of the parameter spaces grows.TSMC is constructed on a sequence of extended spaces E t , each of which has dimension d T , thus in the worst case, the results for an SMC sampler on a space of dimension d T apply.In this respect, the authors in Beskos et al. (2014) have analysed the stability of SMC samplers as the dimension of the state-space increases when the number of particles P is fixed.Their work provides justification, to some extent, for the use of intermediate distributions { φt,k } K k=1 .Under some assumptions, it has been shown that when the number of intermediate distributions K = O (d T ), and as d T → ∞, the effective sample size ESS P t+1 is stable in the sense that it converges to a non-trivial random variable taking values in (1, P ).The total computational cost for bridging φt and φt+1 , assuming a product form of d T components, is O P d 2 T .However, in practice, due to the cancellation of "fill in" variables, and with an appropriate transformation of existing variables, we anticipate the effective dimension of the problem to potentially be substantially less than encountered in this worst case.The theoretical properties of the method are explored further in the Supplementary Information.

Bayesian model comparison for mixtures of Gaussians
In this section we examine the use of TSMC on the mixture of Gaussians application in section 2.2: i.e. we wish perform Bayesian inference of the number of components t, and their parameters θ t , from data y.For simplicity, we study the "without completion" model, where component labels for each measurement are not included in the model.In the next sections we outline the design of the algorithms used, then in section B.5 we describe the results of using these approaches on previously studied data, highlighting features of the approach.Further results are given in the Supplementary Information.

Description of algorithms
Let k be the unknown number of mixture components, and (µ 1:k , τ 1:k , ν 1:k ) (means, precisions and weights respectively) be the parameters of the k components.Our likelihood is the same as in equation ( 5); we use priors τ ∼ Gamma 2, 2S 2 /100 , ν 1:k ∼ Dir (1, ..., 1) for the precisions and weights respectively, and for the means we choose an unconstrained prior of µ ∼ N m, S 2 , where m is the mean and S is the range of the observed data.We impose an ordering constraint on the means, as described in Jasra et al. (2005), solely for the purposes of improving the interpretability of our results.For simplicity we have also not included the commonly used "random beta" hierarchical prior structure on τ (Richardson and Green, 1997).From a statistical perspective these choices are suboptimal: we emphasise that the only reason they are made here is to simply our presentation of the behaviour of TSMC.
We use different variants of TSMC (as described in section 2.3), using a sequence of distributions (ϕ t ) T t=1 where ϕ t (ϑ t ) = π t (θ t ) ψ t (u t ).π t is here the posterior on t components given by equation 6, and ψ t is different depending on the transformation that is chosen.We use intermediate distributions (as described in section 2.4.1), using geometric annealing, in all of our algorithms, making use of the adaptive method from section 2.4.2 to choose how to place these distributions.The results in this section focus particularly on illustrating the advantages afforded by making an intelligent choice of the transformation in TSMC.Full details of the transformations, weight updates and MCMC moves are given in the Supplementary Information.In summary, we use the birth and split moves referred to in section 2.2, together with a move that orders the components.For both moves we use the weight updates in equations ( 14) (referred to henceforth as the conditional approach) and (15) (referred to as the marginal approach).

Results
We ran SMC2 and the TSMC approaches on the enzyme data from Richardson and Green (1997).We ran the algorithms 50 times, up to a maximum of 8 components, with 500 particles.We used an adaptive sequence of intermediate distributions, choosing the next intermediate distribution to be the one the yields a CESS (equation 13) of βP , where β = 0.99.We resampled using stratified resampling when the ESS falls below αP , where α = 0.5.Figure 1 compares the birth and split TSMC algorithms when moving from one to two components.We observe that the split transformation has the effect of moving the parameters to initial values that are more appropriate for exploring the posterior on two components.The birth move is a poor choice for the existing parameters in the model: figure 1e shows that no particles drawn from the proposal (i.e. the posterior for the single component model) overlap with the posterior for the first component in the two component model.Despite the poor proposal, the intermediate distributions (of which there are many more than used for the split move) enable a good representation of the posterior distribution, although below we see that the poor proposal results in very poor estimates of the marginal likelihood.
Figure 2 shows log marginal likelihood estimates from the different approaches (note that a poor quality SMC usually results in an underestimate of the log marginal likelihood), and the cumulative number of intermediate distributions used in estimating all of the marginal likelihoods up to model k for each k.We observe that the performance of SMC2 degrades as the dimension increases due to the increasing distance of the prior from the posterior: we see that the adaptive scheme using the CESS results in the number of intermediate distributions across all dimensions being approximately constant which, as suggested by Beskos et al. (2014) is insufficient to control the variance as the dimension grows.As as discussed above, both birth TSMC methods yield inaccurate Bayes' factor estimates, with split TSMC exhibiting substantially better performance.However, we see that neither conditional approach yields very accurate results when using the weight update given in equation ( 14); instead the marginalised weight update is required to provide good performance.The marginal version of split TSMC significantly outperforms the other approaches, although we note that this is achieved at a higher computational cost due to the sum in the denominator of the weight updates.For all TSMC approaches, we see that the number intermediate distributions decreases as we increase dimension.This result can be attributed to the relatively small change that results from only adding a single component to the model at a time in TSMC.If the method has a good representation of the target at model t and there is minimal change in the posterior on the existing t components when moving to model t + 1, then the SMC is effectively only exploring the posterior on the additional component and thus has higher ESS.

Introduction
In this section we describe the use of TSMC for online inference under the coalescent model in population genetics (Kingman, 1982); we consider the case in which we wish to infer the clonal ancestry (or ancestral tree) of a bacterial population from DNA sequence data.Current approaches in this area use MCMC (Drummond and Rambaut,      2007), which is a limitation in situations where DNA sequence data does not arrive as a batch, such as may happen when studying the spread of an infectious disease as the outbreak is progressing (Didelot et al., 2014).We instead introduce an SMC approach to online inference, inferring posterior distribution as sequences become available (this approach is similar to that of Dinh et al. (2016) which was devised simultaneously to ours).We further envisage that our method will be useful in cases in which data is available as a single batch, through exploiting the well known property that a tree estimated from t + 1 sequences is usually similar to a tree estimated from t sequences.
Exploring the space of trees for a large number of sequences appears challenging due to the large number of possible trees: through adding leaves one by one the SMC approach follows a path through tree space in which transitions from distribution π t to π t+1 are not challenging.Further, our approach yields more stable estimates of the marginal likelihood of models than current approaches used routinely in population genetics, such as the infinite variance harmonic mean estimator (Drummond and Rambaut, 2007) and the stepping stone algorithm (Drummond and Rambaut, 2007;Xie et al., 2011).

Previous work
The idea of updating a tree by adding leaves dates back to at least Felsenstein (1981), in which he describes, for maximum likelihood estimation, that an effective search strategy in tree space is to add species one by one.More recent work also makes use of the idea of adding sequences one at a time: ARGWeaver (Rasmussen et al., 2014) uses this approach to initialise MCMC on (in this case, a space of graphs), t + 1 sequences using the output of MCMC on t sequences, and TreeMix (Pickrell and Pritchard, 2012) uses a similar idea in a greedy algorithm.In work conducted simultaneously to our own, Dinh et al. (2016) also propose a sequential Monte Carlo approach to inferring phylogenies in which the sequence of distributions is given by introducing sequences one by one.However, this approach: uses different proposal distributions for new sequences; does not infer the mutation rate simultaneously with the tree; does not exploit intermediate distributions to reduce the variance; and does not use adaptive MCMC moves.

Data and model
We consider the analysis of T aligned genome sequences y = y 1:T , each of length N .Sites that differ across sequences are known as single nucleotide polymorphisms (SNPs).The data (freely available from http://pubmlst.org/saureus/)used in our examples consists of seven "multi-locus sequence type" (MLST) genes of 25 Staphylococcus aureus sequences, which have been chosen to provide a sample representing the worldwide diversity of this species (Everitt et al., 2014).We make the assumption that the population has had a constant size over time, that it evolves clonally and that SNPs are the result of mutation.Our task is to infer the clonal ancestry of the individuals in the study, i.e. the tree describing how the individuals in the sample evolved from their common ancestors, and (additional to Dinh et al. (2016)) the rate of mutation in the population.We describe a TSMC algorithm for addressing this problem in section 4.2, before presenting results in section 4.3.In the remainder of this section we introduce a little notation.Let T t represent the clonal ancestry of t individuals and let θ/2 be the expected number of mutations in a generation.We are interested in the sequence of distributions for t = 1 : T .We here we use the coalescent prior (Kingman, 1982) p (T t ) for the ancestral tree, the Jukes-Cantor substitution model (Jukes and Cantor, 1969) for f (y 1:t | T t , θ) and choose p (θ) to be a gamma distribution with shape 1 and rate 5 (that has its mass on biologically plausible values of θ).Let l (a) t denote the length of time for which a branches exist in the tree, for 2 ≤ a ≤ t.The heights of the coalescent events are given by h (a) = t ι=a l (ι) t , with h (a) t being the (t − a + 1)th coalesence time when indexing from the the leaves of the tree.We let T t be a random vector B t , h where B t is itself a vector of discrete variables representing the branching order.When we refer to a lineage of a leaf node, this refers to the sequence of branches from this leaf node to the root of the tree.

TSMC for the coalescent
In this section we describe an approaches to adding a new leaf to an existing tree, described a transformation as in section 2.3.The basic idea is to first propose a lineage to add the new branch to (from distribution χ (g) t ), followed by a height h (new) t conditional on this lineage (from distribution χ (h) t ) at which the branch connected to the new leaf will join the tree.The resultant weight update is where Λ is the set that contains the leaves of the lineages that if proposed, could have resulted in the new branch.Note the relationship with equation ( 15): we achieve a lower variance through summing over the possible lineages rather than using an SMC over the joint space that includes the lineage variable.
To choose the lineage, we make use of an approximation to the probability that the new sequence is M s mutations from each of the existing leaves, via approximating the pairwise likelihood of the new sequence and each existing leaf.Following Stephens and Donnelly (2000) (see also Li and Stephens (2003)) we set the probability of choosing the lineage with leaf s using For χ (h) t we propose to approximate the pairwise likelihood f t+1,s y s , y t+1 | θ, h (new) t , g t = s , where y s is the sequence at the leaf of the chosen lineage.Since only two sequences are involved in this likelihood, it is likely to have heavier tails than the posterior.We use a Laplace approximation on a transformed space, following Reis and Yang (2011): further details are given in the Supplementary Information, section 3.2.

SMC and MCMC details
We use the following moves on each parameter in Θ t+1 : for θ we use a multiplicative random walk, i.e. an additive normal random walk in log-space, with proposal variance σ 2 θ in log-space; for each height h (a) (2 < a < T + 1) we use a truncated normal proposal with mean the current value of h (a) and variance σ 2 h (a) .For the branching order we use 20 subtree pruning and regrafting (SPR) moves in each sweep of the MCMC: pilot runs suggested that this many proposed moves result in approximately 0.5 moves being accepted at each sweep of the MCMC; adaptive methods may also be used to make such a choice automatically (South et al., 2016)).
The SMC uses P = 250 particles, with an adaptive sequence of intermediate distributions, choosing the next intermediate distribution to be the one the yields a CESS (equation 13) of βP , where β = 0.95.We used stratified resampling when the ESS falls below αP , where α = 0.5.At each iteration we used the current population of particles to tune the proposal variances, as detailed in the Supplementary Information.1: Log marginal likelihood estimates and total number of distributions for TSMC applied to the coalescent (5 s.f.), for the "Nearest" (first line) and "Furthest" (second line) orderings.

Results
We used six different configurations of our approach, for two different orderings of the 25 sequences.The two orderings were chosen as follows: the "nearest"/"furthest" ordering was chosen by starting with the two sequences with the smallest/largest pairwise SNP difference, then add sequences in the order of minimum/maximum SNP difference to an existing sequence.The six configurations of the methods were: the default configuration; using no tree topology changing MCMC moves; taking χ (h) t to be an Exp( 1) distribution (less concentrated than the Laplace-based proposal); raising equation ( 23) to the power 0 to give a uniform lineage proposal; raising equation ( 23) χ (g) t to the power 2; and raising equation ( 23) χ (g) t to the power 4.These latter two approaches use a lineage proposal where the probability is more concentrated on a smaller number of lineages.
Figure (3) shows majority-rule consensus trees from the final TSMC iteration.Figure (3a) is generated by the default configuration (for the "nearest" ordering, although results from the "furthest" ordering are nearly identical), and is close to the ground truth (as determined by a long MCMC run).Figures (3b) and (3c) used no topology changing MCMC moves, thus illustrate the contribution of the SMC proposal in determining the topology.Table (1) shows estimates of the log marginal likelihood from each configuration of the algorithm for both orderings (longer runs of our method suggest the true value is ≈ −6333), along with the total number of intermediate distributions used.Recall that a poorer quality SMC usually results in an underestimate of the log marginal likelihood, and the number of intermediate distributions offers an indication as to the distance between the target and the proposal where the proposal has heavier tails than the target.We draw the following conclusions: • As also suggested by figure (3), we see that the "nearest" ordering provides consistently better results than the "furthest" ordering."Nearest" provides an ordering in which new sequences are often added above the root of the current tree, since the existing sequences are all more closely related than the new sequence, whereas "furthest" frequently results in adding a leaf close to the existing leaves of the tree.In the latter strategy, the proposal relating to the new sequence is often good, but adding a new sequence can have a large effect on the posterior of existing variables.We see this by comparing figures (3b) and (3c), observing that the "nearest" ordering results in a topology that is close to the truth.The topology from the "furthest" ordering is not as close to the truth, thus is more reliant on topology changing MCMC moves to give an accurate sample from the posterior.
• As expected, using no MCMC topology moves results in very poor estimates, highlighting the important role of MCMC in generating diversity not introduced in the SMC proposals.This poor quality is not accounted for by the adaptive scheme based on the CESS introducing more intermediate distributions, since the CESS is only based on the weights of the particles and cannot account for a lack of diversity.
• Using less directed proposals, on both the lineage and the height, increases the distance between the proposal and target, and results in lower quality estimates.
• Using more directed proposals on the lineage may in some cases slightly improve the method, but appear to make the method less robust to the order in which the individuals are added (so may not be suitable in applications where the order of the individuals cannot be chosen).
A video showing the evolution of the majority rule consensus tree (and the marginal likelihood estimate) through all iterations of the SMC, using the default configuration, is included in the Supplementary Information.

Conclusions
This paper introduces a new technique for Bayesian model comparison and parameter estimation, and an approach to online parameter and marginal likelihood estimation for the coalescent, underpinned by the same methodological development: TSMC.We show that whilst TSMC performs inference on a sequence of posterior distributions with   (2007).In this section we outline several points that are not described elsewhere.
One innovation introduced in the paper is the use of transformations within SMC for creating proposal distributions when moving between dimensions.The effectiveness of TSMC is governed by the distance between neighbouring distributions, thus to design TSMC algorithms suitable for any given application, we require the design of a suitable transformation that minimises the distance between neighbouring distributions.This is essentially the same challenge as is faced in designing effective RJMCMC algorithms, and we may make use of many of the methods devised in the RJMCMC literature (Hastie and Green, 2012).The ideal case is to use a transformation such that every distribution ϕ t→T becomes identical, in which case one may simulate from π T simply by simulating from π 0 then applying the transformation.Approximating such a "transport map" for a sequence of continuous distributions is described in Heng et al. (2015).As discussed in section 1.2, Heng et al. (2015) is one of a number of papers that seeks to automatically construct useful transformations, and we anticipate these techniques being of use in the case of changing dimension that is addressed in this paper.In the RJMCMC literature, Brooks et al. (2003) describe methods for automatically constructing the "fill in" distributions ψ t for a given transformation: the literature on transport maps could be used to automatically construct the transformation in advance of this step.
In figure 2 of section 3 we see a characteristic that of this approach that will be common to many applications, in that the estimated marginal likelihood rises as the model is improved, then falls as the effect of the model complexity penalisation becomes more influential than improvements to the likelihood.We note that by using estimates of the variance of the marginal likelihood estimate (Lee and Whiteley, 2016), we may construct a formal diagnostic that decides to terminate the algorithm at a particular model, on observing that the estimated marginal likelihood declines from an estimated maximum value.
Although the examples in this paper both involve posterior distributions of increasing dimension, we also see a use for our approach in some cases that involve a distributions of decreasing dimension.For example, in population genetics, it is common to perform a large number of different analyses using different overlapping sets of sequences.For this reason many practitioners would value an inference technique that allows for the removal, as well as the addition, of sequences.Further, many genetics applications now involve the analysis of whole genome sequences.Our approach is applicable in this setting, and for this purpose a BEAST package is currently under development.
Microbiology group, NDM Experimental Medicine, University of Oxford.Daniel Wilson is a Sir Henry Dale Fellow, jointly funded by the Wellcome Trust and the Royal Society (grant 101237/Z/13/Z).

A Theoretical aspects
We define a sequence of (unnormalised) targets {ϕ t→T } T t=0 on the same measurable space (E T , E T ) as follows.Consider ϑ t ∼ ϕ t (•) and define where L (X) denotes the law of a random variable X, and where G t→T is defined recursively for 0 ≤ t < T as follows with G T →T as the identity function.Notice that the normalising constant associated to ϕ t→T is equal to Z t .Hence, the TSMC algorithm described in the main paper can be seen as an SMC sampler for the targets {ϕ t→T } t , propagating particles ϑ (p) t→T t,p using a sequence of MCMC kernels {K t→T : E T × E T → [0, 1]} t , where K t→T admits ϕ t→T as invariant.This will also be the case for the modified TSMC algorithm with intermediate distributions, however the details are omitted for simplicity.Therefore, after the (t + 1)th iteration the target ϕ t+1→T can be approximated using for all 0 ≤ t < T and any p ∈ {1, . . ., P }, consequently expectations of the form φt+1 (h) (for a function h : E t → R) can be approximated using The following theorem, the proof of which may be found in Del Moral et al. (2006, Proposition 2), follows from well-known standard SMC convergence results.
Theorem A.1.Under weak integrability conditions (see Chopin (2004, Theorem 1) or Del Moral (2004, p300-306)) and for any bounded h : E t → R , as P → ∞ , and standard results show that these estimates are unbiased (see e.g.Del Moral (2004, proposition 7.4.1)),with relative variance increasing at most linearly in t (Cérou et al., 2011, Theorem 5.1).Such results are summarised in the following theorem.
Theorem A.2.For fixed E T , and when resampling is not done adaptively, the estimates ẐP Furthermore, under strong mixing assumptions there exists a constant C T (t), which is linear in t, such that However, as T increases the dimension of E T (denoted hereafter by d T ) may increase and we will usually require an exponential growth in the number of particles P in order to obtain meaningful results, see e.g.Bickel et al. (2008).For instance, without the resampling step the ESS at time t + 1 is closely related to the following quantity (see e.g.Agapiou et al. (2015)) which serves as a measure of the dissimilarity between proposals and targets, and that quite often increases exponentially in d T .This quantity provides information about the limiting proportion of effective number of particles since The above equation implies that P = O (ρ t+1 (d T )) if we want to maintain an acceptable level for the ESS.In our context, even though the targets { φs→T } s are d T -dimensional the ratios {ϕ s→T /ϕ s−1→T } s will involve cancellations of "fill in" variables as discussed in the paper.This potentially leads to a much lower effective dimension of the problem than d T .
For the SMC method presented in Dinh et al. (2016) in the context of phylogenetic trees, the authors have shown that ρ T grows at most linearly in T under some strong conditions, somewhat comparable to the strong mixing conditions required in Theorem A.2. Imposing an extra condition on the average branch length of the tree, ρ T can be bounded uniformly in T .However, their method performs MH moves after resampling for improving the diversity of the particles, which could result in a sub-optimal algorithm.In contrast, TSMC uses MH moves for bridging φt and φt+1 via the sequence of intermediate distributions { φt,k } K k=1 .Heuristically, the introduction of these intermediate distributions together with sensible transformations {G t→t+1 } should alleviate problems due to the dissimilarity of targets, thus providing control over ρ T .
In this respect, the authors in Beskos et al. (2014) have analysed the stability of SMC samplers as the dimension of the state-space increases when the number of particles P is fixed.Their work provides justification, to some extent, for the use of intermediate distributions { φt,k } K k=1 .Under some assumptions, it has been shown that when the number of intermediate distributions K = O (d T ), and as d T → ∞, the effective sample size ESS P t+1 is stable in the sense that it converges to a non-trivial random variable taking values in (1, P ).The total computational cost for bridging φt and φt+1 , assuming a product form of d T components, is O P d 2 T .Using this reasoning, we suspect TSMC will work well in similar and more complex scenarios, e.g. when the targets do not follow a product form or when strong mixing assumptions do not hold.This idea is supported by the results described in the paper.

B.1 Split move
Suppose that at time t the transformation G t→t+1 : Θ . The label of such transformation, denoted by l t , is jointly drawn with u t from the distribution ψt (• |θ t ).Therefore, after sampling (u t , l t ) ∼ ψt (• |θ t ), the incremental weight at t in the tSMC algorithm is given by where t+1→t .Notice that the denominator contains the term M t since we have introduced the the extra variable L t in the proposal; thus, in order to obtain the correct ratio of normalising constants we need to extend the target using a "dummy" distribution for L t , in this case such distribution is uniform on the set {1, . . ., M t }.
The split move from Richardson and Green (1997)  is the distribution on the auxiliary variables U t required for implementing the split move.An improvement on this idea would be to use a mixture representation of the proposal as done in Population Monte Carlo (Douc et al., 2007), i.e. the denominator of (18) would become however, we do not follow such approach.Instead, we try to alleviate a possible complication when implementing the split move.After selecting and splitting the k-th component (w k , µ k , τ k ), two new weights (say w k − and w k + ), two new means (say µ k − and µ k + ) and two new precisions (say τ k − and τ k + ) are obtained.However, if either then the incremental weight will be zero since the support of the target π t+1 has been restricted to ordered means.We solve this by reordering all the components with respect to their means and correcting the incremental weight with an extra factor.The correct incremental weight can be expressed as follows where the function o t+1 : Θ t+1 → Θ t+1 simply combines the two newly created components, (w k − , µ k − , τ k − ) and (w k + , µ k + , τ k + ), with the set of already ordered t − 1 components (those that were not split).
To see why ( 20) is correct we follow a similar reasoning for deriving (18).In order to obtain the correct ratio of normalising constants, we need to introduce a "dummy" distribution in the target.When inverting the split move with rearrangement, two artificial variables are created denoting the labels of the newly created components.Since µ k − < µ k + , a simple choice for the "dummy" distribution is a uniform over the set S t+1 = { (h, k)| h, k ∈ {1, . . ., t + 1} and h < k} , for which |S| = t + 1 2 , as included in (20).

B.2 Birth move
The birth move can benefit also from a reordering of components.The correct incremental weight is much simpler than in the split case since the auxiliary variable U t ∼ ψ (b) t (• |θ t ) already represents the new component (w * , µ * , τ * ).Using the same logic as before, when inverting the birth move with rearrangement an artificial variable is created which denotes the place of the most recent generated component.Since this label can take values in S t+1 = {1, . . ., t + 1}, the simplest choice for the "dummy" distribution is a uniform over S t+1 ; therefore, the expression for the incremental weight in this case is given by

B.3 Marginalisation of moves
The previous descriptions of the birth and split moves are based on the idea of extending the target using an auxiliary distribution for the labels created due to the reordering process.We saw that a simple choice for this auxiliary distributions is a discrete uniform over the set of possible values for the labels, reason why the weights in ( 20) and ( 21) contain the denominator terms t + 1 2 and t + 1, respectively.However, as discussed later in the examples of Section B.5, the corresponding estimators of the normalising constant may suffer from a very high variance making them useless from a practical point of view.A way around this problem is to marginalise the proposal over the artificial label created by the reordering process; such marginalisation is similar to (19) and is now described.The ordering function o t+1 : Θ t+1 → Θ t+1 , introduced previously, simply reorders the newly generated component (or components) from the birth (split) move.In order to be able to compute the inverse transformation of this reordering, an artificial variable lt+1 ∈ St+1 is created which simply denotes the place (or places) of the new component(s).To be more precise, there are two transformations applied to ϑ t that allow us to obtain the final ϑ t→t+1 together with the label lt+1 .Let Ḡt→t+1 (ϑ t ) := ōt+1 •G t→t+1 (ϑ t ) = ōt+1 (θ t→t+1 (ϑ t ) , u t→t+1 (ϑ t )) = o t+1 (θ t→t+1 (ϑ t )) , u t→t+1 (ϑ t ) , lt+1 = ϑ t→t+1 , lt+1 , where ōt+1 : Θ t+1 × U t+1 → Θ t+1 × U t+1 × St+1 is an extension of o t+1 that reorders θ t→t+1 (ϑ t ), leaves u t→t+1 (ϑ t ) unchanged, and creates lt+1 .In the previous sections there was no need to introduce lt+1 since the denominator in (20) and ( 21) is obtained simply by transforming back θt→t+1 into ϑ t ; observe that for such cases ϕ t→t+1 (ϑ t→t+1 ) =ϕ t Ḡt+1→t ϑ t→t+1 , lt+1 |J t+1→t | = ϕ t (ϑ t ) |J t+1→t | .
The marginalisation step becomes clear by integrating out the variable lt+1 ; in this case the denominators in ( 20) and ( 21 recalling that the variables θ t , u t and l t depend on ϑ t→t+1 , lt+1 via the inverse transformation Ḡt+1→t .In Section B.5, we look at the performance of the marginalised versions of the birth and split moves against those described in Sections B.1 and B.2, which we term the conditional versions.It is clear that marginalising should be a sensible approach for reducing the variance of the estimated of the normalising constants, however in certain cases obtaining such marginal could become expensive or impractical if the required sum contains a large number of elements.when the ESS falls below αP , where α = 0.5.At each iteration we used the current population of particles to tune the proposal variances σ 2 θ and σ 2 h (a) for each a.Each variance was decomposed into two terms as follows σ 2 = s σ2 , with σ2 being an empirical variance and s being a scaling factor (different for each proposal variance).σ2 θ was taken to be the empirical variance of the weighted particles for θ. σ2 h (a) was taken to be the empirical variance of the residuals after using a linear regression of h (a) on θ (used due to the strong dependence of h (a) on θ).Each scaling s was initialised to 1, and: doubled at each iteration where the acceptance rate was estimated as greater than 0.6; halved where the rate was estimated at less than 0.15.
t .This weight update is very similar to one found in DelMoral et al. (2006), for the case where a discrete auxiliary variable is used to index a choice of MCMC kernels used in the move step.Analogous to Del Moral et al.

Figure 1 :
Figure1: The evolution of the estimated posterior for birth and split moves on the enzyme data.9 (a) Box plots of the log marginal likelihood estimates from each algorithm.(b) The number of intermediate distributions used in each algorithm.

Figure 2 :
Figure 2: The relative performance of the different SMC schemes on the mixture example.
No topology moves using "furthest" ordering.

Figure 3 :
Figure 3: Majority-rule consensus trees found by TSMC, with differences to the result obtained by the default configuration highlighted.
(a) Box plots of the log marginal likelihood estimates from each algorithm.(b) The number of intermediate distributions used in each algorithm.

Figure 4 :
Figure 4: The relative performance of the different SMC schemes on the enzyme data.
(a) Box plots of the log marginal likelihood estimates from each algorithm.(b) The number of intermediate distributions used in each algorithm.

Figure 5 :
Figure 5: The relative performance of the different SMC schemes on the acidity data.