Automatically adapting the number of state particles in SMC 2

,


Introduction
We are interested in exact Bayesian parameter inference for state-space models (SSMs) where the likelihood function of the model parameters is intractable.SSMs are ubiquitous in engineering, econometrics and the natural sciences; see Cappé et al. (2005) and references therein for an overview.They are used when the process of interest is observed indirectly over time or space, i.e. they consist of a hidden or latent process tX t u tě1 and an observed process tY t u tě1 .algorithm.Similarly, particle Gibbs uses a conditional particle filter to draw the latent states from their full conditional distribution, then updates the model parameters conditional on the latent states.Both PMMH and particle Gibbs are simulation consistent under mild conditions (Andrieu et al., 2010).Chopin et al. (2012) and Duan and Fulop (2014) apply a similar approach to sequential Monte Carlo (SMC) samplers.SMC methods for static models (Chopin, 2002;Del Moral et al., 2006) recursively sample through a sequence of distributions using a combination of reweighting, resampling and mutation steps.In the Bayesian setting, this sequence often starts at the prior and ends at the posterior distribution.For intractable likelihood SSMs, Chopin et al. (2012) and Duan and Fulop (2014) replace the likelihood within the sequence of distributions being traversed with its unbiased estimator.Practically, this means that each parameter particle is augmented with N x state particles.Due to this nesting of SMC algorithms and following Chopin et al. (2012), we refer to these methods as SMC 2 .As with particle MCMC, for any fixed number of state particles (N x ), SMC 2 targets the exact posterior distribution (Duan and Fulop, 2014).
While other, similar methods are available for Bayesian parameter inference of intractable likelihood SSMs, e.g.nested particle filters (Crisan andMíguez, 2017, 2018) and ensemble MCMC (Drovandi et al., 2022), the resulting inference is approximate and so is not considered in this paper.
The sampling efficiency of particle MCMC and SMC 2 greatly depends on the number of state particles used within the particle filter.In particle MCMC, N x is generally tuned manually, which can be time intensive.A significant advantage of SMC 2 over particle MCMC is that N x can be adapted automatically.Strategies to do this are proposed by Chopin et al. (2012Chopin et al. ( , 2015) ) and Duan and Fulop (2014); however, these methods automate the adaptation of N x at the expense of other model-specific tuning parameters, which must then be tuned manually.Furthermore, the value of N x can be difficult to choose in practice, and has a significant effect on both the Monte Carlo error of the SMC approximation to the target distribution and the computation time.Current methods require a moderate starting value of N x to avoid poor values in subsequent iterations, i.e. values that are too low and negatively impact the accuracy of the samples, or unnecessarily high values that increase the computation time.
Our article introduces a novel and principled strategy to automatically tune N x , while aiming to keep an optimal balance between statistical and computational efficiency.Compared to current methods, our approach has less tuning parameters that require manual calibration.We find that using the expected squared jumping distance of the mutation step to adapt the number of state particles generally gives the most efficient and reliable results.To further improve the overall efficiency of the adaptation, we also modify the exchange importance sampling method of Chopin et al. (2012) to update the set of state particles once N x is adapted.This modified version introduces no extra variability in the parameter particle weights, and outperforms the current methods.
The rest of the paper is organized as follows.Section 2 gives the necessary background on state-space models and SMC methods, including particle filters, SMC for static models and SMC 2 .Section 3 describes the current methods for adapting the number of state particles in SMC 2 .Section 4 describes our novel tuning methodology.Section 5 shows the performance of our methods on a Brownian motion model, a stochastic volatility model, a noisy theta-logistic model and a noisy Ricker model.Section 6 concludes. where The integral in (1) gives the likelihood function ppy 1:t | θq.This integral is often analytically intractable or prohibitively expensive to compute, which means that the likelihood is also intractable.If the value of θ is fixed, a particle filter targeting ppx 1:t | y 1:t , θq gives an unbiased estimate of the likelihood as a by-product, as described in Section 2.2.1.Similarly, a conditional particle filter (Andrieu et al., 2010), i.e. a particle filter that is conditional on a single state trajectory x k 1:t , can be used to unbiasedly simulate latent state trajectories from pp¨| x k 1:t , y 1:t , θq.Particle filters are SMC methods applied to dynamic models.

Sequential Monte Carlo
SMC methods recursively sample from a sequence of distributions, π d pz d q9γ d pz d q, d " 0, . . ., D, where π 0 pz 0 q can generally be sampled from directly and π D pz D q is the target distribution (Del Moral et al., 2006).These distributions are traversed using a combination of resample, mutation and reweight steps.Initially, N z samples are drawn from π 0 pz 0 q and given equal weights tz n 0 , W n 0 " 1 {Nzu Nz n"1 .For each subsequent distribution, the particles are resampled according to their weights, thus removing particles with negligible weights and duplicating high-weight particles.The resampled particles are then mutated using R applications of the mutation kernel Kpz n d´1 , z n d q, and reweighted as where Lpz n d , z n d´1 q is the artificial backward kernel of Del Moral et al. (2006).Note that if the weights at iteration d are independent of the mutated particles z n d , the reweighting step should be completed prior to the resample and mutation steps.At each iteration d, the weighted particles tz n d , W n d u Nz n"1 form an approximation of π d pz d q.See Del Moral et al. (2006) for more details.
An advantage of SMC methods is that an unbiased estimate of the normalizing constant of the target distribution can be obtained as follows (Del Moral et al., 2006) This feature is exploited in the SMC 2 methods described in Section 2.3.

Particle Filters
SMC methods for dynamic models are known as particle filters.For fixed θ, the sequence of filtering distributions for d " 1, . . ., T is The bootstrap particle filter of Gordon et al. (1993) uses the transition density as the mutation kernel Kpx d´1 , x d q " f px d | x d´1 , θq, and selects Lpx d , x d´1 q " 1 as the backward kernel.The weights are then given by for m " 1, . . ., N x .Algorithm 1 shows pseudo-code for the bootstrap particle filter (Gordon et al., 1993).
Let ψpx 1:Nx 1:d q be the joint distribution of all the random variables drawn during the course of the particle filter (Andrieu et al., 2010).The likelihood estimate in (4) is unbiased in the sense that E ψpx 1:Nx 1:d q ´y p Nx py 1:d | θ, x 1:Nx 1:d q ¯" ppy 1:d | θq (Section 7.4.2 of Del Moral, 2004; see also Pitt et al., 2012).

The notation
is used interchangeably throughout the paper.
Algorithm 1 The bootstrap particle filter of Gordon et al. (1993).The index pmq means 'for all m P t1, . . ., N x u'

SMC for Static Models
For static models, where inference on θ is of interest, the sequence of distributions traversed by the SMC algorithm is π d pθ d q9γ d pθ d q, d " 0, . . ., D, where π 0 pθ 0 q " ppθq is the prior and π D pθ D q " ppθ | y 1:T q is the posterior distribution.Assuming that the likelihood function is tractable, there are at least two general ways to construct this sequence, 1. likelihood tempering, which gives π d pθq 9 ppy 1:T | θq g d ppθq for d " 0, . . ., D, and where 0 " g 0 ď ¨¨¨ď g D " 1, and 2. data annealing (Chopin, 2002), which gives π d pθq 9 ppy 1:d | θqppθq for d " 0, . . ., T , where T is the number of observations and D " T .
Typically, SMC for static models uses a mutation kernel which ensures that the current target π d pθq remains invariant.A common choice is to use R applications of an MCMC mu-tation kernel along with the backward kernel Lpθ d , θ d´1 q " γ d pθ d´1 qKpθ d´1 , θ d q{γ d pθ d q (Chopin, 2002;Del Moral et al., 2006).The weights then become Since the weights are independent of the mutated particles θ d , the reweighting step is completed prior to the resample and mutation steps.

SMC 2
Standard SMC methods for static models cannot be applied directly to state-space models if the parameters θ are unknown except when the integral in ( 1) is analytically tractable.When the likelihood is intractable, SMC 2 replaces it in the sequence of distributions being traversed with a particle filter estimator.Essentially, each parameter particle is augmented with a set of weighted state particles.
Since the likelihood is replaced with a particle filter estimator, the parameter particles in SMC 2 are mutated using R applications of a particle MCMC mutation kernel Kp¨, ¨q.Section 2.4 describes the particle marginal Metropolis-Hastings (PMMH) algorithm.As with SMC for static models, the parameter particle weights are given by ( 5).
Two general ways to construct the sequence of targets for SMC 2 are the density tempered marginalised SMC algorithm of Duan and Fulop (2014) and the data annealing SMC 2 method of Chopin et al. (2012), which we refer to as density tempering SMC 2 (DT-SMC 2 ) and data annealing SMC 2 (DA-SMC 2 ) respectively.These are described in Sections 2.3.1 and 2.3.2.
Algorithm 2 shows pseudo-code which applies to both DT-SMC 2 and DA-SMC 2 .The main difference between the two methods is how the sequence of targets is defined.Sections 2.3.1 and 2.3.2 describe the sequence of targets and the reweighting formulas for DT-SMC 2 and DA-SMC 2 respectively.For conciseness, we denote the set of weighted state particles associated with parameter particle n, n " 1, . . ., N θ at iteration d as where S 1:Nx,n d is the set of normalised state particle weights.The nth parameter particle with its attached set of weighted state particles is denoted as

Density Tempering SMC 2
The sequence of distributions for DT-SMC 2 is π d pθq9ppθq " y p Nx py 1:T | θ, x 1:Nx 1:T q ı g d ψpx 1:Nx 1:T q, 0 " g 0 ď ¨¨¨ď g D " 1, which gives the weights from (5) as Due to the tempering parameter g d , DT-SMC 2 is only exact at the first and final temperatures, i.e. ppθqppy 1:T | θq g d { ş ppθqppy 1:T | θq g d dθ is a marginal distribution of π d pθq only at g 1 " 0 and g D " 1.

Algorithm 2
The SMC 2 Algorithm.The index pnq means 'for all n P t1, . . ., N θ u' Input: data y 1:T , number of parameter particles N θ , number of state particles N x , number of MCMC iterations R Output: set of weighted particles tϑ Re-weight the particles from π d´1 p¨q to π d p¨q using ( 6) or (7).end for 8: end for

Particle MCMC mutations
The simplest mutation of the parameter particles in SMC 2 is a sequence of Markov move steps using the PMMH algorithm; see Gunawan et al. (2021) for alternatives.The PMMH method is a standard Metropolis-Hastings algorithm where the intractable likelihood is replaced by the particle filter estimate in (4).Algorithm 3 shows a single PMMH iteration.
While a PMMH mutation leaves the current target invariant, its acceptance rate is sensitive to the variance of the likelihood estimator (Andrieu et al., 2010).In practice, this means that if the variance is too high, then some particles may not be mutated during the mutation step -even with a large number of MCMC iterations.
In the context of particle MCMC samplers, Andrieu et al. (2010) show that N x must be chosen as OpT q to achieve reasonable acceptance rates, i.e. reasonable variance of the likelihood estimator.Pitt et al. (2012), Doucet et al. (2015) and Sherlock et al. (2015) recommend choosing N x such that the variance of the log-likelihood estimator is between 1 and 3 when evaluated at, e.g., the posterior mean.This generally requires a (potentially time-consuming) tuning process for N x before running the algorithm.
Algorithm 3 A single iteration of the particle marginal Metropolis-Hastings algorithm.For SMC 2 , fewer particles may be required to achieve reasonable acceptance rates in the early stages of the algorithm.In DA-SMC 2 , N x " Optq, where t " d, suggests starting with a small N x , and increasing it with each added observation.Likewise, in DT-SMC 2 , a small g d will reduce the impact of a highly variable log-likelihood estimator.In addition, unlike particle MCMC methods, it is possible to automatically adapt N x within SMC 2 .The next section describes the tuning strategies proposed by Chopin et al. (2012Chopin et al. ( , 2015) ) and Duan and Fulop (2014).
3 Existing methods to calibrate N x

Stage 1. Triggering the adaptation
It may be necessary to adapt N x when the mutation step no longer achieves sufficient particle diversity.Chopin et al. (2012Chopin et al. ( , 2015) ) and Duan and Fulop (2014) fix the number of MCMC iterations (R) and change N x whenever the acceptance rate of a single MCMC iteration falls below some target value.This approach has two main drawbacks.First, the acceptance rate does not take the jumping distances of the particles into account, and can be made artificially high by making very local proposals.Second, both R and the target acceptance rate must be tuned -even if the exact likelihood is used, the acceptance rate may naturally be low, depending on the form of the posterior and the proposal function used within the mutation kernel.Ideally, N x and R should be jointly adapted.
Stage 2. Choosing the new number of particles N x A new number of state particles (N x ) is determined in the second stage.Chopin et al. (2012) set N x " 2 ¨Nx (double), while Duan and Fulop (2014) set N x " y σ Nx 2 ¨Nx (rescale-var), where y σ Nx 2 is the estimated variance of the log-likelihood estimator using N x state particles.The variance is estimated from k independent estimates of the log-likelihood (for the current SMC target) based on the sample mean of the parameter particles.This choice is motivated by the results of Pitt et al. (2012), Doucet et al. (2015) and Sherlock et al. (2015), who show that σ 2 Nx 9 1{N x for any number of state particles N x .Setting σ 2 Nx " α{N x and rearranging gives both α " σ 2 Nx ¨Nx and N x " α{σ 2 Nx .Given N x and σ 2 Nx , these expressions can be used to find a new number of state particles Nx ¨Nx .We find that if the initial N x is too small, then the double scheme of Chopin et al. (2012) can take a significant number of iterations to set N x to a reasonable value.It can also increase N x to an unnecessarily high value if the adaptation is triggered when the number of state particles is already large.
While the rescale-var method of Duan and Fulop ( 2014) is more principled, as it takes the variance of the log-likelihood estimator into account, we find that it is also sensitive to the initial number of particles.For a poorly chosen initial N x , the variance of the log-likelihood estimator can be of order 10 2 or higher.In this case, scaling the current number of particles by y σ Nx 2 may give an extremely high value for N x .Chopin et al. (2015) propose a third method; they set N x " τ {σ 2 Nx , where τ is a modelspecific tuning parameter, and σ 2 Nx is the variance of the log-likelihood estimator with N x state particles.This choice is motivated by the results from Doucet et al. (2012) (an earlier version of Doucet et al. (2015)).See Chopin et al. (2015) for further details.Since the parameter τ must be tuned manually, this approach is not included in our numerical experiments in Section 5.

Stage 3. Replacing the state particle set
The final stage replaces the current set of state particles x1:Nx where L d px , x 1:Nx d q is the backward kernel.They use the following approximation to the optimal backward kernel (see Proposition 1 of Del Moral et al. ( 2006)) leading to For density tempering, this becomes The new parameter particle weights are then given by While this method is relatively fast, it can significantly increase the variance of the parameter particle weights (Duan and Fulop, 2014).
As an alternative to reweight, Chopin et al. (2012) propose a conditional particle filter (CPF) step to replace x1:Nx . Here, the state particles and the likelihood estimates are updated by running a particle filter conditional on a single trajectory from the current set of state particles.The incremental weight function of this step is 1, which means that the parameter particle weights are left unchanged.The drawback of this approach is that all the state particles must be stored, which can significantly increase the RAM required by the algorithm.Chopin et al. (2015) propose two extensions of the CPF approach which reduce the memory requirements of the algorithm at the expense of increased computation time.Their first proposal is to only store the state particles with descendants at the final time-point, i.e. using a path storage algorithm within the particle filter (Jacob et al., 2015).Their second method is to store the random seed of the pseudo-random number generator in such a way that the latent states and their associated ancestral indices can be re-generated at any point.Both variants still have a higher RAM requirement and run time compared to the reweight method.Duan and Fulop (2014) propose a reinitialisation scheme to extend the particles (reinit).Whenever N x is increased, they fit a mixture model Qp¨q informed by the current set of particles, then reinitialise the SMC algorithm with N x state particles and Qp¨q as the initial distribution.The modified sequence of distributions for DT-SMC 2 is The reinit method aims to minimize the variance of the weights, but we find it can be very slow as the algorithm may reinitialise numerous times before completion, each time with a larger number of particles.This approach also assumes that the distribution of the set of parameter particles when reinit is triggered is more informative than the prior, which is not necessarily the case if the adaptation is triggered early.

Methods
This section describes our proposed approach for each of the three stages involved in adapting the number of state particles.

Triggering the adaptation
Instead of using the acceptance rate to measure particle diversity, we use the expected squared jumping distance (ESJD), which accounts for both the acceptance rate (the probability that the particles will move) and the jumping distance (how far they will move).See Pasarica and Gelman (2010), Fearnhead and Taylor (2013), Salomone et al. (2018) and Bon et al. (2021) for examples of this idea outside the SMC 2 context.The ESJD at iteration d is defined as where θ d ´θd 2 is the squared Mahalanobis distance between the current value of the parameters (θ d ) and the proposed value (θ d ).The ESJD of the rth MCMC iteration of the mutation step at iteration d (steps 5-7 of Algorithm 2) can be estimated as where p Σ is the covariance matrix of the current parameter particle set, and αpθ n d , θ n,d q is the acceptance probability in (8).The total estimated ESJD for iteration Algorithm 4 outlines how N x and R are adapted.To summarise, the adaptation is triggered in iteration d if { ESJD d´1 is below some target value (stage 1).Once triggered, the number of particles is adapted (stage 2) and the particle set is updated (stage 3).A single MCMC iteration is then run with the new number of particles, and the results from this step are used to determine how many MCMC iterations are required to reach the target ESJD, i.e.R is given by dividing the target ESJD by the estimated ESJD of the single MCMC iteration and rounding up.Once the adaptation is complete, the remaining MCMC iterations are completed.This approach gives a general framework which can be implemented with any of the stage 2 and stage 3 methods described in Section 3, as well as our novel methods in Sections 4.2 and 4.3.

Choosing the new number of particles N
x To set the new number of state particles N x , we build on the rescale-var method of Duan and Fulop (2014), which adapts the number of state particles as follows.
1. Calculate θd , the mean of the current set of parameter samples θ 1:N θ d .2. Run the particle filter with N x state particles k times to get k estimates of the log-likelihood evaluated at θd .PMMH mutation ϑ 1:N θ d,r " Kpϑ 1:N θ d,r´1 , ¨q 11: end for In practice, we find that rescale-var changes N x too drastically from one iteration to the next for two reasons.First, the sample variance may itself be highly variable, especially when N x is small.Second, the sample mean of the parameter particles changes throughout the iterations, meaning that the number of state particles needed to reach a variance of 1 also changes throughout the iterations.The sample mean may also be a poor value at which to estimate the likelihood if the current target is multimodal or if the current set of parameter particles offers a poor Monte Carlo approximation to the current target distribution.The latter may occur if the number of parameter particles N θ is too low.
Our first attempt to overcome some of these problems is to scale the number of state particles by the standard deviation instead of the variance, i.e. we set N x " y σ Nx ¨Nx and call this method rescale-std.A variance of 1 is still the overall target, however, more moderate values of N x are proposed when y σ Nx 2 ‰ 1.At any given iteration, the new target variance is the current standard deviation, i.e.N x is chosen such that y σ N x 2 " y σ Nx .The main drawback of rescale-std is that the variance at the final iteration may be too high, depending on the initial value of N x and the variability of the sample variance between iterations, i.e. it may approach a variance of 1 too slowly.In our numerical experiments in Section 5, however, we find that the final variance of the rescale-std method is generally between 1 and 1.2 2 , which is fairly conservative.In their numerical experiments, Doucet et al. (2015) found that the optimal N x generally gives a variance that is between 1.2 2 " 1.44 and 1.5 2 " 2.25.
Our second method (which we refer to as novel-var) aims to improve upon rescalevar by estimating the variance at different values of N x .To obtain our set of candidate values, N x,1:M , we scale N x by different fractional powers of y σ Nx 2 {σ 2 target , where σ 2 target is the target variance.Note that the candidate values N x,1:M will be close to To avoid unnecessary computation, the current N x is left unchanged if y σ Nx 2 falls within some range σ 2 min ă σ 2 target ă σ 2 max .We also round the candidate number of state particles up to the nearest 10, which ensures that there is at least a difference of 10 between each N x,m P N x,1:M .Once N x,1:M has been obtained, the variance is estimated for each N x,m P N x,1:M , and the new number of state particles is set to the N x,m that has the highest variance less than or equal to σ 2 max .In our numerical experiments in Section 5, we set which gives candidate values ranging from rescale-std (s 0.5 ¨Nx ) to rescale-var (s 1 Nx ).The target, minimum and maximum variances are σ 2 target " G ¨1, σ 2 min " G ¨0.95 2 and σ 2 max " G ¨1.05 2 respectively, where G " 1 for DA-SMC 2 and G " 1{ max p0.6 2 , g 2 d q for DT-SMC 2 .These values are fairly conservative and aim to keep the final variance between 0.95 2 « 0.9 and 1.05 2 « 1.1.
The parameter G is used to take advantage of the effect of the tempering parameter on the variance, i.e. varplog py p Nx py | θq g d qq " g 2 ¨varplog py p Nx py | θqqq.Capping the value of G is necessary in practice, since aiming for an excessive variance is difficult due to the variability of the variance estimate when N x is low.By setting G " 1{ max p0.6 2 , g 2 d q, the highest variance targeted is 1{0.36 « 2.8.In general, we recommend not aiming for a variance that is greater than 3 (Sherlock et al., 2015).Note that including the tempering parameter in this way is infeasible for rescale-var or rescale-std.For the former, changing the target variance only exacerbates the problem of too drastic changes of N x between iterations.This is largely due to the increased variability of the sample variance when g d ă 1.While the variability of y σ Nx 2 is less of a problem for rescale-std, this method struggles keeping up with the increasing variance target.
Compared to rescale-var, we find that both rescale-std and novel-var are significantly less sensitive to the initial number of state particles, sudden changes in the variance arising from changes in the sample mean of the parameter particles, and variability in the estimated variance of the log-likelihood estimator.The novel-var method is also more predictable in what variance is targeted at each iteration compared to rescale-std.
Our final method (novel-esjd) also compares different values of N x , but using the ESJD instead of the variance of the log-likelihood estimator.As before, the choice of candidate values N x,1:M is flexible, and in the numerical experiments in Section 5, we set where G " 1 for DA-SMC 2 and G " 1{ max p0.6 2 , g 2 d q for DT-SMC 2 .Again, each N x,m P N x,1:M is rounded up to the nearest 10.A score is calculated for a particular N x,m P N x,1:M by first doing a mutation step with N x,m state particles, then calculating the number of MCMC iterations (R m ) needed to reach the ESJD target; the score for N x,m is pN m ¨Rm q ´1.Algorithm 5 describes the adaptive mutation step when using novel-esjd.Since the candidate N x values are tested in ascending order (see step 2 of Algorithm 5), it is unnecessary to continue testing the values once the score starts to decrease (steps 8-17 of Algorithm 5).This method does not target a particular variance, but instead aims to select the N x having the cheapest mutation while still achieving the ESJD target.Compared to double and the variance-based methods, we find that novel-esjd is consistent between independent runs, in terms of the run time and the adaptation for N x .It is also relatively insensitive to the initial number of state particles, as well as variability in the variance of the likelihood estimator.
Ideally, the adaptation algorithm (Algorithm 4 or Algorithm 5) will only be triggered if N x or R are too low (or too high, as mentioned in Section 5).In practice, the ESJD is variable, so the adaptation may be triggered more often than necessary.Allowing the number of state particles to decrease helps to keep the value of N x reasonable.Also, if the estimated variance is close to the target variance, one of the candidate N x values will be close in value to the current N x .See Table 1  Calculate the set of candidate values, N x,1:M (e.q. using ( 10)), and sort in ascending order, such that Replace the current set of state particles with x1:Nx,m d using the method described in Section 4.3 Calculate score z m " pN x,m ¨Rm q PMMH mutation ϑ 1:N θ d,r " Kpϑ 1:N θ d,r´1 , ¨q 25: end for

Replacing the state particle set
Our final contribution (denoted replace) is a variation of the reweight scheme of Chopin et al. (2012).Both reweight and replace consist of three steps.First, a particle filter (Algorithm 1) is run with the new number of state particles to obtain y p N x py 1:d | θ d , x 1:N x 1:d q and x 1:N x 1:d .Second, the parameter particle weights are reweighted using where IW n d is the incremental weight for parameter particle n, n " 1, . . ., N θ at iteration d, and finally, the previous likelihood estimate and set of state particles are discarded.Note that prior to this reweighting step, the parameter particles are evenly weighted as the adaptation of N x is performed after the resampling step, i.e.W n d " 1{N θ , for n " 1, . . ., N θ .
With the reweight method, the incremental weights for DA-SMC 2 are obtained by replacing ppy 1:d | θ d q with y p Nx py 1:d | θ d , x 1:Nx 1:d q to approximate the optimal backward kernel.This gives See Section 3 for details.For DT-SMC 2 , the incremental weights are The replace method uses a different approximation to the optimal backward kernel.For DA-SMC 2 , instead of using ppy 1:d | θ d q « p Nx py 1:d | θ d , x 1:Nx 1:d q, we use ppy 1:d | θ d q « p N x py 1:d | θ d , x 1:N x 1:d q, which gives the backward kernel Using this backward kernel, the incremental weights are Similarly for DT-SMC 2 , the approximation ppy 1: and leads to incremental weights Since the incremental weights reduce to 1, the replace approach introduces no extra variability in the parameter particle weights.As a result, replace leads to less variability in the mutation step compared to the reweight method of Chopin et al. (2012), i.e. the parameter particles remain evenly weighted throughout the mutation step.We also find that it is generally faster than the reinit method of Duan and Fulop (2014).

Practical Considerations
The framework introduced in this section has a number of advantages over the existing methods.Most notably, the adaptation of R is automated, the stage 2 options (rescalestd, novel-var and rescale-esjd) are less sensitive to variability in the estimated variance of the log-likelihood estimator, and the parameter particle weights are unchanged by adapting N x .
Two tuning parameters remain to be specified for this method: the target ESJD (ESJD target ) and the number of samples to use when estimating the variance of the log-likelihood estimator (k).In our numerical experiments in Section 5, we use ESJD target " 6 and k " 100, which both give reasonable empirical results.The target ESJD has little effect on the value of N x , due to the structure of the updates described in Section 4.2, but it directly controls R. Likewise, k controls the variability of y σ Nx 2 .Recall that y σ Nx 2 is the estimated variance of the log-likelihood estimator with N x state particles and evaluated at the mean of the current set of parameter particles ( θd ).Ideally, the value of k should change with N x and θd ; however, it is not obvious how to do this.In general, we find that if σ 2 Nx « y σ Nx 2 is high, then the variance of y σ Nx 2 also tends to be high.
Determining optimal values of ESJD target and k is beyond the scope of this paper, but a general recommendation is to follow Salomone et al. (2018) and set ESJD target to the weighted average of the Mahalanobis distance between the parameter particles immediately before the resampling step.We also recommend choosing k such that the variance of y σ Nx 2 is low (ă 0.1) when y σ Nx 2 « 1, i.e. the estimate of y σ Nx 2 should have low variance when it is around the target value.This value of k may be difficult to obtain, but again, we find that k " 100 gives reasonable performance across all the examples in Section 5. To mitigate the effect of a highly variable y σ Nx 2 , it is also helpful to set a lower bound on the value of N x , as well as an upper bound if a sensible one is known.An upper bound is also useful to restrict the amount of computational resources that is used by the algorithm.

Implementation
The methods are evaluated on a simple Brownian motion model, the one-factor stochastic volatility (SV) model in Chopin et al. (2012), and two ecological models: the theta-logistic model (Peters et al., 2010;Drovandi et al., 2022) and the noisy Ricker model (Fasiolo et al., 2016).
The code is implemented in MATLAB and code is available at https://github.com/imkebotha/adaptiveexact-approximate-smc.The likelihood estimates are obtained using the bootstrap particle filter (Algorithm 1) with adaptive multinomial resampling, i.e. resampling is done whenever the effective sample size (ESS) drops below N x {2.The results for all models, except for the Ricker model, are calculated from 50 independent runs, each with N θ " 1000 parameter samples.Due to time and computational constraints, the Ricker model results are based on 20 independent runs, each with N θ " 400 parameter samples.
For DT-SMC 2 , the temperatures are set adaptively using the bisection method (Jasra et al., 2010) to aim for an ESS of 0.6 ¨Nθ .Similarly, the resample-move step is run for DA-SMC 2 if the ESS falls below 0.6 ¨Nθ .As discussed in Section 4.4, a target ESJD of 6 is used and the sample variance y σ Nx 2 for rescale-var, rescale-std, novel-var, and novel-esjd is calculated using k " 100 log-likelihood estimates.For all methods except reinit and double, we also trigger the adaptation whenever { ESJD t´1 ą 2 ¨{ ESJD target -this allows the algorithm to recover if the values of N x and/or R are set too high at any given iteration, which may occur e.g. with DA-SMC 2 if there are outliers in the data.When the reinit method is used, a mixture of three Gaussians is fit to the current sample when reinitialising the algorithm.
The methods are compared based on the mean squared error (MSE) of the posterior mean averaged over the parameters, where the ground truth is taken as the posterior mean from a PMMH chain of length 1 million.As the gold standard (GS), DT-SMC 2 and DA-SMC 2 are also run for each model with a fixed number of particles, while still adapting R. For each of these runs, the number of state particles is tuned such that y σ Nx 2 « 1 for the full dataset, and the extra tuning time is not included in the results.
We use the MSE and the total number of log-likelihood evaluations (denoted TLL) of a given method as a measure of its accuracy and computational cost respectively.Note that each time the particle filter is run for a particular parameter particle, TLL is incremented by N x ˆt, where t is the current number of observations.The MSE multiplied by the TLL of a particular method gives its overall efficiency.Scores for the accuracy, computational cost and overall efficiency of a given method relative to the gold standard are calculated as , Z method,TLL :" TLL GS TLL method , Z method :" Z method,MSE ˆZmethod,TLL .
Higher values are better.
The adaptive mutation step in Algorithm 4 is used for all methods except novel-esjd, which uses the adaptive mutation step in Algorithm 5.The options for stage 2 are double, rescale-var, rescale-std, novel-var and novel-esjd.Likewise, the options for stage 3 are reweight, reinit, and our novel method replace.Since the aim of the novel-var method is to regularly increase the number of state particles throughout the iterations, the combination novel-var with reinit is not tested.Similarly, due to the number of times N x is updated when using novel-esjd, only the combination novelesjd with replace is tested.For all combinations (excluding double and reinit), we allow the number of state particles to decrease.Due to computational constraints, we also cap the number of state particles at 5 times the number of state particles used for the the gold standard method.Note that the double method cannot decrease N x , and reinit assumes increasing N x throughout the iterations as the entire algorithm is reinitialised whenever N x is updated.
To compare the different stage 2 methods, we also plot the evolution of N x for each example.Recall that N x " Optq for DA-SMC 2 and varplog py p Nx py | θq g d qq " g 2 ¨varplog py p Nx py | θqqq for DT-SMC 2 .Based on these two results, a roughly linear increase in N x is desiredlinear in time for DA-SMC 2 and linear in g 2 for DT-SMC 2 .Section A of the Appendix shows marginal posterior density plots.Section B in the Appendix has extra results for the stochastic volatility model with N θ " 100 and N θ " 500, to test the methods with fewer parameter particles.
Results for all stage 2 and stage 3 combinations are obtained for initial N x values of 10 and 100.The variance of the log-likelihood estimator is around 95 for N x " 10 and around 2.7 for N x " 100.The gold standard method is run with 240 state particles.
Table 2 shows the scores averaged over the two initial values of N x for the three stage 3 options (reweight, reinit and replace).Note that these scores are relative to reweight instead of the gold standard.Apart from DT-SMC 2 with double -where reinit is faster than replacereplace consistently outperforms reweight and reinit in terms of statistical and computational efficiency.Interestingly, reinit generally outperforms reweight with rescale-std and rescale-var, but not with double.The performance of reinit greatly depends on the number of times the algorithm is reinitialised and the final number of state particles, and this is generally reflected in the computation time.
Tables 3 and 4 show the scores relative to the gold standard for all the replace combinations.novel-esjd has the best overall score followed by novel-var for DT-SMC 2 , and rescale-var for DA-SMC 2 .double performs well on DT-SMC 2 , but poorly on DA-SMC 2 -it has good statistical efficiency, but is much slower than the other methods.Interestingly, the computational efficiency is generally higher for the adaptive methods than for the gold standard, but their accuracy for DA-SMC 2 is generally lower.This may be due to high variability in the variance of the log-likelihood estimator and the mean of the parameter particles during the initial iterations of DA-SMC 2 .Since fewer observations are used to estimate the likelihood in these early iterations (t ă T ), the mean of the parameter particles can change drastically from one iteration to the next, leading to similarly drastic changes in the sample variance of the log-likelihood estimator.
Figure 1 shows the evolution of N x for replace and an initial N x of 10.Based on these plots, double, novel-var and novel-esjd have the most efficient adaptation for DT-SMC 2 , and novel-esjd has the most efficient adaptation for DA-SMC 2 , which corresponds with the results for Z TLL and Z in Tables 3 and 4. Table 3: Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for DT-SMC 2 for the Brownian motion model using the replace method -higher values are preferred.The gold standard refers to SMC 2 with a fixed number of state particles.Table 4: Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for DA-SMC 2 for the Brownian motion model using the replace method -higher values are preferred.The gold standard refers to SMC 2 with a fixed number of state particles.

Stochastic Volatility Model
Our second example is the one-factor stochastic volatility model used in Chopin et al. (2012), The transition density of this model cannot be evaluated point-wise, but it can be simulated from.
Results for all stage 2 and stage 3 combinations are obtained for initial N x values of 300 and 600.The variance of the log-likelihood estimator is around 7 for 300 state particles and around 3 for 600 state particles.The gold standard method is run with 1650 state particles.
Table 5 shows the scores for the three stage 3 options, relative to reweight and averaged over the two initial N x values.replace consistently outperforms reweight and reinit in terms of overall efficiency.
Tables 6 and 7 show the scores for all the replace combinations.All methods perform similarly for this model.In terms of accuracy (measured by the MSE), the optimal variance of the log-likelihood estimator seems to be smaller for this model than for the others.However, the efficiency of a smaller variance coupled with the increased computation time is fairly similar to the efficiency of a larger variance with cheaper computation.In this example, novel-esjd has the highest MSE, but the lowest computation time.
Figure 2 shows the evolution of N x for replace and an initial N x of 300.Based on these plots, double and novel-esjd have the most efficient adaptation for DT-SMC 2 , and all methods except double have good results for DA-SMC 2 .These methods correspond to those with the quickest run time (lowest TLL), but not to the ones with the best overall efficiency.Table 7: Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for DA-SMC 2 for the stochastic volatility model using the replace method -higher values are preferred.The gold standard refers to SMC 2 with a fixed number of state particles.
Scores for the accuracy, computational cost and overall efficiency are obtained for initial N x values of 700 and 2400.The variance of the log-likelihood estimator is around 40 for 700 state particles and around 3 for 2400 state particles.The gold standard method is run with 4600 state particles.Due to time constraints, results for the double method with reweight and initial N x " 700 are not available for DA-SMC 2 .
Table 8 shows the scores for the three stage 3 options, averaged over the initial N x values and relative to reweight.Except for double with DA-SMC 2 , both reinit and replace outperform reweight, but the results for reinit and replace are mixed.The performance of reinit greatly depends on the number of times the adaptation is triggered.On average, the algorithm is reinitialised fewer times for rescale-std for this example than for the others.
Tables 9 and 10 show the scores for all the replace combinations relative to the gold standard.In this example, novel-esjd outperforms all other methods, followed by novelvar and rescale-var.Unlike the previous examples, double and rescale-std perform poorly here.The gold standard and double have the best MSE for this example, but the worst computation time.The remaining methods have a poor MSE, which is mostly due to the parameter σ as Figure 7 in Section A of the Appendix shows.The gold standard is the only method that achieves a good result for σ.
Figure 3 shows the evolution of N x for replace and an initial N x of 700.novel-esjd seem to have the least variable evolution for both DT-SMC 2 and DA-SMC 2 compared to the other methods.Again, this is reflected in the values of Z TLL , particularly in Tables 9  and 10.Table 10: Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for DA-SMC 2 for the theta-logistic model using the replace method -higher values are preferred.The gold standard refers to SMC 2 with a fixed number of state particles.

Noisy Ricker Model
Our final example is the noisy Ricker population model (Fasiolo et al., 2016), x t`1 " r ¨xt exp p´x t `zt`1 q, z t " N p0, σ 2 q.
The transition density of the Ricker model cannot be evaluated point-wise; however, it is straightforward to generate x t from it, conditional on x t´1 .This model, and its variants, is typically used to represent highly non-linear or near-chaotic ecological systems, e.g. the population dynamics of sheep blowflies (Fasiolo et al., 2016).Fasiolo et al. (2016) show that the likelihood function of the noisy Ricker model exhibits extreme multimodality when the process noise is low, making it difficult to estimate the model.
We draw 700 observations using θ :" plog pφq, log prq, log pσqq " plog p10q, log p44.7q, log p0.6qq.Following Fasiolo et al. (2016), we assign uniform priors to the log-parameters, Uplog pφq | 1.61, 3q, Uplog prq | 2, 5q and Uplog pσq | ´1.8, 1q, respectively.Scores for the accuracy, computational cost and overall efficiency are obtained for initial N x values of 1000 and 20000.The variance of the log-likelihood estimator is around 13 for 1000 state particles and around 2.3 for 20000 state particles.The gold standard method is run with 90000 state particles.Due to time constraints, the ground truth for the posterior mean is based on a PMMH chain of length 200000.
An experiment was stopped if its run time exceeded 9 days.As a result, a full comparison of the stage 3 options cannot be made.Of the experiments that finished, replace had the best results in terms of overall efficiency.On average, replace outperformed reinit and reweight by at least a factor of 2. In a number of cases, the gold standard and replace were the only methods to finish within the time frame.Tables 11 and 12 show the scores for the replace combinations.novel-var and novel-esjd have the best overall results across both DT-SMC 2 and DA-SMC 2 for this example, while rescale-std and rescale-var perform similarly.
Figure 4 shows the evolution of N x for replace and an initial N x of 1000.All methods show a fairly smooth increase in N x over the iterations.Table 11: Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for DT-SMC 2 for the noisy Ricker model using the replace method -higher values are preferred.The gold standard refers to SMC 2 with a fixed number of state particles.

Discussion
We introduce a fully automatic SMC 2 algorithm for parameter inference of intractable likelihood state-space models.Of the methods used to select the new number of state   particles, novel-esjd gives the most consistent results across all models, choice of initial N x and between DT-SMC 2 and DA-SMC 2 .This method uses the ESJD to determine which N x from a set of candidate values will give the cheapest mutation -this value is selected as the new number of state particles.novel-esjd generally outperforms the other methods in terms of the computational and overall efficiency.A significant advantage of novel-esjd is that the adaptation of N x is consistent across independent runs of the algorithm (i.e. when starting at different random seeds), substantially more so than the other methods.
Similarly, the replace method typically shows great improvement over reweight and reinit.replace modifies the approximation to the optimal backward kernel used by reweight.This modification means that, unlike reweight, replace leaves the parameter particle weights unchanged.We also find that replace is generally more reliable than reinit.
Our novel SMC 2 algorithm has three tuning parameters that must be set: the target ESJD for the mutation step, the number of log-likelihood evaluations for the variance estimation (k) and the number of state particles.Determining optimal values of the target ESJD and k is beyond the scope of this paper, but tuning strategies are discussed in Section 4.4.While any initial number of state particles can be used, a small value yields the most efficient results.Compared to the currently available methods, the new approach requires minimal tuning, gives consistent results and is straightforward to use with both data annealing and density tempering SMC 2 .We also find that the adaptive methods generally outperform the gold standard, despite the latter being pre-tuned.
An interesting extension to the current work would be to assess the effect of the target ESJD, the target ESS and the target variance of the log-likelihood estimator when SMC 2 is used for model selection.Another area of future work is extending the method for application to mixed effects models (Botha et al., 2021); for these models, it may be possible to obtain significant gains in efficiency by allowing the number of state particles to (adaptively) vary between subjects.The new method can also be used as the proposal function within importance sampling squared (Tran et al., 2020).
One area of future work is to incorporate more advanced particle filters into our framework, e.g. the adaptive particle filters of Bhadra and Ionides (2016), Crisan and Míguez (2018) and Lee and Whiteley (2018).Another area of future work is to adapt the number of parameter particles (N θ ) for a specific purpose, e.g.estimation of a particular parameter or subset of parameters.This may reduce the computational resources needed, and applies to SMC methods in general.

A Marginal Posterior Plots
In this section, we show the marginal posterior density plots for the examples in Sections 5.2-5.5.Figures 5-8 show the marginal posterior density plots for each example and method.Note that the results shown are for replace using the combined samples from the independent runs, i.e. the marginal posteriors are based on 50 ˆ1000 samples for the Brownian motion, stochastic volatility and theta-logistic models and 20 ˆ400 samples for the Ricker model.The results shown are for a low initial N x .It is clear from the plots that the marginal posterior densities are similar between the adaptive methods.The biggest difference in densities are between DT-SMC 2 and DA-SMC 2 , not between the adaptive methods.Figures 5, 6 and 8 show marginal posteriors from SMC 2 that are very similar to the marginal posteriors from MCMC. Figure 7 shows similar marginal posteriors for the theta-logistic model from SMC 2 and MCMC for all of the parameters except for log pσq.This parameter corresponds to the log of the measurement error in the nutria population data (see Section 5.4 of the main paper).Here, the adaptive SMC 2 methods struggle to accurately capture the lower values of log pσq with posterior support.SMC 2 with a higher, fixed number of state particles (the gold standard method) does not have the same issue, suggesting that the number of state particles is perhaps not adapted high enough in any of the methods for this example.

B Extra Results for the Stochastic Volatility Model
This section shows extra results for the stochastic volatility model.Tables 13 and 14 show the scores for all the replace combinations for N θ " 100, and Tables 15 and 16 show the same results for N θ " 500.There is some variation in the efficiency scores for N θ " 100, 500 and 1000, but the results are relatively similar.Table 16: Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for DA-SMC 2 for the stochastic volatility model with N θ " 500 using the replace method -higher values are preferred.The gold standard refers to SMC 2 with a fixed number of state particles.
There are three main stages to adapting N x : (1) triggering the adaptation, (2) choosing the new number of particles N x , and (3) replacing the current set of state particles x1:Nx,1:N θ d with the new set x1:N x ,1:N θ d .To simplify notation, we write x1:Nx,1:N θ d as x1:Nx d .

.
. (2012)  propose a reweighting step for the parameter particles (reweight) using the generalised importance sampling method of DelMoral et al. (2006) to swap x1The incremental weight function for this step (for DA-SMC 2 ) is IW "

´1/*
If more than one value has been tested */ 8: if m ą 1 then /* If the current score is worse than the previous one */ 9: if z m {z m´1 ă 1 then the current score is equal to the previous one */ 13: else if z m {z m´1 " Nx and R */ 19:Set N x " N x,m ˚and R " R m

Figure 1 :
Figure 1: Evolution of N x for replace and a low initial N x for the Brownian motion model.Each coloured line represents an independent run of the given method.

Figure 2 :
Figure 2: Evolution of N x for replace and a low initial N x for the stochastic volatility model.Each coloured line represents an independent run.

Figure 3 :
Figure 3: Evolution of N x for replace and a low initial N x for the theta-logistic model.Each coloured line represents an independent run.

Figure 4 :
Figure 4: Evolution of N x for replace and a low initial N x for the Ricker model.Each coloured line represents an independent run.

Figure 5 :Figure 6 :
Figure 5: Marginal posterior density plots for the Brownian motion model.Dashed lines are the DA-SMC 2 results and dotted lines of the same colour are the corresponding DT-SMC 2 results.

Figure 7 :Figure 8 :
Figure 7: Marginal posterior density plots for the theta-logistic model.Dashed lines are the DA-SMC 2 results and dotted lines of the same colour are the corresponding DT-SMC 2 results.
Input: data y 1:d , number of state particles N x and the static parameters θ.Output: likelihood estimate y p Nx py 1:d | θq, set of weighted state particles tx Re-weight the particles from π t´1 p¨q to π t p¨q 1 /* Initialise likelihood estimate */ 2: Initialise the likelihood estimate y p Nx py 1 | θq " ř Nx m"1 w m 1 3: for t " 2 to d do /* Resample */ 4: Resample N x particles from x 1:Nx t´1 with probability W 1:Nx t´1 /* Simulate forward */ 5: Simulate the particles forward, x pmq t " f p¨| x Update the likelihood estimate y p Nx py 1:t | θq " y p Nx py 1:t´1 | θq ¨řNx m"1 w m t 8: end for Input: data y, proposal distribution qp¨q, current parameter value θ d , current likelihood estimate y p Nx py | θ d q.Note that y :" y 1:T for DT-SMC 2 and y :" y 1:d for DA-SMC 2 .Optional: current set of weighted state particles x1:Nx d.
Novel method to adapt the number of state particles and mutate the parameter particles for SMC 2 .Set new N x and update the particle set using any combination of the stage 2 and stage 3 methods described in Sections 3, 4.2 and 4.3 3. Calculate yσ Nx 2 , the sample variance of the k log-likelihood estimates.4. Set the new number of state particles to N x " y σ Nx 2 ¨Nx .Algorithm 4 Input: the estimated ESJD from the previous iteration ( { ESJD d´1 ), the target ESJD for each iteration ( { ESJD target ) and the current set of particles ϑ 1:N θ d Output: new number of state particles N x , estimated ESJD ( { 4: end if /* Initial mutation step with updated Nx (if applicable) */ 5: PMMH mutation ϑ 1:N θ d,1 " Kpϑ 1:N θ d , ¨q, calculate { ESJD d,1 6: if adapt then /* Remaining mutation steps */ 9: for r " 2 to R do 10:

Table 1 :
for an example of the possible values of N x for the different methods.Possible values of the number of state particles N x if N x is currently 100 and G " 1, where G accounts for the tempering parameter in DT-SMC 2 .Note that we allow the number of particles to decrease with rescale-var.The new N x will be one of the possible values listed, e.g. if y σ Nx 2 " 1, novel-esjd will set N x to 100 or 200 depending on which value is predicted to give the cheapest mutation.If there is only 1 possible value, then that is the new number of state particles.Algorithm 5 Novel method to adapt the number of state particles and mutate the parameter particles for SMC 2 when using novel-esjd.

Table 2 :
Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for the stage 3 options for the Brownian motion model -higher values are preferred.The results are averaged over the two starting values of N x and are relative to the reweight method.

Table 5 :
Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for the stage 3 options for the stochastic volatility model -higher values are preferred.The results are averaged over the two starting values of N x and are relative to the reweight method

Table 6 :
Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for DT-SMC 2 for the stochastic volatility model using the replace method -higher values are preferred.The gold standard refers to SMC 2 with a fixed number of state particles.

Table 8 :
Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for the stage 3 options for the theta-logistic model -higher values are preferred.The results are averaged over the two starting values of N x and are relative to the reweight method

Table 9 :
Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for DT-SMC 2 for the theta-logistic model using the replace method -higher values are preferred.The gold standard refers to SMC 2 with a fixed number of state particles.

Table 12 :
Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for DA-SMC 2 for the noisy Ricker model using the replace method -higher values are preferred.The gold standard refers to SMC 2 with a fixed number of state particles.

Table 13 :
Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for DT-SMC 2 for the stochastic volatility model with N θ " 100 using the replace method -higher values are preferred.The gold standard refers to SMC 2 with a fixed number of state particles.

Table 14 :
Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for DA-SMC 2 for the stochastic volatility model with N θ " 100 using the replace method -higher values are preferred.The gold standard refers to SMC 2 with a fixed number of state particles.

Table 15 :
Scores for the accuracy (Z MSE ), computational cost (Z TLL ) and overall efficiency (Z) for DT-SMC 2 for the stochastic volatility model with N θ " 500 using the replace method -higher values are preferred.The gold standard refers to SMC 2 with a fixed number of state particles.