1 Introduction

State-space models (SSMs) describe time series observations as dependent on an unobserved system process (Durbin and Koopman 2012). The unobserved system process of an SSM consists of a sequence of continuously-valued latent states that evolve through time but are not observed directly. Instead, observations are modeled as a function of these unobserved latent states via the observation process. SSMs have been applied to various problems, including the modeling of inflation rates (Koopman and Bos 2004), neuron responses (Smith and Brown 2003; Lin et al. 2019), animal movement (Patterson et al. 2008; McClintock et al. 2012; Auger-Méthé et al. 2016), and population dynamics (Buckland et al. 2004; King et al. 2009; King 2012; Newman et al. 2014).

Fitting an SSM usually refers to inference on the model parameters given the observations. However, outside of the discrete-state hidden Markov model (HMM) case (Rabiner 1989), or special continuous cases, for example, linear Gaussian SSMs (Kalman 1960; Durbin and Koopman 2012), a closed-form expression for the likelihood does not exist, leading to model fitting challenges. The various approaches to approximating the likelihood include the use of linear Gaussian approximations (Julier and Uhlmann 1997; Wan and Van Der Merwe 2000), numerical methods using Laplace approximations or the HMM approximations of point mass filters (Bucy and Senne 1971; Koyama et al. 2010; Langrock and King 2013; Thygesen et al. 2017; Herliansyah et al. 2022), the use of sequential Monte Carlo to obtain an unbiased approximation to the likelihood (Beaumont 2003; Andrieu and Roberts 2009; Kantas et al. 2015; Deligiannidis et al. 2018), and Bayesian data augmentation approaches (Frühwirth-Schnatter 2004; Fearnhead 2011; Borowska and King 2023). A review of these methods, and others, is given in Newman et al. (2022).

Here, we focus on Bayesian data augmentation approaches, which can be applied to general SSMs by specifying the latent states as additional auxiliary variables that need to be estimated. In doing so, the data augmentation approach forms the joint posterior distribution of the model parameters and additional latent states (Tanner and Wong 1987; Hobert 2011). The resulting joint likelihood of the parameters and states is typically available in closed-form, permitting the application of standard MCMC algorithms, such as the Gibbs sampler, Metropolis–Hastings algorithms, or Metropolis-within-Gibbs algorithms. An estimate of the marginal posterior distribution of the model parameters can then be obtained by simply retaining the sampled parameter values and disregarding the latent state samples to effectively marginalize over the states. It is, however, typically challenging to efficiently sample the latent state vector within a Metropolis–Hastings algorithm, since there are typically many highly correlated latent states, leading to poor mixing (Frühwirth-Schnatter 2004). Updating the states one at a time typically performs very poorly due to the high correlation between states (Fearnhead 2011), whilst designing good block-update proposal distributions can be challenging (Shephard and Pitt 1997).

Historically, data augmentation has been combined with numerical approximation, sequential Monte Carlo, and/or MCMC to fit the parameters of an SSM. When an SSM is well-approximated by Gaussian distributions, efficient proposal distributions for the latent states in a data augmentation approach include those derived from Laplace approximation (Kristensen et al. 2016) and extensions to the Kalman filter (Giordani and Kohn 2010). In the general case, more computationally intensive methods are usually required, such as the particle Gibbs (PG) sampler (Lindsten et al. 2012; Chopin and Singh 2015; Fearnhead and Meligkotsidou 2016) or particle Gibbs with ancestor sampling (PGAS) (Lindsten et al. 2014). PG and PGAS are classes of particle MCMC method (Andrieu et al. 2010) which use sequential Monte Carlo to update all states simultaneously conditional on the parameters. In their use of sequential Monte Carlo, PG and PGAS methods rely on few assumptions about the underlying state and observation distributions. However, these methods can be computationally expensive, particularly if many samples are required in the sequential Monte Carlo steps due to sample impoverishment (Rainforth et al. 2016; Wang et al. 2017). One approach to overcoming sample impoverishment is the introduction of deterministic resampling, which uses a deterministic grid over the state space to retain samples that may otherwise be discarded in the sequential Monte Carlo steps (Li et al. 2012). Although we do not develop a particle MCMC approach here, the use of a deterministic grid to efficiently sample the latent states is related to our approach.

In this paper, we propose a novel and efficient approach to Metropolis-within-Gibbs data augmentation. In particular, we focus on the challenging aspect of efficiently updating the latent states conditional on the model parameters and present the point mass proposal Metropolis–Hastings (PMPMH) algorithm, proposing candidate values for (blocked) latent state values in two steps. The first step uses a discretization of the state space into a set of pre-defined intervals (a grid) and deterministic approximations to reduce the SSM to an HMM. We then sample a sequence of intervals from the HMM-approximated distribution of the latent states given the model parameters, employing the forward-filtering backward-sampling (FFBS) algorithm (Rabiner 1989). In the second step of the algorithm, we sample values from within the intervals using a continuous proposal distribution. Once the candidate values have been proposed using the two-step procedure, we correct for the discretization error imposed by the approximate HMM using a Metropolis–Hastings acceptance step. As with standard data augmentation algorithms, the model parameters are then updated conditionally on the latent states using standard Gibbs or Metropolis-within-Gibbs updates. Further to describing a general method for combining HMM approximations and MCMC steps, we discuss, in detail, several tuning parameters that can be used to improve the performance of the algorithm. Such tuning parameters include the block size for the latent states (i.e. the number of latent states to update simultaneously), the number of intervals used in the HMM approximation, and the distribution of the intervals in the state space.

The first step of the proposed algorithm is related to embedded HMMs and point mass filtering, which both use the tractable discrete case of an SSM, an HMM, to sample latent states. Embedded HMMs (Neal 2003; Shestopaloff and Neal 2013; Finke et al. 2016; Shestopaloff and Neal 2018) construct an HMM in the indices of stochastically-generated ‘pool’ states, and sample states such that they target the correct posterior distribution. Here, we use the embedding of an HMM via deterministic grid cells to ensure latent states are proposed with reasonable posterior mass. Point mass filters, similarly to the proposed approach, use a deterministic grid to reduce the SSM to an HMM (Bucy and Senne 1971; Kitagawa 1987; Langrock et al. 2012; Langrock and King 2013). The HMM can be used to approximate the likelihood or posterior distribution of the model parameters directly. As such, point mass filters can be (relatively) computationally inexpensive to implement, but generally produce biased estimators. Further, the bias and variance of estimators often depend on the grid (for example, the number and location of the grid cells and their coverage of the state space), and the integration method used to approximate the HMM (de Valpine and Hastings 2002; Matousek et al. 2019). Instead, we use the grid cells simply to inform a Metropolis–Hastings proposal distribution and are thus able to define the HMM and grid cells coarsely (and inexpensively), while being able to achieve convergence to the correct posterior distribution, correcting for the bias introduced by the discretization.

The rest of the paper is structured as follows. We start by defining SSMs and HMMs and model fitting procedures for SSMs via Bayesian data augmentation in Sect. 2. We describe the proposed data augmentation scheme in Sect. 3, the PMPMH algorithm, in which we use a grid-based HMM approximation within a proposal distribution for the latent states. In Sect. 4, we provide three particular cases of the algorithm, each given by a different method for defining the grid cells, and describe when each can be usefully applied. Finally, we illustrate our approach with two case studies in Sect. 5, including a near-chaotic system, where our proposed approach demonstrates a substantial improvement in performance compared to alternative approaches considered. We conclude in Sect. 6 with a discussion of our proposed algorithm and further possible avenues of research.

2 State-space model inference

Here we define SSMs, the notation used throughout the paper, and the special case of discrete-valued system processes: HMMs. We then discuss the challenges of fitting SSMs to data and computational approaches in Sect. 2.1, motivating our proposed data augmentation algorithm.

We consider time series data observed at discrete (regular) time points up to time T, \(y_{1:T}=(y_1,\dots ,y_T)\). An SSM models these data via (i) a system process of unobserved (latent) states, \(x_{1:T}=(x_1,\dots ,x_T)\); and (ii) an observation process linking the latent process with the observed \(y_{1:T}\). We allow the states to be continuous, taking values in some set \(x_t \in \chi \) for each t, and describe their evolution using initial state and transition distributions. Similarly, the observation process is described using an observed state distribution. As standard, we assume a first-order Markov process for the system process and that the observed data are conditionally independent given the underlying latent states, that is, the observation process is only a function of the latent state at time t.

Given the set of model parameters, collectively denoted \(\pmb {\theta }\), the SSM can be described mathematically by

Initial state distribution:

$$\begin{aligned} p(x_1 \vert \pmb {\theta }), \end{aligned}$$

State transition distribution:

$$\begin{aligned} p(x_t \vert x_{t-1}, \pmb {\theta }), t=2, \dots , T, \end{aligned}$$

Observed state distribution:

$$\begin{aligned} p(y_t \vert x_t, \pmb {\theta }), t=1, \dots , T. \end{aligned}$$

A particular case of an SSM is an HMM, in which the state space is discrete and finite. In other words, the states at time t, \(x_t\), take values in the set \(\{1, \dots , N\}\). Therefore, when the SSM considered is an HMM, we simply re-define the initial state and transition probabilities for \(k, n=1, \dots , N\) as

Initial state probabilities:

$$\begin{aligned} P(x_1=n \vert \pmb {\theta }), \end{aligned}$$

State transition probabilities:

$$\begin{aligned} P(x_t=n \vert x_{t-1}=k, \pmb {\theta }), t=2, \dots , T. \end{aligned}$$

2.1 Model fitting

We assume that we are primarily interested in inference on the model parameters, \(\pmb {\theta }\), although this may also extend to the latent states, \(x_{1:T}\), depending on the application. In general, inference on the posterior distribution for \(\pmb {\theta }\) requires a closed-form expression for the observed data (or marginal) likelihood, \(p(y_{1:T} \vert \pmb {\theta })\). The joint likelihood of the model parameters and latent states is given by

$$\begin{aligned} p(x_{1:T}, y_{1:T} \vert \pmb {\theta })&= p(x_1 \vert \pmb {\theta }) \prod _{t=2}^T p(x_t \vert x_{t-1}, \pmb {\theta })\prod _{t=1}^T p(y_t \vert x_t, \pmb {\theta }). \end{aligned}$$
(1)

A closed-form expression for the marginal likelihood typically only exists in the linear Gaussian case, where \(p(y_{1:T} \vert \pmb {\theta })\) can be calculated using the Kalman filter (Kalman 1960; Durbin and Koopman 2012), or when the state space is discrete and finite (an HMM), where the marginalization of Equation (1) amounts to summation over the state space (Rabiner 1989). In general, however, the marginal likelihood is intractable for continuous SSMs, and standard MCMC implementations cannot be used. One solution is to use pseudo-marginal likelihood methods (Beaumont 2003; Andrieu and Roberts 2009), replacing the marginal likelihood with an unbiased estimate to formulate a valid Metropolis–Hastings approach, but we do not focus on these here.

Instead, we consider a Bayesian data augmentation approach. In this case, we form the joint posterior distribution over the model parameters and latent states (Tanner and Wong 1987; Hobert 2011), \(p(x_{1:T}, \pmb {\theta } \vert y_{1:T}) \propto p(\pmb {\theta }) p(x_{1:T}, y_{1:T} \vert \pmb {\theta })\), thus using the joint likelihood in Equation (1) directly. Since, in general, a closed-form expression exists for the joint likelihood term up to proportionality, the proportional likelihood can be evaluated, and standard MCMC approaches can be used targeting \(p(x_{1:T}, \pmb {\theta } \vert y_{1:T})\). Samples from the (marginal) posterior distribution of the parameters, \(p(\pmb {\theta } \vert y_{1:T})\), are then obtained by disregarding the state samples and retaining the parameter samples.

To sample from \(p(x_{1:T}, \pmb {\theta } \vert y_{1:T})\), a common approach is to sample the latent states and parameters in turn via their conditional distributions, simplifying the parameter updates (Carter and Kohn 1994; Frühwirth-Schnatter 2004; Durbin and Koopman 2012). A Metropolis-within-Gibbs algorithm can be applied to target the joint distribution: in each iteration, we update the latent states conditional on the model parameters; and the model parameters conditional on the latent states (for which we now have a closed conditional likelihood expression). However, designing an efficient proposal distribution for the underlying state vector can be challenging. Ideally, we would be able to design a proposal distribution to update all states simultaneously and minimize the correlation between consecutive samples. In general, this state distribution is intractable, and designing an efficient approximation to the state distribution can be challenging since it is often high-dimensional in the number of temporal states. One approach updates all states simultaneously using a particle Gibbs (PG) algorithm (Chopin and Singh 2015; Murphy and Godsill 2016; Lindsten et al. 2014); values for the entire state vector are proposed at each iteration from a conditional sequential Monte Carlo algorithm. In this approach, proposed values for the entire state vector are always accepted and samples from the correct posterior distribution can be obtained. However, PG methods can suffer from poor mixing sample impoverishment: degenerating particles being disregarded upon resampling, impacting mixing and convergence (Wang et al. 2017; Rainforth et al. 2016). Instead of updating all states simultaneously, for example via PG updates, an alternative approach is to use a single-update algorithm: updating each latent state individually to reduce the required dimension of the proposal distribution, but this approach typically leads to very poor mixing due to correlation in the state process (Fearnhead 2011; King 2011).

A compromise between single and global updates of the latent states is to use block updates, simultaneously updating \(\ell \) consecutive states (Shephard and Pitt 1997). The posterior conditional distribution of the states in a given block is, generally, of non-standard form, thus it can be challenging to define an efficient proposal distribution that accounts for the correlation between the states. We propose the use of the discrete analogy to SSMs, HMMs, to define an informative and efficient proposal distribution for blocks of latent states.

3 Point mass proposal Metropolis–Hastings

In this section, we propose the point mass proposal Metropolis–Hastings (PMPMH) algorithm. The setting is a Metropolis-within-Gibbs algorithm for data augmentation. The model parameters, \(\pmb {\theta }\), are updated conditionally on \(x_{1:T}\) using a standard Metropolis–Hastings or Gibbs update, and we propose the PMPMH algorithm to sample Metropolis–Hastings proposals for the latent states given the model parameters.

Fig. 1
figure 1

Example partition of a state space \(\chi ^T\) for each time t. Each grid cell is labeled by the interval it covers

Step 1 of the algorithm initially converts the SSM into an HMM by discretizing the state space at each time point, forming a deterministic grid, and the HMM is approximated deterministically. Grid cells are then sampled from the associated (discretized and approximate) posterior distribution, conditional on the model parameters, using the standard forward-filtering backward-sampling (FFBS) algorithm for HMMs (Rabiner 1989) at a computational cost of \(\mathcal {O}(NT)\) for N grid cells at each time point. Given the sampled grid cells, Step 2 proposes values for \(x_{1:T}\) from within the cells using some specified (bounded) continuous proposal distribution. We then correct for the approximate sampling distribution using a Metropolis–Hastings step targeting the correct posterior distribution.

For pedagogical purposes, we make a few assumptions in our initial description of the algorithm, for example, we derive the algorithm for one-dimensional state spaces. Extensions to higher-dimensional state spaces are possible, and we demonstrate the algorithm in the two-dimensional case with an example in Sect. 5, and discuss scalability further in Sect. 6. However, we note here that there may be practically increased computational demands for higher-dimensional state spaces. For notational simplicity, we present the algorithm for SSMs where the state space and the number of grid cells are the same for each time point, that is, the state space at all time points is denoted by \(\chi \). However, the generalization to state spaces that vary over time is immediate by adapting the notation. Finally, we initially describe the algorithm in terms of updating the full set of latent states, with a version of the algorithm for block updates of the states described later.

3.1 Step 1: sampling a grid cell trajectory

In this initial step, we formulate a discrete HMM representation of the SSM, approximate the HMM, and sample a trajectory from the discrete approximation.

3.1.1 Discretization to an HMM

First, we propose a partition of the whole state space into a grid. The state space at each time t, \(\chi \), is partitioned into intervals, forming grid cells. The N grid cells at time t are denoted \(I_t(n),\) \(n=1, \dots , N\) and span \(\chi \) with no overlap. The respective outer grid cells have infinite length if the state space is unbounded. An example of such a partition is shown in Fig. 1.

The grid cells discretize the state space. Thus, the grid cell indices are the discrete states of an HMM. For the grid cells given by \(\{I_t(1), \dots I_t(N)\}\), the indices are simply \(\{1, \dots N\}\), with dynamics given by the SSM density of each grid cell. Mathematically, let \(B_t\) denote the random variable of the grid cell indices under the specified grid cell boundaries so that \(B_t \in \{1,\dots ,N\}\) for \(t=1,\dots ,T\). Then the HMM probabilities are given for states \(k, n \in \{1, \dots , N\}\) as follows:

Initial state probabilities:

$$\begin{aligned} P(B_1=n \vert \pmb {\theta }), \end{aligned}$$

State transition probabilities:

$$\begin{aligned} P(B_t = n \vert B_{t-1}=k, \pmb {\theta }), t=2, \dots , T, \end{aligned}$$

Observed state distribution:

$$\begin{aligned} p(y_t \vert B_t=n, \pmb {\theta }), t=1, \dots , T. \end{aligned}$$
figure a

These HMM probabilities under the original SSM are therefore defined by integrals of the state space equations over the given intervals:

$$\begin{aligned}&P(B_1=n \vert \pmb {\theta }) = \int \limits _{I_1(n)} p(x_1 \vert \pmb {\theta }) dx_1,\nonumber \\&P(B_t = n \vert B_{t-1}=k, \pmb {\theta }) = \nonumber \\&\qquad \int \limits _{I_t(n)} \int \limits _{I_{t-1}(k)} p(x_t \vert x_{t-1}, \pmb {\theta }) dx_{t-1} dx_t, t=2, \dots , T, \nonumber \\&p(y_t \vert B_t=n, \pmb {\theta }) = \int \limits _{I_t(n)} p(y_t \vert x_t, \pmb {\theta }) dx_t, t=1, \dots , T. \end{aligned}$$
(2)

3.1.2 Approximate HMM

In general, the probabilities associated with the HMM, given in Equation (2), are not available in closed form, thus we approximate these integrals. To formulate an approximation to the HMM under the considered SSM, we use simple (fast) deterministic methods, for example, the midpoint rule or other Riemann sum methods. Mathematically, we define \(L_t(n)\) as the length of the interval \(I_t(n)\), with the length of any infinite grid cells set at some arbitrary (finite) length (discussed further in Appendix B). Within each grid cell, we choose a set of node points, \(\{\xi _t^s(n)\}_{s=1}^S\). The set of nodes is defined a priori and is used to approximate the density functions across each grid cell by a polynomial. In the simple case where \(S=1\) (that is, one node in each grid cell such as the mid-point), we have that \(\{\xi _t^s(n)\}_{s=1}^S=\xi _t(n)\). The approximate HMM, for grid cells indices \(k, n=1, \dots , N\) , can thus be given by the mid-point approximations:

$$\begin{aligned}&\hat{P}(B_1 = n \vert \pmb {\theta }) \propto L_1(n) p(\xi _1(n) \vert \pmb {\theta }),&\\&\hat{P}(B_t=n \vert B_{t-1}=k, \pmb {\theta }) \propto&\nonumber \\&\quad L_t(n) L_{t-1}(k) p(\xi _t(n) \vert \xi _{t-1}(k), \pmb {\theta }), t=2, \dots , T,&\\&\hat{P}(y_t \vert B_t=n, \pmb {\theta }) \propto&\\&\quad L_t(n) p(y_t \vert \xi _t(n), \pmb {\theta }), t=1, \dots , T,&\end{aligned}$$

Each approximate probability is appropriately normalized so that they sum to one and are thus valid probability mass functions. Once the probability mass function approximations have been obtained, grid cell indices are proposed from

$$\begin{aligned} \hat{P}(B_{1:T} \vert y_{1:T}, \pmb {\theta })&\propto \hat{P}(B_1 \vert \pmb {\theta })\hat{P}(y_1 \vert B_1, \pmb {\theta })&\\&\quad \prod _{t=2}^T \hat{P}(B_t \vert B_{t-1}, \pmb {\theta })\hat{P}(y_t \vert B_t, \pmb {\theta }).&\end{aligned}$$

We sample indices from this distribution, denoted \(b'_{1:T}\), using the forward-filtering backward-sampling (FFBS) algorithm in Rabiner (1989).

3.2 Step 2: sampling a point trajectory

Step 2 proposes continuous values for the states which will be subsequently corrected in Metropolis–Hastings steps. Given the discrete grid cell indices sampled previously, we propose values for \(x_{1:T}\) conditional on the corresponding intervals \(\{I_1(b'_1),\dots ,I_T(b'_T)\}\). This amounts to sampling \(x_{1:T}\) from the domain \(I_1(b'_1)\times I_2(b'_2)\times \cdots \times I_T(b'_T)\). For simplicity, we consider proposal distributions for each \(x_t\) independently and independently of \(\pmb {\theta }\), giving proposal distributions of the form \(q(x_t \vert x_t \in I_t(b'_t))=q(x_t \vert B_t=b'_t)\), for \(t=1,\dots ,T\). For example, a natural choice is to propose each \(x_t\) from a uniform proposal distribution over \(I_t(b'_t)\) for finite \(I_t(b'_t)\), or a truncated Gaussian distribution if \(I_t(b'_t)\) has infinite length. However, any distribution with domain \(I_t(b'_t)\) is applicable.

3.3 The PMPMH proposal distribution

The process of sampling a grid cell trajectory and sampling a value for \(x_{1:T}\) defines an independent proposal distribution on \(\chi ^T\). The density function of the proposed trajectories is therefore the probability of selecting the grid cell indices from a specified set of grid cell boundaries, combined with the density of the (continuous) values within those grid cells. Let \(B_{1:T}\) denote the random variables of the grid cell indices under the proposed grid, which may change across iterations with intervals \(I_t(n), n=1, \dots , N, t=1, \dots , T\). Then the density of the subsequently proposed state values, \(x'_{1:T}\), is given by

$$\begin{aligned} q(x'_{1:T} \vert y_{1:T}, \pmb {\theta })&= \hat{P}(B_{1:T}=b'_{1:T} \vert y_{1:T}, \pmb {\theta }) \nonumber \\&\quad \prod _{t=1}^T q(x'_t \vert B_t=b'_t). \end{aligned}$$
(3)

The resulting proposed trajectory, \(x'_{1:T}\), is retained according to the Metropolis–Hastings acceptance probability, given by

$$\begin{aligned} p(x_{1:T}, x'_{1:T} \vert \pmb {\theta }) = \min \left( 1, \frac{ p(x'_{1:T} \vert y_{1:T}, \pmb {\theta }) q(x_{1:T} \vert y_{1:T}, \pmb {\theta }) }{p(x_{1:T} \vert y_{1:T}, \pmb {\theta })q(x'_{1:T} \vert y_{1:T}, \pmb {\theta })} \right) , \end{aligned}$$
(4)

where \(p(x_{1:T} \vert y_{1:T}, \pmb {\theta })\) denotes the posterior conditional distribution of \(x_{1:T}\) given \(y_{1:T}\) and \(\pmb {\theta }\), \(q(x'_{1:T} \vert y_{1:T}, \pmb {\theta })\) is as in Eq. (3), and \(q(x_{1:T} \vert y_{1:T}, \pmb {\theta })\) uses the same definition for the grid cells, that is, their indices, \(b_{1:T}\), are such that \(x_t \in I(b_t)\) for all t.

The PMPMH method for obtaining a sample of \(x_{1:T}\) conditional on \(\pmb {\theta }\) is summarized in Algorithm 1. For completeness, we also include the updates of \(\pmb {\theta }\) conditional on \(x_{1:T}\), which can be updated using standard Metropolis–Hastings or Gibbs updates conditional on the current chain value of the states.

3.4 Block updates of the latent states

Updating all of the states simultaneously may be inefficient if there are many latent states. We can, instead, update the states in smaller blocks: for a block of \(\ell \) states starting at time t, we sample \(x_{t:t+\ell -1}\) from a proposal distribution of the form \(q(x_{t:t+\ell -1} \vert x_{t-1}, x_{t+\ell }, y_{t:t+\ell -1}, \pmb {\theta })\).

A block PMPMH proposal distribution is given by a simple adaptation of the case where all states are updated simultaneously, conditioning on the current states at either side of the block; \(x_{t-1}\) and \(x_{t+\ell }\). Therefore, in Step 1, we condition on the current grid cell indices, \(b_{t-1}\) and \(b_{t+\ell }\), such that \(x_{t-1} \in I(b_{t-1})\) and \(x_{t+\ell } \in I(b_{t+\ell })\) under the current grid definition. We then use FFBS to sample proposed indices from the grid cells in the block. In Step 2, we simply define the proposal distribution, as before, for each latent state in the given block, \(q(x_s \vert B_s=b'_s)\), for \(s=t, \dots , t+\ell -1\). The PMPMH algorithm for block updates is summarized in Appendix C.

We note that it is possible to parallelize the updates of the latent states to improve computational efficiency. For example, blocks of states can be updated in parallel provided that the states to be updated and those being conditioned on do not overlap. For the sake of comparison with other methods, we do not parallelize our computation but simply note its possibility. See also King et al. (2023) for further discussion on paralellization within a state-space model fitting context.

3.5 Conditions for convergence

A benefit of the proposed method is that estimators of integrable functions are consistent under some mild conditions ensuring the validity of the proposal distribution in the Metropolis–Hastings steps.

Theorem

The PMPMH algorithm provides consistent ergodic average estimators of integrable functions with respect to \(p(x_{1:T}, \pmb {\theta } \vert y_{1:T})\) if,

  1. 1.

    for all \(b_t \in \{1, \dots , N\}\), \(t \in \{1, \dots , T\}\), the HMM probabilities are defined such that \(\hat{P}(B_{1:T} = b_{1:T} \vert y_{1:T}, \pmb {\theta })>0\) and the proposal distributions within each grid cell are defined such that \(q(x_t\vert B_t=b_t)>0\) for all \(x_t \in I_t(b_t)\), and

  2. 2.
    1. (a)

      the grid cells are defined and fixed at the start of the algorithm or

    2. (b)

      the grid cells are defined as a function of the current MCMC iteration parameter or state values.

Proof

We assume that the transition kernel on the parameter space is constructed such that it admits \(p(\pmb {\theta }\vert x_{1:T}, y_{1:T})\) as its limiting distribution. By standard results from Metropolis-within-Gibbs algorithms (Latuszyński et al. 2013; Gamerman and Lopes 2006), proving that the PMPMH algorithm provides consistent estimators of integrable functions reduces to ensuring that the transition kernel on the blocks of \(\ell \) states converge to the correct conditional posterior distribution. We therefore show that irreducibility and detailed balance are satisfied by the proposal distributions with respect to the correct conditional posterior distribution for each block of \(\ell \) states.

By construction, we assume that \(\cup _{n=1}^N I_t(n) = \chi \) for all \(t=1, \dots , T\). Condition 1 implies that \(q(x_{1:T} \vert y_{1:T}, \pmb {\theta })>0\) for all \(x_{1:T} \in \chi ^T\), thus irreducibility is satisfied since any value in the state space can be proposed at each iteration of the algorithm. Note that Condition 1 can be ensured by arbitrarily thresholding the HMM probabilities above zero and attributing non-zero density within each grid cell using many standard distributions.

It is immediate that detailed balance holds under Condition 2(a) that the grid cells are defined and fixed at the start of the Metropolis–Hastings algorithm. It is also straightforward to show that detailed balance holds under Condition 2(b) that the grid cells are defined as a function of the current MCMC iteration parameter or state values.

Thus, Condition 1 and Condition 2(a) or 2(b) ensure that irreducibility and detailed balance are satisfied by the PMPMH proposal distribution, and the algorithm provides consistent ergodic average estimators of integrable functions with respect to \(p(x_{1:T}, \pmb {\theta } \vert y_{1:T})\).

4 Defining the grid cells

We have, so far, given a broad framework for using an approximate HMM as a Metropolis–Hastings proposal distribution. The method fundamentally relies on the use of a deterministic grid to discretize the state space. There are, however, many different ways to define the grid cells (their size and location), possibly resulting in substantially different proposal distributions and computational costs. We see in Sect. 5 that the efficiency of the algorithm is highly dependent on the choice of grid cell definition and resulting proposal distribution. Since the optimal choice depends on the application, we provide the reader with three approaches for defining the grid cells and describe when each can be usefully (and efficiently) applied. We discuss the other practical decisions in Appendix B.

4.1 Approach 1: equal grid cells

The first approach sets all finite cells equally sized and the same across all time points, that is, \(I_t(n) = I_s(n)\) for all \(t,s = 1, \dots , T\) and \(n=2, \dots , N-1\) (assuming an infinite lower and upper bound on the latent states). We assume that these equally sized grid cells are centered around the mean of the data, \(\frac{1}{T} \sum _{t=1}^T y_t\), spanning a range of the state space denoted by \(\mathcal {S}\), though this approach could be adapted by centering the grid cells around another function of the data.

If the model parameters do not vary with time, only one transition matrix needs to be calculated per MCMC iteration, or when the model parameters are updated. For a fixed number of grid cells, this approach has the lowest computational cost of the approaches considered. However, since the grid cells are the same for each time point, but the regions of high posterior density may change at each time point, the grid cells should cover a large range of the state space to ensure coverage of these high-density regions and good mixing. This may require many grid cells and a greater computational cost if the high posterior regions of the SSM are highly non-uniform over time, and hence this approach is most efficient for SSMs with uniform such regions.

4.2 Approach 2: data-driven quantiles

If the high posterior density regions vary over time, a time-inhomogeneous approximation to the HMM transition matrix may be more efficient than the previous approach. Hence, we set grid cells in this approach using the observed data at each time point. In particular, in the implementations of this paper, we set the grid cell boundaries at the quantiles of one-dimensional Gaussian distributions centered around each observed data point. For each time point, t, we set the boundaries at the quantiles of X where \(X \sim N(y_t, {\sigma _y}^2)\). The variance of the Gaussian distribution, \({\sigma _y}^2\), is set at a scalar value, or as a function of the current observation process variance within the MCMC iterations and can be used to control the range that the majority of the grid cells cover via pilot tuning. The resulting quantiles are simply rounded if integer values are required. Note that simple extensions to this approach may include using a different distribution to set the quantiles, provided that its domain is in the state space.

When the model parameters are updated, a transition matrix needs to be recalculated for each time point. It is, however, possible to reduce the computational cost of this approach by using the same transition matrix across several time points, centering, for example, on a single data point or the mean of the corresponding sub-series. By using the observed data at each time point, we aim to approximately concentrate the grid cells in areas that are likely to have high posterior mass at each time point. This method may be most efficient if the observed data used in the centering is a good proxy for the underlying state process and its dependencies.

4.3 Approach 3: latent state quantiles

The previous approaches define the grid cells independently of the state process, which may result in slow mixing if the latent states of the SSM are highly correlated over time. The accuracy of the proposal distribution and mixing of the latent states can be improved through the definition of the grid cells. Matousek et al. (2019) show empirically that the accuracy of deterministic grid cell approximations to the posterior distribution can be improved by placing the grid cells according to rough approximations to the posterior distribution. Thus, here we set the grid cells using the current latent states in the MCMC iterations at each time t. The underlying intuition is that, after the MCMC algorithm has converged, the current states reflect a sample from the posterior distribution. Thus, over several iterations, grid cells centered around the current state lead to proposed values that are distributed similarly to the (conditional) posterior distribution (Haario et al. 1999), improving the accuracy of the proposal distribution and mixing of the MCMC steps.

To formulate grid cells using the current state, we let \({x_t}^{(m-1)}\) denote the current state in the MCMC iterations at time t, then similarly to the previous approach, we define the grid cell boundaries at the quantiles of a Gaussian distribution X for each t, where \(X \sim N({x_t}^{(m-1)}, {\sigma _x}^2)\). As with the previous approach, the variance of the Gaussian distribution used to define the grid cells, \({\sigma _x}^2\), controls the span of the finite cells. Again, this variance could be set, for example, as a fixed value (chosen via pilot-tuning), as a function of the current estimate of the system process variance, or as a function of the current state within the MCMC iterations. We note that setting the majority of the grid cells to cover a small range relative to the high posterior density ranges worked well in practice (see Sect. 5).

In this approach, the transition matrices need to be recalculated at each time point for every change in the model parameters but will typically require fewer grid cells than the other approaches to achieve good mixing.

In general, there is a trade-off between the extent to which the grid cells are well-placed and the associated computational expense. For example, equally-sized grid cells are computationally fast but may give lower acceptance probabilities if they give coarse approximations in high-density regions. Where equally-sized grid cells give poor acceptance probabilities, for example, due to non-uniformity in high posterior regions over time, acceptance probabilities may be improved using one of the other approaches at a greater computational expense.

5 Numerical illustrations

We demonstrate the proposed PMPMH algorithm via two case studies. The first is an SSM with a simple one-dimensional Gaussian mixture state process, demonstrating how the algorithm can be implemented, and the properties of the algorithm when different practical decisions are made. We then show how a similar PMPMH implementation can be used to efficiently sample the latent states of a more challenging 2-dimensional population growth model that can display near-chaotic behavior.

In each case study, we compare the performance of the algorithm to two particle Gibbs algorithms: the particle Gibbs (PG) sampler (Chopin and Singh 2015; Murphy and Godsill 2016) and the particle Gibbs with ancestor sampling (PGAS) algorithm (Lindsten et al. 2014), both with multinomial resampling. The two methods are particle MCMC algorithms and, like the proposed algorithm, sample from the joint posterior distribution of the states and parameters by updating each in turn from their full conditional distributions. The PGAS algorithm, in particular, is a state-of-the-art method that often improves upon the mixing properties of the PG algorithm by reducing sample impoverishment (Nonejad 2015; Meent et al. 2015; Kantas et al. 2015), though at an increased computational cost. The code used to implement the PG, PGAS, and PMPMH algorithms in each numerical illustration is given in the Supplementary Information.

5.1 Gaussian mixture state-space model

We consider a simple one-dimensional Gaussian mixture state process:

$$\begin{aligned}&x_1 \sim w_1 N(1, \sigma ^2_{\eta _1}) + (1-w_1)N(1, \sigma ^2_{\eta _2}),\\&x_t \vert x_{t-1} \sim w_t N(x_{t-1}, \sigma ^2_{\eta _1}) + \\&\qquad \qquad \quad (1-w_t) N(x_{t-1}, \sigma ^2_{\eta _2}), t= 2, \dots , T,\\&w_t \sim Bernoulli(p), t= 1, \dots , T, \end{aligned}$$

where \(p \in [0,1]\) denotes the probability of selecting each mixture component of the state distribution (the Gaussian distributions with respective variances \(\sigma ^2_{\eta _1}\) or \(\sigma ^2_{\eta _2}\)). Data, \(y_{1:T}\), are observed according to \(y_t \vert x_t \sim N(x_t, \sigma ^2_{\epsilon })\) and the model parameters are given by \(\pmb {\theta }=(p, \sigma ^2_{\eta _1}, \sigma ^2_{\eta _2}, \sigma ^2_{\epsilon })\). We simulate two data sets from this model, \(y_{1:600}^{(1)}\) using \(\pmb {\theta }=(0.9, 1, 700, 1)\) and \(T=600\) (Model 1), and \(y_{1:1000}^{(2)}\) using \(\pmb {\theta }=(0.99, 1, 10,000, 10)\) and \(T=1000\) (Model 2), shown in Fig. 2.

Fig. 2
figure 2

Simulated data from the Gaussian mixture SSM: \(y^{(1)}_{1:600}\) using a \(\pmb {\theta }=(0.9, 1, 700, 1),\) \(T=600\), and b \(y^{(2)}_{1:1000}\) using \(\pmb {\theta }=(0.99, 1, 10,000, 10),\) \(T=1000\)

Table 1 Average effective sample size (ESS) and effective sample size per second (ESS/s) for Approaches 2 and 3 of the PMPMH algorithm with temporal blocks of size \(\ell =4\)

Both data sets are simulated with a high-variance second mixture component, selected with a low probability, resulting in infrequent but large jumps in the state process. The parameter values for Model 2 result in more occasional and larger jumps in the state process than the first. We compare the properties of the algorithm in both cases.

5.1.1 PMPMH implementation

We sample from the joint posterior distribution of the states and parameters, updating \(x_{1:T}\) and \(\pmb {\theta }\) from their conditional distributions. The sampling schemes and priors for updating \(\pmb {\theta }\) are given in Appendix A. Here, we apply the two stages of the PMPMH algorithm to update \(x_{1:T}\) conditional on \(\pmb {\theta }\). The PMPMH framework presented can be adapted in several ways, for example, by changing the deterministic integration method used to approximate the HMM probabilities, the number of grid cells, and the distribution used to propose values from within grid cells. However, we show in this section that we achieve stable and efficient performance by fixing a number of these decisions and opting for the simple choices given in Appendix B.

Instead, we focus on the practical decisions that are relevant to the efficiency of the algorithm: (a) the choice of each Approach \(1-3\) of Sect. 4 (including the range of the state space covered by the finite grid cells, \(\mathcal {S}\), which we set using scalar variances), (b) the number of grid cells, N, and (c) the temporal block sizes in which the states are updated, \(\ell \). We test the efficiency of the algorithm under various combinations of these tuning parameters:

  1. (a)

    the finite grid cells in Approach 1 over a range of \(\mathcal {S}=150, 250, 350, 450, 550\) units for both data sets (compared to the range of each set of observations, \(y_{1:600}^{(1)}\) and \(y_{1:1000}^{(2)}\), equal to 334 and 478 units respectively). In Approaches 2 and 3, we use \(\mathcal {S}=1, 3, 5, 7, 9, 11, 13, 15\) units for both models,

  2. (b)

    \(N=5, 10, 20\),

  3. (c)

    \(\ell =1, 4, 10\), overlapping blocks by one state to improve the mixing of the states at the “boundaries” of each block as suggested in (Fearnhead 2011).

5.1.2 Results

For each combination of the tuning parameters listed above, the results are based on 10 separate runs of 10, 000 iterations each, taking \(1.3-7.5\) hours with 1.6 GHz CPU for Model 1 (depending on the number of grid cells and the approach chosen) and \(2.2-11\) hours for Model 2. We present the results for each model in Tables 1 and 2 respectively.

Table 2 Average effective sample size (ESS) and effective sample size per second (ESS/s) for Approaches 2 and 3 of the PMPMH algorithm with temporal blocks of size \(\ell =4\)

Increasing the number of grid cells, N, results in a more accurate approximation to \(p(x_{1:T} \vert y_{1:T}, \pmb {\theta })\) in the region covered by the finite grid cells, improving mixing but at a higher computational cost. The equally-spaced grid cell Approach 1 performed poorly on both simulated models compared to the quantile-based methods of Approaches 2 and 3, requiring a large span for both models (350 and 550 units respectively) due to a large range of values in the high posterior regions, and thus large numbers of grid cells and a large computational cost, to provide a reasonable HMM approximation.

Under Approaches 2 and 3, using blocks of size \(\ell =10\) or a span for the finite cells greater than 11 units required more than \(N=20\) grid cells for convergence and resulted in much more costly and less efficient implementations than the other sets of tuning parameters. Further, using single-site updates (\(\ell =1\)) gave poor mixing when compared to blocks of size \(\ell =4\) due to the correlation between consecutive states. The results when using \(\ell =1, 10\), a finite span greater than 11 units and \(N>20\) are thus excluded from Tables 1 and 2 for brevity.

The results for Approaches 2 and 3 in Table 1 use blocks of size \(\ell =4\), and where convergent, converge within \(1000-5000\) iterations using the Brooks-Gelman-Rubin (BGR) diagnostics in Gelman and Rubin (1992); Brooks and Gelman (1998). In the implementations relating to Model 1, both approaches for defining the grid cell boundaries yield similar levels of efficiency when considering the effective sample size (ESS) per second, likely since the small observation error means that both approaches focus grid cells in roughly the same region of the state space around the current state. This is potentially also the reason that both methods required a small span for the finite cells and a small number of grid cells for convergence relative to the range of the data (\(1-9\) vs. 344 units, around \(0.3-2.6\%\) of the range of the data): as is true of proposal distributions that make local moves, grid cells can be focused over a smaller range of the data compared to global-move approaches and still achieve good acceptance probabilities and mixing. The smaller range also means that as few as 5 grid cells can be used to achieve good convergence properties (compared to 50 in Approach 1) via a good HMM approximation over the region, reducing the computational cost of the approaches.

In comparison to Model 2, where the observation process variance is larger, Approach 2 is different from a local-move proposal distribution and generally exhibits poor mixing and convergence in comparison to Approach 3. Conversely, the local moves of Approach 3 are effective at achieving convergence and exhibit relatively stable performance even for \(N=5\). The efficient ranges for the finite grid cells are similar to those in the implementations of the first model as a percentage of the range of the data (here, \(1-2.3\%\) of the range of the data).

We also fitted both of the models using the PG and PGAS algorithms with 5–1000 particles, and various combinations of resampling thresholds based on the standard percentage ESS criterion (Cappé et al. 2005). The PG sampler did not converge for either model using as many as 1000 particles, resulting in computational times of around 14 h for Model 1 and 28 h for Model 2 on 1.6 GHz CPU. Conversely, the PGAS sampler converged using as few as 5 particles for both models. On the whole, the PGAS sampler gave greater levels of efficiency than the PMPMH algorithm for both of the models, achieving an average ESS per second of around 4.45 for Model 1 and 1.49 for Model 2. However, this efficiency was not uniform across all states. The average ESS per second of the states simulated according to the second mixture was 0.25– 0.41 for Model 1 and 0.004–0.008 for Model 2. In contrast, where convergent, the PMPMH algorithm was more robust to the mixture associated with the state, with the ESS per second of states in the second mixture minimally \(99\%\) of those quoted in Tables 1 and 2 (the average ESS per second in the second mixture-distributed states ranging from 0.01 to 0.24 for Model 1 and 0.01–0.06 for Model 2). We now investigate the comparative performance of the algorithm on a model that can display near-chaotic behavior.

5.2 Nicholson’s blowfly model

We consider Nicholson’s Blowfly model for chaotic population growth in Wood (2010). The population counts over time, denoted by \(N_{1:T}\), arise from two correlated survival and birth processes, denoted \(S_{1:T}\) and \(R_{\tau +1:T}\), \(\tau >0\), respectively. Following Wood (2010), we let \(\exp (-\delta \epsilon _t)\) denote the daily survival probability with associated environmental error term \(\epsilon _t\), such that \(\epsilon _t \sim \Gamma (\beta _{\epsilon }, \beta _{\epsilon })\), \(\beta _{\epsilon }>0\). The survival component of the system process is given for \(t=1, \dots , T\) by

$$\begin{aligned} S_t \sim Binom(N_{t-1}, \exp (-\delta \epsilon _t)), \end{aligned}$$

Letting \(e_t\) denote an environmental noise term in the reproductive process such that \(e_t \sim \Gamma (\beta _e, \beta _e)\), \(\beta _e>0\), the reproductive component of the system process is given for \(t=\tau +1, \dots , T,\) by

$$\begin{aligned} R_t \sim Po\left( PN_{t-\tau -1} \exp (-\frac{N_{t-\tau -1}}{N_0}) e_t\right) , \end{aligned}$$
Fig. 3
figure 3

Simulated blowfly population count data using \(T=300\) and \(\pmb {\theta }=(0.7, 50, 1, 0.1, 1)\)

where \(N_t = S_t + R_t\) for \(t=\tau +1, \dots , T\) and \(N_t=S_t\) for \(t=1, \dots , \tau \). We let \(\tau =5\) be the known time lag between population count and subsequent birth count, and \(N_0=50\) is the known initial population count. Further, \(\epsilon _{1:T}\) and \(e_{\tau +1:T}\) are known. The survival and birth processes, \(S_{1:T}\) and \(R_{\tau +1:T}\), are unknown latent states, with observed population counts \(y_t \sim Po(\phi N_t)\), for all t. Further, the model parameters \(\pmb {\theta }=(\delta , P, \beta _\epsilon , \beta _e, \phi )\) are unknown and positively valued. Figure 3 shows the simulated data used in this case study, \(y_{1:T}\), using \(\pmb {\theta }=(0.7, 50, 1, 0.1, 1)\) and \(T=300\).

5.2.1 PMPMH implementation

We devise a Metropolis–Hastings scheme targeting the joint posterior distribution of the unknown latent states and model parameters, \(p(S_{1:T}, R_{\tau +1:T}, \pmb {\theta } \vert y_{1:T})\). The model parameters are updated from their full conditional distribution, \(p(\pmb {\theta } \vert S_{1:T}, R_{\tau +1:T}, y_{1:T})\), using the priors and sampling schemes given in Appendix A. In this section, we describe how the PMPMH algorithm can efficiently update the latent states, \(S_{1:T}\) and \(R_{\tau +1:T}\), conditional on \(\pmb {\theta }\).

First, the state process is now two-dimensional at each time point. To simplify the design of the grid used within the PMPMH algorithm, we build separate PMPMH proposal distributions targeting the full conditional distributions in each state dimension, \(p(S_{1:T} \vert R_{\tau +1:T}, y_{1:T}, \pmb {\theta })\) and \(p(R_{\tau +1:T} \vert S_{t:T}, y_{1:T}, \pmb {\theta })\), respectively. To sample from each of the full conditional distributions in turn, a number of the practical decisions remain the same as in the previous example of Sect. 5.1. We first fix several of the practical decisions to those in Appendix B and use blocks of size \(\ell =4\), overlapping blocks by one state to reduce the correlation between states at the block boundaries. We test the performance of the algorithm for \(N \in \{10, 20, 50\}\) grid cells.

There are two main differences in this implementation compared to the previous example. The first accounts for the near-chaotic state processes and the large range of the data (\(1.59 \times 10^{5}\) units). We apply the current state-centered Approach 3 of Sect. 4, which requires fewer grid cells for good mixing properties due to the lower span for the finite cells required to achieve efficient (more local) moves. Within this approach, we permit large variability around large values for the state process by adjusting the variance of the Gaussian distributions used to set the boundaries of the grid cells: we set the variance proportional to the current state at each time point. We try factors of proportionality of 0.1, 0.25, 0.5, 1, with finite grid cells between the q and \(1-q\) quantiles with \(q=0.01, 0.1, 0.2\), resulting in (average) spans for the finite cells ranging from \(\mathcal {S}=52-400\). The second difference is that we ensure the grid cells are over a discrete space bounded at zero by rounding the quantiles used to determine the grid cells, setting the lower boundary to zero if needed.

5.2.2 Results

We assessed the performance of the PMPMH algorithm using 10 independent MCMC simulations of 50, 000 iterations and present the performance metrics in Table 3. Where the implementations of the PMPMH algorithm converged, convergence was achieved within \(4000-18,000\) iterations, taking \(11-27\) h using 1.6 GHz CPU. Fairly consistent and efficient performance (in terms of ESS per second, Table 3) is achieved using \(20-50\) grid cells with the finite cells covering a region of the state space \(0.05-0.25\%\) of the range of the data (\(\mathcal {S} \in [82, 400]\)).

Table 3 Average ESS and ESS/s across the state and parameter samples for various sets of tuning parameters under Approach 3 with blocks of size \(\ell =4\)

Approach 3 mixes effectively by producing a reasonable HMM approximation in a relatively small region around the current state and is also computationally cheap to implement. The efficiency of the algorithm reduces at an upper bound of 50 grid cells, indicating that the improved mixing properties of the algorithm are not justified by the extra cost when compared with using fewer grid cells. Conversely, 10 grid cells, although lower in computational cost, do not produce a sufficiently accurate approximation to the posterior in the region of the current state. We also note that, similarly to the previous case study, implementations relying on extremely small moves around the state space (finite cells \(<0.05\%\) of the range of the data) gave poor mixing and convergence properties.

The PGAS sampler, when applied to the model using both joint updating of the latent states and updating each latent state process in turn from their full conditional distributions, did not converge in 50, 000 iterations with 1000 particles, remaining in a range of the state space unrepresentative of the posterior due to sample impoverishment and taking 160 hours with 1.6 GHz CPU.

6 Discussion

We provide an efficient novel approach for fitting general SSMs to observed data. The idea uses tractable HMM approximations to efficiently update the unobserved latent states in a Metropolis–Hastings algorithm. We demonstrate the generality of the proposed approach by its application to two problems, including a challenging near-chaotic problem. The proposed PMPMH algorithm is demonstrated to provide reliable posterior estimates within reasonable computational time frames, especially when compared to a state-of-the-art method, which did not converge within a reasonable time frame for the near-chaotic problem. The flexibility of the PMPMH approach via the tuning parameters, including the location, size, and number of grid cells, and the temporal block size provides an adaptable and efficient algorithm. Using simple methods for placing the grid cells, such as equally-sized grid cells, can be efficient since they are computationally cheap. However, such simple methods may require a large number of grid cells to achieve good mixing if the range of the data is large. A current state-centered approach for approximating the HMM is useful, especially when the state space covers a large range, and using a data-centered approach is useful provided the range of the data is small.

The PMPMH approach motivates several interesting points for future research. One avenue for research is to consider the computational efficiency of parallelization when applied to different components of the algorithm. The computational efficiency of algorithms using block-updating strategies can be improved via parallelization (King et al. 2023). Within the framework of the PMPMH algorithm specifically, serially independent blocks of states can be updated in parallel within each Metropolis–Hastings iteration. Further, for the state-centered grid cell approach presented here, computing the approximate HMM probabilities in each block in parallel can lead to an improvement in the execution time of the approximate HMM computations. However, as with parallelization schemes for particle MCMC methods, such as in Henriksen et al. (2012), the user must balance memory limitations and the computational cost from re-synchronization with the computational gains from parallel implementation. Within the sequential computing context of this paper, one way to reduce the overall computational cost of the algorithm is to reduce the number of times the HMM approximation is calculated across iterations. This is valid for approximation methods that do not depend on the current state or model parameters. However, for deterministic grids that adapt to previous samples (for example, Approaches 2 and 3), the resulting posterior estimate may become biased since the chain is no longer Markovian (Haario et al. 1999). However, Haario et al. (1999) also show that the bias introduced is negligible in some applications and that unbiased samples can be obtained by introducing some pre-determined stopping criterion for updating the grid, that is, specifying a number of iterations beyond which the grid is no longer updated. The computational gains from updating the HMM approximation less frequently could, however, be balanced with potentially reduced mixing properties. For example, local-move grids using a low span for finite cells are more sensitive to the frequency of the HMM approximation since they rely on proposed moves being made in the region of the current state. However, updating the grid less frequently may give an efficient approach when applied to more global-move samplers.

One may also attempt to find efficient ways to adapt the HMM approximations. With any such adaptations, there is a trade-off between the improved mixing properties and the computational cost arising from the complexity of the method. For example, in Step 1 of the algorithm, more complex numerical integration methods could be used to approximate the SSM density in each grid cell. Such methods include increasing the number of points in each grid cell used for approximation calculations, simple linear functions joining evaluations at midpoints, or other fast evaluations that include the gradient in each grid cell. There may also be ways to improve the efficiency of the within-cell proposal distributions in Step 2, including approaches that approximate the posterior density in each grid cell where the conditional posteriors are irregular or vary in an unsystematic way depending on the value of the parameter. These approaches could be valuable to explore for more complex SSMs, in which the improved mixing may justify the added computational cost. Further, these approaches to improving mixing in each step may permit the use of fewer grid cells, decreasing the overall computational cost, which may also permit greater scalability of the HMM to higher dimensional latent spaces. Although we demonstrated the scalability of the proposed algorithm to higher dimensional spaces by updating each state dimension conditionally on the remaining state dimensions, higher dimensional spaces are a particular challenge if individual state dimensions are highly correlated, resulting in poor mixing. In this case, if low-dimensional sets of state dimensions are independent, for example, Factorial SSMs (Ghahramani and Jordan 1995; Rimella and Whiteley 2022), or if dimensions of the state space are partially integrable (Borowska and King 2023), the poor mixing from highly-correlated state dimensions could be improved by performing joint updates using lower-dimensional transition matrix approximations. However, in general, the issue of the scalability of grid-based methods to high-dimensional SSMs is a challenge to the proposed algorithm and an active area of research.