As discussed in Sects. 2 and 3, inference without sampling from (9), (10) or (12) is, theoretically, possible. Indeed, since the sun densities of the filtering, predictive and smoothing distributions can be obtained from Theorems 1–2, the main functionals of interest can be computed via closed-form expressions (Arellano-Valle and Azzalini 2006; Azzalini and Bacchieri 2010; Gupta et al. 2013; Azzalini and Capitanio 2014) or by relying on numerical integration. However, these strategies require evaluations of multivariate Gaussian cumulative distribution functions, which tend to be impractical as t grows or when the focus is on complex functionals.
In these situations, Monte Carlo integration provides an accurate procedure to evaluate the generic functionals \({\mathbb {E}}[g({\varvec{\theta }}_{t}) \mid \mathbf{y}_{1:t}]\), \({\mathbb {E}}[g({\varvec{\theta }}_{t}) \mid \mathbf{y}_{1:t-1}]\) and \({\mathbb {E}}[g({\varvec{\theta }}_{1:n}) \mid \mathbf{y}_{1:n}]\) for the filtering, predictive and smoothing distribution via
$$\begin{aligned} \frac{1}{R}\sum _{r=1}^Rg({\varvec{\theta }}^{(r)}_{t\mid t}), \quad \frac{1}{R}\sum _{r=1}^Rg({\varvec{\theta }}^{(r)}_{t\mid t-1}), \quad \frac{1}{R}\sum _{r=1}^Rg({\varvec{\theta }}^{(r)}_{1:n\mid n}), \end{aligned}$$
with \({\varvec{\theta }}^{(r)}_{t\mid t}\), \({\varvec{\theta }}^{(r)}_{t\mid t-1}\) and \({\varvec{\theta }}^{(r)}_{1:n\mid n}\) sampled from \(p({\varvec{\theta }}_{t} \mid \mathbf{y}_{1:t})\), \(p({\varvec{\theta }}_{t} \mid \mathbf{y}_{1:t-1})\) and \(p({\varvec{\theta }}_{1:n} \mid \mathbf{y}_{1:n})\), respectively. For example, if the evaluation of (11) is demanding, the observations predictive density can be easily computed as \(\sum _{r=1}^R \varPhi _m(\mathbf{B}_{t}\mathbf{F}_{t}{\varvec{\theta }}^{(r)}_{t \mid t-1}; \mathbf{B}_{t}\mathbf{V}_{t}\mathbf{B}_{t})/R\).
To be implemented, the above approach requires an efficient strategy to sample from (9), (10) and (12). Exploiting the sun properties and recent results in Botev (2017), an algorithm to draw independent and identically distributed samples from the exact sun distributions in (9), (10) and (12) is developed within Sect. 4.1. As illustrated in Sect. 5, such a technique is more accurate than state-of-the-art methods and can be efficiently implemented in small-to-moderate dimensional time series. In Sect. 4.2 we develop, instead, novel sequential Monte Carlo schemes that allow scalable online learning in high dimensional settings and have optimality properties (Doucet et al. 2000) which shed new light also on existing strategies (e.g, Andrieu and Doucet 2002).
Independent identically distributed sampling
As discussed in Sect. 1, mcmc and sequential Monte Carlo methods to sample from \(p({\varvec{\theta }}_{t} \mid \mathbf{y}_{1:t})\), \(p({\varvec{\theta }}_{t} \mid \mathbf{y}_{1:t-1})\) and \(p({\varvec{\theta }}_{1:n} \mid \mathbf{y}_{1:n})\) are available. However, the commonly recommended practice, if feasible, is to rely on independent and identically distributed (i.i.d.) samples. Here, we derive a Monte Carlo algorithm to address this goal with a main focus on the smoothing distribution, and discuss direct modifications to allow sampling also in the filtering and predictive case. Indeed, Monte Carlo inference is particularly suitable for batch settings, although, as discussed later, the proposed routine is practically useful also when the focus is on filtering and predictive distributions, since i.i.d. samples are simulated rapidly, for each t, in small-to-moderate dimensions.
Exploiting the closed-form expression of the smoothing distribution in Theorem 2, and the additive representation (7) of the sun, i.i.d. samples for \({\varvec{\theta }}_{1:n\mid n}\) from the smoothing distribution (12) can be obtained via a linear combination between independent samples from (pn)-variate Gaussians and (mn)-variate truncated normals. Algorithm 1 provides the detailed pseudo-code for this novel strategy, whose outputs are i.i.d. samples from the joint smoothing density \(p({\varvec{\theta }}_{1:n} \mid \mathbf{y}_{1:n})\). Here, the most computationally intensive step is the sampling from \(\textsc {tn}_{m n}(\mathbf{0},{\varvec{\varGamma }}_{1:n\mid n}; {\mathbb {A}}_{{\varvec{\gamma }}_{1:n\mid n}})\), which denotes the multivariate Gaussian distribution \(\text{ N}_{m n}(\mathbf{0},{\varvec{\varGamma }}_{1:n\mid n})\) truncated to the region \({\mathbb {A}}_{{\varvec{\gamma }}_{1:n\mid n}}=\{\mathbf{u}_1 \in {\mathbb {R}}^{m n}: \mathbf{u}_1+{\varvec{\gamma }}_{1:n\mid n}>0\}\). In fact, although efficient Hamiltonian Monte Carlo solutions are available (Pakman and Paninski 2014), these strategies do not provide independent samples. More recently, an accept-reject method based on minimax tilting has been proposed by Botev (2017) to improve the acceptance rate of classical rejection sampling, while avoiding mixing issues of mcmc. This routine is available in the R library TruncatedNormal and allows efficient sampling from multivariate truncated normals with a dimension of few hundreds, thereby providing effective Monte Carlo inference via Algorithm 1 in small-to-moderate dimensional time series where mn is of the order of few hundreds.
Clearly, the availability of an i.i.d. sampling scheme from the smoothing distribution overcomes the need of mcmc methods and particle smoothers. The first set of strategies usually faces mixing or time-inefficiency issues, especially in imbalanced binary settings (Johndrow et al. 2019), whereas the second class of routines tends to be computationally intensive and subject to particles degeneracy (Doucet and Johansen 2009).
When the interest is on Monte Carlo inference for the marginal smoothing density \(p({\varvec{\theta }}_t \mid \mathbf{y}_{1:n})\) at a specific time t, Algorithm 1 requires minor adaptations relying again on the additive representation of the sun in (13), under similar arguments considered for the joint smoothing setting. This latter routine can be also used to sample from the filtering distribution in (10) by applying such a scheme with \(n=t\) to obtain i.i.d. samples for \({\varvec{\theta }}_{t\mid t}\) from \(p({\varvec{\theta }}_{t} \mid \mathbf{y}_{1:t})\). Leveraging realizations from the filtering distribution at time \(t-1\), i.i.d. samples for \({\varvec{\theta }}_{t \mid t-1}\) from the predictive density \(p({\varvec{\theta }}_{t} \mid \mathbf{y}_{1:t-1})\), can be simply obtained via the direct application of (2) which provides samples for \({\varvec{\theta }}_{t\mid t-1}\) from \(\text{ N}_p(\mathbf{G}_{t}{\varvec{\theta }}_{t-1\mid t-1}, \mathbf{W}_{t})\). As a result, efficient Monte Carlo inference in small-to-moderate dimensional dynamic probit models is possible also for filtering and predictive distributions.
Sequential Monte Carlo sampling
When the dimension of the dynamic probit model (1)–(2) grows, sampling from multivariate truncated Gaussians in Algorithm 1 might yield computational bottlenecks (Botev 2017). This is particularly likely to occur in series monitored on a fine time grid. Indeed, in several applications, the number of time series m is typically small, whereas the length of the time window can be large. To address this issue and allow scalable online filtering and prediction also in large t settings, we first derive in Sect. 4.2.1 a particle filter which exploits the sun results to obtain optimality properties, in the sense of Doucet et al. (2000). Despite covering a gap in the literature on dynamic probit models, as clarified in Sects. 4.2.1 and 4.2.2, such a strategy is amenable to further improvements since it induces unnecessary autocorrelation in the Gaussian part of the sun generative representation. Motivated by this consideration and by the additive structure of the sun filtering distribution, we further develop in Sect. 4.2.2 a partially collapsed sequential Monte Carlo procedure which recursively samples via lookahead methods (Lin et al. 2013) only the multivariate truncated normal term in the sun additive representation, while keeping the Gaussian component exact. As outlined in Sect. 4.2.2, such a broad class of partially collapsed lookahead particle filters comprises, as a special case, the Rao–Blackwellized particle filter developed by Andrieu and Doucet (2002). This provides novel theoretical support to the notable performance of such a strategy, which was originally motivated, in the context of dynamic probit models, also by the lack of a closed-form optimal particle filter for the states.
“Optimal” particle filter
The first proposed strategy belongs to the class of sequential importance sampling-resampling (sisr) algorithms which provide default strategies in particle filtering (e.g., Doucet et al. 2000, 2001; Durbin and Koopman 2012). For each time t, these routines sample R trajectories for \({\varvec{\theta }}_{1:t|t}\) from \(p({\varvec{\theta }}_{1:t} \mid \mathbf{y}_{1:t})\), known as particles, conditioned on those produced at \(t-1\), by iterating, in time, between the two steps summarized below.
-
1. Sampling. Let \({\varvec{\theta }}_{1:t-1\mid t-1}^{(1)},\ldots ,{\varvec{\theta }}_{1:t-1\mid t-1}^{(R)}\) be the trajectories of the particles at time \(t-1\), and denote with \(\pi ({\varvec{\theta }}_{t} \mid {\varvec{\theta }}_{1:t-1}, \mathbf{y}_{1:t})\) the proposal. Then, for \(r=1, \ldots , R\)
-
[1.a] Sample \({\bar{{\varvec{\theta }}}}_{t \mid t}^{(r)}\) from \(\pi ({{\varvec{\theta }}}_{t} \mid {\varvec{\theta }}_{1:t-1 \mid t-1}^{(r)}, \mathbf{y}_{1:t} )\) and set
$$\begin{aligned} {\bar{{\varvec{\theta }}}}_{1:t \mid t}^{(r)} = ({\varvec{\theta }}_{1:t-1 \mid t-1}^{(r)\intercal }, {\bar{{\varvec{\theta }}}}_{t \mid t}^{(r) \intercal } )^{\intercal }. \end{aligned}$$
-
[1.b] Set \(w^{(r)}_t= w_t({\bar{{\varvec{\theta }}}}_{1:t \mid t}^{(r)})\), with
$$\begin{aligned} w_t({\bar{{\varvec{\theta }}}}_{1:t \mid t}^{(r)}) \propto \frac{p(\mathbf{y}_t \mid {\bar{{\varvec{\theta }}}}_{t \mid t}^{(r)})p({\bar{{\varvec{\theta }}}}_{t \mid t}^{(r)} \mid {\varvec{\theta }}_{t-1 \mid t-1}^{(r)})}{\pi ( {\bar{{\varvec{\theta }}}}^{(r)}_{t \mid t} \mid {\varvec{\theta }}_{1:t-1 \mid t-1}^{(r)}, \mathbf{y}_{1:t} )}, \end{aligned}$$
and normalize the weights, so that their sum is 1.
-
2. Resampling. For \(r=1, \ldots , R\), sample updated particles’ trajectories \({\varvec{\theta }}_{1:t\mid t}^{(1)},\ldots ,{\varvec{\theta }}_{1:t\mid t}^{(R)}\) from \(\sum _{r=1}^{R} w_t^{(r)}\delta _{{\bar{{\varvec{\theta }}}}_{1:t \mid t}^{(r)}}\).
From these particles, functionals of the filtering density \(p({\varvec{\theta }}_t \mid \mathbf{y}_{1:t})\) can be computed using the terminal values \({\varvec{\theta }}_{t|t}\) of each particles’ trajectory for \({\varvec{\theta }}_{1:t|t}\). Note that in point [1.a] we have presented the general formulation of sisr, where the importance density \(\pi ({\varvec{\theta }}_{t} \mid {\varvec{\theta }}_{1:t-1}, \mathbf{y}_{1:t})\) can, in principle, depend on the whole trajectory \({\varvec{\theta }}_{1:t-1}\) (Durbin and Koopman 2012, Sect. 12.3).
As is clear from the above routine, the performance of sisr depends on the choice of \(\pi ({\varvec{\theta }}_{t} \mid {\varvec{\theta }}_{1:t-1}, \mathbf{y}_{1:t})\). Such a density should allow tractable sampling along with efficient evaluation of the importance weights, and should be also carefully specified to propose effective candidate samples. Recalling Doucet et al. (2000), the optimal proposal is \(\pi ({\varvec{\theta }}_{t} \mid {\varvec{\theta }}_{1:t-1}, \mathbf{y}_{1:t})=p( {\varvec{\theta }}_{t} \mid {\varvec{\theta }}_{t-1}, \mathbf{y}_{t} )\), with importance weights \(w_t \propto p(\mathbf{y}_t \mid {\varvec{\theta }}_{t-1} )\). Indeed, conditioned on \({\varvec{\theta }}_{1:t-1|t-1}\) and \(\mathbf{y}_{1:t}\), this choice minimizes the variance of the weights, thus limiting degeneracy issues and improving mixing. Unfortunately, in several dynamic models, tractable sampling from \(p( {\varvec{\theta }}_{t} \mid {\varvec{\theta }}_{t-1}, \mathbf{y}_{t} )\) and the direct evaluation of \(p(\mathbf{y}_t \mid {\varvec{\theta }}_{t-1} )\) is not possible (Doucet et al. 2000). As outlined in Corollary 4, this is not the case for dynamic probit models. In particular, by leveraging the proof of Theorem 1 and the closure properties of the sun, sampling from \(p( {\varvec{\theta }}_{t} \mid {\varvec{\theta }}_{t-1}, \mathbf{y}_{t} )\) is straightforward and \(p(\mathbf{y}_t \mid {\varvec{\theta }}_{t-1} )\) has a simple form.
Corollary 4
For every time \(t=1,\ldots ,n\), the optimal importance distribution under model (1)–(2) is
$$\begin{aligned}&( {\varvec{\theta }}_{t} \mid {\varvec{\theta }}_{t-1}, \mathbf{y}_{t} ) \\&\quad \sim {\textsc {sun}}_{p,m}({\varvec{\xi }}_{t \mid t,t-1}, {\varvec{\varOmega }}_{t\mid t,t-1}, {\varvec{\varDelta }}_{t\mid t, t-1},{\varvec{\gamma }}_{t\mid t, t-1}, {\varvec{\varGamma }}_{t\mid t, t-1}), \nonumber \end{aligned}$$
(14)
whereas the importance weights are
$$\begin{aligned} p(\mathbf{y}_{t} \mid {\varvec{\theta }}_{t-1}) = \varPhi _m({\varvec{\gamma }}_{t\mid t, t-1}; {\varvec{\varGamma }}_{t\mid t, t-1}), \end{aligned}$$
(15)
with parameters defined by the recursive equations
$$\begin{aligned} {\varvec{\xi }}_{t \mid t,t-1}&=\mathbf{G}_t {\varvec{\theta }}_{t-1}, \quad {\varvec{\varOmega }}_{t\mid t,t-1}=\mathbf{W}_t,\\ {\varvec{\varDelta }}_{t\mid t, t-1}&={\bar{{\varvec{\varOmega }}}}_{t\mid t,t-1}{\varvec{\omega }}_{t\mid t,t-1}\mathbf{F}_t^{\intercal }\mathbf{B}_t\mathbf{c}_t^{-1},\\ {\varvec{\gamma }}_{t\mid t, t-1}&=\mathbf{c}_t^{-1}\mathbf{B}_t\mathbf{F}_t {\varvec{\xi }}_{t \mid t,t-1}, \\ {\varvec{\varGamma }}_{t\mid t, t-1}&=\mathbf{c}_t^{-1}\mathbf{B}_t\left( \mathbf{F}_t {\varvec{\varOmega }}_{t\mid t,t-1}\mathbf{F}_t^{\intercal }{+}\mathbf{V}_t\right) \mathbf{B}_t \mathbf{c}_t^{-1}, \end{aligned}$$
where \(\mathbf{c}_t = \left[ (\mathbf{F}_t {\varvec{\varOmega }}_{t\mid t,t-1}\mathbf{F}_t^{\intercal }+\mathbf{V}_t)\odot \mathbf{I}_m\right] ^{1/2}\).
As clarified in Corollary 4, the weights \(p(\mathbf{y}_{t} \mid {\varvec{\theta }}_{t-1})\) for the generated trajectories are available analytically in (15) and do not depend on the sampled values of the particle at time t. This allows the implementation of the more efficient auxiliary particle filter (auf) (Pitt and Shephard 1999) by simply reversing the order of the sampling and resampling steps, thereby obtaining a performance gain (Andrieu and Doucet 2002). Algorithm 2 illustrates the pseudo-code of the proposed “optimal” auxiliary filter, which exploits the additive representation of the sun and Corollary 4. Note that, unlike for Algorithm 1, such a sequential sampling strategy requires to sample at each step from a truncated normal whose dimension does not depend on t, thus facilitating scalable sequential inference in large t studies. Samples from the predictive distribution can be obtained from those of the filtering as discussed in Sect. 4.1.
Despite having optimality properties, a close inspection of Algorithm 2 shows that the states’ particles at \(t-1\) affect both the Gaussian component, via \({\varvec{\xi }}_{t \mid t,t-1}\), and the truncated normal term, via \({\varvec{\gamma }}_{t \mid t,t-1}\), in the sun additive representation of \(({\varvec{\theta }}_t \mid \mathbf{y}_{1:t})\). Although the autocorrelation in the multivariate truncated normal samples is justified by the computational intractability of this variable in high dimensions, inducing serial dependence also in the Gaussian terms seems unnecessary, as these quantities are tractable and their dimension does not depend on t; see Theorem 1. This suggests that a strategy which sequentially updates only the truncated normal term, while maintaining the Gaussian part exact, could further improve the performance of Algorithm 2. This new particle filter is derived in Sect. 4.2.2, inheriting also lookahead ideas (Lin et al. 2013).
Partially collapsed lookahead particle filter
As anticipated within Sect. 4.2, the most computationally intensive step to draw i.i.d. samples from the filtering distribution is sampling from the multivariate truncated normal \(\mathbf{U}_{1 \ 1:t\mid t}\sim \textsc {tn}_{m t}({\varvec{0}},{\varvec{\varGamma }}_{1:t\mid t}; {\mathbb {A}}_{{\varvec{\gamma }}_{1:t\mid t}})\) in Algorithm 1. Here, we present a class of procedures to sequentially generate these samples, which are then combined with realizations from the exact Gaussian component in the sun additive representation, thus producing samples from the filtering distribution. With this goal in mind, define the region \({\mathbb {A}}_{\mathbf{y}_{s:t}}= \{\mathbf{z}\in {\mathbb {R}}^{m(t-s+1)}: (2\mathbf{y}_{s:t}-{\varvec{1}})\odot \mathbf{z}>{\varvec{0}}\}\) for every \(s=1, \ldots , t\), and let \(\mathbf{V}_{1:t}\) be the \((mt) \times (mt)\) block-diagonal matrix having blocks \(\mathbf{V}_{[ss]}=\mathbf{V}_s\), for \(s=1, \ldots , t\). Moreover, denote with \(\mathbf{B}_{s:t}\) and \(\mathbf{F}_{s:t}\) two block-diagonal matrices of dimension \([m(t-s+1)]\times [m (t-s+1)]\) and \([m (t-s+1)]\times [p (t-s+1)]\), respectively, and diagonal blocks \(\mathbf{B}_{s:t [ll]} = \mathbf{B}_{s+l-1}\) and \(\mathbf{F}_{s:t [ll]}=\mathbf{F}_{s+l-1}\) for \(l=1,\ldots ,t-s+1\). Exploiting this notation and adapting results in Sect. 3.2 to the case \(n=t\), it follows from standard properties of multivariate truncated normals (Horrace 2005) that
$$\begin{aligned} \mathbf{U}_{1 \ 1:t\mid t} {\mathop {=}\limits ^{\mathrm{d}}} -{\varvec{\gamma }}_{1:t\mid t} + \mathbf{s}_{1:t\mid t}^{-1}\mathbf{B}_{1:t}{\mathbf{z}}_{1:t\mid t}, \end{aligned}$$
(16)
with \({\mathbf{z}}_{1:t\mid t} \sim \textsc {tn}_{m t}(\mathbf{F}_{1:t}{\varvec{\xi }}_{1:t\mid t}, \mathbf{F}_{1:t}{\varvec{\varOmega }}_{1:t\mid t}\mathbf{F}_{1:t}^{\intercal }+\mathbf{V}_{1:t}; {\mathbb {A}}_{\mathbf{y}_{1:t}})\) and \(\mathbf{s}_{1:t\mid t} = [(\mathbf{D}{\varvec{\varOmega }}_{1:t\mid t}\mathbf{D}^{\intercal }+{\varvec{\varLambda }}) \odot \text{ I}_{m t}]^{1/2}\), where \(\mathbf{D}\) and \({\varvec{\varLambda }}\) are defined as in Sect. 3.2, setting \(n=t\). Note that the multivariate truncated normal distribution for \({\mathbf{z}}_{1:t\mid t}\) actually coincides with the conditional distribution of \(\mathbf{z}_{1:t}\) given \(\mathbf{y}_{1:t}\) under model (3)–(5). Indeed, by marginalizing out \({\varvec{\theta }}_{1:t}\) in \(p(\mathbf{z}_{1:t} \mid {\varvec{\theta }}_{1:t})=\prod _{s=1}^t\phi _m(\mathbf{z}_{s}-\mathbf{F}_s{\varvec{\theta }}_{s};\mathbf{V}_s)=\phi _{m t}(\mathbf{z}_{1:t}-\mathbf{F}_{1:t}{\varvec{\theta }}_{1:t}; \mathbf{V}_{1:t})\) with respect to its multivariate normal distribution derived in the proof of Theorem 2, we have \(p(\mathbf{z}_{1:t}) = \phi _{m t}(\mathbf{z}_{1:t}-\mathbf{F}_{1:t}{\varvec{\xi }}_{1:t\mid t}; \mathbf{F}_{1:t}{\varvec{\varOmega }}_{1:t\mid t}\mathbf{F}_{1:t}^\intercal +\mathbf{V}_{1:t})\) and, as a direct consequence, we obtain
$$\begin{aligned} p(\mathbf{z}_{1:t}\mid \mathbf{y}_{1:t})&\propto p(\mathbf{z}_{1:t})p(\mathbf{y}_{1:t} \mid \mathbf{z}_{1:t}), \\&\propto p(\mathbf{z}_{1:t})\mathbb {1}[ (2\mathbf{y}_{1:t}-{\varvec{1}})\odot \mathbf{z}_{1:t}>{\varvec{0}}], \end{aligned}$$
which is the kernel of the \(\textsc {tn}_{m t}(\mathbf{F}_{1:t}{\varvec{\xi }}_{1:t\mid t}, \mathbf{F}_{1:t}{\varvec{\varOmega }}_{1:t\mid t}\mathbf{F}_{1:t}^{\intercal }+\mathbf{V}_{1:t}; {\mathbb {A}}_{\mathbf{y}_{1:t}})\) density.
The above analytical derivations clarify that in order to sample recursively from \(\mathbf{U}_{1 \ 1:t\mid t}\) it is sufficient to apply Eq. (16) to sequential realizations of \(\mathbf{z}_{1:t \mid t}\) from the joint conditional density \(p(\mathbf{z}_{1:t}\mid \mathbf{y}_{1:t})\), induced by model (3)–(5), after collapsing out \({\varvec{\theta }}_{1:t}\). While basic sisr algorithms for \(p(\mathbf{z}_{1:t}\mid \mathbf{y}_{1:t})\), combined with the exact sampling from the Gaussian component \(\mathbf{U}_{0 \ t\mid t}\), are expected to yield an improved performance relative to the particle filter developed in Sect. 4.2.1, here we adapt an even broader class of lookahead particle filters (Lin et al. 2013) — which includes the basic sisr as a special case. To introduce the general lookahead idea note that \(p(\mathbf{z}_{1:t}\mid \mathbf{y}_{1:t})=p(\mathbf{z}_{t-k+1:t} \mid \mathbf{z}_{1:t-k}, \mathbf{y}_{1:t})p(\mathbf{z}_{1:t-k} \mid \mathbf{y}_{1:t})\), where k is a pre-specified delay offset. Moreover, as a direct consequence of the dependence structure displayed in Fig. 2, we also have that \(p(\mathbf{z}_{t-k+1:t} \mid \mathbf{z}_{1:t-k}, \mathbf{y}_{1:t})=p(\mathbf{z}_{t-k+1:t} \mid \mathbf{z}_{1:t-k}, \mathbf{y}_{t-k+1:t})\) for any generic k. Hence, to sequentially generate realizations of \(\mathbf{z}_{1:t \mid t}\) from \(p(\mathbf{z}_{1:t}\mid \mathbf{y}_{1:t})\), we can first sample \(\mathbf{z}_{1:t-k \mid t}\) from \(p(\mathbf{z}_{1:t-k} \mid \mathbf{y}_{1:t})\) by extending, via sisr, the trajectory \(\mathbf{z}_{1:t-k-1 \mid t-1}\) with optimal proposal \(p(\mathbf{z}_{t-k}\mid \mathbf{z}_{1:t-k-1}=\mathbf{z}_{1:t-k-1 \mid t-1}, \mathbf{y}_{t-k:t})\), and then draw the last k terms in \(\mathbf{z}_{1:t \mid t}\) from \(p(\mathbf{z}_{t-k+1:t} \mid \mathbf{z}_{1:t-k}=\mathbf{z}_{1:t-k \mid t}, \mathbf{y}_{t-k+1:t})\). Note that when \(k=0\) this final operation is not necessary, and the particles’ updating in the first step reduces to basic sisr. Values of k in \(\{1; \ldots ;n-1 \}\) induce, instead, a lookahead structure in which at the current time t the optimal proposal for the delayed particles leverages information of response data \( \mathbf{y}_{t-k:t}\) that are not only contemporaneous to \(\mathbf{z}_{t-k}\), i.e., \(\mathbf{y}_{t-k}\), but also future, namely \(\mathbf{y}_{t-k+1}, \ldots , \mathbf{y}_{t}\). In this way, the samples from the sub-trajectory \(\mathbf{z}_{1:t-k \mid t}\) of \(\mathbf{z}_{1:t \mid t}\) at time t are more compatible with the sampling density \(p(\mathbf{z}_{1:t}\mid \mathbf{y}_{1:t})\) of interest and hence, when completed with the last k terms drawn from \(p(\mathbf{z}_{t-k+1:t} \mid \mathbf{z}_{1:t-k}=\mathbf{z}_{1:t-k \mid t}, \mathbf{y}_{t-k+1:t})\), produce a sequential sampling scheme from \(p(\mathbf{z}_{1:t}\mid \mathbf{y}_{1:t})\) with improved mixing and reduced degeneracy issues relative to basic sisr. Although the magnitude of such gains clearly grows with k, as illustrated in Sect. 5, setting \(k=1\) already provides empirical evidence of improved performance relative to basic sisr, without major computational costs.
To implement the aforementioned strategy it is first necessary to ensure that the lookahead proposal belongs to a class of random variables which allow efficient sampling, while having a tractable closed-form expression for the associated importance weights. Proposition 1 shows that this is the case under model (3)–(5).
Proposition 1
Under the augmented model in (3)–(5), the lookahead proposal mentioned above has the form
$$\begin{aligned} \begin{aligned}&p(\mathbf{z}_{t-k}\mid \mathbf{z}_{1:t-k-1},\mathbf{y}_{t-k:t})\\&\quad = \int p(\mathbf{z}_{t-k:t}\mid \mathbf{z}_{1:t-k-1},\mathbf{y}_{t-k:t}) d\mathbf{z}_{t-k+1:t}, \end{aligned} \end{aligned}$$
(17)
where \(p(\mathbf{z}_{t-k:t}\mid \mathbf{z}_{1:t-k-1}{,}\mathbf{y}_{t-k:t})\) is the density of the truncated normal \({\textsc {tn}_{m(k{+}1)}}{(}\mathbf{r}_{{t-k:t\mid t{-}k{-}1}}{,}\mathbf{S}_{t-k:t\mid t-k-1}{;}{\mathbb {A}}_{\mathbf{y}_{t-k:t}}{)}\) with parameters \(\mathbf{r}_{t-k:t\mid t-k-1}= {\mathbb {E}}(\mathbf{z}_{t-k:t}\mid \mathbf{z}_{1:t-k-1})\) and \(\mathbf{S}_{t-k:t\mid t-k-1}= \text{ var }(\mathbf{z}_{t-k:t}\mid \mathbf{z}_{1:t-k-1})\). The importance weights \(w_t=w(\mathbf{z}_{1:t-k})\) are, instead, proportional to
$$\begin{aligned} \dfrac{p(\mathbf{y}_{t-k:t}\mid \mathbf{z}_{1:t-k-1})}{p(\mathbf{y}_{t-k:t-1}\mid \mathbf{z}_{1:t-k-1})}= \dfrac{\varPhi _{m (k+1)}({\varvec{\mu }}_t;{\varvec{\varSigma }}_t) }{\varPhi _{m k}({\bar{{\varvec{\mu }}}}_t;{\bar{{\varvec{\varSigma }}}}_t) }, \end{aligned}$$
(18)
where the mean vectors are \({\varvec{\mu }}_t = \mathbf{B}_{t-k:t} \mathbf{r}_{t-k:t\mid t-k-1}\) and \({\bar{{\varvec{\mu }}}}_t = \mathbf{B}_{t-k:t-1}\mathbf{r}_{t-k:t-1\mid t-k-1}\), whereas the covariance matrices are defined as \({\varvec{\varSigma }}_t = \mathbf{B}_{t-k:t} \mathbf{S}_{t-k:t\mid t-k-1} \mathbf{B}_{t-k:t}\) and \({\bar{{\varvec{\varSigma }}}}_t = \mathbf{B}_{t-k:t-1} \mathbf{S}_{t-k:t-1\mid t-k-1} \mathbf{B}_{t-k:t-1}\).
To complete the procedure for sampling from \(p(\mathbf{z}_{1:t}\mid \mathbf{y}_{1:t})\) we further require \(p(\mathbf{z}_{t-k+1:t}\mid \mathbf{z}_{1:t-k}, \mathbf{y}_{t-k+1:t})\). As clarified in Proposition 2, also such a quantity is the density of a multivariate truncated normal.
Proposition 2
Under model (3)–(5), it holds
$$\begin{aligned}&(\mathbf{z}_{t-k+1:t}\mid \mathbf{z}_{1:t-k}, \mathbf{y}_{t-k+1:t}) \\&\quad \sim \textsc {tn}_{m k}(\mathbf{r}_{t-k+1:t\mid t-k}, \mathbf{S}_{t-k+1:t\mid t-k};{\mathbb {A}}_{\mathbf{y}_{t-k+1:t}}), \nonumber \end{aligned}$$
(19)
with parameters \(\mathbf{r}_{t-k+1:t\mid t-k}= {\mathbb {E}}(\mathbf{z}_{t-k+1:t}\mid \mathbf{z}_{1:t-k})\) and \(\mathbf{S}_{t-k+1:t\mid t-k}= \text{ var }(\mathbf{z}_{t-k+1:t}\mid \mathbf{z}_{1:t-k})\).
Note that the expression of the importance weights in Eq. (18) does not depend on \(\mathbf{z}_{t-k}\), and, hence, also in this case the resampling step can be performed before sampling from (17), thus leading to an auf routine. Besides improving efficiency, such a strategy allows to combine the particle generation in (17) and the completion of the last k terms of \(\mathbf{z}_{1:t \mid t}\) in (19) within a single sampling step from the joint \([m(k+1)]\)-variate truncated normal distribution for \((\mathbf{z}_{t-k:t}\mid \mathbf{z}_{1:t-k-1},\mathbf{y}_{t-k:t})\) reported in Proposition 1. The first m-dimensional component of this vector yields the new delayed particle for \(\mathbf{z}_{t-k \mid t}\) from (17), whereas the whole sub-trajectory provides the desired sample from \(p(\mathbf{z}_{t-k:t}\mid \mathbf{z}_{1:t-k-1},\mathbf{y}_{t-k:t})\) which is joined to the previously resampled particles for \(\mathbf{z}_{1:t-k-1 \mid t}\) to form a realization of \(\mathbf{z}_{1:t\mid t}\) from \(p(\mathbf{z}_{1:t} \mid \mathbf{y}_{1:t})\). Once this sample is available, one can obtain a draw of \({\varvec{\theta }}_{t \mid t}\) from the filtering density \(p({\varvec{\theta }}_t\mid \mathbf{y}_{1:t})\) of interest by exploiting the additive representation of the sun and the analogy between \(\mathbf{U}_{1 \ 1:t\mid t}\) and \(\mathbf{z}_{1:t|t}\) in (16). In practice, as clarified in Algorithm 3, the updating of \(\mathbf{U}_{1 \ 1:t\mid t}\) via lookahead recursion on \(\mathbf{z}_{1:t|t}\) and the exact sampling from the Gaussian component of the sun filtering distribution for \({\varvec{\theta }}_t\) can be effectively combined in a single online routine based on Kalman filter steps.
To clarify Algorithm 3, note that \(p({\varvec{\theta }}_t\mid \mathbf{z}_{1:t})\) is the filtering density of the Gaussian dynamic linear model defined in (4)–(5), for which the Kalman filter can be directly implemented, once the trajectory \(\mathbf{z}_{1:t \mid t}\) has been generated from \(p(\mathbf{z}_{1:t}\mid \mathbf{y}_{1:t})\) via the lookahead filter. Let \(\mathbf{a}_{t-k-1\mid t-k-1} = {\mathbb {E}}({\varvec{\theta }}_{t-k-1}\mid \mathbf{z}_{1:t-k-1})\), \(\mathbf{P}_{t-k-1\mid t-k-1} = \text {var}({\varvec{\theta }}_{t-k-1}{\mid } \mathbf{z}_{1:t-k-1})\) and \(\mathbf{a}_{t-k{\mid } t-k-1} {=} \ {\mathbb {E}}({\varvec{\theta }}_{t-k}{\mid } \mathbf{z}_{1:t-k-1})\), \(\mathbf{P}_{t-k\mid t-k-1} = \text {var}({\varvec{\theta }}_{t-k}\mid \mathbf{z}_{1:t-k-1})\) be the mean vector and covariance matrices for the Gaussian filtering and predictive distributions produced by the standard Kalman filter recursions at time \(t-k-1\) under model (4)–(5). Besides being necessary to draw values from the states’ filtering and predictive distributions, conditioned on the trajectories of \(\mathbf{z}_{1:t \mid t}\) sampled from \(p(\mathbf{z}_{1:t}\mid \mathbf{y}_{1:t})\), such quantities are also sufficient to update online the lookahead parameters \(\mathbf{r}_{t-k:t\mid t-k-1}\) and \(\mathbf{S}_{t-k:t\mid t-k-1}\) that are required to compute the importance weights in Proposition 1, and to sample from the \([m(k+1)]\)-variate truncated normal density \(p(\mathbf{z}_{t-k:t}\mid \mathbf{z}_{1:t-k-1},\mathbf{y}_{t-k:t})\) under the auxiliary filter. In particular, the formulation of the dynamic model in (4)–(5) implies that \(\mathbf{r}_{t-k:t\mid t-k-1}= {\mathbb {E}}(\mathbf{z}_{t-k:t}\mid \mathbf{z}_{1:t-k-1})={\mathbb {E}}(\mathbf{F}_{t-k:t}{\varvec{\theta }}_{t-k:t}\mid \mathbf{z}_{1:t-k-1})\), and, therefore, \(\mathbf{r}_{t-k:t\mid t-k-1}\) can be expressed as a function of \(\mathbf{a}_{t-k\mid t-k-1}\) via the direct application of the law of the iterated expectations by stacking the m-dimensional vectors \(\mathbf{F}_{t-k}\mathbf{a}_{t-k\mid t-k-1}\), \(\mathbf{F}_{t-k+1}\mathbf{G}_{t-k+1}\mathbf{a}_{t-k\mid t-k-1}, \ldots ,\) \(\mathbf{F}_{t}\mathbf{G}_{t-k+1}^{t} \mathbf{a}_{t-k\mid t-k-1}\), with \(\mathbf{G}_{l}^s\) defined as in Sect. 3.2.
A similar reasoning can be applied to write the covariance matrix \(\mathbf{S}_{t-k:t\mid t-k-1}=\text{ var }(\mathbf{z}_{t-k:t}\mid \mathbf{z}_{1:t-k-1})\) as a function of \(\mathbf{P}_{t-k\mid t-k-1}\). In particular letting \(l_{-}=l-1\), the \(m \times m\) diagonal blocks of \(\mathbf{S}_{t-k:t\mid t-k-1}\) can obtained sequentially after noticing that
$$\begin{aligned} \mathbf{S}_{t-k:t\mid t-k-1[ll]}&=\text{ var }(\mathbf{z}_{t-k+l_{-}}\mid \mathbf{z}_{1:t-k-1})\\&=\mathbf{F}_{t-k+l_{-}}\mathbf{P}_{t-k+l_{-}\mid t-k-1}\mathbf{F}^{\intercal }_{t-k+l_{-}}+\mathbf{V}_{t-k+l_{-}}, \end{aligned}$$
for each index \(l=1, \ldots , k+1\), where the states’ covariance matrix \(\mathbf{P}_{t-k+l_{-}\mid t-k-1}\) at time \(t-k+l_{-}\) can be expressed as a function of \(\mathbf{P}_{t-k\mid t-k-1}\) via the recursive equations \(\mathbf{P}_{t-k+l_{-}\mid t-k-1}=\mathbf{G}_{t-k+l_{-}}\mathbf{P}_{t-k+l_{-}-1\mid t-k-1}\mathbf{G}^{\intercal }_{t-k+l_{-}}+\mathbf{W}_{t-k+l_{-}}\), for every \(l=2, \ldots , k+1\). Moreover, letting \(l_{-}=l-1\) and \(s_{-}=s-1\), also the off-diagonal blocks can be obtained in a related manner, after noticing that the generic block of \(\mathbf{S}_{t-k:t\mid t-k-1}\) is defined as
$$\begin{aligned}&\mathbf{S}_{t-k:t\mid t-k-1[sl]}\\&\quad =\mathbf{S}^{\intercal }_{t-k:t\mid t-k-1[ls]}\\&\quad =\text{ cov }(\mathbf{F}_{t-k+s_{-}}{\varvec{\theta }}_{t-k+s_{-}},\mathbf{F}_{t-k+l_{-}} {\varvec{\theta }}_{t-k+l_{-}}\mid \mathbf{z}_{1:t-k-1})\\&\quad =\mathbf{F}_{t-k+s_{-}}\mathbf{G}_{t-k+l}^{t-k+s_{-}}\mathbf{P}_{t-k+l_{-}\mid t-k-1}\mathbf{F}_{t-k+l_{-}}^\intercal , \end{aligned}$$
for every \(s=2, \ldots , k+1\) and \(l=1, \ldots , s-1\), where the matrix \(\mathbf{P}_{t-k+l_{-}\mid t-k-1}\) can be expressed as a function of \(\mathbf{P}_{t-k\mid t-k-1}\) via the recursive equations reported above.
According to these results, the partially collapsed lookahead particle filter for sampling recursively from \(p({\varvec{\theta }}_t \mid \mathbf{y}_{1:t})\) simply requires to store and update, for each particle trajectory, the sufficient statistics \(\mathbf{a}_{t-k\mid t-k-1}\) and \(\mathbf{P}_{t-k\mid t-k-1}\) via Kalman filter recursions applied to the model (4)–(5), with every \(\mathbf{z}_t\) replaced by the particles generated under the lookahead routine. As previously discussed, also this updating requires only the moments \(\mathbf{a}_{t-k\mid t-k-1}\) and \(\mathbf{P}_{t-k\mid t-k-1}\) computed recursively as a function of the delayed particles’ trajectories. This yields to a computational complexity per iteration that is constant with time, as it does not require to compute quantities whose dimension grows with t. In addition, as discussed in Remark 1, such a dual interpretation combined with our sun closed-form results, provides novel theoretical support to the Rao–Blackwellized particle filter introduced by Andrieu and Doucet (2002).
Remark 1
The Rao–Blackwellized particle filter by Andrieu and Doucet (2002) for \(p({\varvec{\theta }}_t\mid \mathbf{y}_{1:t})\) can be directly obtained as a special case of Algorithm 3, setting \(k=0\).
Consistent with Remark 1, the Rao–Blackwellized idea (Andrieu and Doucet 2002) actually coincides with a partially collapsed filter which only updates, without lookahead strategies, the truncated normal component in the sun additive representation of the states’ filtering distribution, while maintaining the Gaussian term exact. Hence, although this method was originally motivated, in the context of dynamic probit models, also by the apparent lack of an “optimal” closed-form sisr for the states’ filtering distribution, our results actually show that such a strategy is expected to yield improved performance relative to the “optimal” particle filter for sampling directly from \(p({\varvec{\theta }}_t \mid \mathbf{y}_{1:t})\). In fact, unlike this filter, which is actually available according to Sect. 4.2.1, the Rao–Blackwellized idea avoids the unnecessary autocorrelation in the Gaussian component of the sun representation, and relies on an optimal particle filter for the multivariate truncated normal part. In addition, Remark 1 and the derivation of the whole class of partially collapsed lookahead filters suggest that setting \(k>0\) is expected to yield further gains relative to the Rao–Blackwellized particle filter; see Sect. 5 for quantitative evidence supporting these results.