A closed-form filter for binary time series

Non-Gaussian state-space models arise in several applications, and within this framework the binary time series setting provides a relevant example. However, unlike for Gaussian state-space models - where filtering, predictive and smoothing distributions are available in closed form - binary state-space models require approximations or sequential Monte Carlo strategies for inference and prediction. This is due to the apparent absence of conjugacy between the Gaussian states and the likelihood induced by the observation equation for the binary data. In this article we prove that the filtering, predictive and smoothing distributions in dynamic probit models with Gaussian state variables are, in fact, available and belong to a class of unified skew-normals (SUN) whose parameters can be updated recursively in time via analytical expressions. Also the key functionals of these distributions are, in principle, available, but their calculation requires the evaluation of multivariate Gaussian cumulative distribution functions. Leveraging SUN properties, we address this issue via novel Monte Carlo methods based on independent samples from the smoothing distribution, that can easily be adapted to the filtering and predictive case, thus improving state-of-the-art approximate and sequential Monte Carlo inference in small-to-moderate dimensional studies. Novel sequential Monte Carlo procedures that exploit the SUN properties are also developed to deal with online inference in high dimensions. Performance gains over competitors are outlined in a financial application.

Model (1)-(2) generalizes univariate dynamic probit models to multivariate settings, as we will clarify in equations (3)-(5). The quantities F t , V t , G t , W t , a 0 and P 0 denote, instead, known matrices controlling the location, scale and dependence structure in the state-space model (1)-(2). Estimation and inference for these matrices is, itself, a relevant problem which can be addressed both from a frequentist and a Bayesian perspective. Yet our focus is on providing exact results for inference on state variables and prediction of future binary events under (1)- (2). Therefore, consistent with the classical Kalman filter (Kalman, 1960), we rely on known system matrices F t , V t , G t , W t , a 0 and P 0 . Nonetheless, novel results on marginal likelihoods, which can be used in parameter estimation, are provided in Sect. 3.2.
Model (1)-(2) provides a general representation encompassing a variety of formulations. For example, setting V t = I m in (1) for each t yields a set of standard dynamic probit regressions, which include the classical univariate dynamic probit model when m = 1. These representations have appeared in several applications, especially within the econometrics literature, due to a direct connection between (1)-(2) and dynamic discrete choice models (Keane and Wolpin, 2009). This is due to the fact that representation (1)-(2) can be alternatively obtained via the dichotomization of an underlying state-space model for the m-variate Gaussian time series z t = (z 1t , . . . , z mt ) ∈ R m , t = 1, . . . , n, which is regarded, in econometric applications, as a set of timevarying utilities. Indeed, adapting classical results from static probit regression (Albert and Chib, 1993;Chib and Greenberg, 1998), model (1)-(2) is equivalent to y t = (y 1t , . . . , y mt ) = 1(z t > 0) = [1(z 1t > 0), . . . , 1(z mt > 0)] , t = 1, . . . , n, with z 1 , . . . , z n evolving in time according to the Gaussian state-space model having θ 0 ∼ N p (a 0 , P 0 ) and dependence structure as defined by the directed acyclic graph displayed in Fig. 2. In (4), φ m (z t − F t θ t ; V t ) denotes the density function of the Gaussian N m (F t θ t , V t ) evaluated at z t ∈ R m . To clarify the connection between (1)-(2) and (3)-(5), note that ifz t is a generic Gaussian random variable with density (4), then it holds p(y t | θ t ) = pr(B tzt > 0) = given that −B t (z t − F t θ t ) ∼ N m (0,B t V t B t ) under (4).
As is clear from model (4)-(5), if z 1:t = (z 1 , . . . , z t ) were observed, dynamic inference on the states θ t , for t = 1, . . . , n, would be possible via direct application of the Kalman filter (Kalman, 1960). Indeed, exploiting Gaussian-Gaussian conjugacy and the conditional independence properties that are represented in Fig. 2, the filtering p(θ t | z 1:t ) and predictive p(θ t | z 1:t−1 ) densities are also Gaussian and have parameters which can be computed recursively via simple expressions relying on the previous updates. Moreover, the smoothing density p(θ 1:n | z 1:n ) and its marginals p(θ t | z 1:n ), t ≤ n, can also be obtained in closed form leveraging Gaussian-Gaussian conjugacy.
However, in (3)-(5) only a dichotomized version y t of z t is available. Therefore, the filtering, predictive and smoothing densities of interest are p(θ t | y 1:t ), p(θ t | y 1:t−1 ) and p(θ 1:n | y 1:n ), respectively. Recalling model (1)-(2) and Bayes' rule, the calculation of these quantities proceeds by updating the Gaussian distribution for the states in (2) with the probit likelihood in (1), thereby providing condi-tional distributions which do not have an obvious closed form (Albert and Chib, 1993;Chib and Greenberg, 1998).
When the focus is on online inference for filtering and prediction, one solution to the above issue is to rely on approximations of model (1)-(2) which allow the implementation of standard Kalman filter updates, thus leading to approximate dynamic inference on the states via extended (Uhlmann, 1992) or unscented (Julier and Uhlmann, 1997) Kalman filters, among others. However, these approximations may lead to unreliable inference in various settings (Andrieu and Doucet, 2002). Markov chain Monte Carlo (mcmc) strategies (e.g., Carlin et al., 1992;Shephard, 1994;Soyer and Sung, 2013) address this problem but, unlike the Kalman filter, these methods are only suitable for batch learning of smoothing distributions, and tend to face mixing or scalability issues in binary settings (Johndrow et al., 2019).
Sequential Monte Carlo methods (e.g., Doucet et al., 2001) partially solve mcmc issues, and are specifically developed for online inference via particle-based representations of the states' conditional distributions, which are then propagated in time for dynamic filtering and prediction (Gordon et al., 1993;Kitagawa, 1996;Liu and Chen, 1998;Pitt and Shephard, 1999;Doucet et al., 2000;Andrieu and Doucet, 2002). These strategies provide stateof-the-art solutions in non-Gaussian state-space models, and can be also adapted to perform batch learning of the smoothing distribution; see Doucet and Johansen (2009) for a discussion on particles' degeneracy issues that may arise in such a setting. Nonetheless, sequential Monte Carlo is clearly still sub-optimal compared to the case in which p(θ t | y 1:t ), p(θ t | y 1:t−1 ) and p(θ 1:n | y 1:n ) are available in closed form and belong to a tractable class of known densities whose parameters can be sequentially updated via analytical expressions.
In Sect. 3, we prove that, for the dynamic multivariate probit model in (1)-(2), the quantities p(θ t | y 1:t ), p(θ t | y 1:t−1 ) and p(θ 1:n | y 1:n ) are unified skew-normal (sun) densities (Arellano-Valle and Azzalini, 2006) having tractable expressions for the recursive computation of the corresponding parameters. To the best of our knowledge, such a result provides the first closed-form filter and smoother for binary time series, and facilitates improvements both in online and batch inference. As we will highlight in Sect. 2, the sun distribution has several closure properties (Arellano-Valle and Azzalini, 2006;Azzalini and Capitanio, 2014) in addition to explicit formulas -involving the cumulative distribution function of multivariate Gaussians -for the moments (Azzalini and Bacchieri, 2010;Gupta et al., 2013) and the normalizing constant (Arellano-Valle and Azzalini, 2006). In Sect. 3, we exploit these properties to derive closed-form expressions for functionals of p(θ t | y 1:t ), p(θ t | y 1:t−1 ) and p(θ 1:n | y 1:n ), including, in particular, the observations' predictive density p(y t | y 1:t−1 ) and the marginal likelihood p(y 1:n ). In Sect. 4.1, we also derive an exact Monte Carlo scheme to compute generic functionals of the smoothing distribution. This routine relies on a generative representation of the sun via linear combinations of multivariate Gaussians and truncated normals (Arellano-Valle and Azzalini, 2006), and can be also applied effectively to evaluate the functionals of filtering and predictive densities in smallto-moderate dimensions where mt is of the order of few hundreds, a common situation in routine applications.
As clarified in Sect. 4.2, the above strategies face computational bottlenecks in higher dimensions (Botev, 2017) representation, while keeping the Gaussian part exact. As outlined in an illustrative financial application in Sect. 5, this class improves approximation accuracy relative to competing methods, and includes, as a special case, the Rao-Blackwellized particle filter of Andrieu and Doucet (2002).
Concluding remarks can be found in Sect. 6.

The unified skew-normal distribution
Before deriving filtering, predictive and smoothing distributions under model (1)-(2), let us first briefly review the sun family. Arellano-Valle and Azzalini (2006) proposed this broad class to unify different extensions (e.g., Arnold and Beaver, 2000;Arnold et al., 2002;Gupta et al., 2004;González-Farías et al., 2004) of the original multivariate skew-normal (Azzalini and Dalla Valle, 1996), whose density is obtained as the product between a multivariate Gaussian density and the cumulative distribution function of a standard normal evaluated at a value which depends on a skewness-inducing vector of parameters. Motivated by the success of this formulation and of its generalizations (Azzalini and Capitanio, 1999), Arellano-Valle and Azzalini (2006) developed a unifying representation, namely the sun distribution. A random vector θ ∈ R q has unified skew-normal distribution, θ ∼ sun q,h (ξ, Ω, ∆, γ, Γ ), if its density function p(θ) can be expressed as where the covariance matrix Ω of the Gaussian density φ q (θ − ξ; Ω) can be decomposed as Ω = ωΩω, that is by re-scaling the q × q correlation matrixΩ via the positive diagonal scale matrix ω = (Ω I q ) 1/2 , with denoting the element-wise Hadamard product. In (6), the skewness-inducing mechanism is driven by the cumula- denotes the normalizing constant obtained by evaluating the cumulative distribution function of a N h (0, Γ ) at γ. (2006)  = ∆, to be a full-rank correlation matrix. Note that in (6) the quantities q and h define the dimensions of the Gaussian density and cumulative distribution function, respectively.

Arellano-Valle and Azzalini
As clarified by our closed-form sun results in Sect. 3, q defines the dimension of the states' vector, and coincides with p in the sun filtering and predictive distributions, while it is equal to pn in the sun smoothing distribution.
On the other hand, h increases linearly with time in all the distributions of interest.
An additional generative additive representation of the sun relies on linear combinations of Gaussian and truncated normal random variables, thereby facilitating sampling from the sun. In particular, recalling Azzalini and Capitanio (2014, Sect. 7.1.2) and Arellano-Valle and Azzalini (2006), if θ ∼ sun q,h (ξ, Ω, ∆, γ,Γ ), then  (2010) and Gupta et al. (2013), the first and second order moments further require the evaluation of hvariate Gaussian cumulative distribution functions Φ h (·), thus affecting computational tractability in large h settings (e.g., Botev, 2017). In these situations, Monte Carlo integration provides an effective solution, especially when independent samples can be generated efficiently. Therefore, we mostly focus on improved Monte Carlo inference under model (1)-(2) exploiting the sun properties, and refer to Azzalini and Bacchieri (2010) and Gupta et al.

Filtering and predictive distributions
To obtain the exact form of the filtering and predictive distributions under (1)-(2), let us start from p(θ 1 | y 1 ).
This first quantity characterizes the initial step of the filter recursion, and its derivation within Lemma 1 provides the key intuitions to obtain the state predictive p(θ t | y 1:t−1 ) and filtering p(θ t | y 1:t ) densities, for any t ≥ 2. Lemma 1 states that p(θ 1 | y 1 ) is a sun density. In the following, consistent with the notation of Sect. 2, whenever Ω is a q × q covariance matrix, the associated matrices ω andΩ are defined as ω = (Ω I q ) 1/2 andΩ = ω −1 Ωω −1 , respectively. All the proofs can be found in Appendix A, and leverage conjugacy properties of the sun in probit models.
The first result on this property has been derived by Durante (2019) for static univariate Bayesian probit regression. Here, we take a substantially different perspective by focusing on online inference in both multivariate and time-varying probit models that require novel and nonstraightforward extensions. As seen in Soyer and Sung (2013) and Chib and Greenberg (1998), the increased complexity of this endeavor typically motivates a separate treatment relative to the static univariate case.

Lemma 1
Under the dynamic probit model in (1)-(2), the first-step filtering distribution is with parameters defined by the recursive equations Hence p(θ 1 | y 1 ) is a sun density with parameters that can be obtained via tractable arithmetic expressions applied to the quantities defining model (1)-(2). Exploiting the results in Lemma 1, the general filter updates for the multivariate dynamic probit model can be obtained by induction for t ≥ 2 and are presented in Theorem 1.
. Then, the one-step-ahead state predictive distribution at t is with parameters defined by the recursive equations

Moreover, the filtering distribution at time t is
with parameters defined by the recursive equations and Γ t|t is a full-rank correlation matrix having blocks We also provide closed-form expressions for the predictive density of the multivariate binary response data y t . Indeed, the prediction of y t ∈ {0; 1} m given the data y 1:t−1 , is a primary goal in applications of dynamic probit models. In our setting, this task requires the derivation of the predictive density p(y t | y 1:t−1 ) which coincides, under where p(θ t | y 1:t−1 ) is the state predictive density from (9). Corollary 1 shows that p(y t | y 1:t−1 ) has an explicit form.
for every time t, with parameters γ t|t , Γ t|t , γ t|t−1 and Γ t|t−1 , defined as in Theorem 1.
Hence, the evaluation of probabilities of future events is possible via explicit calculations after marginalizing out analytically the states with respect to their predictive density. As is clear from (11), this requires the calculation of Gaussian cumulative distribution functions whose dimension increases with t and m. Efficient evaluation of such integrals is possible for small-to-moderate t and m via recent methods (Botev, 2017), but this solution is impractical for large t and m, as seen in Table 1. In Sect. 4, we develop novel Monte Carlo strategies to address this issue and enhance scalability. This is done by exploiting Theorem 1 to improve current solutions.

Smoothing distribution
We now consider smoothing distributions. In this case, the focus is on the distribution of the entire states' sequence θ 1:n , or a subset of it, given all data y 1:n . Theorem 2 shows that also the smoothing density p(θ 1:n | y 1:n ) be- E(θ s ) = G s 1 a 0 ∈ R p for every s = 1, . . . , n, with G s 1 = G s · · · G 1 . Similarly, letting G s l = G s · · · G l , also the (pn)×(pn) covariance matrix Ω has a block structure with ∼ sun pn,mn (ξ 1:n|n , Ω 1:n|n , ∆ 1:n|n , γ 1:n|n , Γ 1:n|n ), with parameters defined as ξ 1:n|n = ξ, Ω 1:n|n = Ω, ∆ 1:n|n =ΩωD s −1 , Since the sun is closed under marginalization and linear combinations, it follows from Theorem 2 that the smoothing distribution for any combination of states is still a sun. In particular, direct application of the results in Azzalini and Capitanio (2014, Sect. 7.1.2) yields the marginal smoothing distribution for any state θ t reported in Corollary 2.
Corollary 2 Under the model in (1)-(2), the marginal smoothing distribution at any time t ≤ n is with parameters defined as When t = n, (13) gives the filtering distribution at n.
Another important consequence of Theorem 2 is the availability of a closed-form expression for the marginal likelihood p(y 1:n ), which is provided in Corollary 3.
This closed-form result is useful in several contexts, including estimation of unknown system parameters via marginal likelihood maximization, and full Bayesian inference through mcmc or variational inference methods.

Inference via Monte Carlo methods
As discussed in Sects. 2 and 3, inference without sampling from (9), (10) or (12)  Azzalini and Capitanio, 2014) or by relying on numerical integration. However, these strategies require evaluations of multivariate Gaussian cumulative distribution functions, which tend to be impractical as t grows or when the focus is on complex functionals.
In such situations, Monte Carlo integration provides an accurate solution to evaluate the generic functionals and E[g(θ 1:n ) | y 1:n ] for the filtering, predictive and smoothing distribution via t|t−1 and θ (r) 1:n|n sampled from p(θ t | y 1:t ), p(θ t | y 1:t−1 ) and p(θ 1:n | y 1:n ), respectively. For example, if the evaluation of (11) is demanding, the observations predictive density can be easily computed as To be implemented, the above approach requires an efficient strategy to sample from (9), (10) and (12). Exploiting the sun properties and recent results in Botev (2017), an algorithm to draw independent and identically distributed samples from the exact sun distributions in (9), (10) and (12) is developed in Sect. 4.1. As illustrated in Sect. 5, such a technique is more accurate than stateof-the-art methods and can be efficiently implemented in small-to-moderate dimensional time series. In Sect. 4.2 we develop, instead, novel sequential Monte Carlo schemes that allow scalable online learning in high dimensional settings and have optimality properties (Doucet et al., 2000) which shed new light also on existing strategies (e.g, Andrieu and Doucet, 2002).

Independent identically distributed sampling
As discussed in Sect. 1, mcmc and sequential Monte Carlo methods to sample from p(θ t | y 1:t ), p(θ t | y 1:t−1 ) and p(θ 1:n | y 1:n ) are available. However, the commonly recommended practice, if feasible, is to rely on independent and identically distributed (i.i.d.) samples. Here, we derive a Monte Carlo algorithm to address this goal with a main focus on the smoothing distribution, and discuss direct modifications to allow sampling also in the filtering and predictive case. Indeed, Monte Carlo inference is particularly suitable for batch settings, although, as discussed later, the proposed routine is practically useful also when the focus is on filtering and predictive distributions, since i.i.d. samples are simulated rapidly, for each t, in small-to-moderate dimensions.
Exploiting the closed-form expression of the smoothing distribution in Theorem 2, and the additive representation (7) of the sun, i.i.d. samples for θ 1:n|n from the smoothing distribution (12) can be obtained via a linear combination between independent samples from (pn)-variate Gaussians and (mn)-variate truncated normals. Algorithm 1 provides the detailed pseudo-code for this novel strategy, whose outputs are i.i.d. samples from the joint smoothing density p(θ 1:n | y 1:n ). Here, the most computationally intensive step is the sampling from tn mn (0, Γ 1:n|n ; A γ 1:n|n ), which denotes the multivariate normal distribution N mn (0, Γ 1:n|n ) truncated to the region A γ 1:n|n = {u 1 ∈ R mn : u 1 + γ 1:n|n > 0}. In fact, although efficient Hamiltonian Monte Carlo solutions are available (Pakman and Paninski, 2014), these strategies do not provide independent samples. More recently, an accept-reject method based on minimax tilting has been proposed by Botev (2017) to improve the acceptance rate of classical rejection sampling, while avoiding mixing issues of mcmc. This routine is available in the R library TruncatedNormal and allows efficient sampling from multivariate truncated normals with a dimension of few hundreds, thereby providing effective Monte Carlo inference via Algorithm 1 in small-to-moderate dimensional time series where mn is of the order of few hundreds.
Clearly, the availability of an i.i.d. sampling scheme from the smoothing distribution overcomes the need of mcmc methods and particle smoothers. The first set of strategies usually faces mixing or time-inefficiency issues, especially in imbalanced binary settings (Johndrow et al., 2019), whereas the second class of routines tends to be computationally intensive and subject to particles degeneracy (Doucet and Johansen, 2009).
When the focus is on Monte Carlo inference for the marginal smoothing density p(θ t | y 1:n ) at a specific time t, Algorithm 1 requires minor adaptations relying again on the additive representation of the sun in (13), under similar arguments considered for the joint smoothing setting. This latter routine can be also used to sample from the filtering distribution in (10) by applying such a Algorithm 1: Independent and identically distributed sampling from p(θ 1:n | y 1:n ) [1] Sample U (1) 0 1:n|n , . . . , U (R) 0 1:n|n independently from a N pn (0,Ω 1:n|n − ∆ 1:n|n Γ −1 1:n|n ∆ 1:n|n ).
[3] Compute θ scheme with n = t to obtain i.i.d. samples for θ t|t from p(θ t | y 1:t ). Leveraging realizations from the filtering distribution at time t − 1, i.i.d. samples for θ t|t−1 from the predictive density p(θ t | y 1:t−1 ), can be simply obtained via the direct application of (2) which provides samples As a result, efficient Monte Carlo inference in small-to-moderate dimensional dynamic probit models is possible also for filtering and predictive distributions.

Sequential Monte Carlo sampling
When the dimension of the dynamic probit model (1)- (2) grows, sampling from multivariate truncated Gaussians in Algorithm 1 might yield computational bottlenecks (Botev, 2017). This is particularly likely to occur in series monitored on a fine time grid. Indeed, in several applications, the number of time series m is typically small, whereas the length of the time window can be large. To address this issue and allow scalable online filtering and prediction also in large t settings, we first derive in Sect. 4.2.1 a particle filter which exploits the sun results to obtain optimality properties, in the sense of Doucet et al. (2000).  (Lin et al., 2013) only the multivariate truncated normal term in the sun additive representation, while keeping the Gaussian component exact. As outlined in Sect. 4.2.2, such a broad class of partially collapsed lookahead particle filters com-prises, as a special case, the Rao-Blackwellized particle filter developed by Andrieu and Doucet (2002). This provides novel theoretical support to the notable performance of such a strategy, which was originally motivated, in the context of dynamic probit models, also by the lack of a closed-form optimal particle filter for the states.

"Optimal" particle filter
The first proposed strategy belongs to the class of sequential importance sampling-resampling (sisr) algorithms that provide default strategies in particle filtering (e.g., Doucet et al., 2000Doucet et al., , 2001Durbin and Koopman, 2012). For each time t, these routines sample R trajectories for θ 1:t|t from p(θ 1:t | y 1:t ), known as particles, conditioned on those produced at t − 1, by iterating, in time, between the two steps summarized below.
As is clear from the above steps, the performance of sisr relies on the choice of π(θ t | θ 1:t−1 , y 1:t ). Such a density should allow tractable sampling along with efficient evaluation of the importance weights, and should be also carefully specified to propose effective candidate samples. Recalling Doucet et al. (2000), the optimal proposal is π(θ t | θ 1:t−1 , y 1:t ) = p(θ t | θ t−1 , y t ), with importance weights w t ∝ p(y t | θ t−1 ). Indeed, conditioned on θ 1:t−1|t−1 and y 1:t , this choice minimizes the variance of the weights, thus limiting degeneracy issues and improving mixing. Unfortunately, in several dynamic models, tractable sampling from p(θ t | θ t−1 , y t ) and the direct evaluation of p(y t | θ t−1 ) is not possible (Doucet et al., 2000). As outlined in Corollary 4, this is not the case for dynamic probit models. In particular, by leveraging the proof of Theorem 1 and the closure properties of the sun, sampling from p(θ t | θ t−1 , y t ) is straightforward and p(y t | θ t−1 ) has a simple form.
Corollary 4 For every time t = 1, . . . , n, the optimal im- whereas the importance weights are with parameters defined by the recursive equations As clarified in Corollary 4, the weights p(y t | θ t−1 ) for the generated trajectories are available analytically in (15) and do not depend on the sampled values of the particle at time t. This allows the implementation of the more efficient auxiliary particle filter (auf) (Pitt and Shephard, 1999) by simply reversing the order of the sampling and resampling steps, thereby obtaining a performance gain (Andrieu and Doucet, 2002). Algorithm 2 illustrates the pseudo-code of the proposed "optimal" auxiliary filter, which exploits the additive representation of the sun and Corollary 4. Note that, unlike for Algorithm 1, such a sequential sampling strategy requires to sample at each step from a truncated normal whose dimension does not depend on t, thus facilitating scalable sequential inference in large t studies. Samples from the predictive distribution can be obtained from those of the filtering as discussed in Sect. 4.1.
Despite having optimality properties, a close inspection of Algorithm 2 shows that the states' particles at t − 1 affect both the Gaussian component, via ξ t|t,t−1 , and the truncated normal term, via γ t|t,t−1 , in the sun additive representation of (θ t | y 1:t Here, we present a class of procedures to sequentially generate these samples, which are then combined with real- with z 1:t|t ∼ tn mt (F 1:t ξ 1:t|t , F 1:t Ω 1:t|t F 1:t +V 1:t ; A y1:t ) and s 1:t|t = [(DΩ 1:t|t D + Λ) I mt ] 1/2 , where D and Λ are defined as in Sect. 3.2, setting n = t. Note that the multivariate truncated normal distribution for z 1:t|t actually coincides with the conditional distribution of z 1:t given y 1:t under model (3)-(5). Indeed, by marginalizing out θ 1:t in p(z 1:t | θ 1:t ) = t s=1 φ m (z s − F s θ s ; V s ) = φ mt (z 1:t − F 1:t θ 1:t ; V 1:t ) with respect to its multivariate normal distribution derived in the proof of Theorem 2, we have p(z 1:t ) = φ mt (z 1:t −F 1:t ξ 1:t|t ; F 1:t Ω 1:t|t F 1:t +V 1:t ) and, as a direct consequence, we obtain p(z 1:t | y 1:t ) ∝ p(z 1:t )p(y 1:t | z 1:t ), which is the kernel of a tn mt (F 1:t ξ 1:t|t , F 1:t Ω 1:t|t F 1:t + V 1:t ; A y1:t ) density.
The above analytical derivations clarify that in order to sample recursively from U 1 1:t|t it is sufficient to apply equation (16) to sequential realizations of z 1:t|t from the joint conditional density p(z 1:t | y 1:t ), induced by model (3)-(5), after collapsing out θ 1:t . While basic sisr algorithms for p(z 1:t | y 1:t ), combined with the exact sampling from the Gaussian component U 0 t|t , are expected to yield an improved performance relative to the particle filter developed in Sect. 4.2.1, here we adapt an even broader class of lookahead particle filters (Lin et al., 2013) -which includes the basic sisr as a special case. To introduce the general lookahead idea note that p(z 1:t | y 1:t ) = p(z t−k+1:t | z 1:t−k , y 1:t )p(z 1:t−k | y 1:t ), where k is a pre-specified delay offset. Moreover, as a direct consequence of the dependence structure displayed in Fig. 2, we also have that p(z t−k+1:t | z 1:t−k , y 1:t ) = p(z t−k+1:t | z 1:t−k , y t−k+1:t ) for any generic k. Hence, to sequentially generate realizations of z 1:t|t from p(z 1:t | y 1:t ), we can first sample z 1:t−k|t from p(z 1:t−k | y 1:t ) by extending, via sisr, the trajectory z 1:t−k−1|t−1 with optimal proposal p(z t−k | z 1:t−k−1 = z 1:t−k−1|t−1 , y t−k:t ), and then draw the last k terms in z 1:t|t from p(z t−k+1:t | z 1:t−k = z 1:t−k|t , y t−k+1:t ). Note that when k = 0 this final operation is not necessary, and the particles' updating in the first step reduces to basic sisr. Values of k in {1; . . . ; n − 1} induce, instead, a lookahead structure in which at the current time t the optimal proposal for the delayed particles leverages information of response data y t−k:t that are not only contemporaneous to z t−k , i.e., y t−k , but also future, namely y t−k+1 , . . . , y t . In this way, the samples from the sub-trajectory z 1:t−k|t of z 1:t|t at time t are more compatible with the sampling density p(z 1:t | y 1:t ) of interest and hence, when completed with the last k terms drawn from p(z t−k+1:t | z 1:t−k = z 1:t−k|t , y t−k+1:t ), produce a sequential sampling scheme from p(z 1:t | y 1:t ) with im-proved mixing and reduced degeneracy issues relative to basic sisr. Although the magnitude of such gains clearly grows with k, as illustrated in Sect. 5, setting k = 1 already provides empirical evidence of improved performance relative to basic sisr, without major computational costs.
To implement the aforementioned strategy it is first necessary to ensure that the lookahead proposal belongs to a class of random variables which allow efficient sampling, while having a tractable closed-form expression for the associated importance weights. Proposition 1 shows that this is the case under model (3)-(5). (3)-(5), the lookahead proposal mentioned above has the form

Proposition 1 Under the augmented model in
where p(z t−k:t | z 1:t−k−1 ,y t−k:t ) is the density of a trun- with parameters r t−k:t|t−k−1 = E(z t−k:t | z 1:t−k−1 ) and S t−k:t|t−k−1 = var(z t−k:t | z 1:t−k−1 ). The importance weights w t = w(z 1:t−k ) are, instead, proportional to where the mean vectors are µ t = B t−k:t r t−k:t|t−k−1 and µ t = B t−k:t−1 r t−k:t−1|t−k−1 , whereas the covariance matrices are defined as Σ t = B t−k:t S t−k:t|t−k−1 B t−k:t and To complete the procedure for sampling from p(z 1:t | y 1:t ) we further require p(z t−k+1:t | z 1:t−k , y t−k+1:t ). As clarified in Proposition 2, also such a quantity is the density of a multivariate truncated normal.
Note that the expression of the importance weights in equation (18) does not depend on z t−k , and, hence, also in this case the resampling step can be performed before sampling from (17), thus leading to an auf routine. Besides improving efficiency, such a strategy allows to combine the particle generation in (17)  To clarify Algorithm 3, note that p(θ t | z 1:t ) is the filtering density of the Gaussian dynamic linear model defined in (4)-(5), for which the Kalman filter can be directly implemented, once the trajectory z 1:t|t has been generated from p(z 1:t | y 1:t ) via the lookahead filter. Let var(θ t−k−1 |z 1:t−k−1 ) and a t−k|t−k−1 = E(θ t−k |z 1:t−k−1 ), In particular, the formulation of the dynamic model in (4)-Algorithm 3: Lookahead particle filter to draw from p(θ t | y 1:t ), for t = 1, . . . , n [auf version with kf steps] Set k, and initialize a (r) 0|0 = a 0 for r = 1, . . . , R and P 0|0 = P 0 . for t from 1 to k do t|t from Algorithm 1 [this can be done efficiently in an exact manner since k is usually small]. for t from k + 1 to n do [2] Define the vectors and matrices that are required to perform steps [3] and [4].
and compute S t−k:t|t−k−1 as in Sect. 4.2.2.
[3] Implement the resampling step under the auf version.
[3.1] For r = 1, . . . , R, calculate the importance weight w (5) implies that r t−k:t|t−k−1 = E(z t−k:t | z 1:t−k−1 ) = E(F t−k:t θ t−k:t | z 1:t−k−1 ), and, therefore, r t−k:t|t−k−1 can be expressed as a function of a t−k|t−k−1 via the direct application of the law of the iterated expectations by stacking the m-dimensional vectors F t−k a t−k|t−k−1 , l is defined as in Sect. 3.2. A similar reasoning can be applied to write the covariance matrix S t−k:t|t−k−1 = var(z t−k:t | z 1:t−k−1 ) as a function of P t−k|t−k−1 . In particular letting l − = l − 1, the m × m diagonal blocks of S t−k:t|t−k−1 can obtained sequentially after noticing that for every l = 1, . . . , k + 1, where the states' covariance matrix P t−k+l−|t−k−1 at time t − k + l − can be expressed as a function of P t−k|t−k−1 via the recursive equations P t−k+l−|t−k−1 = G t−k+l− P t−k+l−−1|t−k−1 G t−k+l− + W t−k+l− , for every l = 2, . . . , k + 1. Moreover, letting l − = l − 1 and s − = s − 1, also the off-diagonal blocks can be obtained in a related manner, after noticing that the generic block of S t−k:t|t−k−1 is defined as for every s = 2, . . . , k + 1 and l = 1, . . . , s − 1, where the matrix P t−k+l−|t−k−1 can be expressed as a function of P t−k|t−k−1 via the recursive equations reported above.
According to these results, the partially collapsed lookahead particle filter for sampling recursively from p(θ t | y 1:t ) simply requires to store and update, for each particle trajectory, the sufficient statistics a t−k|t−k−1 and P t−k|t−k−1 via Kalman filter recursions applied to model (4)-(5), with every z t replaced by the particles generated under the lookahead routine. As previously discussed, also this updating requires only the moments a t−k|t−k−1 and P t−k|t−k−1 computed recursively as a function of the delayed particles' trajectories. This yields to a computational complexity per iteration that is constant with time, as it does not require to compute quantities whose dimension grows with t. In addition, as discussed in Remark 1, such a dual interpretation combined with our sun closed-form results, provides novel theoretical support to the Rao-Blackwellized particle filter introduced by Andrieu and Doucet (2002). Andrieu and Doucet (2002) for p(θ t | y 1:t ) can be directly obtained as a special case of Algorithm 3, setting k = 0.

Remark 1 The Rao-Blackwellized particle filter by
Consistent with Remark 1, the Rao-Blackwellized idea (Andrieu and Doucet, 2002) actually coincides with a partially collapsed filter which only updates, without lookahead strategies, the truncated normal component in the sun additive representation of the states' filtering distribution, while maintaining the Gaussian term exact. Hence, although this method was originally motivated, in the context of dynamic probit models, also by the apparent lack of an "optimal" closed-form sisr for the states' filtering distribution, our results actually show that such a strategy is expected to yield improved performance relative to the "optimal" particle filter for sampling directly from p(θ t | y 1:t ). In fact, unlike this filter, which is actually available according to Sect. 4.2.1, the Rao-Blackwellized idea avoids the unnecessary autocorrelation in the Gaussian component of the sun representation, and relies on an optimal particle filter for the multivariate truncated normal part. In addition, Remark 1 and the derivation of the whole class of partially collapsed lookahead filters suggest that setting k > 0 is expected to yield further gains relative to the Rao-Blackwellized particle filter; see Sect. 5 for quantitative evidence supporting these results.

Illustration on financial time series
Recalling Sects. 1-4, our core contribution in this article is not on developing innovative dynamic models for binary data with improved ability in recovering some groundtruth generative process, but on providing novel closed- and recent years (e.g., Kim and Han, 2000;Kara et al., 2011;Atkins et al., 2018), with common approaches combining a wide variety of technical indicators and news information to forecast stock markets directions via complex machine learning methods. Here, we show how a similar predictive performance can be obtained via a simple and interpretable dynamic probit regression for y t , which combines past information on the opening directions of cac40 with those of the nikkei225, regarded as binary covariates x t with dynamic coefficients. Since the Japanese market opens before the French one, x t is available prior to y t and, hence, provides a valid predictor for each day t.
Recalling the above discussion and leveraging the default model specifications in these settings (e.g., Soyer and Sung, 2013), we rely on a dynamic probit regression for y t with two independent random walk processes for the coefficients θ t = (θ 1t , θ 2t ) . Letting F t = (1, x t ) and pr(y t = 1 | θ t ) = Φ(θ 1t + θ 2t x t ; 1), such a model can be expressed as in equations (1) where θ 0 ∼ N 2 (a 0 , P 0 ), whereas W is a time-invariant di-  To implement this routine, we set a 0 = (0, 0) and P 0 = diag(3, 3) following the guidelines in Gelman et al. (2008) and Chopin and Ridgway (2017)  . . , 97, the accuracy is measured via the median Wasserstein distance (over 100 replicated experiments) between the empirical filtering distribution computed from 10 3 , 10 4 and 10 5 particles, respectively, and the one obtained by direct evaluation of the associated exact density from (10) on two grids of 2000 equally spaced values for θ 1t and θ 2t . This allows to compute, for every t = 1, . . . , 97, the ranking of each sampling scheme in terms of accuracy in approximating the exact filtering density at time t, and to derive the associated barplot summarizing the distribution of the rankings over the whole window. ing when imbalances in the data induce skewness. The bootstrap particle filter (Gordon et al., 1993) provides, instead, a general sisr that relies on the importance density p(θ t | θ t−1 ), thus failing to account effectively for information in y t , when proposing particles. Rao-Blackwellized sequential Monte Carlo (Andrieu and Doucet, 2002) aims at providing an alternative particle filter, which also addresses the apparent unavailability of an analytic form for the "optimal" particle filter (Doucet et al., 2000).
The authors overcome this issue by proposing a sequential Monte Carlo strategy for the Rao-Blackwellized density p(z 1:t | y 1:t ) of the partially observed Gaussian responses z 1:t in model (3)-(5) and compute, for each trajectory z 1:t|t , relevant moments of (θ t | z 1:t|t ) via classical Kalman filter updates -applied to model (4)-(5) -which are then averaged across the particles to obtain Monte Carlo estimates for the moments of (θ t | y 1:t ).
As specified in Remark 1, this solution, when adapted to draw samples from p(θ t | y 1:t ), is a special case of the sequential strategy in Sect. 4.2.2, with no lookahead, i.e., k = 0.
Although the above methods yield state-of-the-art so- For the two dynamic state variables θ 1t and θ 2t , the accuracy of each sampling scheme is measured via the Wasserstein distance (e.g., Villani, 2008) between the empirical filtering distribution computed, for every time t = 1, . . . , 97, from R = 10 3 , R = 10 4 and R = 10 5 particles produced by that specific scheme and the one obtained via  Table 1: For the states θ 1t and θ 2t , averaged accuracy in approximating the exact sun filtering distribution at t = 1, . . . , 97, and computational cost for obtaining a sample of dimension R from such a filtering distribution at time t. For each scheme, the accuracy is measured via the Wasserstein distance between the empirical filtering distribution computed from 10 3 , 10 4 and 10 5 particles, respectively, and the one obtained via direct evaluation of the associated exact sun density from (10)  Note also that, although ekf and rao-b focus, mostly, on moments of (θ t | y 1:t ), such strategies can be adapted to sample from an approximation of the filtering distribution. Figure 4   progressively impractical, thus motivating scalable particle filters with linear cost in t, such as boot, rao-b, opt and la-1. In our basic R implementation, we found that the proposed i.i.d. sampler has reasonable runtimes (of a couple of minutes) also for larger series with mt ≈ 300.
However, in much higher dimensions the particle filters become orders of magnitude faster and still practically effective.
As expected, the opt filter in Sect. 4.2.1 tends to improve the performance of boot, since this strategy is optimal within the class where boot is defined. However,  Fig. 4 and Table 1 provide empirical evidence in support of this argument, while displaying additional improvements of the lookahead strategy derived in Sect. 4.2.2 over rao-b, even when k is set just to 1, i.e., la-1. As shown in Table 1, the complexities of la-1 and rao-b are of the same order, except for sampling from bivariate truncated normals under la-1 instead of univariate ones as in rao-b. This holds for any fixed k, with the additional sampling cost being C(m[k + 1]).
However, consistent with the results in Fig. 4 and Table   1 it suffices to set k quite small to already obtain some These two distributions can be easily obtained by applying the function Φ(θ 1t + x t θ 2t ; 1) to the particles of the states filtering and predictive distribution. In line with Fig. 3, a positive opening of the nikkei225 provides, in general, a high estimate for the probability that y t = 1, whereas a negative opening tends to favor the event y t = 0. However, the strength of this result evolves over time with some periods showing less evident shifts in the probabilities process when x t changes from 1 to 0. One-step-ahead prediction, leveraging the samples of the predictive distribution for the probability process, led to a correct classi-fication rate of 66.34% which is comparable to those obtained under more complex procedures combining a wide variety of inputs to predict stock markets directions via state-of-the-art machine learning methods (e.g., Kim and Han, 2000;Kara et al., 2011;Atkins et al., 2018).

Discussion
This article shows that filtering, predictive and smoothing densities in multivariate dynamic probit models have a sun kernel and the associated parameters can be computed via tractable expressions. As discussed in Sects.

Appendix A: Proofs of the main results
Proof of Lemma 1. To prove Lemma 1, note that, by applying the Bayes' rule, we obtain p(θ 1 | y 1 ) ∝ p(θ 1 )p(y 1 | θ 1 ), where p(θ 1 ) = φ p (θ 1 −G 1 a 0 ; G 1 P 0 G 1 +W 1 ) and p(y 1 | can be obtained by noting that θ 1 = G 1 θ 0 + ε 1 in (2), with θ 0 ∼ N p (a 0 , P 0 ) and ε 1 ∼ N p (0, W 1 ). The form for the probability mass function of (y 1 | θ 1 ) is instead a direct consequence of equation (1). Hence, combining these results and recalling (6), it is clear that p(θ 1 | y 1 ) is proportional to the density of a sun with suitably-specified parameters, such that the kernel of (6) coincides with In particular, letting with s −1 1 as in Lemma 1. Note that this term is introduced to make Γ 1|1 a correlation matrix, as required in the sun parametrization (Arellano-Valle and Azzalini, 2006). Recalling Durante (2019), and substituting these quantities in the kernel of the sun density (6), we have thus proving Lemma 1. To prove that Ω * 1|1 is a correlation matrix, replace the indentity I m with B 1 V 1 B 1 in the proof of Theorem 1 by Durante (2019).
Proof of Theorem 1. Recalling equation (2), the proof for p(θ t | y 1:t−1 ) in (9) requires studying the variable G t θ t−1 + ε t , given y 1:t−1 , where and ε t ∼ N p (0, W t ), with ε t ⊥ y 1:t−1 . To address this goal, first note that, by the closure properties of the sun family under linear transformations (Azzalini and Capitanio, 2014, Sect. 7 Hence, to conclude the proof of equation (9), we only need to obtain the distribution of the sum among this variable and the noise ε t ∼ N p (0, W t ). This can be accomplished by considering the moment generating function of such a sum -as done by Azzalini and Capitanio (2014, Sect. 7.1.2) to prove closure under convolution. Indeed, it is straightforward to note that the product of the moment generating functions for ε t and (G t θ t−1 | y 1:t−1 ) leads to the moment generating function of a sun random variable having parameters ξ t|t−1 = coincides with the posterior density in the probit model , and sun prior p(θ t | y 1:t−1 ) from (9). Hence, (10) can be derived from Corollary 4 in Durante (2019), replacing matrix I m in the Proof of Corollary 1. To prove Corollary 1, re-write denoting the kernel of the predictive density from (9).
Proof of Corollary 3. The expression for the marginal likelihood follows by noting that p(y 1:n ) is the normalizing constant of the smoothing density. Indeed, p(y 1:n ) = p(y 1:n |θ 1:n )p(θ 1:n )dθ 1:n . Hence, the integrand coincides with the kernel of the smoothing density, so that the whole integral is equal to Φ mn (γ 1:n|n ; Γ 1:n|n ).
Proof of Corollary 4. The proof of Corollary 4 is similar to that of Lemma 1. Indeed, the proposal p(θ t | θ t−1 , y t ) is proportional to the product between the like- To derive the importance weights in (15), it suffices to notice that the marginal likelihood p(y t | θ t−1 ) coincides with the normalizing constant of the sun in (14).

Appendix B: Derivation of computational costs
In this section we derive the computational costs of the algorithms discussed in Sects. 4 and 5. Let us first consider Algorithm 1 with an initial focus on the smoothing distribution. For this routine, the matrix computations to obtain the parameters of interest require O(n 3 [p 3 + m 3 ]) operations. Regarding the sampling cost to obtain R draws, step [1] requires O(p 3 n 3 + Rp 2 n 2 ) operations since we have to first compute the Cholesky decomposition ofΩ 1:n|n − ∆ 1:n|n Γ −1 1:n|n ∆ 1:n|n in O(p 3 n 3 ), and then multiply each independent sample for the resulting lower triangular matrix, at O(Rp 2 n 2 ) total cost.
Step [2] requires, instead, to obtain a minimax exponentially-tilted estimate at O(m 3 n 3 ) cost (Botev, 2017) and then perform O(n 2 m 2 C(mn)) operations for each independent sample, where C(d) denotes the average number of proposed draws required per accepted sample in Botev (2017), when the dimension of the truncated normal is d. Hence, the overall cost of Algorithm 1 is O(n 3 (p 3 + m 3 ) + Rn 2 [p 2 + m 2 C(mn)]). If the interest is in the filtering distribution, which coincides with the marginal smoothing at n = t, it is sufficient to sample U 0 n|n instead of U 0 1:n|n . Hence, the overall cost for R samples reduces to O(tp 3 + t 3 m 3 + R[p 2 + t 2 m 2 C(mt)]).
We now consider the computational costs of the particle filters considered in Sect. 4 and 5. For each t, the cost is due to computation of parameters, sampling and evaluation of the importance weights. Starting with the "optimal" particle filter in Sect. 4.2.1, the matrix operations for computing the quantities in steps [3.1]-[3.3] of Algorithm 2 have an overall cost for the R samples of O(m 3 + pm 2 + p 2 m + Rpm + Rp 2 ). The sampling costs are, instead, O(p 3 + Rp 2 ) and O(m 3 + Rm 2 C(m)) for the Gaussian and truncated normal terms, respectively.
To conclude the derivation of the computational costs, it is necessary to derive those associated with the evaluation of the importance weights. For all the particle fil-ters analyzed, such weights are obtained by evaluating in R different points the cumulative distribution function of a zero mean multivariate normal with fixed covariance matrix. To facilitate comparison, we assume that this evaluation relies on a Monte Carlo estimate based on M samples in all the particle filters. For the "optimal" particle filter, this step requires O(m 3 + M m 2 ) operations to obtain the samples, plus O(M Rm) for computing the Monte Carlo estimate. Combining these results, the overall cost for the "optimal" particle filter at time t is O(t(p 3 +m 3 )+tR[p 2 +pm+m 2 C(m)]+tM [m 2 +Rm]).
Let us now derive the cost of the Rao-Blackwellized algorithm by Andrieu and Doucet (2002). In this case, adapting the notation of the original paper to the one of Sect. 4.2.2, it can be noticed that one kf step requires O(p 3 + Rp 2 + Rpm + m 3 ) operations for the computation of P t|t−1 , a t|t−1 , S t|t−1 , r t|t−1 , P t|t and a t|t , at any t. As for the sampling part, it first requires R draws from an m-variate truncated normal. Exploiting the same arguments considered for the previous algorithms, this step has an O(m 3 + Rm 2 C(m)) cost. The sampling from the final Gaussian filtering distribution p(θ t | z 1:t = z 1:t|t ) of direct interest requires instead O(p 3 + Rp 2 ) operations. Leveraging again the derivations for the previous algorithms, the computation of the importance weights has cost O(m 3 + M m 2 + RM m). Therefore, the overall cost of the sequential filtering procedure at time t is O(t(p 3 +m 3 )+tR[p 2 +pm+m 2 C(m)]+tM [m 2 +Rm]).
The above derivations for the Rao-Blackwellized algorithm directly extend to the partially collapsed lookahead particle filter shown in Algorithm 3. In fact, while at each t Hence, adapting the cost of the Rao-Blackwellized algorithm to this broader setting, we have that the overall cost of Algorithm 3 at time t is O(t(k + p 3 +k 3 + m 3 )+tR[k + p 2 + k + pm + k 2 + m 2 C(k + m)] + tM [k 2 + m 2 + Rk + m]), where k + = k + 1. Note that, in practice, k is set equal to a pre-specified small constant and, therefore, the actual implementation cost reduces to O(t(p 3 +m 3 )+tR[p 2 +pm+ m 2 C(k + m)] + tM [m 2 + Rm]), where k + only enters in C(k + m).
The bootstrap particle filter leverages the proposal p(θ t | θ t−1 ), with importance weights given by the likelihood in equation (1). Hence, exploiting similar arguments considered for the previous routines yields a cost O(t(p 3 + m 3 ) + tR(p 2 + pm) + tM [m 2 + Rm]).
Finally, note that the cost of the extended Kalman filter (Uhlmann, 1992) is lower than the one of the particle filters since no sampling is involved, except for the Monte Carlo evaluation of the multivariate probit likelihood. In particular, at each t, one has to invert a p × p and an m × m matrix, plus computing the likelihood, which yields a total cost at t of O(t[p 3 + m 3 + M m 2 ]).