When computing ELPD-based exact LOO-CV for a Bayesian model we need to compute the log leave-one-out predictive densities \(\log {p(y_i | y_{-i})}\) for every response value \(y_i, \, i = 1, \ldots , N\), where \(y_{-i}\) denotes all response values except observation \(i\). To obtain \(p(y_i | y_{-i})\), we need to have access to the pointwise likelihood \(p(y_i\,|\, y_{-i}, \theta )\) and integrate over the model parameters \(\theta \):
$$\begin{aligned} p(y_i\,|\,y_{-i}) = \int p(y_i\,|\, y_{-i}, \theta ) \, p(\theta \,|\, y_{-i}) \,d \theta \end{aligned}$$
(1)
Here, \(p(\theta \,|\, y_{-i})\) is the leave-one-out posterior distribution for \(\theta \), that is, the posterior distribution for \(\theta \) obtained by fitting the model while holding out the \(i\)th observation (in Sect. 3, we will show how refitting the model to data \(y_{-i}\) can be avoided).
If the observation model is formulated directly as the product of the pointwise observation models, we call it a factorized model. In this case, the likelihood is also the product of the pointwise likelihood contributions \(p(y_i\,|\, y_{-i}, \theta )\). To better illustrate possible structures of the observation models, we formally divide \(\theta \) into two parts, observation-specific latent variables \(f = (f_1, \ldots , f_N)\) and hyperparameters \(\psi \), so that \(p(y_i\,|\, y_{-i}, \theta ) = p(y_i\,|\, y_{-i}, f_i, \psi )\). Depending on the model, one of the two parts of \(\theta \) may also be empty. In very simple models, such as linear regression models, latent variables are not explicitely presented and response values are conditionally independent given \(\psi \), so that \(p(y_i\,|\, y_{-i}, f_i, \psi ) = p(y_i \,|\, \psi )\) (see Fig. 1a). The full likelihood can then be written in the familiar form
$$\begin{aligned} p(y \,|\, \psi ) = \prod _{i=1}^N p(y_i \,|\, \psi ), \end{aligned}$$
(2)
where \(y = (y_1, \ldots , y_N)\) denotes the vector of all responses. When the likelihood factorizes this way, the conditional pointwise log-likelihood can be obtained easily by computing \(p(y_i\,|\, \psi )\) for each \(i\) with computational cost \(O(n)\).
If directional paths between consecutive responses are added, responses are no longer conditionally independent, but the model factorizes to simple terms with Markovian dependency. This is common in time-series models. For example, in an autoregressive model of order 1 (see Fig. 1b), the pointwise likelihoods are given by \(p(y_i \,|\, y_{i-1}, \psi )\). In other time series, models may have observation-specific latent variables \(f_i\) and conditionally independent responses so that the pointwise log-likelihoods simplify to \(p(y_i\,|\, y_{-i}, f_i, \psi ) = p(y_i \,|\, f_i)\). In models without directional paths between the latent values \(f\) (see Fig. 1c), such as latent Gaussian processes (GPs; e.g., Rasmussen 2003) or spatial conditional autoregressive (CAR) models (e.g., Gelfand and Vounatsou 2003), an explicit joint prior over \(f\) is imposed. In models with directional paths between the latent values \(f\) (see Fig. 1d), such as hidden Markov models (HMMs; e.g., Rabiner and Juang 1986), the joint prior over \(f\) is defined implicitly via the directional dependencies. What is more, estimation can make use of the latent Markov property of such models, for example, using the Kalman filter (e.g., Welch et al. 1995). In all of these cases (i.e., Fig. 1a–d), the factorization property is retained and computational cost for the pointwise log-likelihood contributions remains in \(O(n)\).
Yet, there are several reasons why a non-factorized observation model (see Fig. 1e) may be necessary or preferred. In non-factorized models, the joint likelihood of the response values \(p(y \,|\, \theta )\) is not factorized into observation-specific components, but rather given directly as one joint expression. For some models, an analytical factorized formulation is simply not available in which case we speak of a non-factorizable model. Even in models whose observation model can be factorized in principle, it may still be preferable to use a non-factorized form. This is true in particular for models with observation-specific latent variables (see Fig. 1c, d), as a non-factorized formulation where the latent variables have been integrated out is often more efficient and numerically stable. For example, a latent GP combined with a Gaussian observation model can be fit more efficiently by marginalizing over \(f\) and formulating the GP directly on the responses \(y\) (e.g., Rasmussen 2003). Such marginalization has the additional advantage that both exact and approximate leave-one-out predictive estimation become more stable. This is because, in the factorized formulation, leaving out response \(y_i\) also implies treating the corresponding latent variable \(f_i\) as missing, which is then only identified through the joint prior over \(f\). If this prior is weak, the posterior of \(f_i\) is highly influenced by one observation and the leave-one-out predictions of \(y_i\) may be unstable both numerically and because of estimation error due to finite MCMC sampling or similar finite approximations.
Whether a non-factorized model is used by necessity or for efficiency and stability, it comes at the cost of having no direct access to the leave-one-out predictive densities (1) and thus to the overall leave-one-out predictive accuracy. In theory, we can express the observation-specific likelihoods in terms of the joint likelihood via
$$\begin{aligned} p(y_i \,|\, y_{i-1}, \theta ) = \frac{p(y \,|\, \theta )}{p(y_{-i} \,|\, \theta )} = \frac{p(y \,|\, \theta )}{\int p(y \,|\, \theta ) \, d y_i}, \end{aligned}$$
(3)
but the expression on the right-hand side of (3) may not always have an analytical solution. Computing \(\log p(y_i \,|\, y_{-i}, \theta )\) for non-factorized models is therefore often impossible, or at least inefficient and numerically unstable. However, there is a large class of multivariate normal and Student-\(t\) models for which we will provide efficient analytical solutions in this paper.
Non-factorized normal models
The density of the \(N\) dimensional multivariate normal distribution of vector \(y\) is given by
$$\begin{aligned} p(y | \mu , \Sigma ) = \frac{1}{\sqrt{(2 \pi )^N |\Sigma |}} \exp \left( -\frac{1}{2}(y - \mu )^{\mathrm{T}} \Sigma ^{-1} (y - \mu ) \right) \end{aligned}$$
(4)
with mean vector \(\mu \) and covariance matrix \(\Sigma \). Often \(\mu \) and \(\Sigma \) are functions of the model parameters \(\theta \), that is, \(\mu = \mu (\theta )\) and \(\Sigma = \Sigma (\theta )\), but for notational convenience we omit the potential dependence of \(\mu \) and \(\Sigma \) on \(\theta \) unless it is relevant. Using standard multivariate normal theory (e.g., Tong 2012), we know that for the \(i\)th observation the conditional distribution \(p(y_i | y_{-i}, \theta )\) is univariate normal with mean
$$\begin{aligned} {\tilde{\mu }}_{i} = \mu _i + \sigma _{i,-i} \Sigma ^{-1}_{-i} (y_{-i} - \mu _{-i}) \end{aligned}$$
(5)
and variance
$$\begin{aligned} {\tilde{\sigma }}_{i} = \sigma _{ii} + \sigma _{i,-i} \Sigma ^{-1}_{-i} \sigma _{-i,i}. \end{aligned}$$
(6)
In the equations above, \(\mu _{-i}\) is the mean vector without the \(i\)th element, \(\Sigma _{-i}\) is the covariance matrix without the \(i\)th row and column (\(\Sigma ^{-1}_{-i}\) is its inverse), \(\sigma _{i,-i}\) and \(\sigma _{-i,i}\) are the \(i\)th row and column vectors of \(\Sigma \) without the \(i\)th element, and \(\sigma _{ii}\) is the \(i\)th diagonal element of \(\Sigma \). Equations (5) and (6) can be used to compute the pointwise log-likelihood values as
$$\begin{aligned} \log p(y_i \,|\, y_{-i},\theta ) = - \frac{1}{2}\log (2\pi {\tilde{\sigma }}_{i}) - \frac{1}{2}\frac{(y_i-{\tilde{\mu }}_{i})^2}{{\tilde{\sigma }}_{i}}. \end{aligned}$$
(7)
Evaluating Eq. (7) for each \(y_i\) and each posterior draw \(\theta _s\) then constitutes the input for the LOO-CV computations. However, the resulting procedure is quite inefficient. Computation is usually dominated by the \(O(N^k)\) cost of computing \(\Sigma _{-i}^{-1}\), where \(k\) depends on the structure of \(\Sigma \). If \(\Sigma \) is dense then \(k = 3\). For sparse \(\Sigma \) or reduced rank computations we have \(2< k < 3\). And since \(\Sigma _{-i}^{-1}\) must be computed for each \(i\), the overall complexity is actually \(O(N^{k + 1})\).
Additionally, if \(\Sigma _{-i}\) also depends on the model parameters \(\theta \) in a non-trivial manner, which is the case for most models of practical relevance, then it needs to be inverted for each of the \(S\) posterior draws. Therefore, in most applications the overall complexity will actually be \(O(S N^{k+1})\), which will be impractical in most situations. Accordingly, we seek to find more efficient expressions for \({\tilde{\mu }}_{i}\) and \({\tilde{\sigma }}_{i}\) that make these computations feasible in practice.
Proposition 1
If y is multivariate normal with mean vector \(\mu \) and covariance matrix \(\Sigma \), then the conditional mean and standard deviation of \(y_i\) given \(y_{-i}\) for any observation i can be computed as
$$\begin{aligned} {\tilde{\mu }}_{i}= & {} y_i - \frac{g_i}{{\bar{\sigma }}_{ii}}, \end{aligned}$$
(8)
$$\begin{aligned} {\tilde{\sigma }}_{i}= & {} \frac{1}{{\bar{\sigma }}_{ii}}, \end{aligned}$$
(9)
where \(g_i = \left[ \Sigma ^{-1} (y - \mu )\right] _i\) and \({\bar{\sigma }}_{ii} = \left[ \Sigma ^{-1}\right] _{ii}\).
The proof is based on results from Sundararajan and Keerthi (2001) and is provided in the “Appendix”. Contrary to the brute force computations in (5) and (6), where \(\Sigma _{-i}\) has to be inverted separately for each \(i\), Eqs. (8) and (9) require inverting the full covariance matrix \(\Sigma \) only once and it can be reused for each \(i\). This reduces the computational cost to \(O(N^k)\) if \(\Sigma \) is independent of \(\theta \) and \(O(S N^k)\) otherwise. If the model is parameterized in terms of the covariance matrix \(\Sigma = \Sigma (\theta )\), it is not possible to reduce the complexity further as inverting \(\Sigma \) is unavoidable. However, if the model is parameterized directly through the inverse of \(\Sigma \), that is \(\Sigma ^{-1} = \Sigma ^{-1}(\theta )\), the complexity goes down to \(O(S N^2)\). Note that the latter is not possible in the brute force approach as both \(\Sigma \) and \(\Sigma ^{-1}\) are required.
Non-factorized Student-t models
Several generalizations of the multivariate normal distribution have been suggested, perhaps most notably the multivariate Student-\(t\) distribution (Zellner 1976), which has an additional positive degrees of freedom parameter \(\nu \) that controls the tails of the distribution. If \(\nu \) is small, the tails are much fatter than those of the normal distribution. If \(\nu \) is large, the multivariate Student-\(t\) distribution becomes more similar to the corresponding multivariate normal distribution and is equal to the latter for \(\nu \rightarrow \infty \). As \(\nu \) can be estimated alongside the other model parameters in Student-\(t\) models, the thickness of the tails is flexibly adjusted based on information from the observed response values and the prior. The (multivariate) Student-\(t\) distribution has been studied in various places (e.g., Zellner 1976; O’Hagan 1979; Fernández and Steel 1999; Zhang and Yeung 2010; Piché et al. 2012; Shah et al. 2014). For example, Student-\(t\) processes which are based on the multivariate Student-\(t\) distribution constitute a generalization of Gaussian processes while retaining most of the latter’s favorable properties (Shah et al. 2014).Footnote 2
The density of the \(N\) dimensional multivariate Student-\(t\) distribution of vector \(y\) is given by
$$\begin{aligned} p(y | \nu , \mu , \Sigma ) = \frac{\Gamma ((\nu + N) / 2)}{\Gamma (\nu / 2)} \frac{1}{\sqrt{(\nu \pi )^N |\Sigma |}} \left( 1 + \frac{1}{\nu } (y - \mu )^{\mathrm{T}} \Sigma ^{-1} (y - \mu ) \right) ^{-(\nu + N)/2} \end{aligned}$$
(10)
with degrees of freedom \(\nu \), location vector \(\mu \) and scale matrix \(\Sigma \). The mean of \(y\) is \(\mu \) if \(\nu > 1\) and \(\frac{\nu }{\nu -2}\Sigma \) is the covariance matrix if \(\nu > 2\). Similar to the multivariate normal case, the conditional distribution of the \(i\)th observation given all other observations and the model parameters, \(p(y_i | y_{-i}, \theta )\), can be computed analytically.
Proposition 2
If y is multivariate Student-t with degrees of freedom \(\nu \), location vector \(\mu \), and scale matrix \(\Sigma \), then the conditional distribution of \(y_i\) given \(y_{-i}\) for any observation i is univariate Student-t with parameters
$$\begin{aligned} {\tilde{\nu }}_i= & {} \nu + N - 1, \end{aligned}$$
(11)
$$\begin{aligned} {\tilde{\mu }}_{i}= & {} \mu _i + \sigma _{i,-i} \Sigma ^{-1}_{-i}(y_{-i} - \mu _{-i}), \end{aligned}$$
(12)
$$\begin{aligned} {\tilde{\sigma }}_{i}= & {} \frac{\nu + \beta _{-i}}{\nu + N - 1} \left( \sigma _{ii} + \sigma _{i,-i} \Sigma ^{-1}_{-i} \sigma _{-i,i} \right) , \end{aligned}$$
(13)
where
$$\begin{aligned} \beta _{-i} = (y_{-i} - \mu _{-i})^{\mathrm{T}} \Sigma ^{-1}_{-i} (y_{-i} - \mu _{-i}). \end{aligned}$$
(14)
A proof based on results of Shah et al. (2014) is given in the Appendix. Here \({\tilde{\mu }}_{i}\) is the same as in the normal case and \({\tilde{\sigma }}_{i}\) is the same up to the correction factor \(\frac{\nu + \beta _{-i}}{\nu + N - 1}\), which approaches \(1\) for \(\nu \rightarrow \infty \) as one would expect. Based on the above equations, we can compute the pointwise log-likelihood values in the Student-\(t\) case as
$$\begin{aligned} \log p(y_i \,|\, y_{-i},\theta )&= \log (\Gamma (({\tilde{\nu }}_i + 1) / 2)) - \log (\Gamma ({\tilde{\nu }}_i / 2)) - \frac{1}{2}\log ({\tilde{\nu }}_i \pi {\tilde{\sigma }}_{i} ) \nonumber \\&\quad - \frac{{\tilde{\nu }}_i + 1}{2} \log \left( 1 + \frac{1}{{\tilde{\nu }}_i} \frac{(y_i-{\tilde{\mu }}_{i})^2}{{\tilde{\sigma }}_{i}} \right) . \end{aligned}$$
(15)
This approach has the same overall computational cost of \(O(S N^{k+1})\) as the non-optimized normal case and is therefore quite inefficient. Fortunately, the efficiency can again be improved.
Proposition 3
If y is multivariate Student-t with degrees of freedom \(\nu \), location vector \(\mu \), and scale matrix \(\Sigma \), then the conditional location and scale of \(y_i\) given \(y_{-i}\) for any observation i can be computed as
$$\begin{aligned} {\tilde{\mu }}_{i}= & {} y_i - \frac{g_i}{{\bar{\sigma }}_{ii}}, \end{aligned}$$
(16)
$$\begin{aligned} {\tilde{\sigma }}_{i}= & {} \frac{\nu + \beta _{-i}}{\nu + N - 1} \frac{1}{{\bar{\sigma }}_{ii}}, \end{aligned}$$
(17)
with
$$\begin{aligned} \beta _{-i} = (y_{-i} - \mu _{-i})^{\mathrm{T}} \left( \Sigma ^{-1} - \frac{{\bar{\sigma }}_{-i,i} {\bar{\sigma }}_{-i,i}^{\mathrm{T}}}{{\bar{\sigma }}_{ii}} \right) (y_{-i} - \mu _{-i}), \end{aligned}$$
(18)
where \(g_i = \left[ \Sigma ^{-1} (y - \mu )\right] _i\), \({\bar{\sigma }}_{ii} = \left[ \Sigma ^{-1}\right] _{ii}\), and \({\bar{\sigma }}_{-i,i} = \left[ \Sigma ^{-1}\right] _{-i,i}\) is the ith column vector of \(\Sigma ^{-1}\) without the ith element.
The proof is provided in the Appendix. After inverting \(\Sigma \), computing \(\beta _{-i}\) for a single \(i\) is an \(O(N^2)\) operation, which needs to be repeated for each observation. So the cost of computing \(\beta _{-i}\) for all observations is \(O(N^3)\). The cost of inverting \(\Sigma \) continues to be \(O(N^k)\) and so the overall cost is dominated by \(O(N^3)\), or \(O(S N^3)\) if \(\Sigma \) depends on the model parameters \(\theta \) in a non-trivial manner. Unlike the normal case, we cannot reduce the computational costs below \(O(S N^3)\) even if the model is parameterized directly in terms of \(\Sigma ^{-1} = \Sigma ^{-1}(\theta )\) and so avoids matrix inversion altogether. However, this is still substantially more efficient than the brute force approach, which requires \(O(S N^{k+1}) > O(SN^3)\) operations.
Example: lagged SAR models
It often requires additional work to take a given multivariate normal or Student-\(t\) model and express it in the form required to apply the equations for the predictive mean and standard deviation. Consider, for example, the lagged simultaneous autoregressive (SAR) model (Cressie 1992; Haining and Haining 2003; LeSage and Pace 2009), a spatial model with many applications in both the social sciences (e.g., economics) and natural sciences (e.g., ecology). The model is given by
$$\begin{aligned} y = \rho W y + \eta + \epsilon , \end{aligned}$$
(19)
or equivalently
$$\begin{aligned} (I - \rho W) y = \eta + \epsilon , \end{aligned}$$
(20)
where \(\rho \) is a scalar spatial correlation parameter and \(W\) is a user-defined matrix of weights. The matrix \(W\) has entries \(w_{ii} = 0\) along the diagonal and the off-diagonal entries \(w_{ij}\) are larger when units \(i\) and \(j\) are closer to each other but mostly zero as well. In a linear model, the predictor term is \(\eta = X \beta \), with design matrix \(X\) and regression coefficients \(\beta \), but the definition of the lagged SAR model holds for arbitrary \(\eta \), so these results are not restricted to the linear case. See LeSage and Pace (2009), Sect. 2.3, for a more detailed introduction to SAR models. A general discussion about predictions of SAR models from a frequentist perspective can be found in Goulard et al. (2017).
If we have \(\epsilon \sim \mathrm {N}(0, \sigma ^2 I)\), with residual variance \(\sigma ^2\) and identity matrix \(I\) of dimension \(N\), it follows that
$$\begin{aligned} (I - \rho W) y \sim \mathrm {N}(\eta , \sigma ^2 I), \end{aligned}$$
(21)
but this standard way of expressing the model is not compatible with the requirements of Proposition 1. To make the lagged SAR model reconcilable with this proposition we need to rewrite it as follows (conditional on \(\rho \), \(\eta \), and \(\sigma ^2\)):
$$\begin{aligned} y \sim \mathrm {N}\left( (I - \rho W)^{-1} \eta ,\, \sigma ^2 (I - \rho W)^{-1} (I - \rho W)^{-\mathrm{T}} \right) , \end{aligned}$$
(22)
or more compactly, with \({\widetilde{W}} = (I - \rho W)\),
$$\begin{aligned} y \sim \mathrm {N}\left( {\widetilde{W}}^{-1} \eta ,\, \sigma ^2 ({\widetilde{W}}^{\mathrm{T}} {\widetilde{W}})^{-1} \right) . \end{aligned}$$
(23)
Written in this way, the lagged SAR model has the required form (4). Accordingly, we can compute the leave-one-out predictive densities with Eqs. (8) and (9), replacing \(\mu \) with \({\widetilde{W}}^{-1} \eta \) and taking the covariance matrix \(\Sigma \) to be \(\sigma ^2 ({\widetilde{W}}^T {\widetilde{W}})^{-1}\). This implies \(\Sigma ^{-1}=\sigma ^{-2}{\widetilde{W}} {\widetilde{W}}^T\) and so that the overall computational cost is dominated by \({\widetilde{W}}^{-1} \eta \). In SAR models, \(W\) is usually sparse and so is \({\widetilde{W}}\). Thus, if sparse matrix operations are used, then the computational cost for \(\Sigma ^{-1}\) will be less than \(O(N^2)\) and for \({\widetilde{W}}^{-1}\) it will be less than \(O(N^3)\) depending on number of non-zeros and the fill pattern. Since \({\widetilde{W}}\) depends on the parameter \(\rho \) in a non-trivial manner, \({\widetilde{W}}^{-1}\) needs to be computed for each posterior draw, which implies an overall computational cost of less than \(O(S N^3)\).
If the residuals are Student-\(t\) distributed, we can apply analogous transformations as above to arrive at the Student-\(t\) distribution for the responses
$$\begin{aligned} y \sim \mathrm {t}\left( \nu ,\, {\widetilde{W}}^{-1} \eta ,\, \sigma ^2 ({\widetilde{W}}^{\mathrm{T}} {\widetilde{W}})^{-1} \right) , \end{aligned}$$
(24)
with computational cost dominated by the computation of the \(\beta _i\) from Eq. (18) which is in \(O(S N^3)\).
Studying leave-one-out predictive densities in SAR models is related to considering impact measures, that is, measures to quantify how changes in the predicting variables of a given observation \(i\) affect the responses in other observations \(j \ne i\) as well as the obtained parameter estimates (see LeSage and Pace 2009, Section 2.7). A detailed discussion of this topic it out of scope of the present paper.