Practical Bayesian model evaluation using leaveoneout crossvalidation and WAIC
 9.7k Downloads
 247 Citations
Abstract
Leaveoneout crossvalidation (LOO) and the widely applicable information criterion (WAIC) are methods for estimating pointwise outofsample prediction accuracy from a fitted Bayesian model using the loglikelihood evaluated at the posterior simulations of the parameter values. LOO and WAIC have various advantages over simpler estimates of predictive error such as AIC and DIC but are less used in practice because they involve additional computational steps. Here we lay out fast and stable computations for LOO and WAIC that can be performed using existing simulation draws. We introduce an efficient computation of LOO using Paretosmoothed importance sampling (PSIS), a new procedure for regularizing importance weights. Although WAIC is asymptotically equal to LOO, we demonstrate that PSISLOO is more robust in the finite case with weak priors or influential observations. As a byproduct of our calculations, we also obtain approximate standard errors for estimated predictive errors and for comparison of predictive errors between two models. We implement the computations in an R package called loo and demonstrate using models fit with the Bayesian inference package Stan.
Keywords
Bayesian computation Leaveoneout crossvalidation (LOO) Kfold crossvalidation Widely applicable information criterion (WAIC) Stan Pareto smoothed importance sampling (PSIS)1 Introduction
After fitting a Bayesian model we often want to measure its predictive accuracy, for its own sake or for purposes of model comparison, selection, or averaging (Geisser and Eddy 1979; Hoeting et al. 1999; Vehtari and Lampinen 2002; Ando and Tsay 2010; Vehtari and Ojanen 2012). Crossvalidation and information criteria are two approaches to estimating outofsample predictive accuracy using withinsample fits (Akaike 1973; Stone 1977). In this article we consider computations using the loglikelihood evaluated at the usual posterior simulations of the parameters. Computation time for the predictive accuracy measures should be negligible compared to the cost of fitting the model and obtaining posterior draws in the first place.
Exact crossvalidation requires refitting the model with different training sets. Approximate leaveoneout crossvalidation (LOO) can be computed easily using importance sampling (IS; Gelfand et al. 1992; Gelfand 1996) but the resulting estimate is noisy, as the variance of the importance weights can be large or even infinite (Peruggia 1997; Epifani et al. 2008). Here we propose to use Pareto smoothed importance sampling (PSIS), a new approach that provides a more accurate and reliable estimate by fitting a Pareto distribution to the upper tail of the distribution of the importance weights. PSIS allows us to compute LOO using importance weights that would otherwise be unstable.
The widely applicable or WatanabeAkaike information criterion (WAIC; Watanabe 2010) can be viewed as an improvement on the deviance information criterion (DIC) for Bayesian models. DIC has gained popularity in recent years, in part through its implementation in the graphical modeling package BUGS (Spiegelhalter et al. 2002; Spiegelhalter et al. 1994, 2003), but it is known to have some problems, which arise in part from not being fully Bayesian in that it is based on a point estimate (van der Linde 2005; Plummer 2008). For example, DIC can produce negative estimates of the effective number of parameters in a model and it is not defined for singular models. WAIC is fully Bayesian in that it uses the entire posterior distribution, and it is asymptotically equal to Bayesian crossvalidation. Unlike DIC, WAIC is invariant to parametrization and also works for singular models.
Although WAIC is asymptotically equal to LOO, we demonstrate that PSISLOO is more robust in the finite case with weak priors or influential observations. We provide diagnostics for both PSISLOO and WAIC which tell when these approximations are likely to have large errors and computationally more intensive methods such as Kfold crossvalidation should be used. Fast and stable computation and diagnostics for PSISLOO allows safe use of this new method in routine statistical practice. As a byproduct of our calculations, we also obtain approximate standard errors for estimated predictive errors and for the comparison of predictive errors between two models.
We implement the computations in a package for R (R Core Team 2016) called loo (Vehtari et al. 2016a, b) and demonstrate using models fit with the Bayesian inference package Stan (Stan Development Team 2016a, b).^{1} All the computations are fast compared to the typical time required to fit the model in the first place. Although the examples provided in this paper all use Stan, the loo package is independent of Stan and can be used with models estimated by other software packages or custom userwritten algorithms.
2 Estimating outofsample pointwise predictive accuracy using posterior simulations
Instead of the log predictive density \(\log p(\tilde{y}_iy)\), other utility (or cost) functions \(u(p(\tilde{y}y),\tilde{y})\) could be used, such as classification error. Here we take the log score as the default for evaluating the predictive density (Geisser and Eddy 1979; Bernardo and Smith 1994; Gneiting and Raftery 2007).
2.1 Leaveoneout crossvalidation
2.1.1 Raw importance sampling
For simple models the variance of the importance weights may be computed analytically. The necessary and sufficient conditions for the variance of the casedeletion importance sampling weights to be finite for a Bayesian linear model are given by Peruggia (1997). Epifani et al. (2008) extend the analytical results to generalized linear models and nonlinear MichaelisMenten models. However, these conditions can not be computed analytically in general.
Koopman et al. (2009) propose to use the maximum likelihood fit of the generalized Pareto distribution to the upper tail of the distribution of the importance ratios and use the fitted parameters to form a test for whether the variance of the importance ratios is finite. If the hypothesis test suggests the variance is infinite then they abandon importance sampling.
2.1.2 Truncated importance sampling
2.1.3 Pareto smoothed importance sampling
We can improve the LOO estimate using Pareto smoothed importance sampling (PSIS; Vehtari and Gelman 2015), which applies a smoothing procedure to the importance weights. We briefly review the motivation and steps of PSIS here, before moving on to focus on the goals of using and evaluating predictive information criteria.
As noted above, the distribution of the importance weights used in LOO may have a long right tail. We use the empirical Bayes estimate of Zhang and Stephens (2009) to fit a generalized Pareto distribution to the tail (20 % largest importance ratios). By examining the shape parameter k of the fitted Pareto distribution, we are able to obtain sample based estimates of the existence of the moments (Koopman et al. 2009). This extends the diagnostic approach of Peruggia (1997) and Epifani et al. (2008) to be used routinely with ISLOO for any model with a factorizing likelihood.
Epifani et al. (2008) show that when estimating the leaveoneout predictive density, the central limit theorem holds if the distribution of the weights has finite variance. These results can be extended via the generalized central limit theorem for stable distributions. Thus, even if the variance of the importance weight distribution is infinite, if the mean exists then the accuracy of the estimate improves as additional posterior draws are obtained.
 1.
Fit the generalized Pareto distribution to the 20% largest importance ratios \(r_s\) as computed in (6). The computation is done separately for each heldout data point i. In simulation experiments with thousands and tens of thousands of draws, we have found that the fit is not sensitive to the specific cutoff value (for a consistent estimation, the proportion of the samples above the cutoff should get smaller when the number of draws increases).
 2.Stabilize the importance ratios by replacing the M largest ratios by the expected values of the order statistics of the fitted generalized Pareto distributionwhere M is the number of simulation draws used to fit the Pareto (in this case, \(M=0.2\,S\)) and \(F^{1}\) is the inverseCDF of the generalized Pareto distribution. Label these new weights as \(\tilde{w}_i^s\) where, again, s indexes the simulation draws and i indexes the data points; thus, for each i there is a distinct vector of S weights.$$\begin{aligned} F^{1}\left( \frac{z1/2}{M}\right) , \quad z=1,\ldots ,M, \end{aligned}$$
 3.
To guarantee finite variance of the estimate, truncate each vector of weights at \(S^{3/4}\bar{w}_i\), where \(\bar{w}_i\) is the average of the S smoothed weights corresponding to the distribution holding out data point i. Finally, label these truncated weights as \(w^s_i\).

If \(k<\frac{1}{2}\), the variance of the raw importance ratios is finite, the central limit theorem holds, and the estimate converges quickly.

If k is between \(\frac{1}{2}\) and 1, the variance of the raw importance ratios is infinite but the mean exists, the generalized central limit theorem for stable distributions holds, and the convergence of the estimate is slower. The variance of the PSIS estimate is finite but may be large.

If \(k>1\), the variance and the mean of the raw ratios distribution do not exist. The variance of the PSIS estimate is finite but may be large.
The additional computational cost of sampling directly from each \(p(\theta ^sy_{i})\) is approximately the same as sampling from the full posterior, but it is recommended if the number of problematic data points is not too high.
A more robust model may also help because importance sampling is less likely to work well if the marginal posterior \(p(\theta ^sy)\) and LOO posterior \(p(\theta ^sy_{i})\) are very different. This is more likely to happen with a nonrobust model and highly influential observations. A robust model may reduce the sensitivity to one or several highly influential observations, as we show in the examples in Sect. 4.
2.2 WAIC
The effective number of parameters \(\widehat{p}_\mathrm{waic}\) can be used as measure of complexity of the model, but it should not be overinterpreted, as the original goal is to estimate the difference between lpd and elpd. As shown by Gelman et al. (2014) and demonstrated also in Sect. 4, in the case of a weak prior, \(\widehat{p}_\mathrm{waic}\) can severely underestimate the difference between lpd and elpd. For \(\widehat{p}_\mathrm{waic}\) there is no similar theory as for the moments of the importance sampling weight distribution, but based on our simulation experiments it seems that \(\widehat{p}_\mathrm{waic}\) is unreliable if any of the terms \(V_{s=1}^S \log p(y_i\theta ^s)\) exceeds 0.4.
2.3 Kfold crossvalidation
In this paper we focus on leaveoneout crossvalidation and WAIC, but, for statistical and computational reasons, it can make sense to crossvalidate using \(K \ll n\) holdout sets. In some ways, Kfold crossvalidation is simpler than leaveoneout crossvalidation but in other ways it is not. Kfold crossvalidation requires refitting the model K times which can be computationally expensive whereas approximative LOO methods, such as PSISLOO, require only one evaluation of the model.
If in PSISLOO \({\hat{k}}>0.7\) for a few i we recommend sampling directly from each corresponding \(p(\theta ^sy_{i})\), but if there are more than K problematic i, then we recommend checking the results using Kfold crossvalidation. Vehtari and Lampinen (2002) demonstrate cases where ISLOO fails (according to effective sample size estimates instead of the \({\hat{k}}\) diagnostic proposed here) for a large number of i and KfoldCV produces more reliable results.
For Kfold crossvalidation, if the subjects are exchangeable, that is, the order does not contain information, then there is no need for random selection. If the order does contain information, e.g. in survival studies the later patients have shorter followups, then randomization is often useful.
In most cases we recommend partitioning the data into subsets by randomly permuting the observations and then systemically dividing them into K subgroups. If the subjects are exchangeable, that is, the order does not contain information, then there is no need for random selection, but if the order does contain information, e.g. in survival studies the later patients have shorter followups, then randomization is useful. In some cases it may be useful to stratify to obtain better balance among groups. See Vehtari and Lampinen (2002), Arolot and Celisse (2010), and Vehtari and Ojanen (2012) for further discussion of these points.
As the data can be divided in many ways into K groups it introduces additional variance in the estimates, which is also evident from our experiments. This variance can be reduced by repeating KfoldCV several times with different permutations in the data division, but this will further increase the computational cost.
2.4 Data division
The purpose of using LOO or WAIC is to estimate the accuracy of the predictive distribution \(p(\tilde{y}_iy)\). Computation of PSISLOO and WAIC (and AIC and DIC) is based on computing terms \(\log p(y_iy)=\log \int p(y_i\theta )p(\theta y)\) assuming some agreedupon division of the data y into individual data points \(y_i\). Although often \(y_i\) will denote a single scalar observation, in the case of hierarchical data, it may denote a group of observations. For example, in cognitive or medical studies we may be interested in prediction for a new subject (or patient), and thus it is natural in crossvalidation to consider an approach where \(y_i\) would denote all observations for a single subject and \(y_{i}\) would denote the observations for all the other subjects. In theory, we can use PSISLOO and WAIC in this case, too, but as the number of observations per subject increases it is more likely that they will not work as well. The fact that importance sampling is difficult in higher dimensions is well known and is demonstrated for ISLOO by Vehtari and Lampinen (2002) and for PSIS by Vehtari and Gelman (2015). The same problem can also be shown to hold for WAIC. If diagnostics warn about the reliability of PSISLOO (or WAIC), then Kfold crossvalidation can be used by taking into account the hierarchical structure in the data when doing the data division as demonstrated, for example, by Vehtari and Lampinen (2002).
3 Implementation in Stan
We have set up code to implement LOO, WAIC, and Kfold crossvalidation in R and Stan so that users will have a quick and convenient way to assess and compare model fits. Implementation is not automatic, though, because of the need to compute the separate factors \(p(y_i\theta )\) in the likelihood. Stan works with the joint density and in its usual computations does not “know” which parts come from the prior and which from the likelihood. Nor does Stan in general make use of any factorization of the likelihood into pieces corresponding to each data point. Thus, to compute these measures of predictive fit in Stan, the user needs to explicitly code the factors of the likelihood (actually, the terms of the loglikelihood) as a vector. We can then pull apart the separate terms and compute crossvalidation and WAIC at the end, after all simulations have been collected. Sample code for carrying out this procedure using Stan and the loo R package is provided in Appendix. This code can be adapted to apply our procedure in other computing languages.
Although the implementation is not automatic when writing custom Stan programs, we can create implementations that are automatic for users of our new rstanarm R package (Gabry and Goodrich 2016). rstanarm provides a highlevel interface to Stan that enables the user to specify many of the most common applied Bayesian regression models using standard R modeling syntax (e.g. like that of glm). The models are then estimated using Stan’s algorithms and the results are returned to the user in a form similar to the fitted model objects to which R users are accustomed. For the models implemented in rstanarm, we have preprogrammed many tasks, including computing and saving the pointwise predictive measures and importance ratios which we use to compute WAIC and PSISLOO. The loo method for rstanarm models requires no additional programming from the user after fitting a model, as we can compute all of the needed quantities internally from the contents of the fitted model object and then pass them to the functions in the loo package. Examples of using loo with rstanarm can be found in the rstanarm vignettes, and we also provide an example in Appendix 3 of this paper.
4 Examples
We illustrate with six simple examples: two examples from our earlier research in computing the effective number of parameters in a hierarchical model, three examples that were used by Epifani et al. (2008) to illustrate the estimation of the variance of the weight distribution, and one example of a multilevel regression from our earlier applied research. For each example we used the Stan default of 4 chains run for 1000 warmup and 1000 postwarmup iterations, yielding a total of 4000 saved simulation draws. With Gibbs sampling or randomwalk Metropolis, 4000 is not a large number of simulation draws. The algorithm used by Stan is Hamiltonian Monte Carlo with NoUTurnSampling (Hoffman and Gelman 2014), which is much more efficient, and 1000 is already more than sufficient in many realworld settings. In these examples we followed standard practice and monitored convergence and effective sample sizes as recommended by Gelman et al. (2013). We performed 100 independent replications of all experiments to obtain estimates of variation. For the exact LOO results and convergence plots we run longer chains to obtain a total of 100,000 draws (except for the radon example which is much slower to run).
4.1 Example: Scaled 8 schools
In a controlled study, independent randomized experiments were conducted in 8 different high schools to estimate the effect of special preparation for college admission tests
School  Estimated effect, \(y_j\)  Standard error of estimate, \(\sigma _j\) 

A  28  15 
B  8  10 
C  \(\)3  16 
D  7  11 
E  \(\)1  9 
F  1  11 
G  18  10 
H  12  18 
For our first example we take an analysis of an education experiment used by Gelman et al. (2014) to demonstrate the use of information criteria for hierarchical Bayesian models.
The goal of the study was to measure the effects of a test preparation program conducted in eight different high schools in New Jersey. A separate randomized experiment was conducted in each school, and the administrators of each school implemented the program in their own way. Rubin (1981) performed a Bayesian metaanalysis, partially pooling the eight estimates toward a common mean. The model has the form, \(y_i\sim \mathrm{N}(\theta _i,\sigma ^2_i)\) and \(\theta _i\sim \mathrm{N}(\mu ,\tau ^2)\), for \(i=1,\ldots ,n=8\), with a uniform prior distribution on \((\mu ,\tau )\). The measurements \(y_i\) and uncertainties \(\sigma _i\) are the estimates and standard errors from separate regressions performed for each school, as shown in Table 1. The test scores for the individual students are no longer available.
This model has eight parameters but they are constrained through their hierarchical distribution and are not estimated independently; thus we would anticipate the effective number of parameters should be some number between 1 and 8.
To better illustrate the behavior of LOO and WAIC, we repeat the analysis, rescaling the data points y by a factor ranging from 0.1 to 4 while keeping the standard errors \(\sigma \) unchanged. With a small data scaling factor the hierarchical model nears complete pooling and with a large data scaling factor the model approaches separate fits to the data for each school. Figure 1 shows \(\widehat{\text{ elpd }}\) for the various LOO approximation methods as a function of the scaling factor, based on 4000 simulation draws at each grid point.
In the case of exact LOO, \(\widehat{\mathrm{lpd}}  \widehat{\mathrm{elpd}}_\mathrm{loo}\) can be larger than p. As the prior for \(\theta _i\) approaches flatness, the log predictive density \(p_{\mathrm{post}(i)}(y_i) \rightarrow \infty \). At the same time, the full posterior becomes an inadequate approximation to \(p_{\mathrm{post}(i)}\) and all approximations become poor approximations to the actual outofsample prediction error under the model.
WAIC starts to fail when one of the posterior variances of the log predictive densities exceeds 0.4. LOO approximations work well even if the tail shape k of the generalized Pareto distribution is between \(\frac{1}{2}\) and 1, and the variance of the raw importance ratios is infinite. The error of LOO approximations increases with k, with a clearer difference between the methods when \(k>0.7\).
4.2 Example: Simulated 8 schools
In the previous example, we used exact LOO as the gold standard. In this section, we generate simulated data from the same statistical model and compare predictive performance on independent test data. Even when the number of observations n is fixed, as the scale of the population distribution increases we observe the effect of weak prior information in hierarchical models discussed in the previous section and by Gelman et al. (2014). Comparing the error, bias and variance of the various approximations, we find that PSISLOO offers the best balance.
For \(i=1,\ldots ,n=8\), we simulate \(\theta _{0,i}\sim \mathrm{N}(\mu _0,\tau ^2_0)\) and \(y_i\sim \mathrm{N}(\theta _{0,i},\sigma ^2_{0,i})\), where we set \(\sigma _{0,i}=10,\mu _0=0\), and \(\tau _0\in \{1,2,\ldots ,30\}\). The simulated data is similar to the real 8 schools data, for which the empirical estimate is \(\hat{\tau }\approx 10\). For each value of \(\tau _0\) we generate 100 training sets of size 8 and one test data set of size 1000. Posterior inference is based on 4000 draws for each constructed model.
Figure 2a shows the root mean square error (RMSE) for the various LOO approximation methods as a function of \(\tau _0\), the scale of the population distribution. When \(\tau _0\) is large all of the approximations eventually have ever increasing RMSE, while exact LOO has an upper limit. For medium scales the approximations have smaller RMSE than exact LOO. As discussed later, this is explained by the difference in the variance of the estimates. For small scales WAIC has slightly smaller RMSE than the other methods (including exact LOO).
Watanabe (2010) shows that WAIC gives an asymptotically unbiased estimate of the outofsample prediction error—this does not hold for hierarchical models with weak prior information as shown by Gelman et al. (2014)—but exact LOO is slightly biased as the LOO posteriors use only \(n1\) observations. WAIC’s different behavior can be understood through the truncated Taylor series correction to the lpd, that is, not using the entire series will bias it towards lpd (see Sect. 2.2). The bias in LOO is negligible when n is large, but with small n it can be be larger.
Figure 2b shows RMSE for the bias corrected LOO approximations using the first order correction of Burman (1989). For small scales the error of bias corrected LOOs is smaller than WAIC. When the scale increases the RMSEs are close to the noncorrected versions. Although the bias correction is easy to compute, the difference in accuracy is negligible for most applications.
We shall discuss Fig. 2c in a moment, but first consider Fig. 3, which shows the RMSE of the approximation methods and the lpd of observed data decomposed into bias and standard deviation. All methods (except the lpd of observed data) have small biases and variances with small population distribution scales. Bias corrected exact LOO has practically zero bias for all scale values but the highest variance. When the scale increases the LOO approximations eventually fail and bias increases. As the approximations start to fail, there is a certain region where implicit shrinkage towards the lpd of observed data decelerates the increase in RMSE as the variance is reduced, even if the bias continues to grow.
If the goal were to minimize the RMSE for smaller and medium scales, we could also shrink exact LOO and increase shrinkage in approximations. Figure 2c shows the RMSE of the LOO approximations with two new choices. Truncated Importance Sampling LOO with very heavy truncation (to \(\root 4 \of {S}\bar{r}\)) closely matches the performance of WAIC. In the experiments not included here, we also observed that adding more correct Taylor series terms to WAIC will make it behave similar to Truncated Importance Sampling with less truncation (see discussion of Taylor series expansion in Sect. 2.2). Shrunk exact LOO (\(\alpha \cdot \mathrm{elpd}_\mathrm{loo} + (1\alpha )\cdot \text{ lpd }\), with \(\alpha =0.85\) chosen by hand for illustrative purposes only) has a smaller RMSE for small and medium scale values as the variance is reduced, but the price is increased bias at larger scale values.
4.3 Example: Linear regression for stack loss data
To check the performance of the proposed diagnostic for our second example we analyze the stack loss data used by Peruggia (1997) which is known to have analytically proven infinite variance of one of the importance weight distributions.
High estimates of the tail shape parameter \(\hat{k}\) indicate that the full posterior is not a good importance sampling approximation to the desired leaveoneout posterior, and thus the observation is surprising according to the model. It is natural to consider an alternative model. We tried replacing the normal observation model with a Studentt to make the model more robust for the possible outlier. Figure 6 shows the distribution of the estimated tail shapes \({\hat{k}}\) and estimation errors for PSISLOO compared to LOO in 100 independent Stan runs for the Studentt linear regression model. The estimated tail shapes and the errors in computing this component of LOO are smaller than with Gaussian model.
4.4 Example: Nonlinear regression for Puromycin reaction data
As a nonlinear regression example, we use the Puromycin biochemical reaction data also analyzed by Epifani et al. (2008). For a group of cells not treated with the drug Puromycin, there are \(n = 11\) measurements of the initial velocity of a reaction, \(V_i\) , obtained when the concentration of the substrate was set at a given positive value, \(c_i\). Velocity on concentration is given by the MichaelisMenten relation, \(V_i \sim \text{ N }(mc_i/(\kappa + c_i), \sigma ^2)\). Epifani et al. (2008) show that the raw importance ratios for observation \(i=1\) have infinite variance.
Figure 7 shows the distribution of the estimated tail shapes k and estimation errors compared to LOO in 100 independent Stan runs. The estimates of the tail shape k for \(i=1\) suggest that the variance of the raw importance ratios is infinite. However, the generalized central limit theorem for stable distributions still holds and we can get an accurate estimate of the corresponding term in LOO. We could obtain more draws to reduce the Monte Carlo error, or again consider a more robust model.
4.5 Example: Logistic regression for leukemia survival
Epifani et al. (2008) show that the raw importance ratios for data point \(i=15\) have infinite variance. Figure 8 shows the distribution of the estimated tail shapes k and estimation errors compared to LOO in 100 independent Stan runs. The estimates of the tail shape k for \(i=15\) suggest that the mean and variance of the raw importance ratios do not exist. Thus the generalized central limit theorem does not hold.
Figure 9 shows that if we continue sampling, the tail shape estimate stays above 1 and \(\widehat{\mathrm{elpd}}_i\) will not converge.
Large estimates for the tail shape parameter indicate that the full posterior is not a good importance sampling approximation for the desired leaveoneout posterior, and thus the observation is surprising. The original model used the white blood cell count directly as a predictor, and it would be natural to use its logarithm instead. Figure 10 shows the distribution of the estimated tail shapes k and estimation errors compared to LOO in 100 independent Stan runs for this modified model. Both the tail shape values and errors are now smaller.
4.6 Example: Multilevel regression for radon contamination
Gelman and Hill (2007) describe a study conducted by the United States Environmental Protection Agency designed to measure levels of the carcinogen radon in houses throughout the United States. In high concentrations radon is known to cause lung cancer and is estimated to be responsible for several thousands of deaths every year in the United States. Here we focus on the sample of 919 houses in the state of Minnesota, which are distributed (unevenly) throughout 85 counties.
The sample size in this example \((n=919)\) is not huge but is large enough that it is important to have a computational method for LOO that is fast for each data point. Although the MCMC for the full posterior inference (using four parallel chains) finished in only 93 s, the computations for exact brute force LOO require fitting the model 919 times and took more than 20 h to complete (Macbook Pro, 2.6 GHz Intel Core i7). With the same hardware the PSISLOO computations took less than 5 s.
Root mean square error for different computations of LOO as determined from a simulation study, in each case based on running Stan to obtain 4000 posterior draws and repeating 100 times
Method  8 schools  StacksN  Stackst  Puromycin  Leukemia  Leukemialog  Radon 

PSISLOO  0.21  0.21  0.12  0.20  1.33  0.18  0.34 
ISLOO  0.28  0.37  0.12  0.28  1.43  0.21  0.39 
TISLOO  0.19  0.37  0.12  0.27  1.80  0.18  0.36 
WAIC  0.40  0.68  0.12  0.46  2.30  0.29  1.30 
PSISLOO+  0.21  0.11  0.12  0.10  0.11  0.18  0.34 
10foldCV  −  1.34  1.01  −  1.62  1.40  2.87 
\(10\times 10\)foldCV  −  0.46  0.38  −  0.43  0.36  − 
Partial replication of Table 2 using 16,000 posterior draws in each case
Method  8 schools  StacksN  Stackst  Puromycin  Leukemia  Leukemialog  Radon 

PSISLOO  0.19  0.12  0.07  0.10  1.02  0.09  0.18 
ISLOO  0.13  0.21  0.07  0.25  1.21  0.11  0.24 
TISLOO  0.15  0.27  0.07  0.17  1.60  0.09  0.24 
WAIC  0.40  0.67  0.09  0.44  2.27  0.25  1.30 
4.7 Summary of examples
Table 2 compares the performance of Pareto smoothed importance sampling, raw importance sampling, truncated importance sampling, and WAIC for estimating expected outofsample prediction accuracy for each of the examples in Sects. 4.1–4.6. Models were fit in Stan to obtain 4000 simulation draws. In each case, the distributions come from 100 independent simulations of the entire fitting process, and the root mean squared error is evaluated by comparing to exact LOO, which was computed by separately fitting the model to each leaveoneout dataset for each example. The last three lines of Table 2 show additionally the performance of PSISLOO combined with direct sampling for the problematic i with \({\hat{k}}>0.7\) (PSISLOO+), 10foldCV, and 10 times repeated 10foldCV.^{5} For the StacksN, Puromycin, and Leukemia examples, there was one i with \({\hat{k}}>0.7\), and thus the improvement has the same computational cost as the full posterior inference. 10foldCV has higher RMSE than LOO approximations except in the Leukemia case. The higher RMSE of 10foldCV is due to additional variance from the data division. The repeated 10foldCV has smaller RMSE than basic 10foldCV, but now the cost of computation is already 100 times the original full posterior inference. These results show that KfoldCV is needed only if LOO approximations fail badly (see also the results in Vehtari and Lampinen 2002).
As measured by root mean squared error, PSIS consistently performs well. In general, when ISLOO has problems it is because of the high variance of the raw importance weights, while TISLOO and WAIC have problems because of bias. Table 3 shows a replication using 16,000 Stan draws for each example. The results are similar results and PSISLOO is able to improve the most given additional draws.
5 Standard errors and model comparison
We next consider some approaches for assessing the uncertainty of crossvalidation and WAIC estimates of prediction error. We present these methods in a separate section rather than in our main development because, as discussed below, the diagnostics can be difficult to interpret when the sample size is small.
5.1 Standard errors
These standard errors come from considering the n data points as a sample from a larger population or, equivalently, as independent realizations of an error model. One can also compute Monte Carlo standard errors arising from the finite number of simulation draws using the formula from Gelman et al. (2013) which uses both between and withinchain information and is implemented in Stan. In practice we expect Monte Carlo standard errors to not be so interesting because we would hope to have enough simulations that the computations are stable, but it could make sense to look at them just to check that they are low enough to be negligible compared to sampling error (which scales like 1 / n rather than 1 / S).
The standard error (23) and the corresponding formula for \(\text{ se }\,(\widehat{\mathrm{elpd}}_\mathrm{waic})\) have two difficulties when the sample size is low. First, the n terms are not strictly independent because they are all computed from the same set of posterior simulations \(\theta ^s\). This is a generic issue when evaluating the standard error of any crossvalidated estimate. Second, the terms in any of these expressions can come from highly skewed distributions, so the second moment might not give a good summary of uncertainty. Both of these problems should subside as n becomes large. For small n, one could instead compute nonparametric error estimates using a Bayesian bootstrap on the computed loglikelihood values corresponding to the n data points (Vehtari and Lampinen 2002).
5.2 Model comparison
When comparing two fitted models, we can estimate the difference in their expected predictive accuracy by the difference in \(\widehat{\mathrm{elpd}}_\mathrm{loo}\) or \(\widehat{\mathrm{elpd}}_\mathrm{waic}\) (multiplied by \(2\), if desired, to be on the deviance scale). To compute the standard error of this difference we can use a paired estimate to take advantage of the fact that the same set of n data points is being used to fit both models.
As before, these calculations should be most useful when n is large, because then nonnormality of the distribution is not such an issue when estimating the uncertainty of these sums.
In any case, we suspect that these standard error formulas, for all their flaws, should give a better sense of uncertainty than what is obtained using the current standard approach for comparing differences of deviances to a \(\chi ^2\) distribution, a practice that is derived for Gaussian linear models or asymptotically and, in any case, only applies to nested models.
Further research needs to be done to evaluate the performance in model comparison of (24) and the corresponding standard error formula for LOO. Crossvalidation and WAIC should not be used to select a single model among a large number of models due to a selection induced bias as demonstrated, for example, by Piironen and Vehtari (2016).
5.3 Model comparison using pointwise prediction errors
We can also compare models in their leaveoneout errors, point by point. We illustrate with an analysis of a survey of residents from a small area in Bangladesh that was affected by arsenic in drinking water. Respondents with elevated arsenic levels in their wells were asked if they were interested in getting water from a neighbor’s well, and a series of models were fit to predict this binary response given various information about the households (Gelman and Hill 2007).
Figure 12 shows the pointwise results for the arsenic example. The scattered blue dots on the left side of Fig. 12a and on the lower right of Fig. 12b correspond to data points which Model A fits particularly poorly—that is, large negative contributions to the expected log predictive density. We can also sum these n terms to yield an estimated difference in \(\mathrm{elpd}_\mathrm{loo}\) of 16.4 with a standard error of 4.4. This standard error derives from the finite sample size and is scaled by the variation in the differences displayed in Fig. 12; it is not a Monte Carlo error and does not decline to 0 as the number of Stan simulation draws increases.
6 Discussion
This paper has focused on the practicalities of implementing LOO, WAIC, and Kfold crossvalidation within a Bayesian simulation environment, in particular the coding of the loglikelihood in the model, the computations of the information measures, and the stabilization of weights to enable an approximation of LOO without requiring refitting the model.
Some difficulties persist, however. As discussed above, any predictive accuracy measure involves two definitions: (1) the choice of what part of the model to label as “the likelihood”, which is directly connected to which potential replications are being considered for outofsample prediction; and (2) the factorization of the likelihood into “data points”, which is reflected in the later calculations of expected log predictive density.
Some choices of replication can seem natural for a particular dataset but less so in other comparable settings. For example, the 8 schools data are available only at the school level and so it seems natural to treat the schoollevel estimates as data. But if the original data had been available, we would surely have defined the likelihood based on the individual students’ test scores. It is an awkward feature of predictive error measures that they might be determined based on computational convenience or data availability rather than fundamental features of the problem. To put it another way, we are assessing the fit of the model to the particular data at hand.
Finally, these methods all have limitations. The concern with WAIC is that formula (12) is an asymptotic expression for the bias of lpd for estimating outofsample prediction error and is only an approximation for finite samples. Crossvalidation (whether calculated directly by refitting the model to several different data subsets, or approximated using importance sampling as we did for LOO) has a different problem in that it relies on inference from a smaller subset of the data being close to inference from the full dataset, an assumption that is typically but not always true.
For example, as we demonstrated in Sect. 4.1, in a hierarchical model with only one data point per group, PSISLOO and WAIC can dramatically understate prediction accuracy. Another setting where LOO (and crossvalidation more generally) can fail is in models with weak priors and sparse data. For example, consider logistic regression with flat priors on the coefficients and data that happen to be so close to separation that the removal of a single data point can induce separation and thus infinite parameter estimates. In this case the LOO estimate of average prediction accuracy will be zero (that is, \(\widehat{\mathrm{elpd}}_\mathrm{isloo}\) will be \(\infty \)) if it is calculated to full precision, even though predictions of future data from the actual fitted model will have bounded loss. Such problems should not arise asymptotically with a fixed model and increasing sample size but can occur with actual finite data, especially in settings where models are increasing in complexity and are insufficiently constrained.
That said, quick estimates of outofsample prediction error can be valuable for summarizing and comparing models, as can be seen from the popularity of AIC and DIC. For Bayesian models, we prefer PSISLOO and Kfold crossvalidation to those approximations which are based on point estimation.
Footnotes
 1.
The loo R package is available from CRAN and https://github.com/standev/loo. The corresponding code for Matlab, Octave, and Python is available at https://github.com/avehtari/PSIS.
 2.
In Gelman et al. (2013), the variancebased \(p_\mathrm{waic}\) defined here is called \(p_{\mathrm{waic}\, 2}\). There is also a meanbased formula, \(p_{\mathrm{waic}\, 1}\), which we do not use here.
 3.
Smoothed density estimates were made using a logistic Gaussian process (Vehtari and Riihimäki 2014).
 4.
As expected, the two slightly high estimates for k correspond to particularly influential observations, in this case houses with extremely low radon measurements.
 5.
10foldCV results were not computed for data sets with \(n\le 11\), and 10 times repeated 10foldCV was not feasible for the radon example due to the computation time required.
 6.
The code in the generated quantities block is written using the new syntax introduced in Stan version 2.10.0.
 7.
For models fit to large datasets it can be infeasible to store the entire loglikelihood matrix in memory. A function for computing the loglikelihood from the data and posterior draws of the relevant parameters may be specified instead of the loglikelihood matrix—the necessary data and draws are supplied as an additional argument—and columns of the loglikelihood matrix are computed as needed. This requires less memory than storing the entire loglikelihood matrix and allows loo to be used with much larger datasets.
 8.
In statistics there is a tradition of looking at deviance, while in computer science the log score is more popular, so we return both.
 9.
The extract_log_lik() function used in the example is a convenience function for extracting the loglikelihood matrix from a fitted Stan model provided that the user has computed and stored the pointwise loglikelihood in their Stan program (see, for example, the generated quantities block in 1). The argument parameter_name (defaulting to “log_lik”) can also be supplied to indicate which parameter or generated quantity corresponds to the loglikelihood.
Notes
Acknowledgments
We thank Bob Carpenter, Avraham Adler, Joona Karjalainen, Sean Raleigh, Sumio Watanabe, and Ben Lambert for helpful comments, Juho Piironen for R help, Tuomas Sivula for Python port, and the U.S. National Science Foundation, Institute of Education Sciences, and Office of Naval Research for partial support of this research.
References
 Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csaki, F. (eds.) Proceedings of the Second International Symposium on Information Theory, pp. 267–281. Akademiai Kiado, Budapest (1973)Google Scholar
 Ando, T., Tsay, R.: Predictive likelihood for Bayesian model selection and averaging. Int. J. Forecast. 26, 744–763 (2010)CrossRefGoogle Scholar
 Arolot, S., Celisse, A.: A survey of crossvalidation procedures for model selection. Stat. Surv. 4, 40–79 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
 Bernardo, J.M., Smith A.F.M.: Bayesian Theory. Wiley, New York (1994)Google Scholar
 Burman, P.: A comparative study of ordinary crossvalidation, \(v\)fold crossvalidation and the repeated learningtesting methods. Biometrika 76, 503–514 (1989)MathSciNetCrossRefzbMATHGoogle Scholar
 Epifani, I., MacEachern, S.N., Peruggia, M.: Casedeletion importance sampling estimators: central limit theorems and related results. Electron. J. Stat. 2, 774–806 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
 Gabry, J., Goodrich, B.: rstanarm: Bayesian applied regression modeling via Stan. R package version 2.10.0. (2016). http://mcstan.org/interfaces/rstanarm
 Geisser, S., Eddy, W.: A predictive approach to model selection. J. Am. Stat. Assoc. 74, 153–160 (1979)MathSciNetCrossRefzbMATHGoogle Scholar
 Gelfand, A.E.: Model determination using samplingbased methods. In: Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (eds.) Markov Chain Monte Carlo in Practice, pp. 145–162. Chapman and Hall, London (1996)Google Scholar
 Gelfand, A.E., Dey, D.K., Chang, H.: Model determination using predictive distributions with implementation via samplingbased methods. In: Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M. (eds.) Bayesian Statistics, 4th edn, pp. 147–167. Oxford University Press, Oxford (1992)Google Scholar
 Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis, 3rd edn. CRC Press, London (2013)zbMATHGoogle Scholar
 Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Cambridge (2007)Google Scholar
 Gelman, A., Hwang, J., Vehtari, A.: Understanding predictive information criteria for Bayesian models. Stat. Comput. 24, 997–1016 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
 Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 102, 359–378 (2007)Google Scholar
 Hoeting, J., Madigan, D., Raftery, A.E., Volinsky, C.: Bayesian model averaging. Stat. Sci. 14, 382–417 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
 Hoffman, M.D., Gelman, A.: The noUturn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15, 1593–1623 (2014)MathSciNetzbMATHGoogle Scholar
 Ionides, E.L.: Truncated importance sampling. J. Comput. Graph. Stat. 17, 295–311 (2008)MathSciNetCrossRefGoogle Scholar
 Koopman, S.J., Shephard, N., Creal, D.: Testing the assumptions behind importance sampling. J. Econom. 149, 2–11 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
 Peruggia, M.: On the variability of casedeletion importance sampling weights in the Bayesian linear model. J. Am. Stat. Assoc. 92, 199–207 (1997)MathSciNetCrossRefzbMATHGoogle Scholar
 Piironen, J., Vehtari, A.: Comparison of Bayesian predictive methods for model selection. Stat. Comput. (2016) (In press). http://link.springer.com/article/10.1007/s112220169649y
 Plummer, M.: Penalized loss functions for Bayesian model comparison. Biostatistics 9, 523–539 (2008)CrossRefzbMATHGoogle Scholar
 R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2016). https://www.Rproject.org/
 Rubin, D.B.: Estimation in parallel randomized experiments. J. Educ. Stat. 6, 377–401 (1981)Google Scholar
 Spiegelhalter, D.J., Best, N.G., Carlin, B.P., van der Linde, A.: Bayesian measures of model complexity and fit. J. R. Stat. Soc. B 64, 583–639 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
 Spiegelhalter, D., Thomas, A., Best, N., Gilks, W., Lunn, D.: BUGS: Bayesian inference using Gibbs sampling. MRC Biostatistics Unit, Cambridge, England (1994, 2003). http://www.mrcbsu.cam.ac.uk/bugs/
 Stan Development Team: The Stan C++ Library, version 2.10.0 (2016a). http://mcstan.org/
 Stan Development Team: RStan: the R interface to Stan, version 2.10.1 (2016b). http://mcstan.org/interfaces/rstan.html
 Stone, M.: An asymptotic equivalence of choice of model crossvalidation and Akaike’s criterion. J. R. Stat. Soc. B 36, 44–47 (1977)MathSciNetzbMATHGoogle Scholar
 van der Linde, A.: DIC in variable selection. Stat. Neerl. 1, 45–56 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
 Vehtari, A., Gelman, A.: Pareto smoothed importance sampling (2015). arXiv:1507.02646
 Vehtari, A., Gelman, A., Gabry, J.: loo: Efficient leaveoneout crossvalidation and WAIC for Bayesian models. R package version 0.1.6 (2016a). https://github.com/standev/loo
 Vehtari, A., Mononen, T., Tolvanen, V., Sivula, T., Winther, O.: Bayesian leaveoneout crossvalidation approximations for Gaussian latent variable models. J. Mach. Learn. Res. 17, 1–38 (2016b)Google Scholar
 Vehtari, A., Lampinen, J.: Bayesian model assessment and comparison using crossvalidation predictive densities. Neural Comput. 14, 2439–2468 (2002)CrossRefzbMATHGoogle Scholar
 Vehtari, A., Ojanen, J.: A survey of Bayesian predictive methods for model assessment, selection and comparison. Stat. Surv. 6, 142–228 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
 Vehtari, A., Riihimäki, J.: Laplace approximation for logistic Gaussian process density estimation and regression. Bayesian Anal. 9, 425–448 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
 Watanabe, S.: Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J. Mach. Learn. Res. 11, 3571–3594 (2010)MathSciNetzbMATHGoogle Scholar
 Zhang, J., Stephens, M.A.: A new and efficient estimation method for the generalized Pareto distribution. Technometrics 51, 316–325 (2009)MathSciNetCrossRefGoogle Scholar