Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Abstract

Leave-one-out cross-validation (LOO) and the widely applicable information criterion (WAIC) are methods for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulations of the parameter values. LOO and WAIC have various advantages over simpler estimates of predictive error such as AIC and DIC but are less used in practice because they involve additional computational steps. Here we lay out fast and stable computations for LOO and WAIC that can be performed using existing simulation draws. We introduce an efficient computation of LOO using Pareto-smoothed importance sampling (PSIS), a new procedure for regularizing importance weights. Although WAIC is asymptotically equal to LOO, we demonstrate that PSIS-LOO is more robust in the finite case with weak priors or influential observations. As a byproduct of our calculations, we also obtain approximate standard errors for estimated predictive errors and for comparison of predictive errors between two models. We implement the computations in an R package called loo and demonstrate using models fit with the Bayesian inference package Stan.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Notes

  1. 1.

    The loo R package is available from CRAN and https://github.com/stan-dev/loo. The corresponding code for Matlab, Octave, and Python is available at https://github.com/avehtari/PSIS.

  2. 2.

    In Gelman et al. (2013), the variance-based \(p_\mathrm{waic}\) defined here is called \(p_{\mathrm{waic}\, 2}\). There is also a mean-based formula, \(p_{\mathrm{waic}\, 1}\), which we do not use here.

  3. 3.

    Smoothed density estimates were made using a logistic Gaussian process (Vehtari and Riihimäki 2014).

  4. 4.

    As expected, the two slightly high estimates for k correspond to particularly influential observations, in this case houses with extremely low radon measurements.

  5. 5.

    10-fold-CV results were not computed for data sets with \(n\le 11\), and 10 times repeated 10-fold-CV was not feasible for the radon example due to the computation time required.

  6. 6.

    The code in the generated quantities block is written using the new syntax introduced in Stan version 2.10.0.

  7. 7.

    For models fit to large datasets it can be infeasible to store the entire log-likelihood matrix in memory. A function for computing the log-likelihood from the data and posterior draws of the relevant parameters may be specified instead of the log-likelihood matrix—the necessary data and draws are supplied as an additional argument—and columns of the log-likelihood matrix are computed as needed. This requires less memory than storing the entire log-likelihood matrix and allows loo to be used with much larger datasets.

  8. 8.

    In statistics there is a tradition of looking at deviance, while in computer science the log score is more popular, so we return both.

  9. 9.

    The extract_log_lik() function used in the example is a convenience function for extracting the log-likelihood matrix from a fitted Stan model provided that the user has computed and stored the pointwise log-likelihood in their Stan program (see, for example, the generated quantities block in 1). The argument parameter_name (defaulting to “log_lik”) can also be supplied to indicate which parameter or generated quantity corresponds to the log-likelihood.

References

  1. Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csaki, F. (eds.) Proceedings of the Second International Symposium on Information Theory, pp. 267–281. Akademiai Kiado, Budapest (1973)

  2. Ando, T., Tsay, R.: Predictive likelihood for Bayesian model selection and averaging. Int. J. Forecast. 26, 744–763 (2010)

    Article  Google Scholar 

  3. Arolot, S., Celisse, A.: A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010)

    MathSciNet  Article  MATH  Google Scholar 

  4. Bernardo, J.M., Smith A.F.M.: Bayesian Theory. Wiley, New York (1994)

  5. Burman, P.: A comparative study of ordinary cross-validation, \(v\)-fold cross-validation and the repeated learning-testing methods. Biometrika 76, 503–514 (1989)

    MathSciNet  Article  MATH  Google Scholar 

  6. Epifani, I., MacEachern, S.N., Peruggia, M.: Case-deletion importance sampling estimators: central limit theorems and related results. Electron. J. Stat. 2, 774–806 (2008)

    MathSciNet  Article  MATH  Google Scholar 

  7. Gabry, J., Goodrich, B.: rstanarm: Bayesian applied regression modeling via Stan. R package version 2.10.0. (2016). http://mc-stan.org/interfaces/rstanarm

  8. Geisser, S., Eddy, W.: A predictive approach to model selection. J. Am. Stat. Assoc. 74, 153–160 (1979)

    MathSciNet  Article  MATH  Google Scholar 

  9. Gelfand, A.E.: Model determination using sampling-based methods. In: Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (eds.) Markov Chain Monte Carlo in Practice, pp. 145–162. Chapman and Hall, London (1996)

    Google Scholar 

  10. Gelfand, A.E., Dey, D.K., Chang, H.: Model determination using predictive distributions with implementation via sampling-based methods. In: Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M. (eds.) Bayesian Statistics, 4th edn, pp. 147–167. Oxford University Press, Oxford (1992)

    Google Scholar 

  11. Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis, 3rd edn. CRC Press, London (2013)

    Google Scholar 

  12. Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Cambridge (2007)

    Google Scholar 

  13. Gelman, A., Hwang, J., Vehtari, A.: Understanding predictive information criteria for Bayesian models. Stat. Comput. 24, 997–1016 (2014)

    MathSciNet  Article  MATH  Google Scholar 

  14. Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 102, 359–378 (2007)

  15. Hoeting, J., Madigan, D., Raftery, A.E., Volinsky, C.: Bayesian model averaging. Stat. Sci. 14, 382–417 (1999)

    MathSciNet  Article  MATH  Google Scholar 

  16. Hoffman, M.D., Gelman, A.: The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15, 1593–1623 (2014)

    MathSciNet  MATH  Google Scholar 

  17. Ionides, E.L.: Truncated importance sampling. J. Comput. Graph. Stat. 17, 295–311 (2008)

    MathSciNet  Article  Google Scholar 

  18. Koopman, S.J., Shephard, N., Creal, D.: Testing the assumptions behind importance sampling. J. Econom. 149, 2–11 (2009)

    MathSciNet  Article  MATH  Google Scholar 

  19. Peruggia, M.: On the variability of case-deletion importance sampling weights in the Bayesian linear model. J. Am. Stat. Assoc. 92, 199–207 (1997)

    MathSciNet  Article  MATH  Google Scholar 

  20. Piironen, J., Vehtari, A.: Comparison of Bayesian predictive methods for model selection. Stat. Comput. (2016) (In press). http://link.springer.com/article/10.1007/s11222-016-9649-y

  21. Plummer, M.: Penalized loss functions for Bayesian model comparison. Biostatistics 9, 523–539 (2008)

    Article  MATH  Google Scholar 

  22. R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2016). https://www.R-project.org/

  23. Rubin, D.B.: Estimation in parallel randomized experiments. J. Educ. Stat. 6, 377–401 (1981)

    Google Scholar 

  24. Spiegelhalter, D.J., Best, N.G., Carlin, B.P., van der Linde, A.: Bayesian measures of model complexity and fit. J. R. Stat. Soc. B 64, 583–639 (2002)

    MathSciNet  Article  MATH  Google Scholar 

  25. Spiegelhalter, D., Thomas, A., Best, N., Gilks, W., Lunn, D.: BUGS: Bayesian inference using Gibbs sampling. MRC Biostatistics Unit, Cambridge, England (1994, 2003). http://www.mrc-bsu.cam.ac.uk/bugs/

  26. Stan Development Team: The Stan C++ Library, version 2.10.0 (2016a). http://mc-stan.org/

  27. Stan Development Team: RStan: the R interface to Stan, version 2.10.1 (2016b). http://mc-stan.org/interfaces/rstan.html

  28. Stone, M.: An asymptotic equivalence of choice of model cross-validation and Akaike’s criterion. J. R. Stat. Soc. B 36, 44–47 (1977)

    MathSciNet  MATH  Google Scholar 

  29. van der Linde, A.: DIC in variable selection. Stat. Neerl. 1, 45–56 (2005)

    MathSciNet  Article  MATH  Google Scholar 

  30. Vehtari, A., Gelman, A.: Pareto smoothed importance sampling (2015). arXiv:1507.02646

  31. Vehtari, A., Gelman, A., Gabry, J.: loo: Efficient leave-one-out cross-validation and WAIC for Bayesian models. R package version 0.1.6 (2016a). https://github.com/stan-dev/loo

  32. Vehtari, A., Mononen, T., Tolvanen, V., Sivula, T., Winther, O.: Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. J. Mach. Learn. Res. 17, 1–38 (2016b)

  33. Vehtari, A., Lampinen, J.: Bayesian model assessment and comparison using cross-validation predictive densities. Neural Comput. 14, 2439–2468 (2002)

    Article  MATH  Google Scholar 

  34. Vehtari, A., Ojanen, J.: A survey of Bayesian predictive methods for model assessment, selection and comparison. Stat. Surv. 6, 142–228 (2012)

    MathSciNet  Article  MATH  Google Scholar 

  35. Vehtari, A., Riihimäki, J.: Laplace approximation for logistic Gaussian process density estimation and regression. Bayesian Anal. 9, 425–448 (2014)

    MathSciNet  Article  MATH  Google Scholar 

  36. Watanabe, S.: Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J. Mach. Learn. Res. 11, 3571–3594 (2010)

    MathSciNet  MATH  Google Scholar 

  37. Zhang, J., Stephens, M.A.: A new and efficient estimation method for the generalized Pareto distribution. Technometrics 51, 316–325 (2009)

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgments

We thank Bob Carpenter, Avraham Adler, Joona Karjalainen, Sean Raleigh, Sumio Watanabe, and Ben Lambert for helpful comments, Juho Piironen for R help, Tuomas Sivula for Python port, and the U.S. National Science Foundation, Institute of Education Sciences, and Office of Naval Research for partial support of this research.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Aki Vehtari.

Additional information

An erratum to this article is available at http://dx.doi.org/10.1007/s11222-016-9709-3.

Appendix: Implementation in Stan and R

Appendix: Implementation in Stan and R

Appendix 1: Stan code for computing and storing the pointwise log-likelihood

We illustrate how to write Stan code that computes and stores the pointwise log-likelihood using the arsenic example from Sect. 5.3. We save the program in the file logistic.stan:

figurea

We have defined the log-likelihood as a vector log_lik in the generated quantities block so that the individual terms will be saved by Stan.Footnote 6 It would seem desirable to compute the terms of the log-likelihood directly without requiring the repetition of code, perhaps by flagging the appropriate lines in the model or by identifying the log likelihood as those lines in the model that are defined relative to the data. But there are so many ways of writing any model in Stan—anything goes as long as it produces the correct log posterior density, up to any arbitrary constant—that we cannot see any general way at this time for computing LOO and WAIC without repeating the likelihood part of the code. The good news is that the additional computations are relatively cheap: sitting as they do in the generated quantities block (rather than in the transformed parameters and model blocks), the expressions for the terms of the log posterior need only be computed once per saved iteration rather than once per HMC leapfrog step, and no gradient calculations are required.

Appendix 2: The loo R package for LOO and WAIC

The loo R package provides the functions loo() and waic() for efficiently computing PSIS-LOO and WAIC for fitted Bayesian models using the methods described in this paper.

These functions take as their argument an \(S \times n\) log-likelihood matrix, where S is the size of the posterior sample (the number of retained draws) and n is the number of data points.Footnote 7 The required means and variances across simulations are calculated and then used to compute the effective number of parameters and LOO or WAIC.

The loo() function returns \(\widehat{\mathrm{elpd}}_\mathrm{loo},\widehat{p}_\mathrm{loo},\mathrm{looic} =-2\, \widehat{\mathrm{elpd}}_\mathrm{loo}\) (to provide the output on the conventional scale of “deviance” or AIC),Footnote 8 the pointwise contributions of each of these measures, and standard errors. The waic() function computes the analogous quantities for WAIC. Also returned by the loo() function is the estimated shape parameter \({\hat{k}}\) for the generalized Pareto fit to the importance ratios for each leave-one-out distribution. These computations could also be implemented directly in Stan C++, perhaps following the rule that the calculations are performed if there is a variable named log_lik. The loo R package, however, is more general and does not require that a model be fit using Stan, as long as an appropriate log-likelihood matrix is supplied.

Using the loo package. Below, we provide R code for preparing and running the logistic regression for the arsenic example in Stan. After fitting the model we then use the loo package to compute LOO and WAIC.Footnote 9

figureb

The printed output shows \(\widehat{\mathrm{elpd}}_\mathrm{loo},\widehat{p}_\mathrm{loo},\mathrm{looic}{} \), and their standard errors:

figurec

By default, the estimates for the shape parameter \(\hat{k} { ofthe} { generalizedPareto} { distributionare} { alsochecked} { anda} { messageis} { displayedinforming} { theuser} { ifany} \hat{k}{} \) are problematic (see the end of Sect. 2.1). In the example above the message tells us that all of the estimates for \(\hat{k}{} \) are fine. However, if any  \(\hat{k}{} \)  were between  \(1/2\)  and  \(1\)  or greater than  \(1\)  the message would instead look something like this:

figured

If there are any warnings then it can be useful to visualize the estimates to check which data points correspond to the large  \({\hat{k}}{} \)  values. A plot of the  \({\hat{k}}{} \)  estimates can also be generated using plot(loo1) and the list returned by the loo function also contains the full vector of  \({\hat{k}}{} \)  values.

Model comparison To compare this model to a second model on their values of LOO we can use the compare function:

figuree

This new object, loo_diff, contains the estimated difference of expected leave-one-out prediction errors between the two models, along with the standard error:

figuref

Code for WAIC For WAIC the code is analogous and the objects returned have the same structure (except there are no Pareto \(k\) estimates). The compare() function can also be used to estimate the difference in WAIC between two models:

figureg

Appendix 3: Using the loo R package with rstanarm models

Here we show how to fit the model for the radon example from Sect. 4.6 and carry out PSIS-LOO using the rstanarm and loo packages.

figureh

After fitting the models we can pass the fitted model objects modelA and modelB directly to rstanarm’s loo method and it will call the necessary functions from the loo package internally.

figurei

This returns:

figurej

If there are warnings about large values of the estimated Pareto shape parameter \({\hat{k}}{} \) for the importance ratios, rstanarm is also able to automatically carry out the procedure we call PSIS-LOO+ (see Sect. 4.7). That is, rstanarm can refit the model, leaving out these problematic observations one at a time and computing their elpd contributions directly. Then these values are combined with the results from PSIS-LOO for the other observations and returned to the user. We recommended this when there are only a few large  \({{\hat{k}}}{} \)  estimates. If there are many of them then we recommend \(K\)-fold cross-validation, which is also implemented in the latest release of rstanarm.

Appendix 4: Stan code for \(K\)-fold cross-validation

To implement \(K\)-fold cross-validation we repeatedly partition the data, with each partition fitting the model to the training set and using it to predict the holdout set. The code for cross-validation does not look so generic because of the need to repeatedly partition the data. However, in any particular example the calculations are not difficult to implement, the main challenge being the increase in computation time by roughly a factor of \(K\). We recommend doing the partitioning in R (or Python, or whichever data-processing environment is being used) and then passing the training data and holdout data to Stan in two pieces.

Again we illustrate with the logistic regression for the arsenic example. We start with the model from above, but we pass in both the training data (N_t, y_t, X_t) and the holdout set (N_h, y_h, X_h), augmenting the data block accordingly. We then alter the generated quantities block to operate on the holdout data:

figurek

LOO could be also implemented in this way, setting  \(N_t\) to \(N-1\)  and  \(N_h\)  to 1. But, as discussed in the article, for large datasets it is more practical to approximate LOO using importance sampling on the draws from the posterior distribution fit to the entire dataset.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Vehtari, A., Gelman, A. & Gabry, J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput 27, 1413–1432 (2017). https://doi.org/10.1007/s11222-016-9696-4

Download citation

Keywords

  • Bayesian computation
  • Leave-one-out cross-validation (LOO)
  • K-fold cross-validation
  • Widely applicable information criterion (WAIC)
  • Stan
  • Pareto smoothed importance sampling (PSIS)