Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Vehtari, Aki; Gelman, Andrew; Gabry, Jonah

doi:10.1007/s11222-016-9696-4

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Published: 30 August 2016

Volume 27, pages 1413–1432, (2017)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Aki Vehtari¹,
Andrew Gelman² &
Jonah Gabry²

40k Accesses
2391 Citations
75 Altmetric
16 Mentions
Explore all metrics

An Erratum to this article was published on 18 October 2016

Abstract

Leave-one-out cross-validation (LOO) and the widely applicable information criterion (WAIC) are methods for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulations of the parameter values. LOO and WAIC have various advantages over simpler estimates of predictive error such as AIC and DIC but are less used in practice because they involve additional computational steps. Here we lay out fast and stable computations for LOO and WAIC that can be performed using existing simulation draws. We introduce an efficient computation of LOO using Pareto-smoothed importance sampling (PSIS), a new procedure for regularizing importance weights. Although WAIC is asymptotically equal to LOO, we demonstrate that PSIS-LOO is more robust in the finite case with weak priors or influential observations. As a byproduct of our calculations, we also obtain approximate standard errors for estimated predictive errors and for comparison of predictive errors between two models. We implement the computations in an R package called loo and demonstrate using models fit with the Bayesian inference package Stan.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Rejoinder: More Limitations of Bayesian Leave-One-Out Cross-Validation

Article Open access 15 January 2019

What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?

Article 13 June 2020

Limitations of “Limitations of Bayesian Leave-one-out Cross-Validation for Model Selection”

Article Open access 30 November 2018

Notes

The loo R package is available from CRAN and https://github.com/stan-dev/loo. The corresponding code for Matlab, Octave, and Python is available at https://github.com/avehtari/PSIS.
In Gelman et al. (2013), the variance-based \(p_\mathrm{waic}\) defined here is called \(p_{\mathrm{waic}\, 2}\). There is also a mean-based formula, \(p_{\mathrm{waic}\, 1}\), which we do not use here.
Smoothed density estimates were made using a logistic Gaussian process (Vehtari and Riihimäki 2014).
As expected, the two slightly high estimates for k correspond to particularly influential observations, in this case houses with extremely low radon measurements.
10-fold-CV results were not computed for data sets with \(n\le 11\), and 10 times repeated 10-fold-CV was not feasible for the radon example due to the computation time required.
The code in the generated quantities block is written using the new syntax introduced in Stan version 2.10.0.
For models fit to large datasets it can be infeasible to store the entire log-likelihood matrix in memory. A function for computing the log-likelihood from the data and posterior draws of the relevant parameters may be specified instead of the log-likelihood matrix—the necessary data and draws are supplied as an additional argument—and columns of the log-likelihood matrix are computed as needed. This requires less memory than storing the entire log-likelihood matrix and allows loo to be used with much larger datasets.
In statistics there is a tradition of looking at deviance, while in computer science the log score is more popular, so we return both.
The extract_log_lik() function used in the example is a convenience function for extracting the log-likelihood matrix from a fitted Stan model provided that the user has computed and stored the pointwise log-likelihood in their Stan program (see, for example, the generated quantities block in 1). The argument parameter_name (defaulting to “log_lik”) can also be supplied to indicate which parameter or generated quantity corresponds to the log-likelihood.

References

Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csaki, F. (eds.) Proceedings of the Second International Symposium on Information Theory, pp. 267–281. Akademiai Kiado, Budapest (1973)
Ando, T., Tsay, R.: Predictive likelihood for Bayesian model selection and averaging. Int. J. Forecast. 26, 744–763 (2010)
Article Google Scholar
Arolot, S., Celisse, A.: A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010)
Article MathSciNet MATH Google Scholar
Bernardo, J.M., Smith A.F.M.: Bayesian Theory. Wiley, New York (1994)
Burman, P.: A comparative study of ordinary cross-validation, \(v\)-fold cross-validation and the repeated learning-testing methods. Biometrika 76, 503–514 (1989)
Article MathSciNet MATH Google Scholar
Epifani, I., MacEachern, S.N., Peruggia, M.: Case-deletion importance sampling estimators: central limit theorems and related results. Electron. J. Stat. 2, 774–806 (2008)
Article MathSciNet MATH Google Scholar
Gabry, J., Goodrich, B.: rstanarm: Bayesian applied regression modeling via Stan. R package version 2.10.0. (2016). http://mc-stan.org/interfaces/rstanarm
Geisser, S., Eddy, W.: A predictive approach to model selection. J. Am. Stat. Assoc. 74, 153–160 (1979)
Article MathSciNet MATH Google Scholar
Gelfand, A.E.: Model determination using sampling-based methods. In: Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (eds.) Markov Chain Monte Carlo in Practice, pp. 145–162. Chapman and Hall, London (1996)
Google Scholar
Gelfand, A.E., Dey, D.K., Chang, H.: Model determination using predictive distributions with implementation via sampling-based methods. In: Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M. (eds.) Bayesian Statistics, 4th edn, pp. 147–167. Oxford University Press, Oxford (1992)
Google Scholar
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis, 3rd edn. CRC Press, London (2013)
MATH Google Scholar
Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Cambridge (2007)
Google Scholar
Gelman, A., Hwang, J., Vehtari, A.: Understanding predictive information criteria for Bayesian models. Stat. Comput. 24, 997–1016 (2014)
Article MathSciNet MATH Google Scholar
Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 102, 359–378 (2007)
Hoeting, J., Madigan, D., Raftery, A.E., Volinsky, C.: Bayesian model averaging. Stat. Sci. 14, 382–417 (1999)
Article MathSciNet MATH Google Scholar
Hoffman, M.D., Gelman, A.: The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15, 1593–1623 (2014)
MathSciNet MATH Google Scholar
Ionides, E.L.: Truncated importance sampling. J. Comput. Graph. Stat. 17, 295–311 (2008)
Article MathSciNet Google Scholar
Koopman, S.J., Shephard, N., Creal, D.: Testing the assumptions behind importance sampling. J. Econom. 149, 2–11 (2009)
Article MathSciNet MATH Google Scholar
Peruggia, M.: On the variability of case-deletion importance sampling weights in the Bayesian linear model. J. Am. Stat. Assoc. 92, 199–207 (1997)
Article MathSciNet MATH Google Scholar
Piironen, J., Vehtari, A.: Comparison of Bayesian predictive methods for model selection. Stat. Comput. (2016) (In press). http://link.springer.com/article/10.1007/s11222-016-9649-y
Plummer, M.: Penalized loss functions for Bayesian model comparison. Biostatistics 9, 523–539 (2008)
Article MATH Google Scholar
R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2016). https://www.R-project.org/
Rubin, D.B.: Estimation in parallel randomized experiments. J. Educ. Stat. 6, 377–401 (1981)
Google Scholar
Spiegelhalter, D.J., Best, N.G., Carlin, B.P., van der Linde, A.: Bayesian measures of model complexity and fit. J. R. Stat. Soc. B 64, 583–639 (2002)
Article MathSciNet MATH Google Scholar
Spiegelhalter, D., Thomas, A., Best, N., Gilks, W., Lunn, D.: BUGS: Bayesian inference using Gibbs sampling. MRC Biostatistics Unit, Cambridge, England (1994, 2003). http://www.mrc-bsu.cam.ac.uk/bugs/
Stan Development Team: The Stan C++ Library, version 2.10.0 (2016a). http://mc-stan.org/
Stan Development Team: RStan: the R interface to Stan, version 2.10.1 (2016b). http://mc-stan.org/interfaces/rstan.html
Stone, M.: An asymptotic equivalence of choice of model cross-validation and Akaike’s criterion. J. R. Stat. Soc. B 36, 44–47 (1977)
MathSciNet MATH Google Scholar
van der Linde, A.: DIC in variable selection. Stat. Neerl. 1, 45–56 (2005)
Article MathSciNet MATH Google Scholar
Vehtari, A., Gelman, A.: Pareto smoothed importance sampling (2015). arXiv:1507.02646
Vehtari, A., Gelman, A., Gabry, J.: loo: Efficient leave-one-out cross-validation and WAIC for Bayesian models. R package version 0.1.6 (2016a). https://github.com/stan-dev/loo
Vehtari, A., Mononen, T., Tolvanen, V., Sivula, T., Winther, O.: Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. J. Mach. Learn. Res. 17, 1–38 (2016b)
Vehtari, A., Lampinen, J.: Bayesian model assessment and comparison using cross-validation predictive densities. Neural Comput. 14, 2439–2468 (2002)
Article MATH Google Scholar
Vehtari, A., Ojanen, J.: A survey of Bayesian predictive methods for model assessment, selection and comparison. Stat. Surv. 6, 142–228 (2012)
Article MathSciNet MATH Google Scholar
Vehtari, A., Riihimäki, J.: Laplace approximation for logistic Gaussian process density estimation and regression. Bayesian Anal. 9, 425–448 (2014)
Article MathSciNet MATH Google Scholar
Watanabe, S.: Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J. Mach. Learn. Res. 11, 3571–3594 (2010)
MathSciNet MATH Google Scholar
Zhang, J., Stephens, M.A.: A new and efficient estimation method for the generalized Pareto distribution. Technometrics 51, 316–325 (2009)
Article MathSciNet Google Scholar

Download references

Acknowledgments

We thank Bob Carpenter, Avraham Adler, Joona Karjalainen, Sean Raleigh, Sumio Watanabe, and Ben Lambert for helpful comments, Juho Piironen for R help, Tuomas Sivula for Python port, and the U.S. National Science Foundation, Institute of Education Sciences, and Office of Naval Research for partial support of this research.

Author information

Authors and Affiliations

Department of Computer Science, Helsinki Institute for Information Technology HIIT, Aalto University, Espoo, Finland
Aki Vehtari
Department of Statistics, Columbia University, New York, USA
Andrew Gelman & Jonah Gabry

Authors

Aki Vehtari
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Gelman
View author publications
You can also search for this author in PubMed Google Scholar
Jonah Gabry
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aki Vehtari.

Additional information

An erratum to this article is available at http://dx.doi.org/10.1007/s11222-016-9709-3.

Appendix: Implementation in Stan and R

1.1 Appendix 1: Stan code for computing and storing the pointwise log-likelihood

We illustrate how to write Stan code that computes and stores the pointwise log-likelihood using the arsenic example from Sect. 5.3. We save the program in the file logistic.stan:

We have defined the log-likelihood as a vector log_lik in the generated quantities block so that the individual terms will be saved by Stan.^{Footnote 6} It would seem desirable to compute the terms of the log-likelihood directly without requiring the repetition of code, perhaps by flagging the appropriate lines in the model or by identifying the log likelihood as those lines in the model that are defined relative to the data. But there are so many ways of writing any model in Stan—anything goes as long as it produces the correct log posterior density, up to any arbitrary constant—that we cannot see any general way at this time for computing LOO and WAIC without repeating the likelihood part of the code. The good news is that the additional computations are relatively cheap: sitting as they do in the generated quantities block (rather than in the transformed parameters and model blocks), the expressions for the terms of the log posterior need only be computed once per saved iteration rather than once per HMC leapfrog step, and no gradient calculations are required.

1.2 Appendix 2: The loo R package for LOO and WAIC

The loo R package provides the functions loo() and waic() for efficiently computing PSIS-LOO and WAIC for fitted Bayesian models using the methods described in this paper.

These functions take as their argument an \(S \times n\) log-likelihood matrix, where S is the size of the posterior sample (the number of retained draws) and n is the number of data points.^{Footnote 7} The required means and variances across simulations are calculated and then used to compute the effective number of parameters and LOO or WAIC.

The loo() function returns \(\widehat{\mathrm{elpd}}_\mathrm{loo},\widehat{p}_\mathrm{loo},\mathrm{looic} =-2\, \widehat{\mathrm{elpd}}_\mathrm{loo}\) (to provide the output on the conventional scale of “deviance” or AIC),^{Footnote 8} the pointwise contributions of each of these measures, and standard errors. The waic() function computes the analogous quantities for WAIC. Also returned by the loo() function is the estimated shape parameter \({\hat{k}}\) for the generalized Pareto fit to the importance ratios for each leave-one-out distribution. These computations could also be implemented directly in Stan C++, perhaps following the rule that the calculations are performed if there is a variable named log_lik. The loo R package, however, is more general and does not require that a model be fit using Stan, as long as an appropriate log-likelihood matrix is supplied.

Using the loo package. Below, we provide R code for preparing and running the logistic regression for the arsenic example in Stan. After fitting the model we then use the loo package to compute LOO and WAIC.^{Footnote 9}

The printed output shows \(\widehat{\mathrm{elpd}}_\mathrm{loo},\widehat{p}_\mathrm{loo},\mathrm{looic}{} \), and their standard errors:

By default, the estimates for the shape parameter \(\hat{k} { ofthe} { generalizedPareto} { distributionare} { alsochecked} { anda} { messageis} { displayedinforming} { theuser} { ifany} \hat{k}{} \) are problematic (see the end of Sect. 2.1). In the example above the message tells us that all of the estimates for \(\hat{k}{} \) are fine. However, if any \(\hat{k}{} \) were between \(1/2\) and \(1\) or greater than \(1\) the message would instead look something like this:

If there are any warnings then it can be useful to visualize the estimates to check which data points correspond to the large \({\hat{k}}{} \) values. A plot of the \({\hat{k}}{} \) estimates can also be generated using plot(loo1) and the list returned by the loo function also contains the full vector of \({\hat{k}}{} \) values.

Model comparison To compare this model to a second model on their values of LOO we can use the compare function:

This new object, loo_diff, contains the estimated difference of expected leave-one-out prediction errors between the two models, along with the standard error:

Code for WAIC For WAIC the code is analogous and the objects returned have the same structure (except there are no Pareto \(k\) estimates). The compare() function can also be used to estimate the difference in WAIC between two models:

1.3 Appendix 3: Using the loo R package with rstanarm models

Here we show how to fit the model for the radon example from Sect. 4.6 and carry out PSIS-LOO using the rstanarm and loo packages.

After fitting the models we can pass the fitted model objects modelA and modelB directly to rstanarm’s loo method and it will call the necessary functions from the loo package internally.

This returns:

If there are warnings about large values of the estimated Pareto shape parameter \({\hat{k}}{} \) for the importance ratios, rstanarm is also able to automatically carry out the procedure we call PSIS-LOO+ (see Sect. 4.7). That is, rstanarm can refit the model, leaving out these problematic observations one at a time and computing their elpd contributions directly. Then these values are combined with the results from PSIS-LOO for the other observations and returned to the user. We recommended this when there are only a few large \({{\hat{k}}}{} \) estimates. If there are many of them then we recommend \(K\)-fold cross-validation, which is also implemented in the latest release of rstanarm.

1.4 Appendix 4: Stan code for \(K\)-fold cross-validation

To implement \(K\)-fold cross-validation we repeatedly partition the data, with each partition fitting the model to the training set and using it to predict the holdout set. The code for cross-validation does not look so generic because of the need to repeatedly partition the data. However, in any particular example the calculations are not difficult to implement, the main challenge being the increase in computation time by roughly a factor of \(K\). We recommend doing the partitioning in R (or Python, or whichever data-processing environment is being used) and then passing the training data and holdout data to Stan in two pieces.

Again we illustrate with the logistic regression for the arsenic example. We start with the model from above, but we pass in both the training data (N_t, y_t, X_t) and the holdout set (N_h, y_h, X_h), augmenting the data block accordingly. We then alter the generated quantities block to operate on the holdout data:

LOO could be also implemented in this way, setting \(N_t\) to \(N-1\) and \(N_h\) to 1. But, as discussed in the article, for large datasets it is more practical to approximate LOO using importance sampling on the draws from the posterior distribution fit to the entire dataset.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vehtari, A., Gelman, A. & Gabry, J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput 27, 1413–1432 (2017). https://doi.org/10.1007/s11222-016-9696-4

Download citation

Received: 23 January 2016
Accepted: 22 August 2016
Published: 30 August 2016
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11222-016-9696-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Abstract

Access this article

Similar content being viewed by others

Rejoinder: More Limitations of Bayesian Leave-One-Out Cross-Validation

What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?

Limitations of “Limitations of Bayesian Leave-one-out Cross-Validation for Model Selection”

Notes

References

Acknowledgments