Abstract
We consider the mean prediction error of a classification or regression procedure as well as its cross-validation estimates, and investigate the variance of this estimate as a function of an arbitrary cross-validation design. We decompose this variance into a scalar product of coefficients and certain covariance expressions, such that the coefficients depend solely on the resampling design, and the covariances depend solely on the data’s probability distribution. We rewrite this scalar product in such a form that the initially large number of summands can gradually be decreased down to three under the validity of a quadratic approximation to the core covariances. We show an analytical example in which this quadratic approximation holds true exactly. Moreover, in this example, we show that the leave-p-out estimator of the error depends on p only by means of a constant and can, therefore, be written in a much simpler form. Furthermore, there is an unbiased estimator of the variance of K-fold cross-validation, in contrast to a claim in the literature. As a consequence, we can show that balanced incomplete block designs have smaller variance than K-fold cross-validation. In a real data example from the UCI machine learning repository, this property can be confirmed. We finally show how to find balanced incomplete block designs in practice.
Similar content being viewed by others
References
Arlot, S., and A. Celisse. 2010. A survey of cross-validation procedures for model selection. Statistics Surveys 4:40–79.
Bailey, R. A., and P. J. Cameron. 2013. Using graphs to find the best block designs. In Topics in structural graph theory, vol. 147 of Encyclopedia Math. Appl., 282–317. Cambridge, UK: Cambridge University ress.
Bengio, Y., and Y. Grandvalet. 2003/2004. No unbiased estimator of the variance of K-fold cross-validation. Journal of Machine Learning Research 5:1089–105.
Cramér, H. 1946. Mathematical methods of statistics. Princeton Mathematical Series, vol. 9. Princeton, NJ: Princeton University Press.
Fuchs, M., R. Hornung, R. De Bin, and A. L. Boulesteix. 2013. A u-statistic estimator for the variance of resampling-based error estimators. Technical report, Ludwig Maximilian University of Munich, Munich, Germany.
Hastie, T., R. Tibshirani, and J. Friedman. 2009. The elements of statistical learning, 2nd ed. Springer Series in Statistics. New York, NY: Springer.
Hoeffding W. 1948. A class of statistics with asymptotically normal distribution. Annals of Mathematical Statistics 19:293–325.
I-Cheng, Y. 2007. Modeling slump flow of concrete using second-order regressions and artificial neural networks. Cement and Concrete Composites 29 (6):474–80.
Lee, A. J. 1990. U-statisticsa, vol. 110 of Statistics: Textbooks and monographs. New York, NY: Marcel Dekker.
Maesono, Y. 1998. Asymptotic comparisons of several variance estimators and their effects for Studentizations. Annals of the Institute of Statistical Mathematics 50 (3):451–70. doi:10.1023/A:1003521327411.
Nadeau, C., and Y. Bengio. 2003. Inference for the generalization error. Machine Learning 522:239–81. doi:10.1023/A:1024068626366.
Stinson, D. R. 2004. Combinatorial designs. New York, NY: Springer-Verlag.
Tang, B., 1999. Balanced bootstrap in sample surveys and its relationship with balanced repeated replication. Journal of Statistical Planning and Inference 81 (1):121–27.
Wallis, W. D., ed. 1996. Computational and constructive design theory, vol. 368 of Mathematics and its applications. Dordrecht, The Netherland: Kluwer Academic. doi:10.1007/978-1-4757-2497-4.
Wang, Q., and B. Lindsay. 2014. Variance estimation of a general u-statistic with application to cross-validation. Statistica Sinica 24:1117–41.
Zhang, Q., and P. Z. G. Qian. 2013. Designs for crossvalidating approximation models. Biometrika 100 (4):997–1004.
Author information
Authors and Affiliations
Corresponding author
Additional information
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/ujsp.
Supplemental data for this article can be accessed on the publisher’s website.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Fuchs, M., Krautenbacher, N. Minimization and estimation of the variance of prediction errors for cross-validation designs. J Stat Theory Pract 10, 420–443 (2016). https://doi.org/10.1080/15598608.2016.1158675
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1080/15598608.2016.1158675