Skip to main content
Log in

Minimization and estimation of the variance of prediction errors for cross-validation designs

  • Article
  • Published:
Journal of Statistical Theory and Practice Aims and scope Submit manuscript

Abstract

We consider the mean prediction error of a classification or regression procedure as well as its cross-validation estimates, and investigate the variance of this estimate as a function of an arbitrary cross-validation design. We decompose this variance into a scalar product of coefficients and certain covariance expressions, such that the coefficients depend solely on the resampling design, and the covariances depend solely on the data’s probability distribution. We rewrite this scalar product in such a form that the initially large number of summands can gradually be decreased down to three under the validity of a quadratic approximation to the core covariances. We show an analytical example in which this quadratic approximation holds true exactly. Moreover, in this example, we show that the leave-p-out estimator of the error depends on p only by means of a constant and can, therefore, be written in a much simpler form. Furthermore, there is an unbiased estimator of the variance of K-fold cross-validation, in contrast to a claim in the literature. As a consequence, we can show that balanced incomplete block designs have smaller variance than K-fold cross-validation. In a real data example from the UCI machine learning repository, this property can be confirmed. We finally show how to find balanced incomplete block designs in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Arlot, S., and A. Celisse. 2010. A survey of cross-validation procedures for model selection. Statistics Surveys 4:40–79.

    Article  MathSciNet  Google Scholar 

  • Bailey, R. A., and P. J. Cameron. 2013. Using graphs to find the best block designs. In Topics in structural graph theory, vol. 147 of Encyclopedia Math. Appl., 282–317. Cambridge, UK: Cambridge University ress.

    Google Scholar 

  • Bengio, Y., and Y. Grandvalet. 2003/2004. No unbiased estimator of the variance of K-fold cross-validation. Journal of Machine Learning Research 5:1089–105.

    MathSciNet  MATH  Google Scholar 

  • Cramér, H. 1946. Mathematical methods of statistics. Princeton Mathematical Series, vol. 9. Princeton, NJ: Princeton University Press.

    MATH  Google Scholar 

  • Fuchs, M., R. Hornung, R. De Bin, and A. L. Boulesteix. 2013. A u-statistic estimator for the variance of resampling-based error estimators. Technical report, Ludwig Maximilian University of Munich, Munich, Germany.

  • Hastie, T., R. Tibshirani, and J. Friedman. 2009. The elements of statistical learning, 2nd ed. Springer Series in Statistics. New York, NY: Springer.

    Book  Google Scholar 

  • Hoeffding W. 1948. A class of statistics with asymptotically normal distribution. Annals of Mathematical Statistics 19:293–325.

    Article  MathSciNet  Google Scholar 

  • I-Cheng, Y. 2007. Modeling slump flow of concrete using second-order regressions and artificial neural networks. Cement and Concrete Composites 29 (6):474–80.

    Article  Google Scholar 

  • Lee, A. J. 1990. U-statisticsa, vol. 110 of Statistics: Textbooks and monographs. New York, NY: Marcel Dekker.

    Google Scholar 

  • Maesono, Y. 1998. Asymptotic comparisons of several variance estimators and their effects for Studentizations. Annals of the Institute of Statistical Mathematics 50 (3):451–70. doi:10.1023/A:1003521327411.

    Article  MathSciNet  Google Scholar 

  • Nadeau, C., and Y. Bengio. 2003. Inference for the generalization error. Machine Learning 522:239–81. doi:10.1023/A:1024068626366.

    Article  Google Scholar 

  • Stinson, D. R. 2004. Combinatorial designs. New York, NY: Springer-Verlag.

    MATH  Google Scholar 

  • Tang, B., 1999. Balanced bootstrap in sample surveys and its relationship with balanced repeated replication. Journal of Statistical Planning and Inference 81 (1):121–27.

    Article  MathSciNet  Google Scholar 

  • Wallis, W. D., ed. 1996. Computational and constructive design theory, vol. 368 of Mathematics and its applications. Dordrecht, The Netherland: Kluwer Academic. doi:10.1007/978-1-4757-2497-4.

    Google Scholar 

  • Wang, Q., and B. Lindsay. 2014. Variance estimation of a general u-statistic with application to cross-validation. Statistica Sinica 24:1117–41.

    MathSciNet  MATH  Google Scholar 

  • Zhang, Q., and P. Z. G. Qian. 2013. Designs for crossvalidating approximation models. Biometrika 100 (4):997–1004.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mathias Fuchs.

Additional information

Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/ujsp.

Supplemental data for this article can be accessed on the publisher’s website.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fuchs, M., Krautenbacher, N. Minimization and estimation of the variance of prediction errors for cross-validation designs. J Stat Theory Pract 10, 420–443 (2016). https://doi.org/10.1080/15598608.2016.1158675

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1080/15598608.2016.1158675

Keywords

AMS Subject Classification

Navigation