Inference for the Generalization Error

Nadeau, Claude; Bengio, Yoshua

doi:10.1023/A:1024068626366

Inference for the Generalization Error

Published: September 2003

Volume 52, pages 239–281, (2003)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Inference for the Generalization Error

Download PDF

Claude Nadeau¹ &
Yoshua Bengio²

7760 Accesses
531 Citations
14 Altmetric
1 Mention
Explore all metrics

Abstract

In order to compare learning algorithms, experimental results reported in the machine learning literature often use statistical tests of significance to support the claim that a new learning algorithm generalizes better. Such tests should take into account the variability due to the choice of training set and not only that due to the test examples, as is often the case. This could lead to gross underestimation of the variance of the cross-validation estimator, and to the wrong conclusion that the new algorithm is significantly better when it is not. We perform a theoretical investigation of the variance of a variant of the cross-validation estimator of the generalization error that takes into account the variability due to the randomness of the training set as well as test examples. Our analysis shows that all the variance estimators that are based only on the results of the cross-validation experiment must be biased. This analysis allows us to propose new estimators of this variance. We show, via simulations, that tests of hypothesis about the generalization error using those new variance estimators have better properties than tests involving variance estimators currently in use and listed in Dietterich (1998). In particular, the new tests have correct size and good power. That is, the new tests do not reject the null hypothesis too often when the hypothesis is true, but they tend to frequently reject the null hypothesis when the latter is false.

References

Blake, C., Keogh, E., & Merz, C.-J. (1998). UCI repository of machine learning databases.
Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics,24:6, 2350–2383.
Google Scholar
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Wadsworth International Group.
Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:2, 1–47.
Google Scholar
Devroye, L., Gyröfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Springer-Verlag.
Dietterich, T. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10:7, 1895–1924.
Google Scholar
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. Monographs on Statistics and Applied Probability 57. New York, NY: Chapman & Hall.
Google Scholar
Everitt, B. (1977). The analysis of contingency tables. London: Chapman & Hall.
Google Scholar
Goutte, C. (1997). Note on free lunches and cross-validation. Neural Computation, 9:6, 1053–1059.
Google Scholar
Hinton, G., Neal, R., Tibshirani, R., & DELVE team members. (1995). Assessing learning procedures using DELVE. Technical report, University of Toronto, Department of Computer Science.
Kearns, M., & Ron, D. (1997). Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Tenth Annual Conference on Computational Learning Theory (pp. 152–162). Morgan Kaufmann.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceeding of the Fourteenth International Joint Conference on Artificial Intelligence (pp. 1137–1143). Morgan Kaufmann.
Kolen, J. & Pollack, J. (1991). Back propagation is sensitive to initial conditions. Advances in Neural Information Processing Systems (pp. 860–867). San Francisco, CA: Morgan Kauffmann.
Google Scholar
Nadeau, C., & Bengio, Y. (1999). Inference for the generalisation error. Technical report 99s-25, CIRANO.
Vapnik, V. (1982). Estimation of dependences based on empirical data. Berlin: Springer-Verlag.
Google Scholar
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 1–25.
Google Scholar
Wolpert, D., & Macready,W. (1995). No free lunch theorems for search. Technical report SFI-TR-95-02-010, The Santa Fe Institute.
Zhu, H., & Rohwer, R. (1996). No free lunch for cross validation. Neural Computation, 8:7, 1421–1426.
Google Scholar

Download references

Author information

Authors and Affiliations

Health Canada, AL0900B1, Ottawa, ON, Canada, K1A 0L2
Claude Nadeau
CIRANO and Dept. IRO, Université de Montréal, C.P. 6128 Succ. Centre-Ville, Montréal, Quebec, Canada, H3C 3J7
Yoshua Bengio

Authors

Claude Nadeau
View author publications
You can also search for this author in PubMed Google Scholar
Yoshua Bengio
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nadeau, C., Bengio, Y. Inference for the Generalization Error. Machine Learning 52, 239–281 (2003). https://doi.org/10.1023/A:1024068626366

Download citation

Issue Date: September 2003
DOI: https://doi.org/10.1023/A:1024068626366

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Inference for the Generalization Error

Abstract

Article PDF

Similar content being viewed by others

Statistical Fitting Criterion on the Basis of Cross-Validation Estimation

Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers

Using p-values for the comparison of classifiers: pitfalls and alternatives

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Inference for the Generalization Error

Abstract

Article PDF

Similar content being viewed by others

Statistical Fitting Criterion on the Basis of Cross-Validation Estimation

Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers

Using p-values for the comparison of classifiers: pitfalls and alternatives

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation