Data Mining and Knowledge Discovery

, Volume 1, Issue 1, pp 55–77 | Cite as

On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality

  • Jerome H. Friedman
Article

Abstract

The classification problem is considered in which an outputvariable y assumes discrete values with respectiveprobabilities that depend upon the simultaneous values of a set of input variablesx = {x_1,....,x_n}. At issue is how error in the estimates of theseprobabilities affects classification error when the estimates are used ina classification rule. These effects are seen to be somewhat counterintuitive in both their strength and nature. In particular the bias andvariance components of the estimation error combine to influenceclassification in a very different way than with squared error on theprobabilities themselves. Certain types of (very high) bias can becanceled by low variance to produce accurate classification. This candramatically mitigate the effect of the bias associated with some simpleestimators like “naive” Bayes, and the bias induced by thecurse-of-dimensionality on nearest-neighbor procedures. This helps explainwhy such simple methods are often competitive with and sometimes superiorto more sophisticated ones for classification, and why“bagging/aggregating” classifiers can often improveaccuracy. These results also suggest simple modifications to theseprocedures that can (sometimes dramatically) further improve theirclassification performance.

classification bias variance curse-of-dimensionality bagging naive Bayes nearest-neighbors 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bellman, R.E. 1961. Adaptive Control Processes. Princeton University Press.Google Scholar
  2. Breiman, L. 1995. Bagging predictors. Dept. of Statistics, University of California, Berkeley, Technical Report.Google Scholar
  3. Breiman, L. 1996. Bias, variance, and arcing classifiers. Dept. of Statistics, University of California, Technical Report (revised).Google Scholar
  4. Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. 1984. Classification and Regression Trees. Wadsworth.Google Scholar
  5. Chow, W.S. and Chen, Y.C. 1992. A new fast algorithm for effective training of neural classifiers. Pattern Recognition, 25:423–429.Google Scholar
  6. Dietterich, T.G. and Kong, E.B. 1995. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Dept. of Computer Science, Oregon State University Technical Report.Google Scholar
  7. Efron, B. and Tibshirani, R. 1995. Cross-validation and the bootstrap: Estimating the error rate of a prediction rule. Dept. of Statistics, Stanford University Technical Report.Google Scholar
  8. Fix, E. and Hodges, J.L. 1951. Discriminatory analysis-nonparametric discrimination: Consistency properties. Randolf Field Texas: U.S. Airforce School of Aviation Medicine Technical Report No. 4.Google Scholar
  9. Friedman, J.H. 1985. Classification and multiple response regression through projection pursuit. Dept. of Statistics, Stanford University Technical Report LCS012.Google Scholar
  10. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comp., 4:1–48.Google Scholar
  11. Good, I.J. 1965. The Estimation of Probabilities: An Essay on Modern Bayesian Methods. M.I.T. Press.Google Scholar
  12. Hand, D.J. 1982. Kernel discriminant analysis. Chichester: Research Studies Press.Google Scholar
  13. Heckerman, D., Geiger, D., and Chickering, D. 1994. Learning Bayesian networks: the combination of knowledge and statistical data. In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, pp. 293–301, AAAI Press and M.I.T. Press.Google Scholar
  14. Henley, W.E. and Hand, D.J. 1996. A k-nearest neighbour classifier for assessing consumer credit risk. The Statistician, 45:77–95.Google Scholar
  15. Holte, R.C. 1993. Very simple classification rules perform well on most commonly used data sets. Machine Learning, 11:63–90.Google Scholar
  16. Kohavi, R. and Wolpert, D.H. 1996. Bias plus variance decomposition for zero-one loss functions. Dept. of Computer Science, Stanford University Technical Report.Google Scholar
  17. Kohonen, T. 1990. The self-organizing map. Proceedings of the IEEE, 78:1464–1480.Google Scholar
  18. Langley, P., Iba, W., and Thompson, K. 1992. An analysis of Bayesian classifiers. In Proceedings of the Tenth National Conference on Artificial Intelligence, pp. 223–228, AAAI Press and M.I.T. Press.Google Scholar
  19. Lippmann, R. 1989. Pattern classification using neural networks. IEEE Communications Magazine, 11:47–64.Google Scholar
  20. McLachlan, G.J. 1992. Discriminant Analysis and Statistical Pattern Recognition. Wiley.Google Scholar
  21. Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann.Google Scholar
  22. Rosen, D.B., Burke, H.B., and Goodman, P.H. 1995. Local learning methods in high dimension: Beating the bias-variance dilemma via recalibration. Workshop Machines That Learn-Neural Networks for Computing, Snowbird Utah.Google Scholar
  23. Tibshirani, R. 1996. Bias, variance and prediction error for classification rules. Dept. of Statistics, University of Toronto Technical Report.Google Scholar
  24. Titterington, D.M., Murray, G.D., Murray, L.S., Spiegelhalter, D.J., Skene, A.M., Habbema, J.D.F., and Gelpke, G.J. 1981. Comparison of discrimination techniques applied to a complex data set of head injured patients. J. Roy. Statist. Soc. A, 144:145–175.Google Scholar

Copyright information

© Kluwer Academic Publishers 1997

Authors and Affiliations

  • Jerome H. Friedman
    • 1
  1. 1.Department of Statistics and Stanford Linear Accelerator CenterStanford UniversityUSA

Personalised recommendations