# On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality

- 2.3k Downloads
- 318 Citations

## Abstract

The classification problem is considered in which an outputvariable *y* assumes discrete values with respectiveprobabilities that depend upon the simultaneous values of a set of input variablesx = {x_1,....,x_n}. At issue is how error in the estimates of theseprobabilities affects classification error when the estimates are used ina classification rule. These effects are seen to be somewhat counterintuitive in both their strength and nature. In particular the bias andvariance components of the estimation error combine to influenceclassification in a very different way than with squared error on theprobabilities themselves. Certain types of (very high) bias can becanceled by low variance to produce accurate classification. This candramatically mitigate the effect of the bias associated with some simpleestimators like “naive” Bayes, and the bias induced by thecurse-of-dimensionality on nearest-neighbor procedures. This helps explainwhy such simple methods are often competitive with and sometimes superiorto more sophisticated ones for classification, and why“bagging/aggregating” classifiers can often improveaccuracy. These results also suggest simple modifications to theseprocedures that can (sometimes dramatically) further improve theirclassification performance.

## Preview

Unable to display preview. Download preview PDF.

## References

- Bellman, R.E. 1961. Adaptive Control Processes. Princeton University Press.Google Scholar
- Breiman, L. 1995. Bagging predictors. Dept. of Statistics, University of California, Berkeley, Technical Report.Google Scholar
- Breiman, L. 1996. Bias, variance, and arcing classifiers. Dept. of Statistics, University of California, Technical Report (revised).Google Scholar
- Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. 1984. Classification and Regression Trees. Wadsworth.Google Scholar
- Chow, W.S. and Chen, Y.C. 1992. A new fast algorithm for effective training of neural classifiers. Pattern Recognition, 25:423–429.Google Scholar
- Dietterich, T.G. and Kong, E.B. 1995. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Dept. of Computer Science, Oregon State University Technical Report.Google Scholar
- Efron, B. and Tibshirani, R. 1995. Cross-validation and the bootstrap: Estimating the error rate of a prediction rule. Dept. of Statistics, Stanford University Technical Report.Google Scholar
- Fix, E. and Hodges, J.L. 1951. Discriminatory analysis-nonparametric discrimination: Consistency properties. Randolf Field Texas: U.S. Airforce School of Aviation Medicine Technical Report No. 4.Google Scholar
- Friedman, J.H. 1985. Classification and multiple response regression through projection pursuit. Dept. of Statistics, Stanford University Technical Report LCS012.Google Scholar
- Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comp., 4:1–48.Google Scholar
- Good, I.J. 1965. The Estimation of Probabilities: An Essay on Modern Bayesian Methods. M.I.T. Press.Google Scholar
- Hand, D.J. 1982. Kernel discriminant analysis. Chichester: Research Studies Press.Google Scholar
- Heckerman, D., Geiger, D., and Chickering, D. 1994. Learning Bayesian networks: the combination of knowledge and statistical data. In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, pp. 293–301, AAAI Press and M.I.T. Press.Google Scholar
- Henley, W.E. and Hand, D.J. 1996. A
*k*-nearest neighbour classifier for assessing consumer credit risk. The Statistician, 45:77–95.Google Scholar - Holte, R.C. 1993. Very simple classification rules perform well on most commonly used data sets. Machine Learning, 11:63–90.Google Scholar
- Kohavi, R. and Wolpert, D.H. 1996. Bias plus variance decomposition for zero-one loss functions. Dept. of Computer Science, Stanford University Technical Report.Google Scholar
- Kohonen, T. 1990. The self-organizing map. Proceedings of the IEEE, 78:1464–1480.Google Scholar
- Langley, P., Iba, W., and Thompson, K. 1992. An analysis of Bayesian classifiers. In Proceedings of the Tenth National Conference on Artificial Intelligence, pp. 223–228, AAAI Press and M.I.T. Press.Google Scholar
- Lippmann, R. 1989. Pattern classification using neural networks. IEEE Communications Magazine, 11:47–64.Google Scholar
- McLachlan, G.J. 1992. Discriminant Analysis and Statistical Pattern Recognition. Wiley.Google Scholar
- Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann.Google Scholar
- Rosen, D.B., Burke, H.B., and Goodman, P.H. 1995. Local learning methods in high dimension: Beating the bias-variance dilemma via recalibration. Workshop Machines That Learn-Neural Networks for Computing, Snowbird Utah.Google Scholar
- Tibshirani, R. 1996. Bias, variance and prediction error for classification rules. Dept. of Statistics, University of Toronto Technical Report.Google Scholar
- Titterington, D.M., Murray, G.D., Murray, L.S., Spiegelhalter, D.J., Skene, A.M., Habbema, J.D.F., and Gelpke, G.J. 1981. Comparison of discrimination techniques applied to a complex data set of head injured patients. J. Roy. Statist. Soc. A, 144:145–175.Google Scholar