## Abstract

Despite its simplicity, the naive Bayes learning scheme performs well on most classification tasks, and is often significantly more accurate than more sophisticated methods. Although the probability estimates that it produces can be inaccurate, it often assigns maximum probability to the correct class. This suggests that its good performance might be restricted to situations where the output is categorical. It is therefore interesting to see how it performs in domains where the predicted value is numeric, because in this case, predictions are more sensitive to inaccurate probability estimates.

This paper shows how to apply the naive Bayes methodology to numeric prediction (i.e., regression) tasks by modeling the probability distribution of the target value with kernel density estimators, and compares it to linear regression, locally weighted linear regression, and a method that produces “model trees”—decision trees with linear regression functions at the leaves. Although we exhibit an artificial dataset for which naive Bayes is the method of choice, on real-world datasets it is almost uniformly worse than locally weighted linear regression and model trees. The comparison with linear regression depends on the error measure: for one measure naive Bayes performs similarly, while for another it is worse. We also show that standard naive Bayes applied to regression problems by discretizing the target value performs similarly badly. We then present empirical evidence that isolates naive Bayes' independence assumption as the culprit for its poor performance in the regression setting. These results indicate that the simplistic statistical assumption that naive Bayes makes is indeed more restrictive for regression than for classification.

## References

- Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In
*Proceedings of the 2nd International Symposium on Information Theory*(pp. 267–281). Budapest: Akadémiai Kiadó.Google Scholar - Atkeson, C. G., Moore, A.W., & Schaal, S. (1997). Locally weighted learning.
*Artificial Intelligence Review*,*11*, 11–73.Google Scholar - Blake, C., Keogh, E., & Merz, C. J. (1998). UCI repository of machine learning data-bases. Irvine, CA: University of California, Department of Information and Computer Science. [http://www.ics.uci.edu/~mlearn/ MLRepository.html].Google Scholar
- Cestnik, B. (1990). Estimating probabilities:A crucial task in machine learning. In
*Proceedings of the 9th European Conference on Artificial Intelligence*, Stockholm, Sweden (pp. 147–149). London: Pitman.Google Scholar - Clark, P. & Niblett, T. (1989). The CN2 Induction Algorithm.
*Machine Learning*,*3*(4), 261–283.Google Scholar - Domingos, P. & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. In
*Machine Learning*,*29*(2/3), 103–130.Google Scholar - Duda, R. & Hart, P. (1973).
*Pattern classification and scene analysis*. New York: Wiley.Google Scholar - Fayyad, U. M. & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In
*Proceedings of the 13th International Joint Conference on Artificial Intelligence*, Chambery, France (pp. 1022–1027). San Mateo, CA: Morgan Kaufmann.Google Scholar - Frank, E., Wang, Y., Inglis, S., Holmes, G., & Witten, I. H. (1998). Using model trees for classification.
*Machine Learning*,*32*(1), 63–76.Google Scholar - Friedman, J. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality.
*Data Mining and Knowledge Discovery*,*1*, 55–77.Google Scholar - Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers.
*Machine Learning*,*29*(2/3), 131–163.Google Scholar - Ghahramani, Z.& Jordan, M. I. (1994). Supervised learning from incomplete data via anEMapproach. In
*Advances in neural information processing systems*6 (pp. 120–127). San Mateo, CA: Morgan Kaufmann.Google Scholar - Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets.
*Machine Learning*, 11, 63–91.Google Scholar - John, G. H. & Kohavi, R. (1997). Wrappers for feature subset selection.
*Artificial Intelligence*,*97*(1/2), 273–324.Google Scholar - John, G. H. & Langley P. (1995). Estimating continuous distributions in Bayesian Classifiers.
*Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence*, Montreal, Quebec (pp. 338–345). San Mateo, CA: Morgan Kaufmann.Google Scholar - Kasif, S., Salzberg, S., Waltz, D., Rachlin, J., & Aha, D.W. (1998). A probabilistic framework for memory-based reasoning.
*Artificial Intelligence*,*104*(1/2), 297–312.Google Scholar - Kilpatrick, D. & Cameron-Jones, M. (1998). Numeric prediction using instance-based learning with encoding length selection. In
*Progress in Connectionist-Based Information Systems*, Dunedin, New Zealand (pp. 984–987). Singapore: Springer-Verlag.Google Scholar - Kononenko, I. (1991). Semi-naive Bayesian classifier. In
*Proceedings of the 6th European Working Session on Learning*, Porto, Portugal (pp. 206–219). Berlin: Springer-Verlag.Google Scholar - Kononenko, I. (1998). Personal Communication.Google Scholar
- Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. In
*Proceedings of the 10th National Conference on Artificial Intelligence*, San Jose, CA (pp. 223–228). Menlo Park, CA: AAAI Press.Google Scholar - Langley, P. (1993).Induction of recursive Bayesian classifiers. In
*Proceedings of the 8th European Conference on Machine Learning*, Vienna, Austria (pp. 153–164). Berlin: Springer-Verlag.Google Scholar - Langley, P. & Sage, S. (1994). Induction of selective Bayesian classifiers, In
*Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence*, Seattle, WA (pp. 399–406). San Mateo, CA: Morgan Kaufmann.Google Scholar - Lehmann, E. L. (1983).
*Theory of point estimation*. New York: Wiley.Google Scholar - Pazzani, M. (1996). Searching for dependencies in Bayesian classifiers. In
*Learning from data: Artificial intelligence and statistics V*(pp. 343–348). New York: Springer-Verlag.Google Scholar - Quinlan, J. R. (1992). Learning with continuous classes. In
*Proceedings of the 5th Australian Joint Conference on Artificial Intelligence*, Hobart, Australia (pp. 343–348). Singapore: World Scientific.Google Scholar - Quinlan, J. R. (1993).
*C4.5: Programs for machine learning*. San Mateo, CA: Morgan Kaufmann.Google Scholar - Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In
*Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining*, Portland, OR (pp. 335–338). Menlo Park, CA: AAAI Press.Google Scholar - Silverman, B. W. (1986).
*Density estimation for statistics and data analysis*. New York: Chapman and Hall.Google Scholar - Simonoff, J. S. (1996).
*Smoothing methods in statistics*. New York: Springer-Verlag.Google Scholar - Smyth, P., Gray, A., & Fayyad, U. M. (1995). Retrofitting decision tree classifiers using kernel density estimation, In
*Proceedings of the 12th International Conference on Machine Learning*, Tahoe City, CA (pp. 506–514). San Francisco, CA: Morgan Kaufmann.Google Scholar - StatLib (1999). Department of Statistics, Carnegie Mellon University. [http://lib.stat.cmu.edu].Google Scholar
- Wang, Y. & Witten, I. H. (1997). Induction of model trees for predicting continuous classes, In
*Proceedings of the Poster Papers of the European Conference on Machine Learning*, Prague (pp. 128–137). Prague: University of Economics, Faculty of Informatics and Statistics.Google Scholar