Machine Learning

, Volume 41, Issue 1, pp 5–25 | Cite as

Technical Note: Naive Bayes for Regression

  • Eibe Frank
  • Leonard Trigg
  • Geoffrey Holmes
  • Ian H. Witten


Despite its simplicity, the naive Bayes learning scheme performs well on most classification tasks, and is often significantly more accurate than more sophisticated methods. Although the probability estimates that it produces can be inaccurate, it often assigns maximum probability to the correct class. This suggests that its good performance might be restricted to situations where the output is categorical. It is therefore interesting to see how it performs in domains where the predicted value is numeric, because in this case, predictions are more sensitive to inaccurate probability estimates.

This paper shows how to apply the naive Bayes methodology to numeric prediction (i.e., regression) tasks by modeling the probability distribution of the target value with kernel density estimators, and compares it to linear regression, locally weighted linear regression, and a method that produces “model trees”—decision trees with linear regression functions at the leaves. Although we exhibit an artificial dataset for which naive Bayes is the method of choice, on real-world datasets it is almost uniformly worse than locally weighted linear regression and model trees. The comparison with linear regression depends on the error measure: for one measure naive Bayes performs similarly, while for another it is worse. We also show that standard naive Bayes applied to regression problems by discretizing the target value performs similarly badly. We then present empirical evidence that isolates naive Bayes' independence assumption as the culprit for its poor performance in the regression setting. These results indicate that the simplistic statistical assumption that naive Bayes makes is indeed more restrictive for regression than for classification.

naive Bayes regression model trees linear regression locally weighted regression 


  1. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Proceedings of the 2nd International Symposium on Information Theory (pp. 267–281). Budapest: Akadémiai Kiadó.Google Scholar
  2. Atkeson, C. G., Moore, A.W., & Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11, 11–73.Google Scholar
  3. Blake, C., Keogh, E., & Merz, C. J. (1998). UCI repository of machine learning data-bases. Irvine, CA: University of California, Department of Information and Computer Science. [ MLRepository.html].Google Scholar
  4. Cestnik, B. (1990). Estimating probabilities:A crucial task in machine learning. In Proceedings of the 9th European Conference on Artificial Intelligence, Stockholm, Sweden (pp. 147–149). London: Pitman.Google Scholar
  5. Clark, P. & Niblett, T. (1989). The CN2 Induction Algorithm. Machine Learning, 3(4), 261–283.Google Scholar
  6. Domingos, P. & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. In Machine Learning, 29(2/3), 103–130.Google Scholar
  7. Duda, R. & Hart, P. (1973). Pattern classification and scene analysis. New York: Wiley.Google Scholar
  8. Fayyad, U. M. & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambery, France (pp. 1022–1027). San Mateo, CA: Morgan Kaufmann.Google Scholar
  9. Frank, E., Wang, Y., Inglis, S., Holmes, G., & Witten, I. H. (1998). Using model trees for classification. Machine Learning, 32(1), 63–76.Google Scholar
  10. Friedman, J. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1, 55–77.Google Scholar
  11. Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29(2/3), 131–163.Google Scholar
  12. Ghahramani, Z.& Jordan, M. I. (1994). Supervised learning from incomplete data via anEMapproach. In Advances in neural information processing systems 6 (pp. 120–127). San Mateo, CA: Morgan Kaufmann.Google Scholar
  13. Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11, 63–91.Google Scholar
  14. John, G. H. & Kohavi, R. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1/2), 273–324.Google Scholar
  15. John, G. H. & Langley P. (1995). Estimating continuous distributions in Bayesian Classifiers. Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, Montreal, Quebec (pp. 338–345). San Mateo, CA: Morgan Kaufmann.Google Scholar
  16. Kasif, S., Salzberg, S., Waltz, D., Rachlin, J., & Aha, D.W. (1998). A probabilistic framework for memory-based reasoning. Artificial Intelligence, 104(1/2), 297–312.Google Scholar
  17. Kilpatrick, D. & Cameron-Jones, M. (1998). Numeric prediction using instance-based learning with encoding length selection. In Progress in Connectionist-Based Information Systems, Dunedin, New Zealand (pp. 984–987). Singapore: Springer-Verlag.Google Scholar
  18. Kononenko, I. (1991). Semi-naive Bayesian classifier. In Proceedings of the 6th European Working Session on Learning, Porto, Portugal (pp. 206–219). Berlin: Springer-Verlag.Google Scholar
  19. Kononenko, I. (1998). Personal Communication.Google Scholar
  20. Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. In Proceedings of the 10th National Conference on Artificial Intelligence, San Jose, CA (pp. 223–228). Menlo Park, CA: AAAI Press.Google Scholar
  21. Langley, P. (1993).Induction of recursive Bayesian classifiers. In Proceedings of the 8th European Conference on Machine Learning, Vienna, Austria (pp. 153–164). Berlin: Springer-Verlag.Google Scholar
  22. Langley, P. & Sage, S. (1994). Induction of selective Bayesian classifiers, In Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence, Seattle, WA (pp. 399–406). San Mateo, CA: Morgan Kaufmann.Google Scholar
  23. Lehmann, E. L. (1983). Theory of point estimation. New York: Wiley.Google Scholar
  24. Pazzani, M. (1996). Searching for dependencies in Bayesian classifiers. In Learning from data: Artificial intelligence and statistics V (pp. 343–348). New York: Springer-Verlag.Google Scholar
  25. Quinlan, J. R. (1992). Learning with continuous classes. In Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, Hobart, Australia (pp. 343–348). Singapore: World Scientific.Google Scholar
  26. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann.Google Scholar
  27. Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR (pp. 335–338). Menlo Park, CA: AAAI Press.Google Scholar
  28. Silverman, B. W. (1986). Density estimation for statistics and data analysis. New York: Chapman and Hall.Google Scholar
  29. Simonoff, J. S. (1996). Smoothing methods in statistics. New York: Springer-Verlag.Google Scholar
  30. Smyth, P., Gray, A., & Fayyad, U. M. (1995). Retrofitting decision tree classifiers using kernel density estimation, In Proceedings of the 12th International Conference on Machine Learning, Tahoe City, CA (pp. 506–514). San Francisco, CA: Morgan Kaufmann.Google Scholar
  31. StatLib (1999). Department of Statistics, Carnegie Mellon University. [].Google Scholar
  32. Wang, Y. & Witten, I. H. (1997). Induction of model trees for predicting continuous classes, In Proceedings of the Poster Papers of the European Conference on Machine Learning, Prague (pp. 128–137). Prague: University of Economics, Faculty of Informatics and Statistics.Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • Eibe Frank
    • 1
  • Leonard Trigg
    • 2
  • Geoffrey Holmes
    • 3
  • Ian H. Witten
    • 4
  1. 1.Department of Computer ScienceUniversity of WaikatoHamiltonNew Zealand
  2. 2.Department of Computer ScienceUniversity of WaikatoHamiltonNew Zealand
  3. 3.Department of Computer ScienceUniversity of WaikatoHamiltonNew Zealand
  4. 4.Department of Computer ScienceUniversity of WaikatoHamiltonNew Zealand

Personalised recommendations