Machine Learning

, Volume 74, Issue 1, pp 39–74 | Cite as

Discretization for naive-Bayes learning: managing discretization bias and variance

Article

Abstract

Quantitative attributes are usually discretized in Naive-Bayes learning. We establish simple conditions under which discretization is equivalent to use of the true probability density function during naive-Bayes learning. The use of different discretization techniques can be expected to affect the classification bias and variance of generated naive-Bayes classifiers, effects we name discretization bias and variance. We argue that by properly managing discretization bias and variance, we can effectively reduce naive-Bayes classification error. In particular, we supply insights into managing discretization bias and variance by adjusting the number of intervals and the number of training instances contained in each interval. We accordingly propose proportional discretization and fixed frequency discretization, two efficient unsupervised discretization methods that are able to effectively manage discretization bias and variance. We evaluate our new techniques against four key discretization methods for naive-Bayes classifiers. The experimental results support our theoretical analyses by showing that with statistically significant frequency, naive-Bayes classifiers trained on data discretized by our new methods are able to achieve lower classification error than those trained on data discretized by current established discretization methods.

Keywords

Discretization Naive-Bayes Learning Bias Variance 

References

  1. Acid, S., Campos, L. M. D., & Castellano, J. G. (2005). Learning Bayesian network classifiers: searching in a space of partially directed acyclic graphs. Machine Learning, 59(3), 213–235. MATHGoogle Scholar
  2. An, A., & Cercone, N. (1999). Discretization of continuous attributes for learning classification rules. In Proceedings of the 3rd Pacific-Asia conference on methodologies for knowledge discovery and data mining (pp. 509–514), 1999. Google Scholar
  3. Androutsopoulos, I., Koutsias, J., Chandrinos, K., & Spyropoulos, C. (2000). An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with encrypted personal e-mail messages. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 160–167), 2000. Google Scholar
  4. Bay, S. D. (1999). The UCI KDD archive [http://kdd.ics.uci.edu]. Irvine: Department of Information and Computer Science, University of California.
  5. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases [http://www.ics.uci.edu/~mlearn/mlrepository.html]. Irvine: Department of Information and Computer Science, University of California.
  6. Bluman, A. G. (1992). Elementary statistics, a step by step approach. Dubuque: Wm.C. Brown Publishers. Google Scholar
  7. Breiman, L. (1996). Bias, variance and arcing classifiers (Technical report 460). Statistics Department, University of California, Berkeley. Google Scholar
  8. Casella, G., & Berger, R. L. (1990). Statistical inference. Pacific Grove. Google Scholar
  9. Catlett, J. (1991). On changing continuous attributes into ordered discrete attributes. In Proceedings of the European working session on learning (pp. 164–178), 1991. Google Scholar
  10. Cerquides, J., & Mántaras, R. L. D. (2005). TAN classifiers based on decomposable distributions. Machine Learning, 59(3), 323–354. MATHCrossRefGoogle Scholar
  11. Cestnik, B. (1990). Estimating probabilities: a crucial task in machine learning. In Proceedings of the 9th European conference on artificial intelligence (pp. 147–149), 1990. Google Scholar
  12. Cestnik, B., Kononenko, I., & Bratko, I. (1987). Assistant 86: a knowledge-elicitation tool for sophisticated users. In Proceedings of the 2nd European working session on learning (pp. 31–45), 1987. Google Scholar
  13. Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3, 261–283. Google Scholar
  14. Crawford, E., Kay, J., & Eric, M. (2002). IEMS—the intelligent email sorter. In Proceedings of the 19th international conference on machine learning (pp. 83–90), 2002. Google Scholar
  15. Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30. MathSciNetGoogle Scholar
  16. Domingos, P., & Pazzani, M. J. (1996). Beyond independence: conditions for the optimality of the simple Bayesian classifier. In Proceedings of the 13th international conference on machine learning (pp. 105–112). San Mateo: Morgan Kaufmann Publishers. Google Scholar
  17. Domingos, P., & Pazzani, M. J. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103–130. MATHCrossRefGoogle Scholar
  18. Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In Proceedings of the 12th international conference on machine learning (pp. 194–202), 1995. Google Scholar
  19. Duda, R., & Hart, P. (1973). Pattern classification and scene analysis. New York: Wiley. MATHGoogle Scholar
  20. Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th international joint conference on artificial intelligence (pp. 1022–1027), 1993. Google Scholar
  21. Freitas, A. A., & Lavington, S. H. (1996). Speeding up knowledge discovery in large relational databases by means of a new discretization algorithm. In Advances in databases, proceedings of the 14th British national conference on databases (pp. 124–133), 1996. Google Scholar
  22. Friedman, J. H. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1(1), 55–77. CrossRefGoogle Scholar
  23. Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32, 675–701. CrossRefGoogle Scholar
  24. Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, 11, 86–92. MATHCrossRefMathSciNetGoogle Scholar
  25. Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29(2), 131–163. MATHCrossRefGoogle Scholar
  26. Gama, J., Torgo, L., & Soares, C. (1998). Dynamic discretization of continuous attributes. In Proceedings of the 6th Ibero-American conference on AI (pp. 160–169), 1998. Google Scholar
  27. Hsu, C.-N., Huang, H.-J., & Wong, T.-T. (2000). Why discretization works for naive Bayesian classifiers. In Proceedings of the 17th international conference on machine learning (pp. 309–406), 2000. Google Scholar
  28. Hsu, C.-N., Huang, H.-J., & Wong, T.-T. (2003). Implications of the Dirichlet assumption for discretization of continuous variables in naive Bayesian classifiers. Machine Learning, 53(3), 235–263. MATHCrossRefGoogle Scholar
  29. Hussain, F., Liu, H., Tan, C. L., & Dash, M. (1999). Discretization: An enabling technique. (Technical Report, TRC6/99). School of Computing, National University of Singapore. Google Scholar
  30. John, G. H., & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. In Proceedings of the 11th conference on uncertainty in artificial intelligence (pp. 338–345), 1995. Google Scholar
  31. Keogh, E., & Pazzani, M. J. (1999). Learning augmented Bayesian classifiers: a comparison of distribution-based and classification-based approaches. In Proceedings of international workshop on artificial intelligence and statistics (pp. 225–230), 1999. Google Scholar
  32. Kerber, R. (1992). Chimerge: Discretization for numeric attributes. In National conference on artificial intelligence (pp. 123–128). Menlo Park: AAAI Press. Google Scholar
  33. Kohavi, R., & Wolpert, D. (1996). Bias plus variance decomposition for zero-one loss functions. In Proceedings of the 13th international conference on machine learning (pp. 275–283), 1996. Google Scholar
  34. Kong, E. B., & Dietterich, T. G. (1995). Error-correcting output coding corrects bias and variance. In Proceedings of the 12th international conference on machine learning (pp. 313–321), 1995. Google Scholar
  35. Kononenko, I. (1990). Comparison of inductive and naive Bayesian learning approaches to automatic knowledge acquisition. Amsterdam: IOS Press. Google Scholar
  36. Kononenko, I. (1992). Naive Bayesian classifier and continuous attributes. Informatica, 16(1), 1–8. MathSciNetGoogle Scholar
  37. Kononenko, I. (2001). Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in Medicine, 23(1), 89–109. CrossRefMathSciNetGoogle Scholar
  38. Langley, P. (1993). Induction of recursive Bayesian classifiers. In Proceedings of the European conference on machine learning (pp. 153–164), 1993. Google Scholar
  39. Langley, P., & Sage, S. (1994). Induction of selective Bayesian classifiers. In Proceedings of the 10th conference on uncertainty in artificial intelligence (pp. 399–406), 1994. Google Scholar
  40. Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. In Proceedings of the 10th national conference on artificial intelligence (pp. 223–228), 1992. Google Scholar
  41. Lavrac, N. (1998). Data mining in medicine: selected techniques and applications. In Proceedings of the 2nd international conference on the practical applications of knowledge discovery and data mining (pp. 11–31), 1998. Google Scholar
  42. Lavrac, N., Keravnou, E., & Zupan, B. (2000). Intelligent data analysis in medicine. Encyclopedia of Computer Science and Technology, 42(9), 113–157. Google Scholar
  43. Lewis, D. D. (1998). Naive (Bayes) at forty: the independence assumption in information retrieval. In Proceedings of the 10th European conference on machine learning (pp. 4–15), 1998. Google Scholar
  44. Maron, M., & Kuhns, J. (1960). On relevance, probabilistic indexing, and information retrieval. Journal of the Association for Computing Machinery, 7(3), 216–244. Google Scholar
  45. Mitchell, T. M. (1997). Machine learning. New York: McGraw-Hill. MATHGoogle Scholar
  46. Miyahara, K., & Pazzani, M. J. (2000). Collaborative filtering with the simple Bayesian classifier. In Proceedings of the 6th Pacific rim international conference on artificial intelligence (pp. 679–689), 2000. Google Scholar
  47. Mooney, R. J., & Roy, L. (2000). Content-based book recommending using learning for text categorization. In Proceedings of the 5th ACM conference on digital libraries (pp. 195–204). New York: ACM Press. CrossRefGoogle Scholar
  48. Moore, D. S., & McCabe, G. P. (2002). Introduction to the practice of statistics (4th ed.). San Francisco: Michelle Julet. Google Scholar
  49. Mora, L., Fortes, I., Morales, R., & Triguero, F. (2000). Dynamic discretization of continuous values from time series. In Proceedings of the 11th European conference on machine learning (pp. 280–291), 2000. Google Scholar
  50. Pazzani, M. J. (1995). An iterative improvement approach for the discretization of numeric attributes in Bayesian classifiers. In Proceedings of the 1st international conference on knowledge discovery and data mining (pp. 228–233), 1995. Google Scholar
  51. Pazzani, M. J., Merz, C., Murphy, P., Ali, K., Hume, T., & Brunk, C. (1994). Reducing misclassification costs. In Proceedings of the 11th international conference on machine learning (pp. 217–225). San Mateo: Morgan Kaufmann. Google Scholar
  52. Perner, P., & Trautzsch, S. (1998). Multi-interval discretization methods for decision tree learning. In Proceedings of advances in pattern recognition, joint IAPR international workshops SSPR98 and SPR98 (pp. 475–482), 1998. Google Scholar
  53. Provost, F., & Aronis, J. (1996). Scaling up machine learning with massive parallelism. Machine Learning, 23(1), 33–46. Google Scholar
  54. Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In Proceedings of the 2nd international conference on knowledge discovery and data mining (pp. 334–338), 1996. Google Scholar
  55. Samuels, M. L., & Witmer, J. A. (1999). Statistics for the life sciences (2nd ed.). New York: Prentice-Hall. Google Scholar
  56. Singh, M., & Provan, G. M. (1996). Efficient learning of selective Bayesian network classifiers. In Proceedings of the 13th international conference on machine learning (pp. 453–461), 1996. Google Scholar
  57. Starr, B., Ackerman, M. S., & Pazzani, M. J. (1996). Do-I-care: a collaborative web agent. In Proceedings of the ACM conference on human factors in computing systems (pp. 273–274), 1996. Google Scholar
  58. Torgo, L., & Gama, J. (1997). Search-based class discretization. In Proceedings of the 9th European conference on machine learning (pp. 266–273), 1997. Google Scholar
  59. Webb, G. I. (2000). Multiboosting: a technique for combining boosting and wagging. Machine Learning, 40(2), 159–196. CrossRefGoogle Scholar
  60. Webb, G. I., Boughton, J., & Wang, Z. (2005). Not so naive Bayes: averaged one-dependence estimators. Machine Learning, 58(1), 5–24. MATHCrossRefGoogle Scholar
  61. Weiss, N. A. (2002). Introductory statistics (6th ed.). Greg Tobin. Google Scholar
  62. Yang, Y., & Webb, G. I. (2001). Proportional k-interval discretization for naive-Bayes classifiers. In Proceedings of the 12th European conference on machine learning (pp. 564–575), 2001. Google Scholar
  63. Yang, Y., & Webb, G. I. (2003). On why discretization works for naive-Bayes classifiers. In Proceedings of the 16th Australian joint conference on artificial intelligence (pp. 440–452), 2003. Google Scholar
  64. Zheng, Z., & Webb, G. I. (2000). Lazy learning of Bayesian rules. Machine Learning, 41(1), 53–84. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  1. 1.Australian Taxation OfficeBox HillAustralia
  2. 2.Faculty of Information TechnologyMonash UniversityClaytonAustralia

Personalised recommendations