# Discretization for naive-Bayes learning: managing discretization bias and variance

## Abstract

Quantitative attributes are usually discretized in Naive-Bayes learning. We establish simple conditions under which discretization is equivalent to use of the true probability density function during naive-Bayes learning. The use of different discretization techniques can be expected to affect the classification bias and variance of generated naive-Bayes classifiers, effects we name *discretization bias* and *variance*. We argue that by properly managing discretization bias and variance, we can effectively reduce naive-Bayes classification error. In particular, we supply insights into managing discretization bias and variance by adjusting the number of intervals and the number of training instances contained in each interval. We accordingly propose *proportional discretization* and *fixed frequency discretization*, two efficient unsupervised discretization methods that are able to effectively manage discretization bias and variance. We evaluate our new techniques against four key discretization methods for naive-Bayes classifiers. The experimental results support our theoretical analyses by showing that with statistically significant frequency, naive-Bayes classifiers trained on data discretized by our new methods are able to achieve lower classification error than those trained on data discretized by current established discretization methods.

### Keywords

Discretization Naive-Bayes Learning Bias Variance### References

- Acid, S., Campos, L. M. D., & Castellano, J. G. (2005). Learning Bayesian network classifiers: searching in a space of partially directed acyclic graphs.
*Machine Learning*,*59*(3), 213–235. MATHGoogle Scholar - An, A., & Cercone, N. (1999). Discretization of continuous attributes for learning classification rules. In
*Proceedings of the 3rd Pacific-Asia conference on methodologies for knowledge discovery and data mining*(pp. 509–514), 1999. Google Scholar - Androutsopoulos, I., Koutsias, J., Chandrinos, K., & Spyropoulos, C. (2000). An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with encrypted personal e-mail messages. In
*Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval*(pp. 160–167), 2000. Google Scholar - Bay, S. D. (1999).
*The UCI KDD archive*[http://kdd.ics.uci.edu]. Irvine: Department of Information and Computer Science, University of California. - Blake, C. L., & Merz, C. J. (1998).
*UCI repository of machine learning databases*[http://www.ics.uci.edu/~mlearn/mlrepository.html]. Irvine: Department of Information and Computer Science, University of California. - Bluman, A. G. (1992).
*Elementary statistics, a step by step approach*. Dubuque: Wm.C. Brown Publishers. Google Scholar - Breiman, L. (1996).
*Bias, variance and arcing classifiers*(Technical report 460). Statistics Department, University of California, Berkeley. Google Scholar - Casella, G., & Berger, R. L. (1990).
*Statistical inference*. Pacific Grove. Google Scholar - Catlett, J. (1991). On changing continuous attributes into ordered discrete attributes. In
*Proceedings of the European working session on learning*(pp. 164–178), 1991. Google Scholar - Cerquides, J., & Mántaras, R. L. D. (2005). TAN classifiers based on decomposable distributions.
*Machine Learning*,*59*(3), 323–354. MATHCrossRefGoogle Scholar - Cestnik, B. (1990). Estimating probabilities: a crucial task in machine learning. In
*Proceedings of the 9th European conference on artificial intelligence*(pp. 147–149), 1990. Google Scholar - Cestnik, B., Kononenko, I., & Bratko, I. (1987). Assistant 86: a knowledge-elicitation tool for sophisticated users. In
*Proceedings of the 2nd European working session on learning*(pp. 31–45), 1987. Google Scholar - Clark, P., & Niblett, T. (1989). The CN2 induction algorithm.
*Machine Learning*,*3*, 261–283. Google Scholar - Crawford, E., Kay, J., & Eric, M. (2002). IEMS—the intelligent email sorter. In
*Proceedings of the 19th international conference on machine learning*(pp. 83–90), 2002. Google Scholar - Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets.
*Journal of Machine Learning Research*,*7*, 1–30. MathSciNetGoogle Scholar - Domingos, P., & Pazzani, M. J. (1996). Beyond independence: conditions for the optimality of the simple Bayesian classifier. In
*Proceedings of the 13th international conference on machine learning*(pp. 105–112). San Mateo: Morgan Kaufmann Publishers. Google Scholar - Domingos, P., & Pazzani, M. J. (1997). On the optimality of the simple Bayesian classifier under zero-one loss.
*Machine Learning*,*29*, 103–130. MATHCrossRefGoogle Scholar - Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In
*Proceedings of the 12th international conference on machine learning*(pp. 194–202), 1995. Google Scholar - Duda, R., & Hart, P. (1973).
*Pattern classification and scene analysis*. New York: Wiley. MATHGoogle Scholar - Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In
*Proceedings of the 13th international joint conference on artificial intelligence*(pp. 1022–1027), 1993. Google Scholar - Freitas, A. A., & Lavington, S. H. (1996). Speeding up knowledge discovery in large relational databases by means of a new discretization algorithm. In
*Advances in databases, proceedings of the 14th British national conference on databases*(pp. 124–133), 1996. Google Scholar - Friedman, J. H. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality.
*Data Mining and Knowledge Discovery*,*1*(1), 55–77. CrossRefGoogle Scholar - Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance.
*Journal of the American Statistical Association*,*32*, 675–701. CrossRefGoogle Scholar - Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings.
*Annals of Mathematical Statistics*,*11*, 86–92. MATHCrossRefMathSciNetGoogle Scholar - Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers.
*Machine Learning*,*29*(2), 131–163. MATHCrossRefGoogle Scholar - Gama, J., Torgo, L., & Soares, C. (1998). Dynamic discretization of continuous attributes. In
*Proceedings of the 6th Ibero-American conference on AI*(pp. 160–169), 1998. Google Scholar - Hsu, C.-N., Huang, H.-J., & Wong, T.-T. (2000). Why discretization works for naive Bayesian classifiers. In
*Proceedings of the 17th international conference on machine learning*(pp. 309–406), 2000. Google Scholar - Hsu, C.-N., Huang, H.-J., & Wong, T.-T. (2003). Implications of the Dirichlet assumption for discretization of continuous variables in naive Bayesian classifiers.
*Machine Learning*,*53*(3), 235–263. MATHCrossRefGoogle Scholar - Hussain, F., Liu, H., Tan, C. L., & Dash, M. (1999).
*Discretization: An enabling technique.*(Technical Report, TRC6/99). School of Computing, National University of Singapore. Google Scholar - John, G. H., & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. In
*Proceedings of the 11th conference on uncertainty in artificial intelligence*(pp. 338–345), 1995. Google Scholar - Keogh, E., & Pazzani, M. J. (1999). Learning augmented Bayesian classifiers: a comparison of distribution-based and classification-based approaches. In
*Proceedings of international workshop on artificial intelligence and statistics*(pp. 225–230), 1999. Google Scholar - Kerber, R. (1992). Chimerge: Discretization for numeric attributes. In
*National conference on artificial intelligence*(pp. 123–128). Menlo Park: AAAI Press. Google Scholar - Kohavi, R., & Wolpert, D. (1996). Bias plus variance decomposition for zero-one loss functions. In
*Proceedings of the 13th international conference on machine learning*(pp. 275–283), 1996. Google Scholar - Kong, E. B., & Dietterich, T. G. (1995). Error-correcting output coding corrects bias and variance. In
*Proceedings of the 12th international conference on machine learning*(pp. 313–321), 1995. Google Scholar - Kononenko, I. (1990).
*Comparison of inductive and naive Bayesian learning approaches to automatic knowledge acquisition*. Amsterdam: IOS Press. Google Scholar - Kononenko, I. (1992). Naive Bayesian classifier and continuous attributes.
*Informatica*,*16*(1), 1–8. MathSciNetGoogle Scholar - Kononenko, I. (2001). Machine learning for medical diagnosis: history, state of the art and perspective.
*Artificial Intelligence in Medicine*,*23*(1), 89–109. CrossRefMathSciNetGoogle Scholar - Langley, P. (1993). Induction of recursive Bayesian classifiers. In
*Proceedings of the European conference on machine learning*(pp. 153–164), 1993. Google Scholar - Langley, P., & Sage, S. (1994). Induction of selective Bayesian classifiers. In
*Proceedings of the 10th conference on uncertainty in artificial intelligence*(pp. 399–406), 1994. Google Scholar - Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. In
*Proceedings of the 10th national conference on artificial intelligence*(pp. 223–228), 1992. Google Scholar - Lavrac, N. (1998). Data mining in medicine: selected techniques and applications. In
*Proceedings of the 2nd international conference on the practical applications of knowledge discovery and data mining*(pp. 11–31), 1998. Google Scholar - Lavrac, N., Keravnou, E., & Zupan, B. (2000). Intelligent data analysis in medicine.
*Encyclopedia of Computer Science and Technology*,*42*(9), 113–157. Google Scholar - Lewis, D. D. (1998). Naive (Bayes) at forty: the independence assumption in information retrieval. In
*Proceedings of the 10th European conference on machine learning*(pp. 4–15), 1998. Google Scholar - Maron, M., & Kuhns, J. (1960). On relevance, probabilistic indexing, and information retrieval.
*Journal of the Association for Computing Machinery*,*7*(3), 216–244. Google Scholar - Mitchell, T. M. (1997).
*Machine learning*. New York: McGraw-Hill. MATHGoogle Scholar - Miyahara, K., & Pazzani, M. J. (2000). Collaborative filtering with the simple Bayesian classifier. In
*Proceedings of the 6th Pacific rim international conference on artificial intelligence*(pp. 679–689), 2000. Google Scholar - Mooney, R. J., & Roy, L. (2000). Content-based book recommending using learning for text categorization. In
*Proceedings of the 5th ACM conference on digital libraries*(pp. 195–204). New York: ACM Press. CrossRefGoogle Scholar - Moore, D. S., & McCabe, G. P. (2002).
*Introduction to the practice of statistics*(4th ed.). San Francisco: Michelle Julet. Google Scholar - Mora, L., Fortes, I., Morales, R., & Triguero, F. (2000). Dynamic discretization of continuous values from time series. In
*Proceedings of the 11th European conference on machine learning*(pp. 280–291), 2000. Google Scholar - Pazzani, M. J. (1995). An iterative improvement approach for the discretization of numeric attributes in Bayesian classifiers. In
*Proceedings of the 1st international conference on knowledge discovery and data mining*(pp. 228–233), 1995. Google Scholar - Pazzani, M. J., Merz, C., Murphy, P., Ali, K., Hume, T., & Brunk, C. (1994). Reducing misclassification costs. In
*Proceedings of the 11th international conference on machine learning*(pp. 217–225). San Mateo: Morgan Kaufmann. Google Scholar - Perner, P., & Trautzsch, S. (1998). Multi-interval discretization methods for decision tree learning. In
*Proceedings of advances in pattern recognition, joint IAPR international workshops SSPR98 and SPR98*(pp. 475–482), 1998. Google Scholar - Provost, F., & Aronis, J. (1996). Scaling up machine learning with massive parallelism.
*Machine Learning*,*23*(1), 33–46. Google Scholar - Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In
*Proceedings of the 2nd international conference on knowledge discovery and data mining*(pp. 334–338), 1996. Google Scholar - Samuels, M. L., & Witmer, J. A. (1999).
*Statistics for the life sciences*(2nd ed.). New York: Prentice-Hall. Google Scholar - Singh, M., & Provan, G. M. (1996). Efficient learning of selective Bayesian network classifiers. In
*Proceedings of the 13th international conference on machine learning*(pp. 453–461), 1996. Google Scholar - Starr, B., Ackerman, M. S., & Pazzani, M. J. (1996). Do-I-care: a collaborative web agent. In
*Proceedings of the ACM conference on human factors in computing systems*(pp. 273–274), 1996. Google Scholar - Torgo, L., & Gama, J. (1997). Search-based class discretization. In
*Proceedings of the 9th European conference on machine learning*(pp. 266–273), 1997. Google Scholar - Webb, G. I. (2000). Multiboosting: a technique for combining boosting and wagging.
*Machine Learning*,*40*(2), 159–196. CrossRefGoogle Scholar - Webb, G. I., Boughton, J., & Wang, Z. (2005). Not so naive Bayes: averaged one-dependence estimators.
*Machine Learning*,*58*(1), 5–24. MATHCrossRefGoogle Scholar - Weiss, N. A. (2002).
*Introductory statistics*(6th ed.). Greg Tobin. Google Scholar - Yang, Y., & Webb, G. I. (2001). Proportional k-interval discretization for naive-Bayes classifiers. In
*Proceedings of the 12th European conference on machine learning*(pp. 564–575), 2001. Google Scholar - Yang, Y., & Webb, G. I. (2003). On why discretization works for naive-Bayes classifiers. In
*Proceedings of the 16th Australian joint conference on artificial intelligence*(pp. 440–452), 2003. Google Scholar - Zheng, Z., & Webb, G. I. (2000). Lazy learning of Bayesian rules.
*Machine Learning*,*41*(1), 53–84. CrossRefGoogle Scholar