Abstract
This paper argues that two commonly-used discretization approaches, fixed k-interval discretization and entropy-based discretization have sub-optimal characteristics for naive-Bayes classification. This analysis leads to a new discretization method, Proportional k-Interval Discretization (PKID), which adjusts the number and size of discretized intervals to the number of training instances, thus seeks an appropriate trade-off between the bias and variance of the probability estimation for naive-Bayes classifiers. We justify PKID in theory, as well as test it on a wide cross-section of datasets. Our experimental results suggest that in comparison to its alternatives, PKID provides naive-Bayes classifiers competitive classification performance for smaller datasets and better classification performance for larger datasets.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Wong, A. K. C., Chiu, D. K. Y.: Synthesizing Statistical Knowledge from Incomplete Mixedmode Data, IEEE Transaction on Pattern Analysis and Machine Intelligence 9, 796–805, 1987
Catlett, Jason: Megainduction: Machine Learning on Very Large Databases, University of Sydney, Australia, 1991
Catlett, Jason: On Changing Continuous Attributes into Ordered Discrete Attributes, Proceedings of the European Working Session on Learning, 164–178, 1991
Chmielewski, M. R., Grzymala-Busse, J. W.: Global Discretization of Continuous Attributes as Preprocessing for Machine Learning, Third International Workshop on Rough Sets and Soft Computing, 294–301, 1994
Pfahringer, Bernhard: Compression-Based Discretization of Continuous Attributes, Proceedings of the Twelfth International Conference on Machine Learning, 1995
Fayyad, Usama M., Irani, Keki B.: Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning, Proceedings of the 13th International Joint Conference on Artificial Intelligence, 1022–1027, 1993
Kohavi, R., Wolpert, D.: Bias Plus Variance Decomposition for Zero-One Loss Functions, Proceedings of the 13th International Conference on Machine Learning, 275–283, 1996
Domingos, Pedro, Pazzani, Michael: On the Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning 29, 103–130, 1997
Yang, Yiming, Liu, Xin: A Re-examination of Text Categorization Methods, Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, 42–49, 1999
Johnson, Richard, Bhattacharyya, Gouri: Statistics: Principles and Methods, 12–13, 1985
Cestnik, B.: Estimating Probabilities: A Crucial Task in Machine Learning, Proceedings of the European Conference on Artificial Intelligence, 147–149, 1990
Dougherty, James, Kohavi, Ron, Sahami, Mehran: Supervised and Unsupervised Discretization of Continuous Features, Proceedings of the Twelfth International Conference on Machine Learning, 194–202, 1995
Hsu, Chun-Nan, Huang, Hung-Ju, Wong, Tzu-Tsung: Why Discretization works for Naive Bayesian Classifiers, Machine Learning, Proceedings of the Seventeenth International Conference, 309–406, 2000
Quinlan, J. Ross: C4.5: Programs for Machine Learning, 1993
Blake, C. L., Merz, C. J.: UCI Repository of Machine Learning Databases [http://www.ics.uci.edu/~mlearn/MLRepository.html], Department of Information and Computer Science, University of California, Irvine, 1998
Bay, S. D.: The UCI KDD Archive [http://kdd.ics.uci.edu], Department of Information and Computer Science, University of California, Irvine, 1999
Domingos, Pedro, Pazzani, Michael: Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier, Proceedings of the Thirteenth International Conference on Machine Learning, 105–112, 1996
Webb, Geoffrey I.: MultiBoosting: A Technique for Combining Boosting and Wagging, Machine Learning, 40-2, 159–196, 2000
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yang, Y., Webb, G.I. (2001). Proportional k-Interval Discretization for Naive-Bayes Classifiers. In: De Raedt, L., Flach, P. (eds) Machine Learning: ECML 2001. ECML 2001. Lecture Notes in Computer Science(), vol 2167. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44795-4_48
Download citation
DOI: https://doi.org/10.1007/3-540-44795-4_48
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42536-6
Online ISBN: 978-3-540-44795-5
eBook Packages: Springer Book Archive