Abstract
In supervised machine learning, some algorithms are restricted to discrete data and have to discretize continuous attributes. Many discretization methods, based on statistical criteria, information content, or other specialized criteria, have been studied in the past. In this paper, we propose the discretization method Khiops,1 based on the chi-square statistic. In contrast with related methods ChiMerge and ChiSplit, this method optimizes the chi-square criterion in a global manner on the whole discretization domain and does not require any stopping criterion. A theoretical study followed by experiments demonstrates the robustness and the good predictive performance of the method.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Bertelsen, R., & Martinez, T. R. (1994). Extending ID3 through discretization of continuous inputs. In Proceedings of the 7th Florida Artificial Intelligence Research Symposium (pp. 122–125). Florida AI Research Society.
Bertier, P., & Bouroche, J. M. (1981). Analyse des données multidimensionnelles. Presses Universitaires de France.
Blake, C. L., & Merz, C. J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.
Boullé, M. (2001). Khiops: Discrétisation des attributs numériques pour le Data Mining. Note technique NT/FTR&D/7339. France Telecom R&D.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. California: Wadsworth International.
Burdsall, B., & Giraud-Carrier, C. (1997). Evolving fuzzy prototypes for efficient data clustering. In Proceedings of the Second International ICSC Symposium on Fuzzy Logic and Applications (ISFL'97) (pp. 217–223).
Catlett, J. (1991). On changing continuous attributes into ordered discrete attributes. In Proceedings of the European Working Session on Learning (pp. 87–102). Springer-Verlag.
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In Proceedings of the 12th International Conference on Machine Learning (pp. 194–202). San Francisco, CA: Morgan Kaufmann.
Fayyad, U., & Irani, K. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8, 87–102.
Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11, 63–90.
Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29:2, 119–127.
Kerber, R. (1991). Chimerge discretization of numeric attributes. In Proceedings of the 10th International Conference on Artificial Intelligence (pp. 123–128).
Langley, P., Iba, W., & Thompson, K. (1992). An analysis of bayesian classifiers. In Proceedings of the 10th National Conference on Artificial Intelligence (pp. 223–228). AAAI Press.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.
Zighed, D. A., Rabaseda, S., & Rakotomalala, R. (1998). Fusinter: A method for discretization of continuous attributes for supervised learning. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6:33, 307–326.
Zighed, D. A., & Rakotomalala, R. (2000). Graphes d'induction (pp. 327–359). HERMES Science Publications
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Boulle, M. Khiops: A Statistical Discretization Method of Continuous Attributes. Machine Learning 55, 53–69 (2004). https://doi.org/10.1023/B:MACH.0000019804.29836.05
Issue Date:
DOI: https://doi.org/10.1023/B:MACH.0000019804.29836.05