Using Resampling Techniques for Better Quality Discretization
Many supervised induction algorithms require discrete data, however real data often comes in both discrete and continuous formats. Quality discretization of continuous attributes is an important problem that has effects on accuracy, complexity, variance and understandability of the induction model. Usually, discretization and other types of statistical processes are applied to subsets of the population as the entire population is practically inaccessible. For this reason we argue that the discretization performed on a sample of the population is only an estimate of the entire population. Most of the existing discretization methods, partition the attribute range into two or several intervals using a single or a set of cut points. In this paper, we introduce two variants of a resampling technique (such as bootstrap) to generate a set of candidate discretization points and thus, improving the discretization quality by providing a better estimation towards the entire population. Thus, the goal of this paper is to observe whether this type of resampling can lead to better quality discretization points, which opens up a new paradigm to construction of soft decision trees.
KeywordsBootstrap discretization resampling
Unable to display preview. Download preview PDF.
- 2.Wehenkel, L.: An Information Quality Based Decision Tree Pruning Method. In: Valverde, L., Bouchon-Meunier, B., Yager, R.R. (eds.) IPMU 1992. LNCS, vol. 682. Springer, Heidelberg (1993)Google Scholar
- 4.Kerber, R.: Discretization of Numeric Attributes. In: Proceedings of the Tenth National Conference on Artificial Intelligence, pp. 123–128. MIT Press, Cambridge (1992)Google Scholar
- 5.Zighed, D.A., Rakotomalala, R., Rabaséda, S.: Discretization Method for Continuous Attributes in Induction Graphs. In: Proceeding of the 13th European Meetings on Cybernetics and System Research, pp. 997–1002 (1996)Google Scholar
- 6.Fayyad, U.M., Irani, K.: Multi-interval Discretization of Continuous-Valued Attributes for Classification Learning. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp. 1022–1027. Morgan Kaufmann, San Mateo (1993)Google Scholar
- 7.Zighed, D.A., Rickotomalala, R.: A Method for Non Arborescent Induction Graphs. Technical Report, Laboratory ERIC, University of Lyon 2 (1996)Google Scholar
- 12.Yang, Y., Webb, G.I.: Discretization for naive-bayes learning: managing discretization bias and variance. Technical Report 2003/131, School of Computer Science and Software Engineering, Monash University (2003)Google Scholar
- 13.Hsu, C.N., Huang, H.J., Wong, T.T.: Why discretization works for naive Bayesian classifiers. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 309–406 (2000)Google Scholar
- 14.MODL: A Bayes optimal discretization method for continuous attributes. Journal of Machine Learning, 131–165 (2006)Google Scholar
- 15.Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
- 19.Chmielewski, M.R., Grzymala Busse, J.W.: Global discretization of continuous attributes as preprocessing for machine learning. In: Third International Workshop on Rough Sets and Soft Computing, pp. 294–301 (1994)Google Scholar
- 20.Peng, Y., Flach, P.: Soft Discretization to Enhance the Continuous Decision Tree Induction. In: Giraud-Carrier, C., Lavrac, N., Moyle, S. (eds.) Integrating Aspects of Data Mining, Decision Support and Meta-Learning, September 2001. ECML/PKDD 2001 workshop notes, pp. 109–118 (2001)Google Scholar