Advertisement

Optimum simultaneous discretization with data grid models in supervised classification: a Bayesian model selection approach

  • Marc BoulléEmail author
Regular Article

Abstract

In the domain of data preparation for supervised classification, filter methods for variable ranking are time efficient. However, their intrinsic univariate limitation prevents them from detecting redundancies or constructive interactions between variables. This paper introduces a new method to automatically, rapidly and reliably extract the classificatory information of a pair of input variables. It is based on a simultaneous partitioning of the domains of each input variable, into intervals in the numerical case and into groups of categories in the categorical case. The resulting input data grid allows to quantify the joint information between the two input variables and the output variable. The best joint partitioning is searched by maximizing a Bayesian model selection criterion. Intensive experiments demonstrate the benefits of the approach, especially the significant improvement of accuracy for classification tasks.

Keywords

Data preparation Discretization Feature selection Model selection Supervised classification 

Mathematics Subject Classification (2000)

62H17 62H20 62H30 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abramowitz M, Stegun I (1970) Handbook of mathematical functions. Dover, New YorkGoogle Scholar
  2. Bay S (2001) Multivariate discretization for set mining. Mach Learn 3(4): 491–512zbMATHGoogle Scholar
  3. Berger J (2006) The case of objective Bayesian analysis. Bayesian Anal 1(3): 385–402MathSciNetGoogle Scholar
  4. Bernardo J, Smith A (2000) Bayesian theory. Wiley, New YorkzbMATHGoogle Scholar
  5. Bertier P, Bouroche J (1981) Analyse des données multidimensionnelles. Presses Universitaires de FranceGoogle Scholar
  6. Blake C, Merz C (1996) UCI repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html
  7. Boullé M (2004) Khiops: a statistical discretization method of continuous attributes. Mach Learn 55(1): 53–69zbMATHCrossRefGoogle Scholar
  8. Boullé M (2005) A Bayes optimal approach for partitioning the values of categorical attributes. J Mach Learn Res 6: 1431–1452MathSciNetGoogle Scholar
  9. Boullé M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65(1): 131–165CrossRefGoogle Scholar
  10. Boullé M (2007) Compression-based averaging of selective naive Bayes classifiers. J Mach Learn Res 8: 1659–1685MathSciNetGoogle Scholar
  11. Boullé M (2008) Bivariate data grid models for supervised learning. Technical Report NSM/R&D/ TECH/EASY/TSI/4/MB, France Telecom R&D. http://perso.rd.francetelecom.fr/boulle/publications/BoulleNTTSI4MB08.pdf
  12. Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth International, CaliforniazbMATHGoogle Scholar
  13. Carr D, Littlefield R, Nicholson W, Littlefield J (1987) Scatterplot matrix techniques for large n. J Am Stat Assoc 82: 424–436CrossRefMathSciNetGoogle Scholar
  14. Chapman P, Clinton J, Kerber R, Khabaza T, Reinartz T, Shearer C, Wirth R (2000) CRISP-DM 1.0: step-by-step data mining guideGoogle Scholar
  15. Cochran W (1954) Some methods for strengthening the common chi-squared tests. Biometrics 10(4): 417–451zbMATHCrossRefMathSciNetGoogle Scholar
  16. Connor-Linton J (2003) Chi square tutorial. http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html
  17. Fayyad U, Irani K (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8: 87–102zbMATHGoogle Scholar
  18. Goldstein M (2006) Subjective Bayesian analysis: principles and practice. Bayesian Anal 1(3): 403–420MathSciNetGoogle Scholar
  19. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3: 1157–1182zbMATHCrossRefGoogle Scholar
  20. Guyon I, Gunn S, Hur AB, Dror G (2006) Design and analysis of the NIPS2003 challenge. In: Guyon I, Gunn S, Nikravesh M, Zadeh L (eds) Feature extraction: foundations and applications, chap 9. Springer, New York, pp 237–263Google Scholar
  21. Hansen P, Mladenovic N (2001) Variable neighborhood search: principles and applications. Eur J Oper Res 130: 449–467zbMATHCrossRefMathSciNetGoogle Scholar
  22. Holte R (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11: 63–90zbMATHCrossRefGoogle Scholar
  23. Kass G (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 29(2): 119–127CrossRefGoogle Scholar
  24. Kerber R (1992) ChiMerge discretization of numeric attributes. In: Proceedings of the 10th international conference on artificial intelligence. MIT Press, Cambridge, pp 123–128Google Scholar
  25. Kohavi R, John G (1997) Wrappers for feature selection. Artif Intell 97(1-2): 273–324zbMATHCrossRefGoogle Scholar
  26. Kohavi R, Sahami M (1996) Error-based and entropy-based discretization of continuous features. In: Proceedings of the 2nd international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, pp 114–119Google Scholar
  27. Kononenko I, Bratko I, Roskar E (1984) Experiments in automatic learning of medical diagnostic rules. Technical report, Joseph Stefan Institute, Faculty of Electrical Engineering and Computer Science, LjubljanaGoogle Scholar
  28. Kurgan L, Cios J (2004) CAIM discretization algorithm. IEEE Trans Knowl Data Eng 16(2): 145–153CrossRefGoogle Scholar
  29. Kwedlo W, Kretowski M (1999) An evolutionary algorithm using multivariate discretization for decision rule induction. In: Principles of data mining and knowledge discovery. Lecture notes in computer science, vol 1704. Springer, Berlin, 392–397Google Scholar
  30. Langley P, Iba W, Thompson K (1992) An analysis of Bayesian classifiers. In: 10th National conference on artificial intelligence. AAAI Press, San Jose, pp 223–228Google Scholar
  31. Maass W (1994) Efficient agnostic pac-learning with simple hypothesis. In: COLT ’94: Proceedings of the seventh annual conference on Computational learning theory. ACM Press, New York, pp 67–75Google Scholar
  32. Nadif M, Govaert G (2005) Block clustering of contingency table and mixture model. In: Advances in intelligent data analysis VI. Lecture notes in computer science, vol 3646. Springer, Berlin, pp 249–259Google Scholar
  33. Olszak M, Ritschard G (1995) The behaviour of nominal and ordinal partial association measures. The Statistician 44(2): 195–212CrossRefGoogle Scholar
  34. Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, San FranciscoGoogle Scholar
  35. Quinlan J (1986) Induction of decision trees. Mach Learn 1: 81–106Google Scholar
  36. Quinlan J (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San FranciscoGoogle Scholar
  37. Rissanen J (1978) Modeling by shortest data description. Automatica 14: 465–471zbMATHCrossRefGoogle Scholar
  38. Ritschard G, Nicoloyannis N (2000) Aggregation and association in cross tables. In: PKDD ’00: proceedings of the 4th European conference on principles of data mining and knowledge discovery. Springer, Berlin, pp 593–598Google Scholar
  39. Robert C (1997) The Bayesian choice: a decision-theoretic motivation. Springer, New YorkGoogle Scholar
  40. Saporta G (1990) Probabilités analyse des données et statistique. TECHNIP, PariszbMATHGoogle Scholar
  41. Shannon C (1948) A mathematical theory of communication. Technical Report 27, Bell systems technical journalGoogle Scholar
  42. Steck H, Jaakkola T (2004) Predictive discretization during model selection. Pattern Recognit LNCS 3175: 1–8Google Scholar
  43. Weaver W, Shannon C (1949) The mathematical theory of communication. University of Illinois Press, UrbanazbMATHGoogle Scholar
  44. Zighed D, Rakotomalala R (2000) Graphes d’induction. Hermes, FranceGoogle Scholar
  45. Zighed D, Rabaseda S, Rakotomalala R (1998) Fusinter: a method for discretization of continuous attributes for supervised learning. Int J Uncertain Fuzziness Knowl Based Syst 6(33): 307–326zbMATHCrossRefGoogle Scholar
  46. Zighed D, Ritschard G, Erray W, Scuturici V (2005) Decision trees with optimal joint partitioning. Int J Intell Syst 20(7): 693–718zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  1. 1.Orange LabsLannionFrance

Personalised recommendations