Advertisement

Soft Computing

, Volume 20, Issue 1, pp 173–188 | Cite as

ur-CAIM: improved CAIM discretization for unbalanced and balanced data

  • Alberto Cano
  • Dat T. Nguyen
  • Sebastián VenturaEmail author
  • Krzysztof J. Cios
Methodologies and Application

Abstract

Supervised discretization is one of basic data preprocessing techniques used in data mining. CAIM (class-attribute interdependence maximization) is a discretization algorithm of data for which the classes are known. However, new arising challenges such as the presence of unbalanced data sets, call for new algorithms capable of handling them, in addition to balanced data. This paper presents a new discretization algorithm named ur-CAIM, which improves on the CAIM algorithm in three important ways. First, it generates more flexible discretization schemes while producing a small number of intervals. Second, the quality of the intervals is improved based on the data classes distribution, which leads to better classification performance on balanced and, especially, unbalanced data. Third, the runtime of the algorithm is lower than CAIM’s. The algorithm has been designed free-parameter and it self-adapts to the problem complexity and the data class distribution. The ur-CAIM was compared with 9 well-known discretization methods on 28 balanced, and 70 unbalanced data sets. The results obtained were contrasted through non-parametric statistical tests, which show that our proposal outperforms CAIM and many of the other methods on both types of data but especially on unbalanced data, which is its significant advantage.

Keywords

Supervised discretization Class-attribute interdependency maximization Unbalanced data Classification 

Notes

Acknowledgments

This work has been supported by the National Institutes of Health grant 1R01HD056235-01A1 (KJC), the NSF GANN grant (DTN), the Regional Government of Andalusia and the Ministry of Science and Technology project TIN-2011-22408 (SV,AC), and the Ministry of Education FPU grant AP2010-0042 (AC).

References

  1. Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Analysis framework. J Mult Valued Logic Soft Comput 17:255–287Google Scholar
  2. Alcalá-Fdez J, Sánchez L, García S, del Jesus M, Ventura S, Garrell J, Otero J, Romero C, Bacardit J, Rivas V, Fernández J, Herrera F (2009) KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13:307–318CrossRefGoogle Scholar
  3. Bache K, Lichman M (2013) UCI machine learning repository (University of California, School of Information and Computer Science). Irvine, CA. http://archive.ics.uci.edu/ml
  4. Ben-David A (2008a) About the relationship between ROC curves and Cohen’s kappa. Eng Appl Artif Intell 21(6):874–882Google Scholar
  5. Ben-David A (2008b) Comparison of classification accuracy using Cohen’s weighted kappa. Expert Syst Appl 34(2):825–832Google Scholar
  6. Boullé M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65(1):131–165CrossRefGoogle Scholar
  7. Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159CrossRefGoogle Scholar
  8. Breiman L (2001) Random forests. Mach Learn 45:5–32zbMATHCrossRefGoogle Scholar
  9. Catlett J (1991) On changing continuous attributes into ordered discrete attributes. In: Proceedings of machine learning, EWSL91, Lecture notes in computer science, vol 482. pp 164–178Google Scholar
  10. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27CrossRefGoogle Scholar
  11. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic Minority Over-sampling TEchnique. Artif Intell Res 16:321–357zbMATHGoogle Scholar
  12. Chmielewski MR, Grzymala-Busse JW (1996) Global discretization of continuous attributes as preprocessing for machine learning. Int J Approx Reason 15:319–331zbMATHCrossRefGoogle Scholar
  13. Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data mining: a knowledge discovery approach. Springer, New YorkGoogle Scholar
  14. Cohen WW (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning, pp 115–123Google Scholar
  15. Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27zbMATHCrossRefGoogle Scholar
  16. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30zbMATHMathSciNetGoogle Scholar
  17. Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evolut Comput 1(1):3–18CrossRefGoogle Scholar
  18. Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Proceedings of the 12th international conference machine learning, pp 194–202Google Scholar
  19. Elomaa T, Rousu J (1999) General and efficient multisplitting of numerical attributes. Mach Learn 36(3):201–244zbMATHCrossRefGoogle Scholar
  20. Fayyad U, Irani K (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8:87–102zbMATHGoogle Scholar
  21. Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on uncertainly in artificial intelligence, pp 1022–1029Google Scholar
  22. Fernández A, del Jesus MJ, Herrera F (2010) On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets. Inf Sci 180(8):1268–1291CrossRefGoogle Scholar
  23. Frank E, Witten IH (1998) Generating accurate rule sets without global optimization. In: Proceedings of the 15th international conference on machine learning, pp 144–151Google Scholar
  24. Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning, pp 148–156Google Scholar
  25. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701CrossRefGoogle Scholar
  26. Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid- based approaches. IEEE Trans Syst Man Cybern Part C Appl Revi 42(4):463–484Google Scholar
  27. García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064CrossRefGoogle Scholar
  28. García S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9:2677–2694zbMATHGoogle Scholar
  29. García S, Luengo J, Saez J, Lopez V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750CrossRefGoogle Scholar
  30. García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst 25(1):13–21CrossRefGoogle Scholar
  31. Gonzalez-Abril L, Cuberos FJ, Velasco F, Ortega JA (2009) Ameva: an autonomous discretization algorithm. Expert Syst Appl 36:5327–5332CrossRefGoogle Scholar
  32. Grzymala-Busse JW (2009) A multiple scanning strategy for entropy based discretization. In: Proceedings of foundations of intelligent systems, Lecture notes in computer science, vol 5722. pp 25–34Google Scholar
  33. Grzymala-Busse JW (2013) Discretization based on entropy and multiple scanning. Entropy 15:1486–1502MathSciNetCrossRefGoogle Scholar
  34. Hall M, Frank E, Holmes G, Pfahringer B, Reutemannr P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11:10–18CrossRefGoogle Scholar
  35. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRefGoogle Scholar
  36. Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70zbMATHMathSciNetGoogle Scholar
  37. Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310CrossRefGoogle Scholar
  38. Huang W (1997) Discretization of continuous attributes for inductive machine learning. University of ToledoGoogle Scholar
  39. Janssens D, Brijs T, Vanhoof K, Wets G (2006) Evaluating the performance of cost-based discretization versus entropy- and error-based discretization. Comput Op Res 33(11):3107–3123zbMATHCrossRefGoogle Scholar
  40. John G, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence, pp 338–345Google Scholar
  41. Kaizhu H, Haiqin Y, Irwinng K, Lyu MR (2006) Imbalanced learning with a biased minimax probability machine. IEEE Trans Syst Man Cybern Part B Cybernetics 36(4):913–923CrossRefGoogle Scholar
  42. Kerber R (1992) ChiMerge: discretization of numeric attributes. In: Proceedings of the 10th national conference on artificial intelligence, pp 123–128Google Scholar
  43. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 2. pp 1137–1143Google Scholar
  44. Kotsiantis S, Kanellopoulos D (2006) Discretization techniques: a recent survey. GESTS Int Trans Comput Sci Eng 32(1):47–58Google Scholar
  45. Kurgan LA, Cios KJ (2004) CAIM discretization algorithm. IEEE Trans Knowl Data Eng 16(2):145–153CrossRefGoogle Scholar
  46. Kurgan LA, Cios KJ, Dick S (2006) Highly scalable and robust rule learner: performance evaluation and comparison. IEEE Trans Syst Man Cybern Part B Cybern 36(1):32–53CrossRefGoogle Scholar
  47. Landgrebe T, Paclik P, Tax D, Verzakov S, Duin R (2004) Cost-based classifier evaluation for imbalanced problems. Lect Notes Comput Sci 3138:762–770CrossRefGoogle Scholar
  48. Liu H, Setiono R (1997) Feature selection via discretization. IEEE Trans Knowl Data Eng 9:642–645CrossRefGoogle Scholar
  49. López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141CrossRefGoogle Scholar
  50. Luengo J, Fernández A, García S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15:1909–1936Google Scholar
  51. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kauffman Publishers, BurlingtonGoogle Scholar
  52. Ruiz FJ, Angulo C, Agell N (2008) IDD: a supervised interval distance-based method for discretization. IEEE Trans Knowl Data Eng 20(9):1230–1238CrossRefGoogle Scholar
  53. Tay F, Shen L (2002) A modified Chi2 algorithm for discretization. IEEE Trans Knowl Data Eng 14(2):666–670CrossRefGoogle Scholar
  54. Tsai CJ, Lee CI, Yang WP (2008) A discretization algorithm based on class-attribute contingency coefficient. Inf Sci 178(3):714–731CrossRefGoogle Scholar
  55. Wiens TS, Dale BC, Boyce MS, Kershaw GP (2008) Three way \(k\)-fold cross-validation of resource selection functions. Ecol Model 212(3–4):244–255CrossRefGoogle Scholar
  56. Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83CrossRefGoogle Scholar
  57. Wong A, Liu TS (1975) Typicality, diversity, and feature pattern of an ensemble. IEEE Trans Comput 24(2):158–181zbMATHMathSciNetCrossRefGoogle Scholar
  58. Yang P, Li JS, Huang YX (2011) HDD: a hypercube division-based algorithm for discretisation. Int J Syst Sci 42(4):557–566zbMATHMathSciNetCrossRefGoogle Scholar
  59. Yang Y, Webb GI, Wu X (2010) Discretization methods. In: Proceedings of data mining and knowledge discovery handbook, pp 101–116Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Alberto Cano
    • 1
  • Dat T. Nguyen
    • 3
  • Sebastián Ventura
    • 1
    • 2
    Email author
  • Krzysztof J. Cios
    • 3
    • 4
  1. 1.Department of Computer Science and Numerical AnalysisUniversity of CordobaCórdobaSpain
  2. 2.Computer Sciences Department, Faculty of Computing and Information TechnologyKing Abdulaziz UniversityJeddahSaudi Arabia
  3. 3.Department of Computer ScienceVirginia Commonwealth UniversityRichmondUSA
  4. 4.IITiS Polish Academy of SciencesGliwicePoland

Personalised recommendations