Advertisement

A Data Mining Software Package Including Data Preparation and Reduction: KEEL

  • Salvador García
  • Julián Luengo
  • Francisco Herrera
Chapter
Part of the Intelligent Systems Reference Library book series (ISRL, volume 72)

Abstract

KEEL software is an open source Data Mining tool widely used in research and real life applications. Most of the algorithms described, if not all of them, throughout the book are actually implemented and publicly available in this Data Mining platform. Since KEEL enables the user to create and run single or concatenated preprocessing techniques in the data, such software is carefully introduced in this section, intuitively guiding the reader across the step needed to set up all the data preparations that might be needed. It is also interesting to note that the experimental analyses carried out in this book have been created using KEEL, allowing the consultant to quickly compare and adapt the results presented here. An extensive revision of Data Mining software tools are presented in Sect. 10.1. Among them, we will focus on the open source KEEL platform in Sect. 10.2 providing details of its main features and usage. For the practitioners interest, the most common used data sources are introduced in Sect. 10.3 and the steps needed to integrate any new algorithm in it in Sect. 10.4. Once the results have been obtained, the appropriate comparison guidelines are provided in Sect. 10.5. The most important aspects of the tool are summarized in Sect. 10.6.

Keywords

Data Mining Data Mining Algorithm Learn Classifier System Subgroup Discovery Multiple Instance Learning 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Han, J., Kamber, M., Pei, J.: Data mining: Concepts and techniques, second edition (The Morgan Kaufmann series in data management systems). Morgan Kaufmann, San Francisco (2006)Google Scholar
  2. 2.
    Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques, second edition (Morgan Kaufmann series in data management systems). Morgan Kaufmann Publishers Inc., San Francisco (2005)Google Scholar
  3. 3.
    Demšar, J., Curk, T., Erjavec, A., Gorup, Črt, Hočevar, T., Milutinovič, M., Možina, M., Polajnar, M., Toplak, M., Starič, A., Štajdohar, M., Umek, L., Žagar, L., Žbontar, J., Žitnik, M., Zupan, B.: Orange: Data mining toolbox in python. J. Mach. Learn. Res. 14, 2349–2353 (2013)Google Scholar
  4. 4.
    Abeel, T., de Peer, Y.V., Saeys, Y.: Java-ML: A machine learning library. J. Mach. Learn. Res. 10, 931–934 (2009)zbMATHMathSciNetGoogle Scholar
  5. 5.
    Hofmann, M., Klinkenberg, R.: RapidMiner: Data mining use cases and business analytics applications. Chapman and Hall/CRC, Florida (2013)Google Scholar
  6. 6.
    Williams, G.J.: Data mining with rattle and R: The art of excavating data for knowledge discovery. Use R!. Springer, New York (2011)CrossRefGoogle Scholar
  7. 7.
    Sonnenburg, S., Braun, M., Ong, C., Bengio, S., Bottou, L., Holmes, G., LeCun, Y., Müller, K.R., Pereira, F., Rasmussen, C., Rätsch, G., Schölkopf, B., Smola, A., Vincent, P., Weston, J., Williamson, R.: The need for open source software in machine learning. J. Mach. Learn. Res. 8, 2443–2466 (2007)Google Scholar
  8. 8.
    Alcalá-Fdez, J., Sánchez, L., García, S., del Jesus, M., Ventura, S., Garrell, J., Otero, J., Romero, C., Bacardit, J., Rivas, V., Fernández, J., Herrera, F.: KEEL: A software tool to assess evolutionary algorithms to data mining problems. Soft Comput. 13(3), 307–318 (2009)CrossRefGoogle Scholar
  9. 9.
    Derrac, J., García, S., Herrera, F.: A survey on evolutionary instance selection and generation. Int. J. Appl. Metaheuristic Comput. 1(1), 60–92 (2010)CrossRefGoogle Scholar
  10. 10.
    Kudo, M., Sklansky, J.: Comparison of algorithms that select features for pattern classifiers. Pattern Recognit. 33(1), 25–41 (2000)CrossRefGoogle Scholar
  11. 11.
    Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers, San Francisco (1993)Google Scholar
  12. 12.
    Schölkopf, B., Smola, A.J.: Learning with kernels : support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. MIT Press, Cambridge (2002)Google Scholar
  13. 13.
    Frenay, B., Verleysen, M.: Classification in the presence of label noise: A survey. Neural Netw. Learn. Syst., IEEE Trans. 25(5), 845–869 (2014)CrossRefGoogle Scholar
  14. 14.
    Garcia, E.K., Feldman, S., Gupta, M.R., Srivastava, S.: Completely lazy learning. IEEE Trans. Knowl. Data Eng. 22(9), 1274–1285 (2010)CrossRefGoogle Scholar
  15. 15.
    Alcalá, R., Alcalá-Fdez, J., Casillas, J., Cordón, O., Herrera, F.: Hybrid learning models to get the interpretability-accuracy trade-off in fuzzy modeling. Soft Comput. 10(9), 717–734 (2006)CrossRefGoogle Scholar
  16. 16.
    Rivas, A.J.R., Rojas, I., Ortega, J., del Jesús, M.J.: A new hybrid methodology for cooperative-coevolutionary optimization of radial basis function networks. Soft Comput. 11(7), 655–668 (2007)CrossRefGoogle Scholar
  17. 17.
    Bernadó-Mansilla, E., Ho, T.K.: Domain of competence of xcs classifier system in complexity measurement space. IEEE Trans. Evol. Comput. 9(1), 82–104 (2005)CrossRefGoogle Scholar
  18. 18.
    Ventura, S., Romero, C., Zafra, A., Delgado, J.A., Hervas, C.: Jclec: A java framework for evolutionary computation. Soft Comput. 12(4), 381–392 (2007)CrossRefGoogle Scholar
  19. 19.
    Pyle, D.: Data preparation for data mining. Morgan Kaufmann Publishers Inc., San Francisco (1999)Google Scholar
  20. 20.
    Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Appl. Artif. Intel. 17(5–6), 375–381 (2003)CrossRefGoogle Scholar
  21. 21.
    Luke, S., Panait, L., Balan, G., Paus, S., Skolicki, Z., Bassett, J., Hubley, R., Chircop, A.: ECJ: A Java based evolutionary computation research system. http://cs.gmu.edu/eclab/projects/ecj
  22. 22.
    Meyer, M., Hufschlag, K.: A generic approach to an object-oriented learning classifier system library. J. Artif. Soc. Soc. Simul. 9(3) (2006) http://jasss.soc.surrey.ac.uk/9/3/9.html
  23. 23.
    Llorá, X.: E2k: Evolution to knowledge. SIGEVOlution 1(3), 10–17 (2006)CrossRefGoogle Scholar
  24. 24.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence. IJCAI’95, vol. 2, pp. 1137–1143. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1995)Google Scholar
  25. 25.
    Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)CrossRefGoogle Scholar
  26. 26.
    Ortega, M., Bravo, J. (eds.): Computers and education in the 21st century. Kluwer, Dordrecht (2000)Google Scholar
  27. 27.
    Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T.: Yale: Rapid prototyping for complex data mining tasks. In: Ungar, L., Craven, M., Gunopulos, D., Eliassi-Rad, T. (eds.) KDD ’06: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935–940. NY, USA, New York (2006)Google Scholar
  28. 28.
    Rakotomalala, R.: Tanagra : un logiciel gratuit pour l’enseignement et la recherche. In: S. Pinson, N. Vincent (eds.) EGC, Revue des Nouvelles Technologies de l’Information, pp. 697–702. Cpadus-ditions (2005)Google Scholar
  29. 29.
    Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)CrossRefGoogle Scholar
  30. 30.
    He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  31. 31.
    Sun, Y., Wong, A.K.C., Kamel, M.S.: Classification of imbalanced data: A review. Int. J. Pattern Recognit. Artif. Intel. 23(4), 687–719 (2009)CrossRefGoogle Scholar
  32. 32.
    Dietterich, T., Lathrop, R., Lozano-Perez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1–2), 31–71 (1997)CrossRefzbMATHGoogle Scholar
  33. 33.
    Sánchez, L., Couso, I.: Advocating the use of imprecisely observed data in genetic fuzzy systems. IEEE Trans. Fuzzy Syst. 15(4), 551–562 (2007)CrossRefGoogle Scholar
  34. 34.
    Děmsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)zbMATHMathSciNetGoogle Scholar
  35. 35.
    García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)CrossRefGoogle Scholar
  36. 36.
    García, S., Herrera, F.: An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J. Mach. Learn. Res. 9, 2579–2596 (2008)Google Scholar
  37. 37.
    Fisher, R.A.: Statistical methods and scientific inference (2nd edition). Hafner Publishing, New York (1959)Google Scholar
  38. 38.
    García, S., Fernández, A., Luengo, J., Herrera, F.: A study of statistical techniques and performance measures for genetics-based machine learning: Accuracy and interpretability. Soft Comput. 13(10), 959–977 (2009)CrossRefGoogle Scholar
  39. 39.
    García, S., Molina, D., Lozano, M., Herrera, F.: A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: A case study on the CEC 2005 special session on real parameter optimization. J. Heuristics 15, 617–644 (2009)CrossRefzbMATHGoogle Scholar
  40. 40.
    Luengo, J., García, S., Herrera, F.: A study on the use of statistical tests for experimentation with neural networks: Analysis of parametric test conditions and non-parametric tests. Expert Syst. with Appl. 36, 7798–7808 (2009)CrossRefGoogle Scholar
  41. 41.
    Cox, D., Hinkley, D.: Theoretical statistics. Chapman and Hall, London (1974)CrossRefzbMATHGoogle Scholar
  42. 42.
    Snedecor, G.W., Cochran, W.C.: Statistical methods. Iowa State University Press, Ames (1989)zbMATHGoogle Scholar
  43. 43.
    Shapiro, S.S.: M.W.: An analysis of variance test for normality (complete samples). Biometrika 52(3–4), 591–611 (1965)CrossRefzbMATHMathSciNetGoogle Scholar
  44. 44.
    Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat 18, 50–60 (1947)CrossRefzbMATHMathSciNetGoogle Scholar
  45. 45.
    Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics 1, 80–83 (1945)CrossRefGoogle Scholar
  46. 46.
    Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. the Am. Stat. Assoc. 32(200), 675–701 (1937)CrossRefGoogle Scholar
  47. 47.
    Iman, R., Davenport, J.: Approximations of the critical region of the friedman statistic. Commun. Stat. 9, 571–595 (1980)CrossRefGoogle Scholar
  48. 48.
    Sheskin, D.: Handbook of parametric and nonparametric statistical procedures. Chapman and Hall/CRC, Boca Raton (2006)Google Scholar
  49. 49.
    Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979)zbMATHMathSciNetGoogle Scholar
  50. 50.
    Hochberg, Y.: A sharper bonferroni procedure for multiple tests of significance. Biometrika 75, 800–803 (1988)CrossRefzbMATHMathSciNetGoogle Scholar
  51. 51.
    Nemenyi, P.B.: Distribution-free multiple comparisons, ph.d. thesis (1963)Google Scholar
  52. 52.
    Bergmann, G., Hommel, G.: Improvements of general multiple test procedures for redundant systems of hypotheses. In: Bauer, G.H.P., Sonnemann, E. (eds.) Multiple hypotheses testing, pp. 100–115. Springer, Berlin (1988)CrossRefGoogle Scholar
  53. 53.
    Parpinelli, R., Lopes, H., Freitas, A.: Data mining with an ant colony optimization algorithm. IEEE Trans. Evol. Comput. 6(4), 321–332 (2002)CrossRefGoogle Scholar
  54. 54.
    Tan, K.C., Yu, Q., Ang, J.H.: A coevolutionary algorithm for rules discovery in data mining. Int. J. Syst. Sci. 37(12), 835–864 (2006)CrossRefzbMATHMathSciNetGoogle Scholar
  55. 55.
    Aguilar-Ruiz, J.S., Giráldez, R., Riquelme, J.C.: Natural encoding for evolutionary supervised learning. IEEE Trans. Evol. Comput. 11(4), 466–479 (2007)CrossRefGoogle Scholar
  56. 56.
    Mansoori, E., Zolghadri, M., Katebi, S.: SGERD: A steady-state genetic algorithm for extracting fuzzy classification rules from data. IEEE Trans. Fuzzy Syst. 16(4), 1061–1071 (2008)CrossRefGoogle Scholar
  57. 57.
    Gray, J.B., Fan, G.: Classification tree analysis using TARGET. Comput. Stat. Data Anal. 52(3), 1362–1372 (2008)CrossRefzbMATHMathSciNetGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Salvador García
    • 1
  • Julián Luengo
    • 2
  • Francisco Herrera
    • 3
  1. 1.Department of Computer ScienceUniversity of JaénJaénSpain
  2. 2.Department of Civil EngineeringUniversity of BurgosBurgosSpain
  3. 3.Department of Computer Science and Artificial IntelligenceUniversity of GranadaGranadaSpain

Personalised recommendations