Memetic Computing

, Volume 2, Issue 3, pp 183–199 | Cite as

Stratified prototype selection based on a steady-state memetic algorithm: a study of scalability

  • Joaquín Derrac
  • Salvador García
  • Francisco Herrera
Regular Research Paper

Abstract

Prototype selection (PS) is a suitable data reduction process for refining the training set of a data mining algorithm. Performing PS processes over existing datasets can sometimes be an inefficient task, especially as the size of the problem increases. However, in recent years some techniques have been developed to avoid the drawbacks that appeared due to the lack of scalability of the classical PS approaches. One of these techniques is known as stratification. In this study, we test the combination of stratification with a previously published steady-state memetic algorithm for PS in various problems, ranging from 50,000 to more than 1 million instances. We perform a comparison with some well-known PS methods, and make a deep study of the effects of stratification in the behavior of the selected method, focused on its time complexity, accuracy and convergence capabilities. Furthermore, the trade-off between accuracy and efficiency of the proposed combination is analyzed, concluding that it is a very suitable option to perform PS tasks when the size of the problem exceeds the capabilities of the classical PS methods.

Keywords

Data reduction Memetic algorithm Stratification Scaling up Prototype selection Nearest neighbor rule 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abraham, A, Grosan, C, Ramos, V (eds) (2006) Swarm intelligence in data mining. Studies in computational intelligence, vol 34. Springer, New YorkGoogle Scholar
  2. 2.
    Alpaydin E (2004) Introduction to machine learning. The MIT Press, New YorkGoogle Scholar
  3. 3.
    Angiulli F (2007) Fast nearest neighbor condensation for large data sets classification. IEEE Trans Knowl Data Eng 19(11): 1450–1464CrossRefGoogle Scholar
  4. 4.
    Aranha C, Iba H (2009) The memetic tree-based genetic algorithm and its application to portfolio optimization. Memetic Comp 1(2): 139–151CrossRefGoogle Scholar
  5. 5.
    Asuncion A, Newman DJ (2007) UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html
  6. 6.
    Bacardit J, Burke E, Krasnogor N (2009) Improving the scalability of rule-based evolutionary learning. Memetic Comp 1(1): 55–67CrossRefGoogle Scholar
  7. 7.
    Baluja S (1994) Population-based incremental learning: a method for integrating genetic search based function optimization and competitive learning. Tech Rep CMU-CS-94-163. Computer Science Department, Pittsburgh, PAGoogle Scholar
  8. 8.
    Bezdek J, Kuncheva L (2001) Nearest prototype classifier designs: an experimental study. Int J Intell Syst 16(12): 1445–1473MATHCrossRefGoogle Scholar
  9. 9.
    Cano J, Lozano FHM (2007) Evolutionary stratified training set selection for extracting classification rules with trade-off precision-interpretability. Data Knowl Eng 60: 90–100CrossRefGoogle Scholar
  10. 10.
    Cano JR, Herrera F, Lozano M (2003) Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. IEEE Trans Evol Comput 7(6): 561–575CrossRefGoogle Scholar
  11. 11.
    Cano JR, Herrera F, Lozano M (2005) Stratification for scaling up evolutionary prototype selection. Pattern Recognit Lett 26(7): 953–963CrossRefGoogle Scholar
  12. 12.
    Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1): 37–46CrossRefGoogle Scholar
  13. 13.
    Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1): 21–27MATHCrossRefGoogle Scholar
  14. 14.
    Cunningham P (2008) A taxonomy of similarity mechanisms for case-based reasoning. IEEE Trans Knowl Data Eng 21(11): 1532–1543CrossRefGoogle Scholar
  15. 15.
    Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30MathSciNetGoogle Scholar
  16. 16.
    Eiben AE, Smith JE (2003) Introduction to evolutionary computing. Natural computing. Springer-Verlag, New YorkGoogle Scholar
  17. 17.
    Eshelman LJ (1991) The CHC adaptative search algorithm: how to have safe search when engaging in nontraditional genetic recombination. In: Rawlins GJE (eds) Foundations of genetic algorithms. Morgan Kaufmann, San Mateo, pp 265–283Google Scholar
  18. 18.
    Fischer T, Bauer K, Merz P, Bauer K (2009) Solving the routing and wavelength assignment problem with a multilevel distributed memetic algorithm. Memetic Comp 1(2): 101–123CrossRefGoogle Scholar
  19. 19.
    Forrest S, Mitchell M (1993) What makes a problem hard for a genetic algorithm? some anomalous results and their explanation. Mach Learn 13(2-3): 285–319CrossRefGoogle Scholar
  20. 20.
    Freitas AA (2002) Data mining and knowledge discovery with evolutionary algorithms. Springer-Verlag, New YorkMATHGoogle Scholar
  21. 21.
    García S, Herrera F (2009) A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput 13(10): 959–977CrossRefGoogle Scholar
  22. 22.
    García S, Cano JR, Herrera F (2008) A memetic algorithm for evolutionary prototype selection: a scaling up approach. Pattern Recognit 41(8): 2693–2709MATHCrossRefGoogle Scholar
  23. 23.
    Ghosh, A, Jain, LC (eds) (2005) Evolutionary computation in data mining. Springer-Verlag, New YorkMATHGoogle Scholar
  24. 24.
    Gil-Pita R, Yao X (2008) Evolving edited k-nearest neighbor classifiers. Int J Neural Syst 18(6): 459–467CrossRefGoogle Scholar
  25. 25.
    Guyon, I, Gunn, S, Nikravesh, M, Zadeh, LA (eds) (2006) Feature extraction: foundations and applications. Springer, New YorkMATHGoogle Scholar
  26. 26.
    de Haro-García A, García-Pedrajas N (2009) A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Min Knowl Discov 18(3): 392–418CrossRefGoogle Scholar
  27. 27.
    Hart PE (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14(3): 515–516CrossRefGoogle Scholar
  28. 28.
    Hart WE (1994) Adaptive global optimization with local search. PhD thesis, University of California, San DiegoGoogle Scholar
  29. 29.
    Hart, WE, Krasnogor, N, Smith, JE (eds) (2005) Recent advances in memetic algorithms. Springer-Verlag, New YorkMATHGoogle Scholar
  30. 30.
    Hasan SMK, Sarker R, Essam D, Cornforth D (2009) Memetic algorithms for solving job-shop scheduling problems. Memetic Comp 1(1): 69–83CrossRefGoogle Scholar
  31. 31.
    Ho SY, Liu CC, Liu S (2002) Design of an optimal nearest neighbor classifier using an intelligent genetic algorithm. Pattern Recognit Lett 23(13): 1495–1503MATHCrossRefGoogle Scholar
  32. 32.
    Hore P, Hall LO, Goldgof DB (2009) A scalable framework for cluster ensembles. Pattern Recognit 42(5): 676–688MATHCrossRefGoogle Scholar
  33. 33.
    Ishibuchi H, Nakashima T (1998) Evolution of reference sets in nearest neighbor classification. In: Second Asia-Pacific conference on simulated evolution and learning on simulated evolution and learning (SEAL’98). Lecture notes in computer science, vol 1585, pp 82–89Google Scholar
  34. 34.
    Jankowski N, Grochowski M (2004) Comparison of instances selection algorithms I. Algorithms survey. In: Rutkowski L (ed) International conference on artificial intelligence and soft computing (IAISC’04). LNAI, vol 3070, pp 598–603Google Scholar
  35. 35.
    Kim SW, Oommen BJ (2003) On using prototype reduction schemes to optimize dissimilarity-based classification. Pattern Anal Appl 6(3): 232–244CrossRefMathSciNetGoogle Scholar
  36. 36.
    Krasnogor N, Smith J (2005) A tutorial for competent memetic algorithms: model, taxonomy, and design issues. IEEE Trans Evol Comput 9(5): 474–488CrossRefGoogle Scholar
  37. 37.
    Kuncheva LI (1995) Editing for the k-nearest neighbors rule by a genetic algorithm. Pattern Recognit Lett 16: 809–814CrossRefGoogle Scholar
  38. 38.
    Kuncheva LI, Bezdek JC (1998) Nearest prototype classification: clustering, genetic algorithms, or random search?. IEEE Trans Syst Man Cybernet C 28(1): 160–164CrossRefGoogle Scholar
  39. 39.
    Lim TS, Loh WY, Shih YS (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40(3): 203–228MATHCrossRefGoogle Scholar
  40. 40.
    Liu, H, Motoda, H (eds) (2001) Instance selection and construction for data mining. The Springer international series in engineering and computer science. Springer, New YorkGoogle Scholar
  41. 41.
    Liu H, Motoda H (2002) On issues of instance selection. Data Min Knowl Discov 6(2): 115–130CrossRefMathSciNetGoogle Scholar
  42. 42.
    Liu, H, Motoda, H (eds) (2007) Computational methods of feature selection. CRC Data mining and knowledge discovery series. Chapman & Hall, LondonGoogle Scholar
  43. 43.
    Liu H, Hussain F, Tan CL, Dash M (2002) Discretization: an enabling technique. Data Min Knowl Discov 6(4): 393–423CrossRefMathSciNetGoogle Scholar
  44. 44.
    Liu Z, Elhanany I (2008) A fast and scalable recurrent neural network based on stochastic meta descent. IEEE Trans Neural Netw 19(9): 1652–1658CrossRefGoogle Scholar
  45. 45.
    Lozano M, Herrera F, Krasnogor N, Molina D (2004) Real-coded memetic algorithms with crossover hill-climbing. Evol Comput 12(3): 273–302CrossRefGoogle Scholar
  46. 46.
    Lozano M, Sotoca JM, Sánchez JS, Pla F, Pekalska E, Duin RPW (2006) Experimental study on prototype optimisation algorithms for prototype-based classification in vector spaces. Pattern Recognit 39(10): 1827–1838MATHCrossRefGoogle Scholar
  47. 47.
    Marchiori E (2008) Hit miss networks with applications to instance selection. J Mach Learn Res 9: 997–1017MathSciNetGoogle Scholar
  48. 48.
    Moscato P (1989) On evolution, search, optimization, genetic algorithms and martial arts: towards memetic algorithms. Tech Rep C3P 826, California Institute of Technology, PasadenaGoogle Scholar
  49. 49.
    Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF (2009) A new fast prototype selection method based on clustering. Pattern Anal Appl (in press). doi:10.1007/s10044-008-0142-x
  50. 50.
    Ong YS, Krasnogor N, Ishibuchi H (2007) Special issue on memetic algorithms. IEEE Trans Syst Man Cybernet B 37(1): 2–5MATHCrossRefGoogle Scholar
  51. 51.
    Papadopoulos AN, Manolopoulos Y (2004) Nearest neighbor search: a database perspective. Springer-Verlag Telos, Santa ClaraGoogle Scholar
  52. 52.
    Provost FJ, Kolluri V (1999) A survey of methods for scaling up inductive algorithms. Data Min Knowl Discov 3(2): 131–169CrossRefGoogle Scholar
  53. 53.
    Pyle D (1999) Data preparation for data mining. The Morgan Kaufmann series in data management systems. Morgan Kaufmann, San FransiscoGoogle Scholar
  54. 54.
    Ritter G, Woodruff H, Lowry S, Isenhour T (1975) An algorithm for a selective nearest neighbor decision rule. IEEE Trans Inf Theory 21(6): 665–669MATHCrossRefGoogle Scholar
  55. 55.
    Shakhnarovich, G, Darrell, T, Indyk, P (eds) (2006) Nearest-neighbor methods in learning and vision: theory and practice. The MIT Press, New YorkGoogle Scholar
  56. 56.
    Sheskin DJ (2007) Handbook of parametric and nonparametric statistical procedures, 4th edn. Chapman & Hall/CRC, LondonMATHGoogle Scholar
  57. 57.
    Sierra B, Lazkano E, Inza I, Merino M, Larrañaga P, Quiroga J (2001) Prototype selection and feature subset selection by estimation of distribution algorithms. A case study in the survival of cirrhotic patients treated with tips. In: Proceedings of the 8th conference on AI in medicine in Europe (AIME’01). Springer-Verlag, London, pp 20–29Google Scholar
  58. 58.
    Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6): 80–83CrossRefGoogle Scholar
  59. 59.
    Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybernet 2(3): 408–421MATHCrossRefGoogle Scholar
  60. 60.
    Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38(3): 257–286MATHCrossRefGoogle Scholar
  61. 61.
    Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann series in data management systems, 2nd edn. Morgan Kaufmann, San FransiscoGoogle Scholar
  62. 62.
    Zar JH (2009) Biostatistical analysis, 5th edn. Prentice Hall, Upper SaddleGoogle Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  • Joaquín Derrac
    • 1
  • Salvador García
    • 2
  • Francisco Herrera
    • 1
  1. 1.Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology)University of GranadaGranadaSpain
  2. 2.Department of Computer ScienceUniversity of JaénJaénSpain

Personalised recommendations