Skip to main content
Log in

A divide-and-conquer recursive approach for scaling up instance selection algorithms

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Instance selection is becoming more and more relevant due to the huge amount of data that is being constantly produced. However, although current algorithms are useful for fairly large datasets, scaling problems are found when the number of instances is of hundreds of thousands or millions. In the best case, these algorithms are of efficiency O(n 2), n being the number of instances. When we face huge problems, scalability is an issue, and most algorithms are not applicable. This paper presents a divide-and-conquer recursive approach to the problem of instance selection for instance based learning for very large problems. Our method divides the original training set into small subsets where the instance selection algorithm is applied. Then the selected instances are rejoined in a new training set and the same procedure, partitioning and application of an instance selection algorithm, is repeated. In this way, our approach is based on the philosophy of divide-and-conquer applied in a recursive manner. The proposed method is able to match, and even improve, for the case of storage reduction, the results of well-known standard algorithms with a very significant reduction of execution time. An extensive comparison in 30 datasets form the UCI Machine Learning Repository shows the usefulness of our method. Additionally, the method is applied to 5 huge datasets with from 300,000 to more than a million instances, with very good results and fast execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Anderson TW (1984) An introduction to multivariate statistical analysis. Wiley Series in Probability and Mathematical Statistics, 2nd edn. Wiley, New York

    Google Scholar 

  • Baluja S (1994) Population-based incremental learning. Technical Report CMU-CS-94-163. Carnegie Mellon University, Pittsburgh

  • Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2005) Ensembles of classifiers from spatially disjoint data. In: Lecture notes in computer science, vol 3541. Springer, pp 196–205

  • Barandela R, Ferri FJ, Sínchez JS (2005) Decision boundary preserving prototype selection for nearest neighbor classification. Int J Pattern Recognit Artif Intell 19(6): 787–806

    Article  Google Scholar 

  • Blum A, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97: 245–271

    Article  MATH  MathSciNet  Google Scholar 

  • Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6: 153–172

    Article  MATH  MathSciNet  Google Scholar 

  • Brodley CE (1995) Recursive automatic bias selection for classifier construction. Mach Learn 20(1/2): 63–94

    Article  Google Scholar 

  • Cano JR, Herrera F, Lozano M (2003) Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. IEEE Trans Evol Comput 7(6): 561–575

    Article  Google Scholar 

  • Cano JR, Herrera F, Lozano M (2005) Stratification for scaling up evolutionary prototype selection. Pattern Recognit Lett 26(7): 953–963

    Article  Google Scholar 

  • Chaudhuri S, Motwani R, Narasayya V (1998) Random sampling for histogram construction: how much is enough?. In: Haas L, Tiwary A (eds) Proceedings of ACM SIGMOD, international conference on management of data. ACM Press, New York, USA, pp 436–447

    Google Scholar 

  • Chawla NW, Hall LO, Bowyer KW, Kegelmeyer WP (2004) Learning ensembles from bites: a scalable and accurate approach. J Mach Learn Res 5: 421–451

    MathSciNet  Google Scholar 

  • Chen JH, Chen HM, Ho SY (2005) Design of nearest neighbor classifiers: multi-objective approach. Int J Approx Reason 40(1–2): 3–22

    Article  MATH  Google Scholar 

  • Cochran W (1977) Sampling Techniques. Wiley, New York

    MATH  Google Scholar 

  • Cover TM, Thomas JA (1991) Elements of Information Theory. Wiley series in telecommunication. Wiley, New York

    Google Scholar 

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30

    MathSciNet  Google Scholar 

  • Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7): 1895–1923

    Article  Google Scholar 

  • Eshelman LJ (1990) The CHC adaptive search algorithm: how to have safe search when engaging in nontraditional genetic recombination. Morgan Kauffman, San Mateo

    Google Scholar 

  • García-Pedrajas N, del Castillo JAR, Ortiz-Boyer D (2008) A cooperative coevolutionary algorithm for instance selection for instance-based learning. Mach Learn (in review)

  • Goldberg DE (1989) Genetic Algorithms in Search, Optimization and Machine Learning. Addison–Wesley, Reading, MA

    MATH  Google Scholar 

  • Hettich S, Blake C, Merz C (1998) UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html

  • Holland JH (1975) Adaptation in natural and artificial systems. The University of Michigan Press, Ann Arbor

    Google Scholar 

  • Hussain F, Liu H, Tan C, Dash M (1999) Discretization: an enabling technique. Technical Report TRC6/99, School of Computing, National University of Singapore

  • Ishibuchi H, Nakashima T (2000) Pattern and feature selection by genetic algorithms in nearest neighbor classification. J Adv Comput Intell Intell Inform 4(2): 138–145

    Google Scholar 

  • Kivinen J, Mannila H (1994) The power of sampling in knowledge discovery. In: Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems. ACM Press, Minneapolis, Minnesota, USA, pp 77–85

  • Kuncheva L (1995) Editing for the k-nearest neighbors rule by a genetic algorithm. Pattern Recognit Lett 16: 809–814

    Article  Google Scholar 

  • Li J, Manry MT, Yu C, Wilson DR (2005) Prototype classifier design with pruning. Int J Artif Intell Tools 14(1–2): 261–280

    Article  Google Scholar 

  • Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Kluwer, Norvell

    MATH  Google Scholar 

  • Liu H, Motoda H (2002) On issues of instance selection. Data Min Knowl Discov 6: 115–130

    Article  MathSciNet  Google Scholar 

  • Michalewicz Z (1994) Genetic algorithms + data structures = evolution programs. Springer, New York

    MATH  Google Scholar 

  • Provost FJ, Kolluri V (1999) A survey of methods for scaling up inductive learning algorithms. Data Min Knowl Discov 2: 131–169

    Article  Google Scholar 

  • Reeves CR, Bush DR (2001) Using genetic algorithms for training data selection in RBF networks. In: Liu H, Motoda H (eds) Instances selection and construction for data mining. Kluwer, Norwell, pp 339–356

    Google Scholar 

  • Smith P (1998) Into Statistics. Springer, Singapore

    MATH  Google Scholar 

  • Smyth B, Keane MT (1995) Remembering to forget. In: Mellish CS (eds) Proceedings of the fourteenth international conference on artificial intelligence, vol 1. Montreal, Canada, pp 377–382

    Google Scholar 

  • Son SH, Kim JY (2006) Data reduction for instance-based learning using entropy-based partitioning. In: Proceedigns of the international conference on computational science and its applications—ICCSA 2006, number 3982 in Lecture Notes in Computer Science, Springer, pp 590–599

  • Whitley D (1989) The GENITOR algorithm and selective pressure. In: Publishers MK (eds) Proceedings of the 3rd international conference on genetic algorithms. Los Altos, CA, pp 116–121

    Google Scholar 

  • Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1: 80–83

    Article  Google Scholar 

  • Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3): 408–421

    Article  MATH  Google Scholar 

  • Wilson DR, Martinez AR (1997) Instance pruning techniques. In: Fisher D (eds) Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 404–411

    Google Scholar 

  • Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38: 257–286

    Article  MATH  Google Scholar 

  • Zhu X, Wu X (2006) Scalable representative instance selection and ranking. In: Proceedings of the 18th international conference on patter recognition (ICPR’06), vol 3. IEEE Computer Society, pp 352–355

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolás García-Pedrajas.

Additional information

Responsible editor: Eamonn Keogh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Haro-García, A., García-Pedrajas, N. A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Min Knowl Disc 18, 392–418 (2009). https://doi.org/10.1007/s10618-008-0121-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-008-0121-2

Keywords

Navigation