A divide-and-conquer recursive approach for scaling up instance selection algorithms

de Haro-García, Aida; García-Pedrajas, Nicolás

doi:10.1007/s10618-008-0121-2

A divide-and-conquer recursive approach for scaling up instance selection algorithms

Published: 12 December 2008

Volume 18, pages 392–418, (2009)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Aida de Haro-García¹ &
Nicolás García-Pedrajas¹

311 Accesses
44 Citations
Explore all metrics

Abstract

Instance selection is becoming more and more relevant due to the huge amount of data that is being constantly produced. However, although current algorithms are useful for fairly large datasets, scaling problems are found when the number of instances is of hundreds of thousands or millions. In the best case, these algorithms are of efficiency O(n ²), n being the number of instances. When we face huge problems, scalability is an issue, and most algorithms are not applicable. This paper presents a divide-and-conquer recursive approach to the problem of instance selection for instance based learning for very large problems. Our method divides the original training set into small subsets where the instance selection algorithm is applied. Then the selected instances are rejoined in a new training set and the same procedure, partitioning and application of an instance selection algorithm, is repeated. In this way, our approach is based on the philosophy of divide-and-conquer applied in a recursive manner. The proposed method is able to match, and even improve, for the case of storage reduction, the results of well-known standard algorithms with a very significant reduction of execution time. An extensive comparison in 30 datasets form the UCI Machine Learning Repository shows the usefulness of our method. Additionally, the method is applied to 5 huge datasets with from 300,000 to more than a million instances, with very good results and fast execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Approach for Instance Selection

Instance selection for big data based on locally sensitive hashing and double-voting mechanism

Article 19 March 2022

Large-Scale Instance Selection Using a Heterogeneous Value Difference Matrix

References

Anderson TW (1984) An introduction to multivariate statistical analysis. Wiley Series in Probability and Mathematical Statistics, 2nd edn. Wiley, New York
Google Scholar
Baluja S (1994) Population-based incremental learning. Technical Report CMU-CS-94-163. Carnegie Mellon University, Pittsburgh
Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2005) Ensembles of classifiers from spatially disjoint data. In: Lecture notes in computer science, vol 3541. Springer, pp 196–205
Barandela R, Ferri FJ, Sínchez JS (2005) Decision boundary preserving prototype selection for nearest neighbor classification. Int J Pattern Recognit Artif Intell 19(6): 787–806
Article Google Scholar
Blum A, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97: 245–271
Article MATH MathSciNet Google Scholar
Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6: 153–172
Article MATH MathSciNet Google Scholar
Brodley CE (1995) Recursive automatic bias selection for classifier construction. Mach Learn 20(1/2): 63–94
Article Google Scholar
Cano JR, Herrera F, Lozano M (2003) Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. IEEE Trans Evol Comput 7(6): 561–575
Article Google Scholar
Cano JR, Herrera F, Lozano M (2005) Stratification for scaling up evolutionary prototype selection. Pattern Recognit Lett 26(7): 953–963
Article Google Scholar
Chaudhuri S, Motwani R, Narasayya V (1998) Random sampling for histogram construction: how much is enough?. In: Haas L, Tiwary A (eds) Proceedings of ACM SIGMOD, international conference on management of data. ACM Press, New York, USA, pp 436–447
Google Scholar
Chawla NW, Hall LO, Bowyer KW, Kegelmeyer WP (2004) Learning ensembles from bites: a scalable and accurate approach. J Mach Learn Res 5: 421–451
MathSciNet Google Scholar
Chen JH, Chen HM, Ho SY (2005) Design of nearest neighbor classifiers: multi-objective approach. Int J Approx Reason 40(1–2): 3–22
Article MATH Google Scholar
Cochran W (1977) Sampling Techniques. Wiley, New York
MATH Google Scholar
Cover TM, Thomas JA (1991) Elements of Information Theory. Wiley series in telecommunication. Wiley, New York
Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30
MathSciNet Google Scholar
Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7): 1895–1923
Article Google Scholar
Eshelman LJ (1990) The CHC adaptive search algorithm: how to have safe search when engaging in nontraditional genetic recombination. Morgan Kauffman, San Mateo
Google Scholar
García-Pedrajas N, del Castillo JAR, Ortiz-Boyer D (2008) A cooperative coevolutionary algorithm for instance selection for instance-based learning. Mach Learn (in review)
Goldberg DE (1989) Genetic Algorithms in Search, Optimization and Machine Learning. Addison–Wesley, Reading, MA
MATH Google Scholar
Hettich S, Blake C, Merz C (1998) UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html
Holland JH (1975) Adaptation in natural and artificial systems. The University of Michigan Press, Ann Arbor
Google Scholar
Hussain F, Liu H, Tan C, Dash M (1999) Discretization: an enabling technique. Technical Report TRC6/99, School of Computing, National University of Singapore
Ishibuchi H, Nakashima T (2000) Pattern and feature selection by genetic algorithms in nearest neighbor classification. J Adv Comput Intell Intell Inform 4(2): 138–145
Google Scholar
Kivinen J, Mannila H (1994) The power of sampling in knowledge discovery. In: Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems. ACM Press, Minneapolis, Minnesota, USA, pp 77–85
Kuncheva L (1995) Editing for the k-nearest neighbors rule by a genetic algorithm. Pattern Recognit Lett 16: 809–814
Article Google Scholar
Li J, Manry MT, Yu C, Wilson DR (2005) Prototype classifier design with pruning. Int J Artif Intell Tools 14(1–2): 261–280
Article Google Scholar
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Kluwer, Norvell
MATH Google Scholar
Liu H, Motoda H (2002) On issues of instance selection. Data Min Knowl Discov 6: 115–130
Article MathSciNet Google Scholar
Michalewicz Z (1994) Genetic algorithms + data structures = evolution programs. Springer, New York
MATH Google Scholar
Provost FJ, Kolluri V (1999) A survey of methods for scaling up inductive learning algorithms. Data Min Knowl Discov 2: 131–169
Article Google Scholar
Reeves CR, Bush DR (2001) Using genetic algorithms for training data selection in RBF networks. In: Liu H, Motoda H (eds) Instances selection and construction for data mining. Kluwer, Norwell, pp 339–356
Google Scholar
Smith P (1998) Into Statistics. Springer, Singapore
MATH Google Scholar
Smyth B, Keane MT (1995) Remembering to forget. In: Mellish CS (eds) Proceedings of the fourteenth international conference on artificial intelligence, vol 1. Montreal, Canada, pp 377–382
Google Scholar
Son SH, Kim JY (2006) Data reduction for instance-based learning using entropy-based partitioning. In: Proceedigns of the international conference on computational science and its applications—ICCSA 2006, number 3982 in Lecture Notes in Computer Science, Springer, pp 590–599
Whitley D (1989) The GENITOR algorithm and selective pressure. In: Publishers MK (eds) Proceedings of the 3rd international conference on genetic algorithms. Los Altos, CA, pp 116–121
Google Scholar
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1: 80–83
Article Google Scholar
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3): 408–421
Article MATH Google Scholar
Wilson DR, Martinez AR (1997) Instance pruning techniques. In: Fisher D (eds) Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 404–411
Google Scholar
Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38: 257–286
Article MATH Google Scholar
Zhu X, Wu X (2006) Scalable representative instance selection and ranking. In: Proceedings of the 18th international conference on patter recognition (ICPR’06), vol 3. IEEE Computer Society, pp 352–355

Download references

Author information

Authors and Affiliations

Department of Computing and Numerical Analysis, University of Córdoba, 14071, Córdoba, Spain
Aida de Haro-García & Nicolás García-Pedrajas

Authors

Aida de Haro-García
View author publications
You can also search for this author in PubMed Google Scholar
Nicolás García-Pedrajas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolás García-Pedrajas.

Additional information

Responsible editor: Eamonn Keogh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Haro-García, A., García-Pedrajas, N. A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Min Knowl Disc 18, 392–418 (2009). https://doi.org/10.1007/s10618-008-0121-2

Download citation

Received: 09 June 2008
Accepted: 10 November 2008
Published: 12 December 2008
Issue Date: June 2009
DOI: https://doi.org/10.1007/s10618-008-0121-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A divide-and-conquer recursive approach for scaling up instance selection algorithms

Abstract

Access this article

Similar content being viewed by others

An Efficient Approach for Instance Selection

Instance selection for big data based on locally sensitive hashing and double-voting mechanism

Large-Scale Instance Selection Using a Heterogeneous Value Difference Matrix

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A divide-and-conquer recursive approach for scaling up instance selection algorithms

Abstract

Access this article

Similar content being viewed by others

An Efficient Approach for Instance Selection

Instance selection for big data based on locally sensitive hashing and double-voting mechanism

Large-Scale Instance Selection Using a Heterogeneous Value Difference Matrix

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation