A Comparison of Two Strategies for Scaling Up Instance Selection in Huge Datasets

de Haro-García, Aida; Pérez-Rodríguez, Javier; García-Pedrajas, Nicolás

doi:10.1007/978-3-642-25274-7_7

Aida de Haro-García²²,
Javier Pérez-Rodríguez²² &
Nicolás García-Pedrajas²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7023))

Included in the following conference series:

Conference of the Spanish Association for Artificial Intelligence

1260 Accesses

Abstract

Instance selection is becoming more and more relevant due to the huge amount of data that is constantly being produced. However, although current algorithms are useful for fairly large datasets, many scaling problems are found when the number of instances is of hundred of thousands or millions. Most instance selection algorithms are of complexity at least O(n ²), n being the number of instances. When we face huge problems, the scalability becomes an issue, and most of the algorithms are not applicable.

Recently, two general methods for scaling up instance selection algorithms have been published in the literature: stratification and democratization. Both methods are able to successfully deal with large datasets. In this paper we show a comparison of these two methods when applied to very large and huge datasets up to 50,000,000 instances. Additionally, we also test their performance in huge datasets that are also class-imbalanced. The comparison is made using a parallel implementation of both methods to fully exploit their possibilities.

Although both methods show very good behavior in terms of testing error, storage reduction and execution time, democratization proves an overall better performance.

This work was supported in part by the Project TIN2008-03151 of the Spanish Ministry of Science and Innovation and the project P09-TIC-4623 of the Junta de Andalucía

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery 6, 153–172 (2002)
Article MathSciNet MATH Google Scholar
Cano, J.R., Herrera, F., Lozano, M.: Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions on Evolutionary Computation 7(6), 561–575 (2003)
Article Google Scholar
Cano, J.R., Herrera, F., Lozano, M.: Stratification for scaling up evolutionary prototype selection. Pattern Recognition Letters 26(7), 953–963 (2005)
Article Google Scholar
Derrac, J., García, S., Herrera, F.: Stratified prototype selection based on a steady-state memetic algorithm: a study of scalability. Memetic Computing 2, 183–189 (2010)
Article Google Scholar
Eshelman, L.J.: The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination. Morgan Kauffman, San Mateo (1990)
Google Scholar
García-Osorio, C., de Haro-García, A., García-Pedrajas, N.: Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts. Artificial Intelligence 174, 410–441 (2010)
Article MathSciNet Google Scholar
Kubat, M., Holte, R., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30, 195–215 (1998)
Article Google Scholar
Liu, H., Motoda, H.: On issues of instance selection. Data Mining and Knowledge Discovery 6, 115–130 (2002)
Article MathSciNet Google Scholar
Nadeau, C., Bengio, Y.: Inference for the generalization error. Machine Learning 52, 239–281 (2003)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing and Numerical Analysis, University of Córdoba, Spain
Aida de Haro-García, Javier Pérez-Rodríguez & Nicolás García-Pedrajas

Authors

Aida de Haro-García
View author publications
You can also search for this author in PubMed Google Scholar
Javier Pérez-Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
Nicolás García-Pedrajas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science School, University of the Basque Country, PÂº Manuel de Lardizabal 1, 20018, Donostia-San Sebastian, Spain
Jose A. Lozano
Computing Systems Department, University of Castilla-La Mancha, Campus Universitario s/n, 02071, Albacete, Spain
José A. Gámez
Dep. Statistics, O.R. and Computation, University of La Laguna, 38271, La Laguna, S.C. Tenerife, Spain
José A. Moreno

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Haro-García, A., Pérez-Rodríguez, J., García-Pedrajas, N. (2011). A Comparison of Two Strategies for Scaling Up Instance Selection in Huge Datasets. In: Lozano, J.A., Gámez, J.A., Moreno, J.A. (eds) Advances in Artificial Intelligence. CAEPIA 2011. Lecture Notes in Computer Science(), vol 7023. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25274-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-25274-7_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25273-0
Online ISBN: 978-3-642-25274-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics