Data Reduction for Big Data

Luengo, Julián; García-Gil, Diego; Ramírez-Gallego, Sergio; García, Salvador; Herrera, Francisco

doi:10.1007/978-3-030-39105-8_5

Julián Luengo⁶,
Diego García-Gil⁶,
Sergio Ramírez-Gallego⁷,
Salvador García⁶ &
…
Francisco Herrera⁶

2135 Accesses
2 Citations

Abstract

Data reduction in data mining selects/generates the most representative instances in the input data in order to reduce the original complex instance space and better define the decision boundaries between classes. Theoretically, reduction techniques should enable the application of learning algorithms on large-scale problems. Nevertheless, standard algorithms suffer from the increment on size and complexity of today’s problems. The objective of this chapter is to provide several ideas, algorithms, and techniques to deal with the data reduction problem on Big Data. We begin by analyzing the first ideas on scalable data reduction in single-machine environments. Then we present a distributed data reduction method that solves many of the scalability problems derived from the sequential approaches. Next we provide a case of use of data reduction algorithms in Big Data. Lastly, we study a recent development on data reduction for high-speed streaming systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Angiulli, F. (2007). Fast nearest neighbor condensation for large data sets classification. IEEE Transactions on Knowledge and Data Engineering, 19(11), 1450–1464.
Article Google Scholar
Arnaiz-González, Á., Díez-Pastor, J.-F., Rodríguez, J. J., & García-Osorio, C. (2016). Instance selection of linear complexity for big data. Knowledge-Based Systems, 107, 83–95.
Article Google Scholar
Arnaiz-González, Á., González-Rogel, A., Díez-Pastor, J.-F., & López-Nozal, C. (2017). MR-DIS: Democratic instance selection for big data by MapReduce. Progress in Artificial Intelligence, 6, 1–9.
Article Google Scholar
Bezdek, J. C., & Kuncheva, L. I. (2001). Nearest prototype classifier designs: An experimental study. International Journal of Intelligent Systems, 16(12), 1445–1473.
Article Google Scholar
Cano, J. R., Herrera, F., & Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions on Evolutionary Computation, 7(6), 561–575.
Article Google Scholar
Cano, J. R., Herrera, F., & Lozano, M. (2005). Stratification for scaling up evolutionary prototype selection. Pattern Recognition Letters, 26(7), 953–963.
Article Google Scholar
Cano, J. R., Herrera, F., & Lozano, M. (2006). On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing, 6(3), 323–332.
Article Google Scholar
Chang, C. L. (1974). Finding prototypes for nearest neighbor classifiers. IEEE Transactions on Computers, 100(11), 1179–1184.
Article Google Scholar
Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y. Y., Bradski, G., Ng, A. Y., et al. (2006). Map-reduce for machine learning on multicore. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06 (pp. 281–288).
Google Scholar
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
Article Google Scholar
de Haro-García, A., & García-Pedrajas, N. (2009). A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Mining and Knowledge Discovery, 18(3), 392–418.
Article MathSciNet Google Scholar
Derrac, J., García, S., & Herrera, F. (2010). IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule. Pattern Recognition, 43(6), 2082–2105.
Article Google Scholar
Derrac, J., García, S., & Herrera, F. (2010). Stratified prototype selection based on a steady-state memetic algorithm: A study of scalability. Memetic Computing, 2(3), 183–199.
Article Google Scholar
Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml
Google Scholar
Eiben, A. E., & Smith, J. E. (2003). Introduction to evolutionary computing (Vol. 53). Berlin: Springer.
Book Google Scholar
Gama, J., Ganguly, A., Omitaomu, O., Vatsavai, R., & Gaber, M. (2009). Knowledge discovery from data streams. Intelligent Data Analysis, 13(3), 403–404.
Article Google Scholar
García, S., Cano, J. R., & Herrera, F. (2008). A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recognition, 41(8), 2693–2709.
Article Google Scholar
García, S., Derrac, J., Cano, J. R., & Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 417–435.
Article Google Scholar
García, S., Luengo, J., & Herrera, F. (2014). Data preprocessing in data mining. Berlin: Springer Publishing Company, Incorporated.
Google Scholar
García-Osorio, C., de Haro-García, A., & García-Pedrajas, N. (2010). Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts. Artificial Intelligence, 174(5), 410–441.
Article MathSciNet Google Scholar
García-Pedrajas, N., & de Haro-García, A. (2012). Scaling up data mining algorithms: review and taxonomy. Progress in Artificial Intelligence, 1(1), 71–87.
Article Google Scholar
García-Pedrajas, N., de Haro-García, A., & Pérez-Rodríguez, J. (2013). A scalable approach to simultaneous evolutionary instance and feature selection. Information Sciences, 228, 150–174.
Article MathSciNet Google Scholar
Han, D., Giraud-Carrier, C., & Li, S. (2015). Efficient mining of high-speed uncertain data streams. Applied Intelligence, 43(4), 773–785.
Article Google Scholar
Hart, P. E. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 18, 515–516.
Article Google Scholar
Iguyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.
MATH Google Scholar
Kononenko, I. (1994). Estimating attributes: Analysis and extensions of relief. In Machine Learning: European Conference on Machine Learning ECML-94 (pp. 171–182). Berlin: Springer.
Chapter Google Scholar
Lam, W., Keung, C.-K., & Liu, D. (2002). Discovering useful concept prototypes for classification based on filtering and abstraction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8), 1075–1090.
Article Google Scholar
Liu, H., & Motoda, H. (2007). Computational methods of feature selection. Boca Raton: CRC Press.
Book Google Scholar
Liu, T., Moore, A. W., Gray, A. G., & Yang, K. (2004). An investigation of practical approximate nearest neighbor algorithms. In NIPS’04 Proceedings of the 17th International Conference on Advances in Neural Information Processing Systems NIPS (pp. 825–832).
Google Scholar
Maillo, J., Ramírez, S., Triguero, I., & Herrera, F. (2017). kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.
Article Google Scholar
Navot, A., Shpigelman, L., Tishby, N., & Vaadia, E. (2006). Nearest neighbor based feature selection for regression and its application to neural activity. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (pp. 996–1002).
Google Scholar
Palma-Mendoza, R. J., Rodriguez, D., & de-Marcos, L. (2018). Distributed reliefF-based feature selection in spark. Knowledge and Information Systems, 57, 1–20.
Google Scholar
Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J. M., & Herrera, F. (2016). Evolutionary feature selection for big data classification: A MapReduce approach. Mathematical Problems in Engineering, 2015, 1–11, Article ID 246139
Google Scholar
Provost, F., & Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2), 131–169.
Article Google Scholar
Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., Benítez, J. M., & Herrera, F. (2017). Nearest neighbor classification for high-speed big data streams using spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(10), 2727–2739.
Article Google Scholar
Sánchez, J. S., Pla, F., & Ferri, F. J. (1997). Prototype selection for the nearest neighbour rule through proximity graphs. Pattern Recognition Letters, 18(6), 507–513.
Article Google Scholar
Skalak, D. B. (1994). Prototype and feature selection by sampling and random mutation hill climbing algorithms. In 11th International Conference on Machine Learning (ML’94) (pp. 293–301).
Google Scholar
Triguero, I., Derrac, J., Garcia, S., & Herrera, F. (2012). A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(1), 86–100.
Article Google Scholar
Triguero, I., García, S., & Herrera, F. (2010). IPADE: Iterative prototype adjustment for nearest neighbor classification. IEEE Transactions on Neural Networks, 21(12), 1984–1990.
Article Google Scholar
Triguero, I., García, S., & Herrera, F. (2011). Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification. Pattern Recognition, 44(4), 901–916.
Article Google Scholar
Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289.
Google Scholar
Triguero, I., Gonzalez, S., Moyano, J. M., García, S., Alcala-Fdez, J., Luengo, J., et al. (2017). Keel 3.0: An open source software for multi-stage analysis in data mining. International Journal of Computational Intelligence Systems, 10, 1238–1249.
Article Google Scholar
Triguero, I., Peralta, D., Bacardit, J., García, S., & Herrera, F. (2015). MRPR: A MapReduce solution for prototype reduction in big data classification. Neurocomputing, 150, Part A, 331–345, 2015.
Google Scholar
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408–421.
Article MathSciNet Google Scholar
Xue, B., Zhang, M., Browne, W. N., & Yao, X. (2016). A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation, 20(4), 606–626.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and AI, University of Granada, Granada, Spain
Julián Luengo, Diego García-Gil, Salvador García & Francisco Herrera
DOCOMO Digital España, Madrid, Madrid, Spain
Sergio Ramírez-Gallego

Authors

Julián Luengo
View author publications
You can also search for this author in PubMed Google Scholar
Diego García-Gil
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Ramírez-Gallego
View author publications
You can also search for this author in PubMed Google Scholar
Salvador García
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Data Reduction for Big Data. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-39105-8_5
Published: 17 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39104-1
Online ISBN: 978-3-030-39105-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics