Abstract
Data reduction in data mining selects/generates the most representative instances in the input data in order to reduce the original complex instance space and better define the decision boundaries between classes. Theoretically, reduction techniques should enable the application of learning algorithms on large-scale problems. Nevertheless, standard algorithms suffer from the increment on size and complexity of today’s problems. The objective of this chapter is to provide several ideas, algorithms, and techniques to deal with the data reduction problem on Big Data. We begin by analyzing the first ideas on scalable data reduction in single-machine environments. Then we present a distributed data reduction method that solves many of the scalability problems derived from the sequential approaches. Next we provide a case of use of data reduction algorithms in Big Data. Lastly, we study a recent development on data reduction for high-speed streaming systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Angiulli, F. (2007). Fast nearest neighbor condensation for large data sets classification. IEEE Transactions on Knowledge and Data Engineering, 19(11), 1450–1464.
Arnaiz-González, Á., Díez-Pastor, J.-F., Rodríguez, J. J., & García-Osorio, C. (2016). Instance selection of linear complexity for big data. Knowledge-Based Systems, 107, 83–95.
Arnaiz-González, Á., González-Rogel, A., Díez-Pastor, J.-F., & López-Nozal, C. (2017). MR-DIS: Democratic instance selection for big data by MapReduce. Progress in Artificial Intelligence, 6, 1–9.
Bezdek, J. C., & Kuncheva, L. I. (2001). Nearest prototype classifier designs: An experimental study. International Journal of Intelligent Systems, 16(12), 1445–1473.
Cano, J. R., Herrera, F., & Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions on Evolutionary Computation, 7(6), 561–575.
Cano, J. R., Herrera, F., & Lozano, M. (2005). Stratification for scaling up evolutionary prototype selection. Pattern Recognition Letters, 26(7), 953–963.
Cano, J. R., Herrera, F., & Lozano, M. (2006). On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing, 6(3), 323–332.
Chang, C. L. (1974). Finding prototypes for nearest neighbor classifiers. IEEE Transactions on Computers, 100(11), 1179–1184.
Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y. Y., Bradski, G., Ng, A. Y., et al. (2006). Map-reduce for machine learning on multicore. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06 (pp. 281–288).
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
de Haro-García, A., & García-Pedrajas, N. (2009). A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Mining and Knowledge Discovery, 18(3), 392–418.
Derrac, J., García, S., & Herrera, F. (2010). IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule. Pattern Recognition, 43(6), 2082–2105.
Derrac, J., García, S., & Herrera, F. (2010). Stratified prototype selection based on a steady-state memetic algorithm: A study of scalability. Memetic Computing, 2(3), 183–199.
Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml
Eiben, A. E., & Smith, J. E. (2003). Introduction to evolutionary computing (Vol. 53). Berlin: Springer.
Gama, J., Ganguly, A., Omitaomu, O., Vatsavai, R., & Gaber, M. (2009). Knowledge discovery from data streams. Intelligent Data Analysis, 13(3), 403–404.
García, S., Cano, J. R., & Herrera, F. (2008). A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recognition, 41(8), 2693–2709.
García, S., Derrac, J., Cano, J. R., & Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 417–435.
García, S., Luengo, J., & Herrera, F. (2014). Data preprocessing in data mining. Berlin: Springer Publishing Company, Incorporated.
García-Osorio, C., de Haro-García, A., & García-Pedrajas, N. (2010). Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts. Artificial Intelligence, 174(5), 410–441.
García-Pedrajas, N., & de Haro-García, A. (2012). Scaling up data mining algorithms: review and taxonomy. Progress in Artificial Intelligence, 1(1), 71–87.
García-Pedrajas, N., de Haro-García, A., & Pérez-Rodríguez, J. (2013). A scalable approach to simultaneous evolutionary instance and feature selection. Information Sciences, 228, 150–174.
Han, D., Giraud-Carrier, C., & Li, S. (2015). Efficient mining of high-speed uncertain data streams. Applied Intelligence, 43(4), 773–785.
Hart, P. E. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 18, 515–516.
Iguyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.
Kononenko, I. (1994). Estimating attributes: Analysis and extensions of relief. In Machine Learning: European Conference on Machine Learning ECML-94 (pp. 171–182). Berlin: Springer.
Lam, W., Keung, C.-K., & Liu, D. (2002). Discovering useful concept prototypes for classification based on filtering and abstraction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8), 1075–1090.
Liu, H., & Motoda, H. (2007). Computational methods of feature selection. Boca Raton: CRC Press.
Liu, T., Moore, A. W., Gray, A. G., & Yang, K. (2004). An investigation of practical approximate nearest neighbor algorithms. In NIPS’04 Proceedings of the 17th International Conference on Advances in Neural Information Processing Systems NIPS (pp. 825–832).
Maillo, J., Ramírez, S., Triguero, I., & Herrera, F. (2017). kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.
Navot, A., Shpigelman, L., Tishby, N., & Vaadia, E. (2006). Nearest neighbor based feature selection for regression and its application to neural activity. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (pp. 996–1002).
Palma-Mendoza, R. J., Rodriguez, D., & de-Marcos, L. (2018). Distributed reliefF-based feature selection in spark. Knowledge and Information Systems, 57, 1–20.
Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J. M., & Herrera, F. (2016). Evolutionary feature selection for big data classification: A MapReduce approach. Mathematical Problems in Engineering, 2015, 1–11, Article ID 246139
Provost, F., & Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2), 131–169.
Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., Benítez, J. M., & Herrera, F. (2017). Nearest neighbor classification for high-speed big data streams using spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(10), 2727–2739.
Sánchez, J. S., Pla, F., & Ferri, F. J. (1997). Prototype selection for the nearest neighbour rule through proximity graphs. Pattern Recognition Letters, 18(6), 507–513.
Skalak, D. B. (1994). Prototype and feature selection by sampling and random mutation hill climbing algorithms. In 11th International Conference on Machine Learning (ML’94) (pp. 293–301).
Triguero, I., Derrac, J., Garcia, S., & Herrera, F. (2012). A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(1), 86–100.
Triguero, I., García, S., & Herrera, F. (2010). IPADE: Iterative prototype adjustment for nearest neighbor classification. IEEE Transactions on Neural Networks, 21(12), 1984–1990.
Triguero, I., García, S., & Herrera, F. (2011). Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification. Pattern Recognition, 44(4), 901–916.
Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289.
Triguero, I., Gonzalez, S., Moyano, J. M., García, S., Alcala-Fdez, J., Luengo, J., et al. (2017). Keel 3.0: An open source software for multi-stage analysis in data mining. International Journal of Computational Intelligence Systems, 10, 1238–1249.
Triguero, I., Peralta, D., Bacardit, J., García, S., & Herrera, F. (2015). MRPR: A MapReduce solution for prototype reduction in big data classification. Neurocomputing, 150, Part A, 331–345, 2015.
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408–421.
Xue, B., Zhang, M., Browne, W. N., & Yao, X. (2016). A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation, 20(4), 606–626.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Data Reduction for Big Data. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-39105-8_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39104-1
Online ISBN: 978-3-030-39105-8
eBook Packages: Computer ScienceComputer Science (R0)