Data Reduction for Big Data

  • Julián Luengo
  • Diego García-Gil
  • Sergio Ramírez-Gallego
  • Salvador García
  • Francisco Herrera


Data reduction in data mining selects/generates the most representative instances in the input data in order to reduce the original complex instance space and better define the decision boundaries between classes. Theoretically, reduction techniques should enable the application of learning algorithms on large-scale problems. Nevertheless, standard algorithms suffer from the increment on size and complexity of today’s problems. The objective of this chapter is to provide several ideas, algorithms, and techniques to deal with the data reduction problem on Big Data. We begin by analyzing the first ideas on scalable data reduction in single-machine environments. Then we present a distributed data reduction method that solves many of the scalability problems derived from the sequential approaches. Next we provide a case of use of data reduction algorithms in Big Data. Lastly, we study a recent development on data reduction for high-speed streaming systems.


  1. 1.
    Angiulli, F. (2007). Fast nearest neighbor condensation for large data sets classification. IEEE Transactions on Knowledge and Data Engineering, 19(11), 1450–1464.CrossRefGoogle Scholar
  2. 2.
    Arnaiz-González, Á., Díez-Pastor, J.-F., Rodríguez, J. J., & García-Osorio, C. (2016). Instance selection of linear complexity for big data. Knowledge-Based Systems, 107, 83–95.CrossRefGoogle Scholar
  3. 3.
    Arnaiz-González, Á., González-Rogel, A., Díez-Pastor, J.-F., & López-Nozal, C. (2017). MR-DIS: Democratic instance selection for big data by MapReduce. Progress in Artificial Intelligence, 6, 1–9.CrossRefGoogle Scholar
  4. 4.
    Bezdek, J. C., & Kuncheva, L. I. (2001). Nearest prototype classifier designs: An experimental study. International Journal of Intelligent Systems, 16(12), 1445–1473.CrossRefGoogle Scholar
  5. 5.
    Cano, J. R., Herrera, F., & Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions on Evolutionary Computation, 7(6), 561–575.CrossRefGoogle Scholar
  6. 6.
    Cano, J. R., Herrera, F., & Lozano, M. (2005). Stratification for scaling up evolutionary prototype selection. Pattern Recognition Letters, 26(7), 953–963.CrossRefGoogle Scholar
  7. 7.
    Cano, J. R., Herrera, F., & Lozano, M. (2006). On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing, 6(3), 323–332.CrossRefGoogle Scholar
  8. 8.
    Chang, C. L. (1974). Finding prototypes for nearest neighbor classifiers. IEEE Transactions on Computers, 100(11), 1179–1184.CrossRefGoogle Scholar
  9. 9.
    Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y. Y., Bradski, G., Ng, A. Y., et al. (2006). Map-reduce for machine learning on multicore. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06 (pp. 281–288).Google Scholar
  10. 10.
    Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.CrossRefGoogle Scholar
  11. 11.
    de Haro-García, A., & García-Pedrajas, N. (2009). A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Mining and Knowledge Discovery, 18(3), 392–418.MathSciNetCrossRefGoogle Scholar
  12. 12.
    Derrac, J., García, S., & Herrera, F. (2010). IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule. Pattern Recognition, 43(6), 2082–2105.CrossRefGoogle Scholar
  13. 13.
    Derrac, J., García, S., & Herrera, F. (2010). Stratified prototype selection based on a steady-state memetic algorithm: A study of scalability. Memetic Computing, 2(3), 183–199.CrossRefGoogle Scholar
  14. 14.
    Dua, D., & Graff, C. (2017). UCI machine learning repository. Google Scholar
  15. 15.
    Eiben, A. E., & Smith, J. E. (2003). Introduction to evolutionary computing (Vol. 53). Berlin: Springer.CrossRefGoogle Scholar
  16. 16.
    Gama, J., Ganguly, A., Omitaomu, O., Vatsavai, R., & Gaber, M. (2009). Knowledge discovery from data streams. Intelligent Data Analysis, 13(3), 403–404.CrossRefGoogle Scholar
  17. 17.
    García, S., Cano, J. R., & Herrera, F. (2008). A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recognition, 41(8), 2693–2709.CrossRefGoogle Scholar
  18. 18.
    García, S., Derrac, J., Cano, J. R., & Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 417–435.CrossRefGoogle Scholar
  19. 19.
    García, S., Luengo, J., & Herrera, F. (2014). Data preprocessing in data mining. Berlin: Springer Publishing Company, Incorporated.Google Scholar
  20. 20.
    García-Osorio, C., de Haro-García, A., & García-Pedrajas, N. (2010). Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts. Artificial Intelligence, 174(5), 410–441.MathSciNetCrossRefGoogle Scholar
  21. 21.
    García-Pedrajas, N., & de Haro-García, A. (2012). Scaling up data mining algorithms: review and taxonomy. Progress in Artificial Intelligence, 1(1), 71–87.CrossRefGoogle Scholar
  22. 22.
    García-Pedrajas, N., de Haro-García, A., & Pérez-Rodríguez, J. (2013). A scalable approach to simultaneous evolutionary instance and feature selection. Information Sciences, 228, 150–174.MathSciNetCrossRefGoogle Scholar
  23. 23.
    Han, D., Giraud-Carrier, C., & Li, S. (2015). Efficient mining of high-speed uncertain data streams. Applied Intelligence, 43(4), 773–785.CrossRefGoogle Scholar
  24. 24.
    Hart, P. E. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 18, 515–516.CrossRefGoogle Scholar
  25. 25.
    Iguyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.zbMATHGoogle Scholar
  26. 26.
    Kononenko, I. (1994). Estimating attributes: Analysis and extensions of relief. In Machine Learning: European Conference on Machine Learning ECML-94 (pp. 171–182). Berlin: Springer.CrossRefGoogle Scholar
  27. 27.
    Lam, W., Keung, C.-K., & Liu, D. (2002). Discovering useful concept prototypes for classification based on filtering and abstraction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8), 1075–1090.CrossRefGoogle Scholar
  28. 28.
    Liu, H., & Motoda, H. (2007). Computational methods of feature selection. Boca Raton: CRC Press.CrossRefGoogle Scholar
  29. 29.
    Liu, T., Moore, A. W., Gray, A. G., & Yang, K. (2004). An investigation of practical approximate nearest neighbor algorithms. In NIPS’04 Proceedings of the 17th International Conference on Advances in Neural Information Processing Systems NIPS (pp. 825–832).Google Scholar
  30. 30.
    Maillo, J., Ramírez, S., Triguero, I., & Herrera, F. (2017). kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.CrossRefGoogle Scholar
  31. 31.
    Navot, A., Shpigelman, L., Tishby, N., & Vaadia, E. (2006). Nearest neighbor based feature selection for regression and its application to neural activity. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (pp. 996–1002).Google Scholar
  32. 32.
    Palma-Mendoza, R. J., Rodriguez, D., & de-Marcos, L. (2018). Distributed reliefF-based feature selection in spark. Knowledge and Information Systems, 57, 1–20.Google Scholar
  33. 33.
    Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J. M., & Herrera, F. (2016). Evolutionary feature selection for big data classification: A MapReduce approach. Mathematical Problems in Engineering, 2015, 1–11, Article ID 246139Google Scholar
  34. 34.
    Provost, F., & Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2), 131–169.CrossRefGoogle Scholar
  35. 35.
    Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., Benítez, J. M., & Herrera, F. (2017). Nearest neighbor classification for high-speed big data streams using spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(10), 2727–2739.CrossRefGoogle Scholar
  36. 36.
    Sánchez, J. S., Pla, F., & Ferri, F. J. (1997). Prototype selection for the nearest neighbour rule through proximity graphs. Pattern Recognition Letters, 18(6), 507–513.CrossRefGoogle Scholar
  37. 37.
    Skalak, D. B. (1994). Prototype and feature selection by sampling and random mutation hill climbing algorithms. In 11th International Conference on Machine Learning (ML’94) (pp. 293–301).Google Scholar
  38. 38.
    Triguero, I., Derrac, J., Garcia, S., & Herrera, F. (2012). A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(1), 86–100.CrossRefGoogle Scholar
  39. 39.
    Triguero, I., García, S., & Herrera, F. (2010). IPADE: Iterative prototype adjustment for nearest neighbor classification. IEEE Transactions on Neural Networks, 21(12), 1984–1990.CrossRefGoogle Scholar
  40. 40.
    Triguero, I., García, S., & Herrera, F. (2011). Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification. Pattern Recognition, 44(4), 901–916.CrossRefGoogle Scholar
  41. 41.
    Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289.Google Scholar
  42. 42.
    Triguero, I., Gonzalez, S., Moyano, J. M., García, S., Alcala-Fdez, J., Luengo, J., et al. (2017). Keel 3.0: An open source software for multi-stage analysis in data mining. International Journal of Computational Intelligence Systems, 10, 1238–1249.CrossRefGoogle Scholar
  43. 43.
    Triguero, I., Peralta, D., Bacardit, J., García, S., & Herrera, F. (2015). MRPR: A MapReduce solution for prototype reduction in big data classification. Neurocomputing, 150, Part A, 331–345, 2015.Google Scholar
  44. 44.
    Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408–421.MathSciNetCrossRefGoogle Scholar
  45. 45.
    Xue, B., Zhang, M., Browne, W. N., & Yao, X. (2016). A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation, 20(4), 606–626.CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Julián Luengo
    • 1
  • Diego García-Gil
    • 1
  • Sergio Ramírez-Gallego
    • 2
  • Salvador García
    • 1
  • Francisco Herrera
    • 1
  1. 1.Department of Computer Science and AIUniversity of GranadaGranadaSpain
  2. 2.DOCOMO Digital EspañaMadridSpain

Personalised recommendations