On the Use of Random Discretization and Dimensionality Reduction in Ensembles for Big Data

  • Diego García-GilEmail author
  • Sergio Ramírez-Gallego
  • Salvador García
  • Francisco Herrera
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10870)


Massive data growth in recent years has made data reduction techniques to gain a special popularity because of their ability to reduce this enormous amount of data, also called Big Data. Random Projection Random Discretization is an innovative ensemble method. It uses two data reduction techniques to create more informative data, their proposed Random Discretization, and Random Projections (RP). However, RP has some shortcomings that can be solved by more powerful methods such as Principal Components Analysis (PCA). Aiming to tackle this problem, we propose a new ensemble method using the Apache Spark framework and PCA for dimensionality reduction, named Random Discretization Dimensionality Reduction Ensemble. In our experiments on five Big Data datasets, we show that our proposal achieves better prediction performance than the original algorithm and Random Forest.


Big Data Ensemble Discretization Apache Spark PCA Data reduction 



This contribution is supported by FEDER, the Spanish National Research Projects TIN2014-57251-P and TIN2017-89517-P, and the Project BigDaP-TOOLS - Ayudas Fundación BBVA a Equipos de Investigación Científica 2016.


  1. 1.
    Ahmad, A., Brown, G.: Random projection random discretization ensembles - ensembles of linear multivariate decision trees. IEEE Trans. Knowl. Data Eng. 26(5), 1225–1239 (2014)CrossRefGoogle Scholar
  2. 2.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefGoogle Scholar
  3. 3.
    Dasgupta, S.: Experiments with random projection. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, UAI 2000, pp. 143–151. Morgan Kaufmann Publishers Inc., San Francisco (2000)Google Scholar
  4. 4.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  5. 5.
    Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000). Scholar
  6. 6.
    Fradkin, D., Madigan, D.: Experiments with random projections for machine learning. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 517–522. ACM, New York (2003)Google Scholar
  7. 7.
    García, S., Luengo, J., Sáez, J., López, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)CrossRefGoogle Scholar
  8. 8.
    García, S., Luengo, J., Herrera, F.: Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl. Syst. 98, 1–29 (2016)CrossRefGoogle Scholar
  9. 9.
    García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, Heidelberg (2015). Scholar
  10. 10.
    García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J.M., Herrera, F.: Big data preprocessing: methods and prospects. Big Data Anal. 1(1), 9 (2016)CrossRefGoogle Scholar
  11. 11.
    García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F.: A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Anal. 2(1), 11 (2017)CrossRefGoogle Scholar
  12. 12.
    Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26(189–206), 1 (1984)MathSciNetzbMATHGoogle Scholar
  13. 13.
    Lin, J.: Mapreduce is good enough? If all you have is a hammer, throw away everything that’s not a nail!. Big Data 1(1), 28–37 (2013)CrossRefGoogle Scholar
  14. 14.
    Ramírez-Gallego, S., García, S., Benítez, J., Herrera, F.: A distributed evolutionary multivariate discretizer for big data processing on apache spark. Swarm Evolut. Comput. 38, 240–250 (2018)CrossRefGoogle Scholar
  15. 15.
    Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., Herrera, F.: Big data: tutorial and guidelines on information and process fusion for analytics algorithms with mapreduce. Inf. Fusion 42, 51–61 (2018)CrossRefGoogle Scholar
  16. 16.
    del Río, S., López, V., Benítez, J.M., Herrera, F.: On the use of mapreduce for imbalanced big data using random forest. Inf. Sci. 285, 112–137 (2014)CrossRefGoogle Scholar
  17. 17.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012, pp. 15–28. USENIX Association, Berkeley (2012)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Diego García-Gil
    • 1
    Email author
  • Sergio Ramírez-Gallego
    • 1
  • Salvador García
    • 1
  • Francisco Herrera
    • 1
  1. 1.Department of Computer Science and Artificial IntelligenceUniversity of Granada, CITIC-UGRGranadaSpain

Personalised recommendations