On the Use of Random Discretization and Dimensionality Reduction in Ensembles for Big Data
Massive data growth in recent years has made data reduction techniques to gain a special popularity because of their ability to reduce this enormous amount of data, also called Big Data. Random Projection Random Discretization is an innovative ensemble method. It uses two data reduction techniques to create more informative data, their proposed Random Discretization, and Random Projections (RP). However, RP has some shortcomings that can be solved by more powerful methods such as Principal Components Analysis (PCA). Aiming to tackle this problem, we propose a new ensemble method using the Apache Spark framework and PCA for dimensionality reduction, named Random Discretization Dimensionality Reduction Ensemble. In our experiments on five Big Data datasets, we show that our proposal achieves better prediction performance than the original algorithm and Random Forest.
KeywordsBig Data Ensemble Discretization Apache Spark PCA Data reduction
This contribution is supported by FEDER, the Spanish National Research Projects TIN2014-57251-P and TIN2017-89517-P, and the Project BigDaP-TOOLS - Ayudas Fundación BBVA a Equipos de Investigación Científica 2016.
- 3.Dasgupta, S.: Experiments with random projection. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, UAI 2000, pp. 143–151. Morgan Kaufmann Publishers Inc., San Francisco (2000)Google Scholar
- 6.Fradkin, D., Madigan, D.: Experiments with random projections for machine learning. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 517–522. ACM, New York (2003)Google Scholar
- 17.Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012, pp. 15–28. USENIX Association, Berkeley (2012)Google Scholar