Abstract
In the new era of Big Data, exponential increase in volume is usually accompanied by an explosion in the number of features. Dimensionality reduction arises as a possible solution to enable large-scale learning with millions of dimensions. Nevertheless, as any other family of algorithms, reduction methods require an upgrade in its design so that they can work with such magnitudes. Particularly, they must be prepared to tackle the explosive combinatorial effects of “the curse of Big Dimensionality” while embracing the benefits from the “blessing side of dimensionality” (poorly correlated features). In this chapter we analyze the problems and benefits derived from “the curse of Big Dimensionality”, and how this problem has spread around many fields like life sciences or the Internet. Then we list all the contributions that address the large-scale dimensionality reduction problem. Next, and as a case study, we study in depth the design and behavior of one of the most popular selection frameworks in this field. Finally, we study all contributions related to dimensionality reduction in Big Data streams.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Although feature generation machine is not a distributed method as such, it has been included here for its outstanding relevance in the comparison.
- 3.
Broadcast operation in Spark sends a single copy of the variable to each node.
- 4.
- 5.
Note that the whole memory available in the cluster was only available from the 10-core value.
References
Aggarwal, C. C. (2015). Data mining: the textbook. Berlin: Springer.
Alcalde-Barros, A., García-Gil, D., García, S., & Herrera, F. (2019). DPASF: A Flink library for streaming data preprocessing. Big Data Analytics, 4(1), 4.
Apache Flink. (2019). Apache Flink. http://flink.apache.org/.
Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550.
Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2015). Recent advances and emerging challenges of feature selection in the context of big data. Knowledge-Based Systems, 86, 33–45.
Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2015). Feature selection for high-dimensional data. Berlin: Springer Publishing Company, Incorporated.
Bondell, H. D. & Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64, 115–123.
Brown, G., Pocock, A., Zhao, M.-J., & Luján, M. (2012). Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Journal of Machine Learning Research, 13, 27–66.
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chao, P., Bin, W., & Chao, D. (2012). Design and implementation of parallel term contribution algorithm based on MapReduce model. In 7th Open Cirrus Summit (pp. 43–47). Piscataway: IEEE.
Chen, K., Wan, W. Q., & Li, Y. (2013). Differentially private feature selection under MapReduce framework. The Journal of China Universities of Posts and Telecommunications, 20(5), 85–103.
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Hoboken: Wiley-Interscience
Dalavi, M., & Cheke, S. (2014). Hadoop MapReduce implementation of a novel scheme for term weighting in text categorization. In International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT) (pp. 994–999). Piscataway: IEEE.
del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for imbalanced big data using random forest. Information Sciences, 285, 112–137.
Díaz-Uriarte, R., & de Andrés, S. A. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1), 3.
Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml.
Fung, G. M., & Mangasarian, O. L. (2004). A feature selection newton method for support vector machine classification. Computational Optimization and Applications, 28(2), 185–202.
Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In 2009 IEEE 12th International Conference on Computer Vision (pp. 221–228).
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422.
He, Q., Cheng, X., Zhuang, F., & Shi, Z. (2014). Parallel feature selection using positive approximation based on MapReduce. In 11th International Conference on Fuzzy Systems and Knowledge Discovery FSKD (pp. 397–402).
Hodge, V. J., OKeefe, S., & Austin, J. (2016). Hadoop neural network for parallel and distributed feature selection. Neural Networks, 78, 24–35. In press, https://doi.org/10.1016/j.neunet.2015.08.011.
Katakis, I., Tsoumakas, G., & Vlahavas, I. (2005). On the utility of incremental feature selection for the classification of textual data streams. In P. Bozanis, & E. N. Houstis (Eds.), Advances in informatics (pp. 338–348). Berlin: Springer.
Kumar, M., & Rath, S. K. (2015). Classification of microarray using MapReduce based proximal support vector machine classifier. Knowledge-Based Systems, 89, 584–602.
Li, J., & Liu, H. (2017). Challenges of feature selection for big data analytics. IEEE Intelligent Systems, 32(2), 9–15.
Li, Z., Lu, W., Sun, Z., & Xing, W. (2017). A parallel feature selection method study for text classification. Neural Computing and Applications, 28(1), 513–524.
Mao, Q., & Tsang, I. W. (2013). Efficient multitemplate learning for structured prediction. IEEE Transactions on Neural Networks and Learning Systems, 24, 248–261.
Meena, M. J., Chandran, K. R., Karthik, A., & Samuel, A. V. (2012). An enhanced ACO algorithm to select features for text categorization and its parallelization. Expert Systems with Applications, 39(5), 5861–5871.
Michael, M., & Lin, W.-C. (1973). Experimental study of information measure and inter-intra class distance ratios on feature selection and orderings. IEEE Transactions on Systems, Man and Cybernetics, SMC-3(2), 172–181.
Nie, F., Xiang, S., Jia, Y., Zhang, C., & Yan, S. (2008). Trace ratio criterion for feature selection. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08 (pp. 671–676).
O’Leary, D. E. (2013). Artificial intelligence and big data. IEEE Intelligent Systems, 28, 96–99 (2013)
Ordozgoiti, B., Gómez-Canaval, S., & Mozo, A. (2015). Massively parallel unsupervised feature selection on spark. In New trends in databases and information systems. Communications in Computer and Information Science (Vol. 539, pp. 186–196). Berlin: Springer International Publishing.
Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.
Peralta, D., del Río, S., Ramírez, S., Triguero, I., Benítez, J. M., & Herrera, F. (2015). Evolutionary feature selection for big data classification: A MapReduce approach. Mathematical Problems in Engineering, 2015, Article ID 246139.
Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical recipes in C: The art of scientific computing. New York: Cambridge University Press.
Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J. M., Alonso-Betanzos, A., et al. (2018). An information theory-based feature selection framework for big data under apache spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48(9), 1441–1453.
Roush, W. (2019). MIT technology review. TR10: Peering into video’s future. http://www.technologyreview.com/Infotech/18284/
Singh, S., Kubica, J., Larsen, S. E., & Sorokina, D. (2009). Parallel large scale feature selection for logistic regression. In SIAM International Conference on Data Mining (SDM) (pp. 1172–1183).
Spark, A. (2019). Machine learning library (MLlib) for spark. http://spark.apache.org/docs/latest/mllib-guide.html.
Sun, Z., & Li, Z. (2014). Data intensive parallel feature selection method study. In International Joint Conference on Neural Networks (IJCNN) (pp. 2256–2262).
Tan, M., Tsang, I. W., & Wang, L. (2014). Towards ultrahigh dimensional feature selection for big data. Journal of Machine Learning Research, 15(1), 1371–1429.
Tanupabrungsun, S., & Achalakul, T. (2013). Feature reduction for anomaly detection in manufacturing with MapReduce GA/kNN. In 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS) (pp. 639–644).
Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J. M., & Herrera, F. (2015). ROSEFW-RF: The winner algorithm for the ECBDL’14 big data competition: An extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87, 69–79.
Wang, J., Zhao, P., Hoi, S. C. H., & Jin, R. (2014). Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering, 26(3), 698–710.
Yazidi, J., Bouaguel, W., & Essoussi, N. (2016). A parallel implementation of relief algorithm using MapReduce paradigm (pp. 418–425). Cham: Springer International Publishing.
Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 856–863).
Zdravevski, E., Lameski, P., Kulakov, A., Jakimovski, B., Filiposka, S., & Trajanov, D. (2015). Feature ranking based on information gain for large classification problems with MapReduce. In 2015 IEEE Trustcom/BigDataSE/ISPA (Vol. 2, pp. 186–191).
Zhai, Y., Ong, Y.-S., & Tsang, I. W. (2014). The emerging “big dimensionality”. IEEE Computational Intelligence Magazine, 9(3), 14–26.
Zhai, Y., Tan, M., Ong, Y. S., & Tsang, I. W. (2012). Discovering support and affiliated features from very high dimensions. In J. Langford & J. Pineau (Eds.), Proceedings of the 29th International Conference on Machine Learning (ICML-12) (pp. 1455–1462). New York: ACM.
Zhao, Z., Zhang, R., Cox, J., Duling, D., & Sarle, W. (2013). Massively parallel feature selection: an approach based on variance preservation. Machine Learning, 92(1), 195–220.
Zhong, L. W., & Kwok, J. T. (2012). Efficient sparse modeling with automatic feature grouping. IEEE Transactions on Neural Networks and Learning Systems, 23(9), 1436–1447.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Dimensionality Reduction for Big Data. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-39105-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39104-1
Online ISBN: 978-3-030-39105-8
eBook Packages: Computer ScienceComputer Science (R0)