Dimensionality Reduction for Big Data

Luengo, Julián; García-Gil, Diego; Ramírez-Gallego, Sergio; García, Salvador; Herrera, Francisco

doi:10.1007/978-3-030-39105-8_4

Julián Luengo⁶,
Diego García-Gil⁶,
Sergio Ramírez-Gallego⁷,
Salvador García⁶ &
…
Francisco Herrera⁶

2173 Accesses
1 Citations

Abstract

In the new era of Big Data, exponential increase in volume is usually accompanied by an explosion in the number of features. Dimensionality reduction arises as a possible solution to enable large-scale learning with millions of dimensions. Nevertheless, as any other family of algorithms, reduction methods require an upgrade in its design so that they can work with such magnitudes. Particularly, they must be prepared to tackle the explosive combinatorial effects of “the curse of Big Dimensionality” while embracing the benefits from the “blessing side of dimensionality” (poorly correlated features). In this chapter we analyze the problems and benefits derived from “the curse of Big Dimensionality”, and how this problem has spread around many fields like life sciences or the Internet. Then we list all the contributions that address the large-scale dimensionality reduction problem. Next, and as a case study, we study in depth the design and behavior of one of the most popular selection frameworks in this field. Finally, we study all contributions related to dimensionality reduction in Big Data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://en.wikipedia.org/wiki/Camera_phone.
2.
Although feature generation machine is not a distributed method as such, it has been included here for its outstanding relevance in the comparison.
3.
Broadcast operation in Spark sends a single copy of the variable to each node.
4.
http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-homepage.html.
5.
Note that the whole memory available in the cluster was only available from the 10-core value.

References

Aggarwal, C. C. (2015). Data mining: the textbook. Berlin: Springer.
MATH Google Scholar
Alcalde-Barros, A., García-Gil, D., García, S., & Herrera, F. (2019). DPASF: A Flink library for streaming data preprocessing. Big Data Analytics, 4(1), 4.
Article Google Scholar
Apache Flink. (2019). Apache Flink. http://flink.apache.org/.
Google Scholar
Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550.
Article Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2015). Recent advances and emerging challenges of feature selection in the context of big data. Knowledge-Based Systems, 86, 33–45.
Article Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2015). Feature selection for high-dimensional data. Berlin: Springer Publishing Company, Incorporated.
Book Google Scholar
Bondell, H. D. & Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64, 115–123.
Article MathSciNet MATH Google Scholar
Brown, G., Pocock, A., Zhao, M.-J., & Luján, M. (2012). Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Journal of Machine Learning Research, 13, 27–66.
MathSciNet MATH Google Scholar
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chao, P., Bin, W., & Chao, D. (2012). Design and implementation of parallel term contribution algorithm based on MapReduce model. In 7th Open Cirrus Summit (pp. 43–47). Piscataway: IEEE.
Google Scholar
Chen, K., Wan, W. Q., & Li, Y. (2013). Differentially private feature selection under MapReduce framework. The Journal of China Universities of Posts and Telecommunications, 20(5), 85–103.
Article Google Scholar
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Hoboken: Wiley-Interscience
Book MATH Google Scholar
Dalavi, M., & Cheke, S. (2014). Hadoop MapReduce implementation of a novel scheme for term weighting in text categorization. In International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT) (pp. 994–999). Piscataway: IEEE.
Google Scholar
del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for imbalanced big data using random forest. Information Sciences, 285, 112–137.
Article Google Scholar
Díaz-Uriarte, R., & de Andrés, S. A. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1), 3.
Article Google Scholar
Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml.
Google Scholar
Fung, G. M., & Mangasarian, O. L. (2004). A feature selection newton method for support vector machine classification. Computational Optimization and Applications, 28(2), 185–202.
Article MathSciNet MATH Google Scholar
Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In 2009 IEEE 12th International Conference on Computer Vision (pp. 221–228).
Google Scholar
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422.
Article MATH Google Scholar
He, Q., Cheng, X., Zhuang, F., & Shi, Z. (2014). Parallel feature selection using positive approximation based on MapReduce. In 11th International Conference on Fuzzy Systems and Knowledge Discovery FSKD (pp. 397–402).
Google Scholar
Hodge, V. J., OKeefe, S., & Austin, J. (2016). Hadoop neural network for parallel and distributed feature selection. Neural Networks, 78, 24–35. In press, https://doi.org/10.1016/j.neunet.2015.08.011.
Article MATH Google Scholar
Katakis, I., Tsoumakas, G., & Vlahavas, I. (2005). On the utility of incremental feature selection for the classification of textual data streams. In P. Bozanis, & E. N. Houstis (Eds.), Advances in informatics (pp. 338–348). Berlin: Springer.
Chapter Google Scholar
Kumar, M., & Rath, S. K. (2015). Classification of microarray using MapReduce based proximal support vector machine classifier. Knowledge-Based Systems, 89, 584–602.
Article Google Scholar
Li, J., & Liu, H. (2017). Challenges of feature selection for big data analytics. IEEE Intelligent Systems, 32(2), 9–15.
Article Google Scholar
Li, Z., Lu, W., Sun, Z., & Xing, W. (2017). A parallel feature selection method study for text classification. Neural Computing and Applications, 28(1), 513–524.
Article Google Scholar
Mao, Q., & Tsang, I. W. (2013). Efficient multitemplate learning for structured prediction. IEEE Transactions on Neural Networks and Learning Systems, 24, 248–261.
Article Google Scholar
Meena, M. J., Chandran, K. R., Karthik, A., & Samuel, A. V. (2012). An enhanced ACO algorithm to select features for text categorization and its parallelization. Expert Systems with Applications, 39(5), 5861–5871.
Article Google Scholar
Michael, M., & Lin, W.-C. (1973). Experimental study of information measure and inter-intra class distance ratios on feature selection and orderings. IEEE Transactions on Systems, Man and Cybernetics, SMC-3(2), 172–181.
Article Google Scholar
Nie, F., Xiang, S., Jia, Y., Zhang, C., & Yan, S. (2008). Trace ratio criterion for feature selection. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08 (pp. 671–676).
Google Scholar
O’Leary, D. E. (2013). Artificial intelligence and big data. IEEE Intelligent Systems, 28, 96–99 (2013)
Article Google Scholar
Ordozgoiti, B., Gómez-Canaval, S., & Mozo, A. (2015). Massively parallel unsupervised feature selection on spark. In New trends in databases and information systems. Communications in Computer and Information Science (Vol. 539, pp. 186–196). Berlin: Springer International Publishing.
Google Scholar
Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.
Article Google Scholar
Peralta, D., del Río, S., Ramírez, S., Triguero, I., Benítez, J. M., & Herrera, F. (2015). Evolutionary feature selection for big data classification: A MapReduce approach. Mathematical Problems in Engineering, 2015, Article ID 246139.
Google Scholar
Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical recipes in C: The art of scientific computing. New York: Cambridge University Press.
MATH Google Scholar
Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J. M., Alonso-Betanzos, A., et al. (2018). An information theory-based feature selection framework for big data under apache spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48(9), 1441–1453.
Article Google Scholar
Roush, W. (2019). MIT technology review. TR10: Peering into video’s future. http://www.technologyreview.com/Infotech/18284/
Singh, S., Kubica, J., Larsen, S. E., & Sorokina, D. (2009). Parallel large scale feature selection for logistic regression. In SIAM International Conference on Data Mining (SDM) (pp. 1172–1183).
Google Scholar
Spark, A. (2019). Machine learning library (MLlib) for spark. http://spark.apache.org/docs/latest/mllib-guide.html.
Google Scholar
Sun, Z., & Li, Z. (2014). Data intensive parallel feature selection method study. In International Joint Conference on Neural Networks (IJCNN) (pp. 2256–2262).
Google Scholar
Tan, M., Tsang, I. W., & Wang, L. (2014). Towards ultrahigh dimensional feature selection for big data. Journal of Machine Learning Research, 15(1), 1371–1429.
MathSciNet MATH Google Scholar
Tanupabrungsun, S., & Achalakul, T. (2013). Feature reduction for anomaly detection in manufacturing with MapReduce GA/kNN. In 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS) (pp. 639–644).
Google Scholar
Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J. M., & Herrera, F. (2015). ROSEFW-RF: The winner algorithm for the ECBDL’14 big data competition: An extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87, 69–79.
Article Google Scholar
Wang, J., Zhao, P., Hoi, S. C. H., & Jin, R. (2014). Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering, 26(3), 698–710.
Article Google Scholar
Yazidi, J., Bouaguel, W., & Essoussi, N. (2016). A parallel implementation of relief algorithm using MapReduce paradigm (pp. 418–425). Cham: Springer International Publishing.
Google Scholar
Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 856–863).
Google Scholar
Zdravevski, E., Lameski, P., Kulakov, A., Jakimovski, B., Filiposka, S., & Trajanov, D. (2015). Feature ranking based on information gain for large classification problems with MapReduce. In 2015 IEEE Trustcom/BigDataSE/ISPA (Vol. 2, pp. 186–191).
Google Scholar
Zhai, Y., Ong, Y.-S., & Tsang, I. W. (2014). The emerging “big dimensionality”. IEEE Computational Intelligence Magazine, 9(3), 14–26.
Article Google Scholar
Zhai, Y., Tan, M., Ong, Y. S., & Tsang, I. W. (2012). Discovering support and affiliated features from very high dimensions. In J. Langford & J. Pineau (Eds.), Proceedings of the 29th International Conference on Machine Learning (ICML-12) (pp. 1455–1462). New York: ACM.
Google Scholar
Zhao, Z., Zhang, R., Cox, J., Duling, D., & Sarle, W. (2013). Massively parallel feature selection: an approach based on variance preservation. Machine Learning, 92(1), 195–220.
Article MathSciNet MATH Google Scholar
Zhong, L. W., & Kwok, J. T. (2012). Efficient sparse modeling with automatic feature grouping. IEEE Transactions on Neural Networks and Learning Systems, 23(9), 1436–1447.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and AI, University of Granada, Granada, Spain
Julián Luengo, Diego García-Gil, Salvador García & Francisco Herrera
DOCOMO Digital España, Madrid, Madrid, Spain
Sergio Ramírez-Gallego

Authors

Julián Luengo
View author publications
You can also search for this author in PubMed Google Scholar
Diego García-Gil
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Ramírez-Gallego
View author publications
You can also search for this author in PubMed Google Scholar
Salvador García
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Dimensionality Reduction for Big Data. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-39105-8_4
Published: 17 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39104-1
Online ISBN: 978-3-030-39105-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics