Skip to main content

Dimensionality Reduction for Big Data

  • Chapter
  • First Online:
Big Data Preprocessing

Abstract

In the new era of Big Data, exponential increase in volume is usually accompanied by an explosion in the number of features. Dimensionality reduction arises as a possible solution to enable large-scale learning with millions of dimensions. Nevertheless, as any other family of algorithms, reduction methods require an upgrade in its design so that they can work with such magnitudes. Particularly, they must be prepared to tackle the explosive combinatorial effects of “the curse of Big Dimensionality” while embracing the benefits from the “blessing side of dimensionality” (poorly correlated features). In this chapter we analyze the problems and benefits derived from “the curse of Big Dimensionality”, and how this problem has spread around many fields like life sciences or the Internet. Then we list all the contributions that address the large-scale dimensionality reduction problem. Next, and as a case study, we study in depth the design and behavior of one of the most popular selection frameworks in this field. Finally, we study all contributions related to dimensionality reduction in Big Data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://en.wikipedia.org/wiki/Camera_phone.

  2. 2.

    Although feature generation machine is not a distributed method as such, it has been included here for its outstanding relevance in the comparison.

  3. 3.

    Broadcast operation in Spark sends a single copy of the variable to each node.

  4. 4.

    http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-homepage.html.

  5. 5.

    Note that the whole memory available in the cluster was only available from the 10-core value.

References

  1. Aggarwal, C. C. (2015). Data mining: the textbook. Berlin: Springer.

    MATH  Google Scholar 

  2. Alcalde-Barros, A., García-Gil, D., García, S., & Herrera, F. (2019). DPASF: A Flink library for streaming data preprocessing. Big Data Analytics, 4(1), 4.

    Article  Google Scholar 

  3. Apache Flink. (2019). Apache Flink. http://flink.apache.org/.

    Google Scholar 

  4. Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550.

    Article  Google Scholar 

  5. Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2015). Recent advances and emerging challenges of feature selection in the context of big data. Knowledge-Based Systems, 86, 33–45.

    Article  Google Scholar 

  6. Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2015). Feature selection for high-dimensional data. Berlin: Springer Publishing Company, Incorporated.

    Book  Google Scholar 

  7. Bondell, H. D. & Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64, 115–123.

    Article  MathSciNet  MATH  Google Scholar 

  8. Brown, G., Pocock, A., Zhao, M.-J., & Luján, M. (2012). Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Journal of Machine Learning Research, 13, 27–66.

    MathSciNet  MATH  Google Scholar 

  9. Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

  10. Chao, P., Bin, W., & Chao, D. (2012). Design and implementation of parallel term contribution algorithm based on MapReduce model. In 7th Open Cirrus Summit (pp. 43–47). Piscataway: IEEE.

    Google Scholar 

  11. Chen, K., Wan, W. Q., & Li, Y. (2013). Differentially private feature selection under MapReduce framework. The Journal of China Universities of Posts and Telecommunications, 20(5), 85–103.

    Article  Google Scholar 

  12. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Hoboken: Wiley-Interscience

    Book  MATH  Google Scholar 

  13. Dalavi, M., & Cheke, S. (2014). Hadoop MapReduce implementation of a novel scheme for term weighting in text categorization. In International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT) (pp. 994–999). Piscataway: IEEE.

    Google Scholar 

  14. del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for imbalanced big data using random forest. Information Sciences, 285, 112–137.

    Article  Google Scholar 

  15. Díaz-Uriarte, R., & de Andrés, S. A. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1), 3.

    Article  Google Scholar 

  16. Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml.

    Google Scholar 

  17. Fung, G. M., & Mangasarian, O. L. (2004). A feature selection newton method for support vector machine classification. Computational Optimization and Applications, 28(2), 185–202.

    Article  MathSciNet  MATH  Google Scholar 

  18. Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In 2009 IEEE 12th International Conference on Computer Vision (pp. 221–228).

    Google Scholar 

  19. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422.

    Article  MATH  Google Scholar 

  20. He, Q., Cheng, X., Zhuang, F., & Shi, Z. (2014). Parallel feature selection using positive approximation based on MapReduce. In 11th International Conference on Fuzzy Systems and Knowledge Discovery FSKD (pp. 397–402).

    Google Scholar 

  21. Hodge, V. J., OKeefe, S., & Austin, J. (2016). Hadoop neural network for parallel and distributed feature selection. Neural Networks, 78, 24–35. In press, https://doi.org/10.1016/j.neunet.2015.08.011.

    Article  MATH  Google Scholar 

  22. Katakis, I., Tsoumakas, G., & Vlahavas, I. (2005). On the utility of incremental feature selection for the classification of textual data streams. In P. Bozanis, & E. N. Houstis (Eds.), Advances in informatics (pp. 338–348). Berlin: Springer.

    Chapter  Google Scholar 

  23. Kumar, M., & Rath, S. K. (2015). Classification of microarray using MapReduce based proximal support vector machine classifier. Knowledge-Based Systems, 89, 584–602.

    Article  Google Scholar 

  24. Li, J., & Liu, H. (2017). Challenges of feature selection for big data analytics. IEEE Intelligent Systems, 32(2), 9–15.

    Article  Google Scholar 

  25. Li, Z., Lu, W., Sun, Z., & Xing, W. (2017). A parallel feature selection method study for text classification. Neural Computing and Applications, 28(1), 513–524.

    Article  Google Scholar 

  26. Mao, Q., & Tsang, I. W. (2013). Efficient multitemplate learning for structured prediction. IEEE Transactions on Neural Networks and Learning Systems, 24, 248–261.

    Article  Google Scholar 

  27. Meena, M. J., Chandran, K. R., Karthik, A., & Samuel, A. V. (2012). An enhanced ACO algorithm to select features for text categorization and its parallelization. Expert Systems with Applications, 39(5), 5861–5871.

    Article  Google Scholar 

  28. Michael, M., & Lin, W.-C. (1973). Experimental study of information measure and inter-intra class distance ratios on feature selection and orderings. IEEE Transactions on Systems, Man and Cybernetics, SMC-3(2), 172–181.

    Article  Google Scholar 

  29. Nie, F., Xiang, S., Jia, Y., Zhang, C., & Yan, S. (2008). Trace ratio criterion for feature selection. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08 (pp. 671–676).

    Google Scholar 

  30. O’Leary, D. E. (2013). Artificial intelligence and big data. IEEE Intelligent Systems, 28, 96–99 (2013)

    Article  Google Scholar 

  31. Ordozgoiti, B., Gómez-Canaval, S., & Mozo, A. (2015). Massively parallel unsupervised feature selection on spark. In New trends in databases and information systems. Communications in Computer and Information Science (Vol. 539, pp. 186–196). Berlin: Springer International Publishing.

    Google Scholar 

  32. Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.

    Article  Google Scholar 

  33. Peralta, D., del Río, S., Ramírez, S., Triguero, I., Benítez, J. M., & Herrera, F. (2015). Evolutionary feature selection for big data classification: A MapReduce approach. Mathematical Problems in Engineering, 2015, Article ID 246139.

    Google Scholar 

  34. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical recipes in C: The art of scientific computing. New York: Cambridge University Press.

    MATH  Google Scholar 

  35. Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J. M., Alonso-Betanzos, A., et al. (2018). An information theory-based feature selection framework for big data under apache spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48(9), 1441–1453.

    Article  Google Scholar 

  36. Roush, W. (2019). MIT technology review. TR10: Peering into video’s future. http://www.technologyreview.com/Infotech/18284/

  37. Singh, S., Kubica, J., Larsen, S. E., & Sorokina, D. (2009). Parallel large scale feature selection for logistic regression. In SIAM International Conference on Data Mining (SDM) (pp. 1172–1183).

    Google Scholar 

  38. Spark, A. (2019). Machine learning library (MLlib) for spark. http://spark.apache.org/docs/latest/mllib-guide.html.

    Google Scholar 

  39. Sun, Z., & Li, Z. (2014). Data intensive parallel feature selection method study. In International Joint Conference on Neural Networks (IJCNN) (pp. 2256–2262).

    Google Scholar 

  40. Tan, M., Tsang, I. W., & Wang, L. (2014). Towards ultrahigh dimensional feature selection for big data. Journal of Machine Learning Research, 15(1), 1371–1429.

    MathSciNet  MATH  Google Scholar 

  41. Tanupabrungsun, S., & Achalakul, T. (2013). Feature reduction for anomaly detection in manufacturing with MapReduce GA/kNN. In 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS) (pp. 639–644).

    Google Scholar 

  42. Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J. M., & Herrera, F. (2015). ROSEFW-RF: The winner algorithm for the ECBDL’14 big data competition: An extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87, 69–79.

    Article  Google Scholar 

  43. Wang, J., Zhao, P., Hoi, S. C. H., & Jin, R. (2014). Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering, 26(3), 698–710.

    Article  Google Scholar 

  44. Yazidi, J., Bouaguel, W., & Essoussi, N. (2016). A parallel implementation of relief algorithm using MapReduce paradigm (pp. 418–425). Cham: Springer International Publishing.

    Google Scholar 

  45. Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 856–863).

    Google Scholar 

  46. Zdravevski, E., Lameski, P., Kulakov, A., Jakimovski, B., Filiposka, S., & Trajanov, D. (2015). Feature ranking based on information gain for large classification problems with MapReduce. In 2015 IEEE Trustcom/BigDataSE/ISPA (Vol. 2, pp. 186–191).

    Google Scholar 

  47. Zhai, Y., Ong, Y.-S., & Tsang, I. W. (2014). The emerging “big dimensionality”. IEEE Computational Intelligence Magazine, 9(3), 14–26.

    Article  Google Scholar 

  48. Zhai, Y., Tan, M., Ong, Y. S., & Tsang, I. W. (2012). Discovering support and affiliated features from very high dimensions. In J. Langford & J. Pineau (Eds.), Proceedings of the 29th International Conference on Machine Learning (ICML-12) (pp. 1455–1462). New York: ACM.

    Google Scholar 

  49. Zhao, Z., Zhang, R., Cox, J., Duling, D., & Sarle, W. (2013). Massively parallel feature selection: an approach based on variance preservation. Machine Learning, 92(1), 195–220.

    Article  MathSciNet  MATH  Google Scholar 

  50. Zhong, L. W., & Kwok, J. T. (2012). Efficient sparse modeling with automatic feature grouping. IEEE Transactions on Neural Networks and Learning Systems, 23(9), 1436–1447.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Dimensionality Reduction for Big Data. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-39105-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-39104-1

  • Online ISBN: 978-3-030-39105-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics