Advertisement

Progress in Artificial Intelligence

, Volume 5, Issue 2, pp 65–75 | Cite as

Feature selection for high-dimensional data

  • Verónica Bolón-Canedo
  • Noelia Sánchez-Maroño
  • Amparo Alonso-Betanzos
Regular Paper

Abstract

This paper offers a comprehensive approach to feature selection in the scope of classification problems, explaining the foundations, real application problems and the challenges of feature selection in the context of high-dimensional data. First, we focus on the basis of feature selection, providing a review of its history and basic concepts. Then, we address different topics in which feature selection plays a crucial role, such as microarray data, intrusion detection, or medical applications. Finally, we delve into the open challenges that researchers in the field have to deal with if they are interested to confront the advent of “Big Data” and, more specifically, the “Big Dimensionality”.

Keywords

Feature selection High-dimensional data Big Data 

Notes

Acknowledgments

This research has been economically supported in part by the Ministerio de Economía y Competitividad of the Spanish Government through the research project TIN 2012-37954, partially funded by FEDER funds of the European Union, and by the Consellería de Industria of the Xunta de Galicia through the research project GRC2014/035. V. Bolón-Canedo acknowledges support of the Xunta de Galicia under postdoctoral Grant code ED481B 2014/164-0.

References

  1. 1.
    Awada, W., Khoshgoftaar, T.M., Dittman, D., Wald, R., Napolitano, A.: A Review of the Stability of Feature Selection Techniques for Bioinformatics Data. In: Information Reuse and Integration (IRI), 2012 IEEE 13th International Conference on, pp. 356–363 (2012)Google Scholar
  2. 2.
    Bahamonde, A., Bayn, G. F., Dez, J., Quevedo, J.R., Luaces, O., Del Coz, J.J., Goyache, F.: Feature subset selection for learning preferences: A case study. In: Proceedings of the International conference on Machine learning, p. 7. ACM (2004)Google Scholar
  3. 3.
    Banerjee, M., Chakravarty, S.: Privacy preserving feature selection for distributed data using virtual dimension. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 2281–2284. ACM (2011)Google Scholar
  4. 4.
    Bellman, R.E.: Adaptive control processes: a guided tour, vol. 4, p. 5. Princeton University Press (1961)Google Scholar
  5. 5.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Distributed feature selection: an application to microarray data classification. Appl. Soft Comput. 30Google Scholar
  6. 6.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 34(3), 483–519 (2013)CrossRefGoogle Scholar
  7. 7.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Recent advances and emerging challenges of feature selection in the context of big data. Knowl. Based Syst. (2015)Google Scholar
  8. 8.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014)CrossRefGoogle Scholar
  9. 9.
    Bolón-Canedo, Verónica, Porto-Díaz, Iago, Sánchez-Maroño, Noelia, Alonso-Betanzos, Amparo: A framework for cost-based feature selection. Pattern Recognit. 47(7), 2481–2489 (2014)CrossRefGoogle Scholar
  10. 10.
    Bolon-Canedo, Veronica, Sanchez-Marono, Noelia, Alonso-Betanzos, Amparo: Feature selection and classification in multiple class datasets: An application to kdd cup 99 dataset. Expert Syst. Appl. 38(5), 5947–5957 (2011)CrossRefGoogle Scholar
  11. 11.
    Bolón-Canedo, Verónica, Sánchez-Maroño, Noelia, Alonso-Betanzos, Amparo: An ensemble of filters and classifiers for microarray data classification. Pattern Recognit. 45(1), 531–539 (2012)CrossRefGoogle Scholar
  12. 12.
    Bolón-Canedo, Verónica, Sánchez-Maroño, Noelia, Alonso-Betanzos, Amparo: Data classification using an ensemble of filters. Neurocomputing 135, 13–20 (2014)CrossRefGoogle Scholar
  13. 13.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Feature selection for high-dimensional data. Springer (2015). doi: 10.1007/978-3-319-21858-8
  14. 14.
    Broad institute.: Cancer Program Data Sets. http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi. Accessed Jan 2016
  15. 15.
    Brown, G., Pocock, A., Zhao, M., Luján, M.: Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13(1), 27–66 (2012)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Bryant, R., Katz, R.H., Lazowska, E.D.: Creating revolutionary breakthroughs in commerce, science and society. Big-data Comput (2008)Google Scholar
  17. 17.
    Choh M.T.: Combining noise correction with feature selection. In: Data Warehousing and Knowledge Discovery, pp. 340–349. Springer (2003)Google Scholar
  18. 18.
    Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)Google Scholar
  19. 19.
    Cox, M., Ellsworth, D.: Application-controlled demand paging for out-of-core visualization. In: Proceedings of the 8th conference on Visualization’97, p. 235-ff. IEEE Computer Society Press (1997)Google Scholar
  20. 20.
    Dash, Manoranjan, Liu, Huan: Consistency-based search in feature selection. Artif. Intell. 151(1), 155–176 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Duda, Richard O, Hart, Peter E, Stork, David G: Pattern classification, 2nd edn. Wiley, NY (2010)zbMATHGoogle Scholar
  22. 22.
    Flach, P.: Machine Learning: The art and science of algorithms that make sense of data. Cambridge University Press, Cambridge (2012)CrossRefzbMATHGoogle Scholar
  23. 23.
    Frénay, Benoît, Verleysen, Michel: Classification in the presence of label noise: a survey. Neural Netw. Learn. Syst. IEEE Trans. 25(5), 845–869 (2014)CrossRefGoogle Scholar
  24. 24.
    Galar, Mikel, Fernández, Alberto, Barrenechea, Edurne, Bustince, Humberto, Herrera, Francisco: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)Google Scholar
  25. 25.
    Garcia, S., Luengo, J., Herrera, F.: Data preprocessing in data mining. Springer, Switzerland (2015)CrossRefGoogle Scholar
  26. 26.
    Geng, X., Liu, T. Y., Qin, T., Li, H.: Feature selection for ranking. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in Information Retrieval, p. 407–414. ACM (2007)Google Scholar
  27. 27.
    González Navarro, F.F.: Feature selection in cancer research: microarray gene expression and in vivo 1H-MRS domains. PhD thesis, Universitat Politècnica de Catalunya (2011)Google Scholar
  28. 28.
    Grossberg, Stephen: Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Netw. 1(1), 17–61 (1988)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Guyon, Isabelle, Gunn, Steve, Nikravesh, Masoud, Zadeh, Lofti A: Feature extraction: foundations and applications, vol. 207. Springer, Berlin, Heidelberg (2008)zbMATHGoogle Scholar
  30. 30.
    Guyon, Isabelle, Weston, Jason, Barnhill, Stephen, Vapnik, Vladimir: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)CrossRefzbMATHGoogle Scholar
  31. 31.
    Hall, M.A.: Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato (1999)Google Scholar
  32. 32.
    Hashem, Ibrahim Abaker Targio, Yaqoob, Ibrar, Anuar, Nor Badrul, Mokhtar, Salimah, Gani, Abdullah, Khan, Samee Ullah: The rise of ‘’big data” on cloud computing: review and open research issues. Inf. Syst. 47, 98–115 (2015)CrossRefGoogle Scholar
  33. 33.
    Hernández-Pereira, Elena, Bolón-Canedo, Veronica, Sánchez-Maroño, Noelia, Álvarez-Estévez, Diego, Moret-Bonillo, Vicente, Alonso-Betanzos, Amparo: A comparison of performance of k-complex classification methods using feature selection. Inf. Sci. 328, 1–14 (2016)CrossRefGoogle Scholar
  34. 34.
    Hoens, T.Ryan, Polikar, Robi, Chawla, Nitesh V.: Learning from streaming data with concept drift and imbalance: an overview. Progress in. Artifi. Intell. 1(1), 89–101 (2012)Google Scholar
  35. 35.
    Hua, J., Tembe, W.D., Dougherty, E.R.: Performance of feature-selection methods in the classification of high-dimension data. Pattern Recognit. 42(3), 409–424 (2009)CrossRefzbMATHGoogle Scholar
  36. 36.
    ICML workshop on Learning with Test-Time Budgets. https://sites.google.com/site/budgetedlearning2013/. Accessed Jan 2016
  37. 37.
    Jeong, Y.S., Kang, I.H., Jeong, M.K., Kong, D.: A new feature selection method for one-class classification problems. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(6), 1500–1509Google Scholar
  38. 38.
    KDD Cup 99 Dataset. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. Accessed Jan 2016
  39. 39.
    Kononenko, I: Estimating attributes: analysis and extensions of relief. In: Machine Learning: ECML-94, pp. 171–182. Springer (1994)Google Scholar
  40. 40.
    Kuncheva, L.: Combining pattern classifiers. Methods and algorithms. Wiley, Hoboken, NJ (2014)Google Scholar
  41. 41.
    Laney, Doug: 3d data management: Controlling data volume, velocity and variety. META Group Res. Note 6, 70 (2001)Google Scholar
  42. 42.
    Laporte, L., Flamary, R., Canu, S., Djean, S., Mothe, J.: Nonconvex regularizations for feature selection in ranking with sparse SVM. Neural Netw. Learn. Syst. IEEE Trans. 25(6), 1118–1130 (2014)CrossRefGoogle Scholar
  43. 43.
    Lei, Yu., Liu, Huan: Feature selection for high-dimensional data: A fast correlation-based filter solution. ICML 3, 856–863 (2003)Google Scholar
  44. 44.
    Lei, Yu., Liu, Huan: Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 5, 1205–1224 (2004)MathSciNetzbMATHGoogle Scholar
  45. 45.
    Lichman, M.: UCI machine learning repository, 2013. http://archive.ics.uci.edu/ml. Accessed Jan 2016
  46. 46.
    Ling, C.X., Sheng, V.S.: Class imbalance problem. In Encyclopedia of Machine Learning, pp. 171–171. Springer (2010)Google Scholar
  47. 47.
    Liu, H,, Motoda, H.: Feature selection for knowledge discovery and data mining, volume 454. Springer Science and Business Media (2012)Google Scholar
  48. 48.
    Liu, H, Setiono, R.: Chi2: Feature selection and discretization of numeric attributes. In tai, p. 388. IEEE (1995)Google Scholar
  49. 49.
    López, Victoria, Fernández, Alberto, García, Salvador, Palade, Vasile, Herrera, Francisco: An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)CrossRefGoogle Scholar
  50. 50.
    Molina, L.C., Belanche, L., Nebot, A.: Feature selection algorithms: a survey and experimental evaluation. In: Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on, pp. 306–313. IEEE (2002)Google Scholar
  51. 51.
    Moreno-Torres, Jose G., Raeder, Troy, Alaiz-RodríGuez, RocíO, Chawla, Nitesh V., Herrera, Francisco: A unifying view on dataset shift in classification. Pattern Recognit. 45(1), 521–530 (2012)CrossRefGoogle Scholar
  52. 52.
    Muhlbaier, Michael D., Topalis, Apostolos, Polikar, Robi: Learn. nc: Combining ensemble of classifiers with dynamically weighted consult-and-vote for efficient incremental learning of new classes. Neural Netw. IEEE Trans. 20(1), 152–168 (2009)CrossRefGoogle Scholar
  53. 53.
    NIPS 2002 Workshop: Beyond Classification and Regression: Learning Rankings, Preferences, Equality Predicates, and Other Structures. http://www.cs.cornell.edu/People/tj/ranklearn/. Accessed Jan 2016
  54. 54.
    Pang, Y., Shao, L.: Special issue on dimensionality reduction for visual big data. Neurocomputing 173(Part 2), 125–126 (2016)Google Scholar
  55. 55.
    Peng, Hanchuan, Long, Fuhui, Ding, Chris: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Anal. Mach. Intell. IEEE Trans. 27(8), 1226–1238 (2005)CrossRefGoogle Scholar
  56. 56.
    Peralta, S., Río, S., Ramírez-Gallego, I., Triguero, J.M., Benítez, Herrera, F.: Evolutionary feature selection for big data classification: a mapreduce approach. Math. Prob. Eng. (2015)Google Scholar
  57. 57.
    Peteiro-Barral, D., Boln-Canedo, V., Alonso-Betanzos, A., Guijarro-Berdiñas, B., Sánchez-Maroño, N.: Scalability analysis of filter-based methods for feature selection. Adv. Smart Syst. Res. 2(1), 21–26 (2012)Google Scholar
  58. 58.
    Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset shift in machine learning. The MIT Press (2009)Google Scholar
  59. 59.
    Ross Quinlan, J.: Induction of decision trees. Machine Learn. 1(1), 81–106 (1986)Google Scholar
  60. 60.
    Ramírez-Gallego, S., García, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V. D., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge. WIREs Data Min. Knowl. Discov. 6(1), 5–21 (2016)Google Scholar
  61. 61.
    Remeseiro, B., Bolon-Canedo, V., Peteiro-Barral, D., Alonso-Betanzos, A., Guijarro-Berdinas, B., Mosquera, A., Penedo, M.G., Sanchez-Marono, N.: A methodology for improving tear film lipid layer classification. Biomed. Health Inf. IEEE J. 18(4), 1485–1493 (2014)Google Scholar
  62. 62.
    Remeseiro, B., Ramos, L., Penas, M., Martinez, E., Penedo, M.G., Mosquera, A.: Colour texture analysis for classifying the tear film lipid layer: a comparative study. In: Digital Image Computing Techniques and Applications (DICTA), 2011 International Conference on, p. 268–273. IEEE (2011)Google Scholar
  63. 63.
    Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)Google Scholar
  64. 64.
    Seijo-Pardo, B., Bolón-Canedo, V., Porto-Díaz, I., Alonso-Betanzos, A.: Ensemble feature selection for ranking of features. In 2015 International Work Conference on Artificial Neural Networks (IWANN) 2015, pp. 29–42 (2015)Google Scholar
  65. 65.
    Shalev-Shwartz, S., Ben-David., S.: Understanding Machine Learning: From theory to algorithms. Cambridge University Press, Cambridge (2014)CrossRefzbMATHGoogle Scholar
  66. 66.
    Shalev-Shwartz, Shai: Online learning and online convex optimization. Found. Trends Mach. Learn. 4(2), 107–194 (2011)CrossRefzbMATHGoogle Scholar
  67. 67.
    Sharma, A., Imoto, S., Miyano, S.: A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinf. 9(3), 754–764 (2012)CrossRefGoogle Scholar
  68. 68.
    Spark implementations of Feature Selection methods based on information Theory. https://github.com/sramirez/spark-infotheoretic-feature-selection. Accessed Jan 2016
  69. 69.
    Tan, Kay Chen, Teoh, Eu Jin, Yu, Q., Goh, K.C.: A hybrid evolutionary algorithm for attribute selection in data mining. Expert Syst. Appl. 36(4), 8616–8630 (2009)CrossRefGoogle Scholar
  70. 70.
    Tsymbal, A.: The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin 106, (2004)Google Scholar
  71. 71.
    Vernon T., John F.G., David R., Stephen M.: The digital universe of opportunities: rich data and the increasing value of the internet of things. International Data Corporation, White Paper, IDC\(\_\)1672 (2014)Google Scholar
  72. 72.
    Vergara, Jorge R., Estévez, Pablo A.: A review of feature selection methods based on mutual information. Neural Comput. Appl. 24(1), 175–186 (2014)CrossRefGoogle Scholar
  73. 73.
    Wang, J., Zhao, P., Hoi, S.C., Jin, R.: Online feature selection and its applications. IEEE Trans. Knowl. Data Eng. p. 114 (2013)Google Scholar
  74. 74.
    Wu, X., Yu, K., Ding, W., Wang, H., Zhu, X.: Online feature selection with streaming features. IEEE Trans. Pattern Anal. Mach. Intell. 35, 11781192 (2013)Google Scholar
  75. 75.
    Yiteng, Z., Yew-Soon, O., Tsang, I.W.: The emerging “big dimensionality”. Computational Intelligence Magazine, IEEE 9(3), 14–26 (2014)Google Scholar
  76. 76.
    Zhao, Z., Zhang, R., Cox, J., Duling, D., Sarle, W.: Massively parallel feature selection: an approach based on variance preservation. Mach. Learn. 92(1), 195–220 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  77. 77.
    Zhao, Zheng, Liu, Huan: Searching for interacting features. IJCAI 7, 1156–1161 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Departamento de ComputaciónUniversidade da CoruñaA CoruñaSpain

Personalised recommendations