Knowledge and Information Systems

, Volume 53, Issue 3, pp 551–577 | Cite as

Recent advances in feature selection and its applications

Survey Paper

Abstract

Feature selection is one of the key problems for machine learning and data mining. In this review paper, a brief historical background of the field is given, followed by a selection of challenges which are of particular current interests, such as feature selection for high-dimensional small sample size data, large-scale data, and secure feature selection. Along with these challenges, some hot topics for feature selection have emerged, e.g., stable feature selection, multi-view feature selection, distributed feature selection, multi-label feature selection, online feature selection, and adversarial feature selection. Then, the recent advances of these topics are surveyed in this paper. For each topic, the existing problems are analyzed, and then, current solutions to these problems are presented and discussed. Besides the topics, some representative applications of feature selection are also introduced, such as applications in bioinformatics, social media, and multimedia retrieval.

Keywords

Feature selection Survey Data mining 

Notes

Acknowledgements

This work was partially supported by Natural Science Foundation of China (Nos. 61603197, 91646116), Ministry of Education/China Mobile joint research Grant under Project No. 5-10, Scientific, Technological Support Project (Society) of Jiangsu Province (No. BE2016776), Natural Science Foundation of Jiangsu Province (No. BK20140885), Six talent peaks project in Jiangsu Province under Grant XYDXXJS-CXTD-006 and Jiangsu Qinlan Project.

References

  1. 1.
    Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 31:1157–1182MATHGoogle Scholar
  2. 2.
    Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17:494–502CrossRefGoogle Scholar
  3. 3.
    Hughes GF (1968) On the mean accuracy of statistical pattern recognizers. IEEE Trans Inf Theory 14:55–63CrossRefGoogle Scholar
  4. 4.
    Miller AJ (1984) Selection of subsets of regression variables. J R Stat Soc 147:389–425MathSciNetMATHGoogle Scholar
  5. 5.
    Blum A, Langle P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97:245–271Google Scholar
  6. 6.
    Kohavi R, John G (1997) Wrappers for feature subset selection. Artif Intell 97:273–324CrossRefMATHGoogle Scholar
  7. 7.
    Inza I, Larranaga P, Blanco R, Cerrolaza AJ (2004) Filter versus wrapper gene selection approaches in DNA microarray domains. Artif Intell Med 31:91–103CrossRefGoogle Scholar
  8. 8.
    Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305MATHGoogle Scholar
  9. 9.
    Blum AL, Rivest RL (1992) Training a 3-node neural networks is NP-complete. Neural Netw 5:117–127CrossRefGoogle Scholar
  10. 10.
    Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml
  11. 11.
    Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537CrossRefGoogle Scholar
  12. 12.
    Singh D, Febbo PG, Ross K (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2:203–209CrossRefGoogle Scholar
  13. 13.
    Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 98:13790–13795CrossRefGoogle Scholar
  14. 14.
    Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96:6745–6750Google Scholar
  15. 15.
    Zhao Z (2010) Spectral feature selection for mining ultrahigh dimensional data, Ph.D. thesis. Arizona State UniversityGoogle Scholar
  16. 16.
    Guyon I, Gunn S, Nikravesh M, Zadeh L (2006) Feature extraction, foundations and applications. Springer, Physica-Verlag, New YorkCrossRefMATHGoogle Scholar
  17. 17.
    Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1:131–156CrossRefGoogle Scholar
  18. 18.
    Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40:16–28CrossRefGoogle Scholar
  19. 19.
    Tang JL, Alelyani S, Liu H (2014) Feature selection for classification—a review. In: Aggarwal C (ed) Data classification: algorithms and applications. CRC Press, Boca RatonGoogle Scholar
  20. 20.
    Li JD, Cheng KW, Wang SH, Morstatter F, Trevino RP, Tang JL, Liu H (2016) Feature selection: a data perspective, vol 3, pp 1–73. arXiv:1601.07996
  21. 21.
    Peng HC, Long FH, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27:1226–1238CrossRefGoogle Scholar
  22. 22.
    Mitra P, Murthy CA, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24:301–312CrossRefGoogle Scholar
  23. 23.
    Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of international conference on machine learning, pp 359–366Google Scholar
  24. 24.
    Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of international conference on machine learning, pp 856–863Google Scholar
  25. 25.
    Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224MathSciNetMATHGoogle Scholar
  26. 26.
    Saeys Y, Abeel T, de Peer YV (2008) Robust feature selection using ensemble feature selection techniques. In: Proceedings of the 25th European conference on machine learning and knowledge discovery in databases, Banff, pp 313–325Google Scholar
  27. 27.
    Han Y, Yu L (2010) A variance reduction framework for stable feature selection. In: Proceedings of the international conference on data mining, pp 206–215Google Scholar
  28. 28.
    Loscalzo S, Yu L, Ding C (2009) Consensus group stable feature selection. In: Proceedings of ACM SIGKDD conference on knowledge discovery and data mining, pp 567–575Google Scholar
  29. 29.
    Abeel T, Helleputte T, de Peer YV, Dupont P, Saeys Y (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26:392–398CrossRefGoogle Scholar
  30. 30.
    Li Y, Gao SY, Chen SC (2012) Ensemble feature weighting based on local learning and diversity. In: AAAI Conference on artificial intelligence, pp 1019–1025Google Scholar
  31. 31.
    Woznica A, Nguyen P, Kalousis A (2012) Model mining for robust feature selection. In: Proceedings of ACM SIGKDD conference on knowledge discovery and data mining, pp 913–921Google Scholar
  32. 32.
    Yu L, Han Y, Berens ME (2012) Stable gene selection from microarray data via sample weighting. IEEE/ACM Trans Comput Biol Bioinform 9:262–272CrossRefGoogle Scholar
  33. 33.
    Yu L, Ding C, Loscalzo S (2008) Stable feature selection via dense feature groups. In: Proceedings of ACM SIGKDD conference on knowledge discovery and data mining, pp 803–811Google Scholar
  34. 34.
    He ZY, Yu WC (2010) Stable feature selection for biomarker discovery. Comput Biol Chem 34:215–225CrossRefGoogle Scholar
  35. 35.
    Li Y, Huang SS, Chen SC, Si J (2013) Stable l2-regularized ensemble feature weighting. In: Proceedings of the 11th international workshop on multiple classifier systems, pp 167–178Google Scholar
  36. 36.
    Li Y, Si J, Zhou GJ, Huang SS, Chen SC (2015) Frel: a stable feature selection algorithm. IEEE Trans Neural Netw Learn Syst 26:1388–1402MathSciNetCrossRefGoogle Scholar
  37. 37.
    Crammer K, Bachrach RG, Navot A, Tishby N (2002) Margin analysis of the LVQ algorithm. In: Proceedings of advances in neural information processing systems, pp 462–469Google Scholar
  38. 38.
    Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617MathSciNetMATHGoogle Scholar
  39. 39.
    Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Stat Methodol) 58:267–288MathSciNetMATHGoogle Scholar
  40. 40.
    Ng AY (2004) Feature selection, l1 vs. l2 regularization, and rotational invariance. In: Proceedings of international conference on machine learning, pp 78–85Google Scholar
  41. 41.
    Jenatton R, Obozinski G, Bach F (2010) Structured sparse principal component analysis. In: Proceedings of international conference on artificial intelligence and statisticsGoogle Scholar
  42. 42.
    Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68:49–67MathSciNetCrossRefMATHGoogle Scholar
  43. 43.
    Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol) 67:301–320MathSciNetCrossRefMATHGoogle Scholar
  44. 44.
    Kim S, Xing EP (2010) Tree-guided group lasso for multi-task regression with structured sparsity. In: Proceedings of the 27th international conference on machine learningGoogle Scholar
  45. 45.
    Wang J, Zhou JY, Liu J, Wonka P, Ye JP (2014) A safe screening rule for sparse logistic regression. In: Proceedings of advances in neural information processing systems, pp 1053–1061Google Scholar
  46. 46.
    Wang J, Ye JP (2015) Safe screening for multi-task feature learning with multiple data matrices. In: Proceedings of the 32nd international conference on machine learningGoogle Scholar
  47. 47.
    Zhao Z, Wang JX, Sharma S, Agarwal N, Liu H, Chang Y (2010) An integrative approach to identifying biologically relevant genes. In: Proceedings of SIAM International conference on data miningGoogle Scholar
  48. 48.
    Weinberger K, Dasgupta A, Langford J, Smola A, Attenberg J (2009) Feature hashing for large scale multitask learning. In: Proceedings of international conference on machine learningGoogle Scholar
  49. 49.
    Chu CT, Kim SK, Lin YA, Yu YY, Bradski G, Ng A, Olukotun K (2007) Map-reduce for machine learning on multicore. In: Proceedings of advances in neural information processing systemsGoogle Scholar
  50. 50.
    Snir M, Otto S, Lederman SH, Walker D, Dongarra J (1995) MPI: the complete reference, 1st edn. MIT Press, CambridgeGoogle Scholar
  51. 51.
    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51:107–113CrossRefGoogle Scholar
  52. 52.
    Zhao ZA, Liu H (2012) Spectral feature selection for data mining. Taylor and Francis Group, LondonGoogle Scholar
  53. 53.
    Zhao Z, Zhang RW, Cox J, Duling D, Sarle W (2013) Massively parallel feature selection: an approach based on variance preservation. Mach Learn 92:195–220MathSciNetCrossRefMATHGoogle Scholar
  54. 54.
    Das K, Bhaduri K (2010) H. Kargupta: A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks. Knowl. Inf Syst 24:341–367Google Scholar
  55. 55.
    Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26:97–107CrossRefGoogle Scholar
  56. 56.
    Cao B, He LF, Kong XN, Yu PS, Hao ZF, Ragin AB (2014) Tensor-based multi-view feature selection with applications to brain diseases. In: Proceedings of the 2014 international conference on data mining, pp 40–49Google Scholar
  57. 57.
    Smalter A, Huan J, Lushington G (2009) Feature selection in the tensor product feature space. In: Proceedings of the 2009 international conference on data mining, pp 1004–1009Google Scholar
  58. 58.
    Tang JL, Hu X, Gao HJ, Liu H (2013) Unsupervised feature selection for multi-view data in social media. In: Proceedings of the 2013 SIAM conference on data miningGoogle Scholar
  59. 59.
    Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422CrossRefMATHGoogle Scholar
  60. 60.
    Fang Z, Zhang ZM (2013) Discriminative feature selection for multi-view cross-domain learning. In: Proceedings of ACM international conference of information and knowledge management, pp 1321–1330Google Scholar
  61. 61.
    Chen WZ, Yan J, Zhang BY, Chen Z, Yang Q (2007) Document transformation for multi-label feature selection in text categorization. In: Proceedings of the 7th IEEE conference on data mining, pp 451–456Google Scholar
  62. 62.
    Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106Google Scholar
  63. 63.
    Kass GV (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 119–127Google Scholar
  64. 64.
    Yan J, Liu N, Zhang B, Yan S, Chen Z, Cheng Q, Fan W, Ma WY (2005) OCFS: optimal orthogonal centroid feature selection for text categorization. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, pp 122–129Google Scholar
  65. 65.
    Lastra G, Luaces O, Quevedo JR, Bahamonde A (2011) Graphical feature selection for multilabel classification tasks. In: Proceedings of the 10th international conference on advances in intelligent data analysis, pp 281–305Google Scholar
  66. 66.
    Kong X, Yu PS (2012) gMLC: a multi-label feature selection framework for graph classification. Knowl Inf Syst 31:281–305CrossRefGoogle Scholar
  67. 67.
    Gu QQ, Li ZH, Han JW (2011) Correlated multi-label feature selection. In: Proceedings of the 20th ACM international conference on information and knowledge management, pp 1087–1096Google Scholar
  68. 68.
    Elisseeff A, Weston J (2001) A kernel method for multi-labelled classification. In: Advances in neural information processing systems, pp 681–687Google Scholar
  69. 69.
    Yan P, Li Y (2016) Graph-margin based multi-label feature selection. In: European conference on machine learning, pp 540–555Google Scholar
  70. 70.
    Perkins S, Theiler J (2003) Online feature selection using grafting. In: Proceedings of international conference on machine learning, pp 592–599Google Scholar
  71. 71.
    Wu X, Yu K, Wang H, Ding W (2010) Online streaming feature selection. In: Proceedings of international conference on machine learning, pp 1159–1166Google Scholar
  72. 72.
    Zhou D, Huang J, Scholkopf B (2005) Learning from labeled and unlabeled data on a directed graph. In: Proceedings of international conference on machine learning, pp 1036–1043Google Scholar
  73. 73.
    Yu K, Wu XD, Ding W, Pei J (2014) Towards scalable and accurate online feature selection for big data. In: Proceedings of IEEE conference on data mining, pp 660–669Google Scholar
  74. 74.
    Sengupta D, Bandyopadhyay S, Sinha D (2017) A scoring scheme for online feature selection: simulating model performance without retraining. IEEE Trans Neural Netw Learn Syst 28:405–414CrossRefGoogle Scholar
  75. 75.
    Wang J, Zhao ZQ, Hu XG, Cheung YM, Wang M, Wu XD (2013) Online group feature selection. In: Proceedings of international joint conference on artificial intelligenceGoogle Scholar
  76. 76.
    Wang J, Zhao P, Hoi S, Jin R (2014) Online feature selection and its applications. IEEE Trans Knowl Data Eng 26:698–710Google Scholar
  77. 77.
    Zhang Q, Zhang P, Long G, Ding W, Zhang C, Wu X (2015) Towards mining trapezoidal data streams. In: Proceedings of IEEE international conference on data mining, pp 1111–1116Google Scholar
  78. 78.
    Avidan S, Butman M (2006) Efficient methods for privacy preserving face detection. In: Advances in neural information processing systems, pp 57–64Google Scholar
  79. 79.
    Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:337–407Google Scholar
  80. 80.
    Zhou Q, Zhou H, Li T (2016) Cost-sensitive feature selection using random forest: selecting low-cost subsets of informative features. Knowl-Based Syst 95:1–11CrossRefGoogle Scholar
  81. 81.
    Dwork C (2006) Differential privacy. In: Proceedings of international colloquium on automata, languages and programming, pp 1–12Google Scholar
  82. 82.
    Yang J, Li Y (2014) Differential privacy feature selection. In: Proceedings of international joint conference on neural networks, pp 4182–4189Google Scholar
  83. 83.
    Li Y, Yang J, Ji W (2016) Local learning-based feature weighting with privacy preservation. Neurocomputing 174:1107–1115CrossRefGoogle Scholar
  84. 84.
    Sun YJ, Todorovic S, Goodison S (2010) Local learning based feature selection for high dimensional data analysis. IEEE Trans Pattern Anal Mach Intell 32:1–18CrossRefGoogle Scholar
  85. 85.
    Barreno M, Nelson B, Joseph AD, Tygar JD (2010) The security of machine learning. Mach Learn 81:121–148MathSciNetCrossRefGoogle Scholar
  86. 86.
    Huang L, Joseph AD, Nelson B, Rubinstein BIP, Tygar JD (2011) Adversarial machine learning. In: Proceedings of 4th ACM workshop on artificial intelligence and security, pp 43–58Google Scholar
  87. 87.
    Biggio B, Fumera G, Roli F (2014) Security evaluation of pattern classifiers under attack. IEEE Trans Knowl Data Eng 26:984–996CrossRefGoogle Scholar
  88. 88.
    Li B, Vorobeychik Y (2014) Feature cross-substitution in adversarial classification. In: Proceedings of advances in neural information processing systems, pp 2087–2095Google Scholar
  89. 89.
    Xiao H, Biggio B, Brown G, Fumera G, Eckert C, Roli F (2015) Is feature selection secure against training data poisoning? In: Proceedings of the 32th international conference on machine learningGoogle Scholar
  90. 90.
    Zhang F, Chan PPK, Biggio B, Yeung DS, Roli F (2015) Adversarial feature selection against evasion attacks. IEEE Trans Cybern 46:766–777Google Scholar
  91. 91.
    Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517CrossRefGoogle Scholar
  92. 92.
    Bolon-Canedo V, Sanchez-Marono N, Alonso-Betanzos A, Benitez JM, Herrera F (2014) A review of microarray datasets and applied feature selection methods. Inf Sci 282:111–135CrossRefGoogle Scholar
  93. 93.
    Nie FP, Huang H, Cai X, Ding C (2010) Efficient and robust feature selection via joint l21-norms minimization. Adv Neural Inf Process Syst 23:1813–1821Google Scholar
  94. 94.
    Tang JL, Liu H (2012) Feature selection with linked data in social media. In: SIAM international conference on data miningGoogle Scholar
  95. 95.
    Tang JL, Liu H (2012) Unsupervised feature selection for linked social media data. In: Eighteenth ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  96. 96.
    Tang JL, Liu H (2014) Feature selection for social media data. ACM Trans Knowl Discov Data 8:1–27CrossRefGoogle Scholar
  97. 97.
    Tang JL, Liu H (2014) An unsupervised feature selection framework for social media data. IEEE Trans Knowl Data Eng 26:2914–2927CrossRefGoogle Scholar
  98. 98.
    Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113-1-026113-15Google Scholar
  99. 99.
    Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416MathSciNetCrossRefGoogle Scholar
  100. 100.
    Li JD, Tang JL, Hu X, Liu H (2015) Unsupervised streaming feature selection in social media. In: Proceedings of ACM international conference of information and knowledge managementGoogle Scholar
  101. 101.
    Wu F, Han YH, Liu X, Shao J, Zhuang YT, Zhang ZF (2012) The heterogeneous feature selection with structural sparsity for multimedia annotation and hashing: a survey. Int J Multimed Inf Retr 1:3–15CrossRefGoogle Scholar
  102. 102.
    Wright J, Yang A, Ganesh A, Sastry S, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31:210–227CrossRefGoogle Scholar
  103. 103.
    Jiang W, Er GH, Dai QH, Gu JW (2006) Similarity-based online feature selection in content-based image retrieval. IEEE Trans Image Process 15:702–712CrossRefGoogle Scholar
  104. 104.
    Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 38:337–374MathSciNetCrossRefMATHGoogle Scholar
  105. 105.
    Khoshgoftaar TM, Gao KH, Napolitano A, Wald R (2014) A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Info Syst Frontiers 16:801–822CrossRefGoogle Scholar
  106. 106.
    Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313:504–507MathSciNetCrossRefMATHGoogle Scholar
  107. 107.
    Zhao L, Hu Q, Wang W (2015) Heterogeneous feature selection with multi-modal deep neural networks and sparse group lasso. IEEE Trans Multimed 17:1936–1948CrossRefGoogle Scholar
  108. 108.
    Moro S, Cortez P, Rita P (2015) Business intelligence in banking: a literature analysis from 2002 to 2013 using text mining and latent Dirichlet allocation. Expert Syst Appl 42:1314–1324CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2017

Authors and Affiliations

  1. 1.School of Computer Science and TechnologyNanjing University of Posts and TelecommunicationsNanjingChina
  2. 2.Jiangsu Key Laboratory of Big Data Security and Intelligent ProcessingNanjing University of Posts and TelecommunicationsNanjingChina
  3. 3.School of Computer ScienceFlorida International UniversityMiamiUSA
  4. 4.School of Computing, Informatics, and Decision Systems EngineeringArizona State UniversityTempeUSA

Personalised recommendations