Coordinating Discernibility and Independence Scores of Variables in a 2D Space for Efficient and Accurate Feature Selection
Feature selection is to remove redundant and irrelevant features from original ones of exemplars, so that a sparse and representative feature subset can be detected for building a more efficient and accurate classifier. This paper presents a novel definition for the discernibility and independence scores of a feature, and then constructs a two dimensional (2D) space with the feature’s independence as y-axis and discernibility as x-axis to rank features’ importance. This new method is named FSDI (Feature Selection based on Discernibility and Independence of a feature). The discernibility score of a feature is to measure the distinguishability of the feature to detect instances from different classes. The independence score is to measure the redundancy of a feature. All features are plotted in the 2D space according to their discernibility and independence coordinates. The area of the rectangular corresponding to a feature’s discernibility and independence in the 2D space is used as a criterion to rank the importance of the features. Top-k features with much higher importance than the rest ones are selected to form the sparse and representative feature subset for building an efficient and accurate classifier. Experimental results on 5 classical gene expression datasets demonstrate that our proposed FSDI algorithm can select the gene subset efficiently and has the best performance in classification. Our method provides a good solution to the bottleneck issues related to the high time complexity of the existing gene subset selection algorithms.
KeywordsDiscernibility Independence Feature selection Gene subset selection
We are much obliged to those who share the gene expression datasets with us. This work is supported in part by the National Natural Science Foundation of China under Grant No. 31372250, is also supported by the Key Science and Technology Program of Shaanxi Province of China under Grant No. 2013K12-03-24, and is at the same time supported by the Fundamental Research Funds for the Central Universities under Grant No. GK201503067, and by the Innovation Funds of Graduate Programs at Shaanxi Normal University under Grant No. 2015CXS028.
- 2.Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)Google Scholar
- 6.Hall, M.A.: Correlation-based feature selection for machine learning. The University of Waikato (1999)Google Scholar
- 7.Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques: concepts and techniques. Elsevier (2011)Google Scholar
- 10.Kira, K., Rendell, L.A.: The feature selection problem: Traditional methods and a new algorithm. Paper presented at the AAAI (1992)Google Scholar
- 13.MacQueen, J.: Some methods for classification and analysis of multivariate observations. Paper presented at the Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (1967)Google Scholar
- 14.Mao, Y., Zhou, X., Xia, Z., Yi, Z., Sun, Y.: A survey for study of feature selection. Algorithm 20(2), 211–218 (2007). (in Chinese)Google Scholar
- 18.Wang, R., Tang, K.: Feature Selection for Maximizing the Area Under the ROC Curve, pp. 400–405 (2009)Google Scholar
- 19.Xie, J., Gao, H.: Statistical correlation and k-means based distinguishable gene subset selection algorithms. J. Softw. 9, 013 (2014). (in Chinese)Google Scholar
- 20.Xie, J., Gao, H.: A stable gene subset selection algorithm for cancers. In: Yin, X., Ho, K., Zeng, D., Aickelin, U., Zhou, R., Wang, H. (eds.) HIS 2015. LNCS, vol. 9085, pp. 111–122. Springer, Heidelberg (2015)Google Scholar
- 21.Xie, J., Xie, W.: Several feature selection algorithms based on the discernibility of a feature subset and support vector machines. Chin. J. Comput. Chin. Ed. 37(8), 1704–1718 (2014). (in Chinese)Google Scholar