Abstract
Feature selection techniques have become an apparent need in many bioinformatics applications, especially when there exist a huge number of features. For instance, classification of hereditary disease genes/proteins plays a significant role in prediction and diagnosis of diseases. In this regard, some knowledge of features’ goodness in making predictions is needed. Apparently, distinctive features and their relevancy to class labels are determinant in designing efficient classifiers. Indeed, excluding redundant and/or irrelevant features, without incurring much loss of information, can reduce the processing cost while improving the predictor’s performance. Consequently, feature selection is a preliminary task in most biological studies. Traditionally, biological data analysis methods also use the common feature selection techniques which imagine the data instances as independent objects and so not consider their possibly inter-relations. For instance, protein–protein interactions (PPIs) handle a wide range of biological processes including cell-to-cell interactions and metabolic and developmental control. Apparently, linked data have more similar characteristics than uncorrelated ones and so accounting these inter-relations beside to data content will be beneficial in feature selection. To incorporate the data inter-relations (e.g., PPIs in biological data) along with the data content in selecting more effective features, a novel feature selection algorithm is proposed. This method works in unsupervised manner to handle the unlabeled biological data since most of the real-world genes/proteins have no label. For this purpose, we try to optimize a novel objective function which incorporates both the inter-relations of data instances and their content. The proposed method tries to identify the most relevant and non-redundant features and extract the top-ranked ones. For this purpose, an efficient iterative algorithm is developed to optimize the objective function. To assess our methods, three well-known evaluation criteria are examined on some real-world biological datasets and the results are compared against some of the state-of-the-art feature selection methods. The experiments demonstrate the effectiveness of our proposed approach.
Similar content being viewed by others
References
Loscalzo J, Kohane I, Barabási A-L (2007) Human disease classification in the postgenomic era: a complex systems approach to human pathobiology. Mol Syst Biol 3(124):1–11
Tang J, Liu H (2012) Unsupervised feature selection for linked social media data. In: KDD
Nie F, Huang H, Cai X, Ding CHQ (2010) Efficient and robust feature selection via joint l2, 1-norms minimization. In: NIPS, pp 1813–182
Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026113
Backstrom L, Leskovec J (2011) Supervised random walks: predicting and recommending links in social networks. In: WSDM, pp 635–644
Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26(9):2138–2150
Peng H, Long F, Ding CHQ (2005) Feature selection based on mutual information: criteria of maxdependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889
Cawley GC, Talbot NLC, Girolami M (2006) Sparse multinomial logistic regression via bayesian l1 regularisation. In: NIPS, pp 209–216
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc (Ser B) 58:267–288
Robnik-Sikonja M, Kononenko I (2003) Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 53(1):23–69
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning, ACM, pp. 1151–1157
He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. In: NIPS, vol. 18, no. 507
Constantinopoulos C, Titsias M, Likas A (2006) Bayesian feature and model selection for gaussian mixture models. In: TPAMI, pp. 1013–1018
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1):389–422
Pehro D, Stork DG (2001) Pattern Classification. Wiley, London
Krzanowski W (1987) Selection of variables to preserve multivariate data structure, using principal components. Appl Stat 26:22–33
Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: KDD. ACM pp. 333–342
Yang Y, Shen H, Ma Z, Huang Z, Zhou X (2011) L21-norm regularized discriminative feature selection for unsupervised learning. In: Proceedings of the twenty-second international joint conference on artificial intelligence
Li Z, Yang Y, Liu J, Zhou X, Lu H (2012) Unsupervised feature selection using nonnegative spectral analysis. In: AAAI
Qian M, Zhai C (2013) Robust unsupervised feature selection. In: IJCAI
Gu Q, Han J (2011) Towards feature selection in network. In: CIKM
Tang J, Liu H (2012) Feature selection with linked data in social media. In: SIAM international conference on data mining
Tang J, Liu H (2013) CoSelect: feature selection with instance selection for social media data. In: SIAM international conference on data mining
Tang J, Liu H (2014) An unsupervised feature selection framework for social media data. IEEE Trans Knowl Data Eng 26(12):2914–2927
Tang L, Liu H (2009) Relational learning via latent social dimensions. In: KDD
Newman ME (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA 103(23):8577–8582
Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: NIPS. MIT Press, pp 849–856
Yang Y, Shen HT, Nie F, Ji R, Zhou X (2011) Nonnegative spectral clustering with discriminative regularization. In: AAAI
Ding C, Zhou D, He X, Zha H (2006) R 1-pca: rotational invariant l 1-norm principal component analysis for robust subspace factorization. In: ICML
Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins, Baltimore
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans PAMI 22:888–905
Yu SX, Shi J (2003) Multiclass spectral clustering. In: ICCV
Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic Press Professional Inc, San Diego
Lee D, Seung H (1999) Learning the parts of objects by nonnegative matrix factorization. Nature 401:788–791
Lee D, Seung H (2001) Algorithms for nonnegative matrix factorization. In: NIPS
Liu Y, Jin R, Yang L (2006) Semi-supervised multilabel learning by constrained non-negative matrix factorization. In: AAAI
Kuhn H, Tucker A (1951) Nonlinear programming. In: Berkeley symposium on mathematical statistics and probabilistics
Goh K et al (2007) The human disease network. PNAS 104(21):8685–8690
Gill N, Singh S, Aseri TC (2014) Computational disease gene prioritization: an appraisal. J Comput Biol 21(6):456–465
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Hoseini E, Mansoori EG (2016) Selecting discriminative features in social media data: an unsupervised approach. Neurocomputing 205(C):463–471
Mansoori EG, Zolghadri MJ, Katebi SD (2009) Protein superfamily classification using fuzzy rule-based classifier. IEEE Trans Nanobiosci 8(1):92–99
Jowkar GH, Mansoori EG (2016) Perceptron ensemble of graph-based positive unlabeled learning for disease gene identification. Comput Biol Chem 64:263–270
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Am Stat Assoc 32(200):675–701
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hoseini, E., Mansoori, E.G. Unsupervised feature selection in linked biological data. Pattern Anal Applic 22, 999–1013 (2019). https://doi.org/10.1007/s10044-018-0707-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-018-0707-2