Unsupervised feature selection in linked biological data

Hoseini, Elham; Mansoori, Eghbal G.

doi:10.1007/s10044-018-0707-2

Unsupervised feature selection in linked biological data

Theoretical Advances
Published: 27 April 2018

Volume 22, pages 999–1013, (2019)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Elham Hoseini¹ &
Eghbal G. Mansoori¹

346 Accesses
2 Citations
Explore all metrics

Abstract

Feature selection techniques have become an apparent need in many bioinformatics applications, especially when there exist a huge number of features. For instance, classification of hereditary disease genes/proteins plays a significant role in prediction and diagnosis of diseases. In this regard, some knowledge of features’ goodness in making predictions is needed. Apparently, distinctive features and their relevancy to class labels are determinant in designing efficient classifiers. Indeed, excluding redundant and/or irrelevant features, without incurring much loss of information, can reduce the processing cost while improving the predictor’s performance. Consequently, feature selection is a preliminary task in most biological studies. Traditionally, biological data analysis methods also use the common feature selection techniques which imagine the data instances as independent objects and so not consider their possibly inter-relations. For instance, protein–protein interactions (PPIs) handle a wide range of biological processes including cell-to-cell interactions and metabolic and developmental control. Apparently, linked data have more similar characteristics than uncorrelated ones and so accounting these inter-relations beside to data content will be beneficial in feature selection. To incorporate the data inter-relations (e.g., PPIs in biological data) along with the data content in selecting more effective features, a novel feature selection algorithm is proposed. This method works in unsupervised manner to handle the unlabeled biological data since most of the real-world genes/proteins have no label. For this purpose, we try to optimize a novel objective function which incorporates both the inter-relations of data instances and their content. The proposed method tries to identify the most relevant and non-redundant features and extract the top-ranked ones. For this purpose, an efficient iterative algorithm is developed to optimize the objective function. To assess our methods, three well-known evaluation criteria are examined on some real-world biological datasets and the results are compared against some of the state-of-the-art feature selection methods. The experiments demonstrate the effectiveness of our proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Loscalzo J, Kohane I, Barabási A-L (2007) Human disease classification in the postgenomic era: a complex systems approach to human pathobiology. Mol Syst Biol 3(124):1–11
Google Scholar
Tang J, Liu H (2012) Unsupervised feature selection for linked social media data. In: KDD
Nie F, Huang H, Cai X, Ding CHQ (2010) Efficient and robust feature selection via joint l2, 1-norms minimization. In: NIPS, pp 1813–182
Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026113
Article Google Scholar
Backstrom L, Leskovec J (2011) Supervised random walks: predicting and recommending links in social networks. In: WSDM, pp 635–644
Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26(9):2138–2150
Article Google Scholar
Peng H, Long F, Ding CHQ (2005) Feature selection based on mutual information: criteria of maxdependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889
MathSciNet MATH Google Scholar
Cawley GC, Talbot NLC, Girolami M (2006) Sparse multinomial logistic regression via bayesian l1 regularisation. In: NIPS, pp 209–216
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc (Ser B) 58:267–288
MathSciNet MATH Google Scholar
Robnik-Sikonja M, Kononenko I (2003) Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 53(1):23–69
Article MATH Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning, ACM, pp. 1151–1157
He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. In: NIPS, vol. 18, no. 507
Constantinopoulos C, Titsias M, Likas A (2006) Bayesian feature and model selection for gaussian mixture models. In: TPAMI, pp. 1013–1018
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1):389–422
Article MATH Google Scholar
Pehro D, Stork DG (2001) Pattern Classification. Wiley, London
Google Scholar
Krzanowski W (1987) Selection of variables to preserve multivariate data structure, using principal components. Appl Stat 26:22–33
Article MathSciNet Google Scholar
Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: KDD. ACM pp. 333–342
Yang Y, Shen H, Ma Z, Huang Z, Zhou X (2011) L21-norm regularized discriminative feature selection for unsupervised learning. In: Proceedings of the twenty-second international joint conference on artificial intelligence
Li Z, Yang Y, Liu J, Zhou X, Lu H (2012) Unsupervised feature selection using nonnegative spectral analysis. In: AAAI
Qian M, Zhai C (2013) Robust unsupervised feature selection. In: IJCAI
Gu Q, Han J (2011) Towards feature selection in network. In: CIKM
Tang J, Liu H (2012) Feature selection with linked data in social media. In: SIAM international conference on data mining
Tang J, Liu H (2013) CoSelect: feature selection with instance selection for social media data. In: SIAM international conference on data mining
Tang J, Liu H (2014) An unsupervised feature selection framework for social media data. IEEE Trans Knowl Data Eng 26(12):2914–2927
Article Google Scholar
Tang L, Liu H (2009) Relational learning via latent social dimensions. In: KDD
Newman ME (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA 103(23):8577–8582
Article Google Scholar
Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: NIPS. MIT Press, pp 849–856
Yang Y, Shen HT, Nie F, Ji R, Zhou X (2011) Nonnegative spectral clustering with discriminative regularization. In: AAAI
Ding C, Zhou D, He X, Zha H (2006) R 1-pca: rotational invariant l 1-norm principal component analysis for robust subspace factorization. In: ICML
Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins, Baltimore
MATH Google Scholar
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans PAMI 22:888–905
Article Google Scholar
Yu SX, Shi J (2003) Multiclass spectral clustering. In: ICCV
Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic Press Professional Inc, San Diego
MATH Google Scholar
Lee D, Seung H (1999) Learning the parts of objects by nonnegative matrix factorization. Nature 401:788–791
Article MATH Google Scholar
Lee D, Seung H (2001) Algorithms for nonnegative matrix factorization. In: NIPS
Liu Y, Jin R, Yang L (2006) Semi-supervised multilabel learning by constrained non-negative matrix factorization. In: AAAI
Kuhn H, Tucker A (1951) Nonlinear programming. In: Berkeley symposium on mathematical statistics and probabilistics
Goh K et al (2007) The human disease network. PNAS 104(21):8685–8690
Article Google Scholar
Gill N, Singh S, Aseri TC (2014) Computational disease gene prioritization: an appraisal. J Comput Biol 21(6):456–465
Article MathSciNet Google Scholar
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
MathSciNet MATH Google Scholar
Hoseini E, Mansoori EG (2016) Selecting discriminative features in social media data: an unsupervised approach. Neurocomputing 205(C):463–471
Article Google Scholar
Mansoori EG, Zolghadri MJ, Katebi SD (2009) Protein superfamily classification using fuzzy rule-based classifier. IEEE Trans Nanobiosci 8(1):92–99
Article Google Scholar
Jowkar GH, Mansoori EG (2016) Perceptron ensemble of graph-based positive unlabeled learning for disease gene identification. Comput Biol Chem 64:263–270
Article MathSciNet Google Scholar
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425
Article Google Scholar
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Am Stat Assoc 32(200):675–701
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Shiraz University, Shiraz, Iran
Elham Hoseini & Eghbal G. Mansoori

Authors

Elham Hoseini
View author publications
You can also search for this author in PubMed Google Scholar
Eghbal G. Mansoori
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elham Hoseini.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hoseini, E., Mansoori, E.G. Unsupervised feature selection in linked biological data. Pattern Anal Applic 22, 999–1013 (2019). https://doi.org/10.1007/s10044-018-0707-2

Download citation

Received: 21 April 2017
Accepted: 08 April 2018
Published: 27 April 2018
Issue Date: 01 August 2019
DOI: https://doi.org/10.1007/s10044-018-0707-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised feature selection in linked biological data

Abstract

Access this article

Similar content being viewed by others

McTwo: a two-step feature selection algorithm based on maximal information coefficient

Multi-view feature selection for identifying gene markers: a diversified biological data driven approach

Improving feature selection performance using pairwise pre-evaluation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised feature selection in linked biological data

Abstract

Access this article

Similar content being viewed by others

McTwo: a two-step feature selection algorithm based on maximal information coefficient

Multi-view feature selection for identifying gene markers: a diversified biological data driven approach

Improving feature selection performance using pairwise pre-evaluation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation