Skip to main content
Log in

Unsupervised feature selection in linked biological data

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Feature selection techniques have become an apparent need in many bioinformatics applications, especially when there exist a huge number of features. For instance, classification of hereditary disease genes/proteins plays a significant role in prediction and diagnosis of diseases. In this regard, some knowledge of features’ goodness in making predictions is needed. Apparently, distinctive features and their relevancy to class labels are determinant in designing efficient classifiers. Indeed, excluding redundant and/or irrelevant features, without incurring much loss of information, can reduce the processing cost while improving the predictor’s performance. Consequently, feature selection is a preliminary task in most biological studies. Traditionally, biological data analysis methods also use the common feature selection techniques which imagine the data instances as independent objects and so not consider their possibly inter-relations. For instance, protein–protein interactions (PPIs) handle a wide range of biological processes including cell-to-cell interactions and metabolic and developmental control. Apparently, linked data have more similar characteristics than uncorrelated ones and so accounting these inter-relations beside to data content will be beneficial in feature selection. To incorporate the data inter-relations (e.g., PPIs in biological data) along with the data content in selecting more effective features, a novel feature selection algorithm is proposed. This method works in unsupervised manner to handle the unlabeled biological data since most of the real-world genes/proteins have no label. For this purpose, we try to optimize a novel objective function which incorporates both the inter-relations of data instances and their content. The proposed method tries to identify the most relevant and non-redundant features and extract the top-ranked ones. For this purpose, an efficient iterative algorithm is developed to optimize the objective function. To assess our methods, three well-known evaluation criteria are examined on some real-world biological datasets and the results are compared against some of the state-of-the-art feature selection methods. The experiments demonstrate the effectiveness of our proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. http://www.ebi.ac.uk/intact/.

  2. http://www.uniprot.org/uniprot/.

  3. http://www.hprd.org/.

  4. http://het.io/disease-genes/.

References

  1. Loscalzo J, Kohane I, Barabási A-L (2007) Human disease classification in the postgenomic era: a complex systems approach to human pathobiology. Mol Syst Biol 3(124):1–11

    Google Scholar 

  2. Tang J, Liu H (2012) Unsupervised feature selection for linked social media data. In: KDD

  3. Nie F, Huang H, Cai X, Ding CHQ (2010) Efficient and robust feature selection via joint l2, 1-norms minimization. In: NIPS, pp 1813–182

  4. Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026113

    Article  Google Scholar 

  5. Backstrom L, Leskovec J (2011) Supervised random walks: predicting and recommending links in social networks. In: WSDM, pp 635–644

  6. Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26(9):2138–2150

    Article  Google Scholar 

  7. Peng H, Long F, Ding CHQ (2005) Feature selection based on mutual information: criteria of maxdependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  8. Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889

    MathSciNet  MATH  Google Scholar 

  9. Cawley GC, Talbot NLC, Girolami M (2006) Sparse multinomial logistic regression via bayesian l1 regularisation. In: NIPS, pp 209–216

  10. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc (Ser B) 58:267–288

    MathSciNet  MATH  Google Scholar 

  11. Robnik-Sikonja M, Kononenko I (2003) Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 53(1):23–69

    Article  MATH  Google Scholar 

  12. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  13. Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning, ACM, pp. 1151–1157

  14. He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. In: NIPS, vol. 18, no. 507

  15. Constantinopoulos C, Titsias M, Likas A (2006) Bayesian feature and model selection for gaussian mixture models. In: TPAMI, pp. 1013–1018

  16. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1):389–422

    Article  MATH  Google Scholar 

  17. Pehro D, Stork DG (2001) Pattern Classification. Wiley, London

    Google Scholar 

  18. Krzanowski W (1987) Selection of variables to preserve multivariate data structure, using principal components. Appl Stat 26:22–33

    Article  MathSciNet  Google Scholar 

  19. Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: KDD. ACM pp. 333–342

  20. Yang Y, Shen H, Ma Z, Huang Z, Zhou X (2011) L21-norm regularized discriminative feature selection for unsupervised learning. In: Proceedings of the twenty-second international joint conference on artificial intelligence

  21. Li Z, Yang Y, Liu J, Zhou X, Lu H (2012) Unsupervised feature selection using nonnegative spectral analysis. In: AAAI

  22. Qian M, Zhai C (2013) Robust unsupervised feature selection. In: IJCAI

  23. Gu Q, Han J (2011) Towards feature selection in network. In: CIKM

  24. Tang J, Liu H (2012) Feature selection with linked data in social media. In: SIAM international conference on data mining

  25. Tang J, Liu H (2013) CoSelect: feature selection with instance selection for social media data. In: SIAM international conference on data mining

  26. Tang J, Liu H (2014) An unsupervised feature selection framework for social media data. IEEE Trans Knowl Data Eng 26(12):2914–2927

    Article  Google Scholar 

  27. Tang L, Liu H (2009) Relational learning via latent social dimensions. In: KDD

  28. Newman ME (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA 103(23):8577–8582

    Article  Google Scholar 

  29. Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: NIPS. MIT Press, pp 849–856

  30. Yang Y, Shen HT, Nie F, Ji R, Zhou X (2011) Nonnegative spectral clustering with discriminative regularization. In: AAAI

  31. Ding C, Zhou D, He X, Zha H (2006) R 1-pca: rotational invariant l 1-norm principal component analysis for robust subspace factorization. In: ICML

  32. Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins, Baltimore

    MATH  Google Scholar 

  33. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans PAMI 22:888–905

    Article  Google Scholar 

  34. Yu SX, Shi J (2003) Multiclass spectral clustering. In: ICCV

  35. Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic Press Professional Inc, San Diego

    MATH  Google Scholar 

  36. Lee D, Seung H (1999) Learning the parts of objects by nonnegative matrix factorization. Nature 401:788–791

    Article  MATH  Google Scholar 

  37. Lee D, Seung H (2001) Algorithms for nonnegative matrix factorization. In: NIPS

  38. Liu Y, Jin R, Yang L (2006) Semi-supervised multilabel learning by constrained non-negative matrix factorization. In: AAAI

  39. Kuhn H, Tucker A (1951) Nonlinear programming. In: Berkeley symposium on mathematical statistics and probabilistics

  40. Goh K et al (2007) The human disease network. PNAS 104(21):8685–8690

    Article  Google Scholar 

  41. Gill N, Singh S, Aseri TC (2014) Computational disease gene prioritization: an appraisal. J Comput Biol 21(6):456–465

    Article  MathSciNet  Google Scholar 

  42. Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    MathSciNet  MATH  Google Scholar 

  43. Hoseini E, Mansoori EG (2016) Selecting discriminative features in social media data: an unsupervised approach. Neurocomputing 205(C):463–471

    Article  Google Scholar 

  44. Mansoori EG, Zolghadri MJ, Katebi SD (2009) Protein superfamily classification using fuzzy rule-based classifier. IEEE Trans Nanobiosci 8(1):92–99

    Article  Google Scholar 

  45. Jowkar GH, Mansoori EG (2016) Perceptron ensemble of graph-based positive unlabeled learning for disease gene identification. Comput Biol Chem 64:263–270

    Article  MathSciNet  Google Scholar 

  46. Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425

    Article  Google Scholar 

  47. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Am Stat Assoc 32(200):675–701

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elham Hoseini.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hoseini, E., Mansoori, E.G. Unsupervised feature selection in linked biological data. Pattern Anal Applic 22, 999–1013 (2019). https://doi.org/10.1007/s10044-018-0707-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-018-0707-2

Keywords

Navigation