Skip to main content

Advertisement

Log in

Feature selection using feature dissimilarity measure and density-based clustering: Application to biological data

  • Published:
Journal of Biosciences Aims and scope Submit manuscript

Abstract

Reduction of dimensionality has emerged as a routine process in modelling complex biological systems. A large number of feature selection techniques have been reported in the literature to improve model performance in terms of accuracy and speed. In the present article an unsupervised feature selection technique is proposed, using maximum information compression index as the dissimilarity measure and the well-known density-based cluster identification technique DBSCAN for identifying the largest natural group of dissimilar features. The algorithm is fast and less sensitive to the user-supplied parameters. Moreover, the method automatically determines the required number of features and identifies them. We used the proposed method for reducing dimensionality of a number of benchmark data sets of varying sizes. Its performance was also extensively compared with some other well-known feature selection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2

Similar content being viewed by others

References

  • Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, et al. 2000 Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403 503–511

    Article  CAS  PubMed  Google Scholar 

  • Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D and Levine AJ 1999 Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96 6745–6750

  • Blake C and Merz CJ 1998 {UCI} repository of machine learning databases

  • Devijver PA and Kittler J 1982 Pattern recognition: a statistical approach (Englewood Cliffs: Prentice/Hall International)

    Google Scholar 

  • Ester M, Kriegel H-P, Sander J, and Xu X 1996 A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96 226–231

  • Gillick L and Cox SJ 1989 Some statistical issues in the comparison of speech recognition algorithms; in Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference on, pp 532–535. IEEE

  • Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, et al. 1999 Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286 531–537

    Article  CAS  PubMed  Google Scholar 

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P and Witten IH 2009 The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11 10–18

    Article  Google Scholar 

  • Jaitin DA, Kenigsberg E, Keren-Shaul H, Elefant N, Paul F, Zaretsky I, Mildner A, Cohen N, et al. 2014 Massively parallel single-cell rna-seq for marker-free decomposition of tissues into cell types. Science 343 776–779

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Kanji GK 1999 100 statistical tests (Sage)

  • Mitra P, Murthy C and Pal SK 2002 Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Mach. Intell. 24 301–312

    Article  Google Scholar 

  • Mukhopadhyay A, Maulik U and Bandyopadhyay S 2009 Multiobjective genetic algorithm-based fuzzy clustering of categorical attributes. IEEE Trans. Evol. Comput. 13 991–1005

    Article  Google Scholar 

  • Pal SK and Wang PP 1996 Genetic algorithms for pattern recognition (CRC press)

  • Peng H, Long F and Ding C 2005 Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27 1226–1238

    Article  PubMed  Google Scholar 

  • Pudil P, Novovičová J and Kittler J 1994 Floating search methods in feature selection. Pattern Recogn. Lett. 15 1119–1125

    Article  Google Scholar 

  • Ruiz R, Riquelme JC and Aguilar-Ruiz JS 2006 Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recogn. 39 2383–2392

    Article  Google Scholar 

  • Tan F, Fu X, Zhang Y, and Bourgeois AG 2006 Improving feature subset selection using a genetic algorithm for microarray gene expression data; In Evolutionary Computation, 2006. CEC 2006. IEEE Congress on IEEE. pp 2529–2534

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sanghamitra Bandyopadhyay.

Additional information

[Sengupta D, Aich I and Bandyopadhyay S 2015 Feature selection using feature dissimilarity measure and density-based clustering: Application to biological data. J. Biosci.] DOI 10.1007/s12038-015-9556-y

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sengupta, D., Aich, I. & Bandyopadhyay, S. Feature selection using feature dissimilarity measure and density-based clustering: Application to biological data. J Biosci 40, 721–730 (2015). https://doi.org/10.1007/s12038-015-9556-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12038-015-9556-y

Keywords

Navigation