Parallel Selection of Informative Genes for Classification

  • Michael Slavik
  • Xingquan Zhu
  • Imad Mahgoub
  • Muhammad Shoaib
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5462)

Abstract

In this paper, we argue that existing gene selection methods are not effective for selecting important genes when the number of samples and the data dimensions grow sufficiently large. As a solution, we propose two approaches for parallel gene selections, both are based on the well known ReliefF feature selection method. In the first design, denoted by PReliefF p , the input data are split into non-overlapping subsets assigned to cluster nodes. Each node carries out gene selection by using the ReliefF method on its own subset, without interaction with other clusters. The final ranking of the genes is generated by gathering the weight vectors from all nodes. In the second design, namely PReliefF g , each node dynamically updates the global weight vectors so the gene selection results in one node can be used to boost the selection of the other nodes. Experimental results from real-world microarray expression data show that PReliefF p and PReliefF g achieve a speedup factor nearly equal to the number of nodes. When combined with several popular classification methods, the classifiers built from the genes selected from both methods have the same or even better accuracy than the genes selected from the original ReliefF method.

Keywords

Support Vector Machine Weight Vector Gene Selection Main Loop Informative Gene 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Golub, T., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)CrossRefPubMedGoogle Scholar
  2. 2.
    Xiong, M., et al.: Biomarker identification by feature wrappers. Genome Research 11, 1878–1887 (2001)PubMedPubMedCentralGoogle Scholar
  3. 3.
    Baker, S., Kramer, B.: Identifying genes that contribute most to good classification in microarrays. BMC Bioinformatics 7, 407 (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Segal, E., et al.: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genetics 34(2), 166–176 (2003)CrossRefPubMedGoogle Scholar
  5. 5.
    Quinlan, J.: C4.5: Programs for Machine learning. M. Kaufmann, San Francisco (1993)Google Scholar
  6. 6.
    Hua, J., et al.: Optimal number of features as a function of sample size for various classification rules. Bioinformatics 21, 1509–1515 (2005)CrossRefPubMedGoogle Scholar
  7. 7.
    Zhan, J., Deng, H.: Gene selection for classification of microarray data based on the Bayes error. BMC Bioinformatics 8, 370 (2007)CrossRefGoogle Scholar
  8. 8.
    Diaz, R., Alvarez, S.: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7, 3 (2006)CrossRefGoogle Scholar
  9. 9.
    Mamitsuka, H.: Selecting features in microarray classification using ROC curves. Pattern Recognition 39, 2393–2404 (2006)CrossRefGoogle Scholar
  10. 10.
    Dobbin, K., et al.: How large a training set is needed to develop a classifier for microarray data. Clinical Cancer Research 14(1) (2008)Google Scholar
  11. 11.
    Mukherjee, S., Roberts, S.: A Theoretical Analysis of Gene Selection. In: Proc. of IEEE Computer Society Bioinformatics Conference, pp. 131–141 (2004)Google Scholar
  12. 12.
    Li, T., et al.: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20, 2429–2437 (2004)CrossRefPubMedGoogle Scholar
  13. 13.
    Statnikov, A., et al.: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21(5), 631–643 (2005)CrossRefPubMedGoogle Scholar
  14. 14.
    Witten, F.E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (1999)Google Scholar
  15. 15.
    Plackett, R.: Karl Pearson and the Chi-Squared Test. International Statistical Review 51(1), 59–72 (1983)CrossRefGoogle Scholar
  16. 16.
    Robnik-Šikonja, M., Kononenko, I.: Theoretical and Empirical Analysis of ReliefF and RReliefF Mach. Learn. 53, 23–69 (2003)Google Scholar
  17. 17.
    Gropp, W., et al.: MPICH2 User’s Guide (2008), http://www.mcs.anl.gov/research/projects/mpich2/index.php
  18. 18.
    Kohavi, R., John, G.: Wrappers for Feature Subset Selection. Artificial Intelligence 97(1-2), 273–324 (1997)CrossRefGoogle Scholar
  19. 19.
    Kent Ridge Biomedical Data Set Repository, http://sdmc.i2r.a-star.edu.sg/rp/

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Michael Slavik
    • 1
  • Xingquan Zhu
    • 1
  • Imad Mahgoub
    • 1
  • Muhammad Shoaib
    • 1
  1. 1.Department of Computer Science & EngineeringFlorida Atlantic UniversityBoca RatonUSA

Personalised recommendations