Skip to main content

Data Intensive Computing: A Biomedical Case Study in Gene Selection and Filtering

  • Chapter
  • First Online:
Handbook of Data Intensive Computing

Abstract

In data intensive computing environments where the number of samples and data dimensions grow sufficiently large, existing methods in Bioinformatics research are not effective for selecting important genes. In this chapter, we propose two approaches for parallel selection of genes, both are based on the well known { ReliefF} feature selection method and cluster computing environments. In the first design, denoted by { PReliefF} p , the input data are split into non-overlapping subsets assigned to cluster nodes. Each node carries out gene selection by using the { ReliefF} method on its own subset, without interaction with other clusters. The final ranking of the genes for selection is generated by gathering weight vectors from all nodes. In the second design, namely { PReliefF} g , each node dynamically updates global weight vectors so the gene selection results in one node can be used to boost the selection process for other nodes. Experimental results from real-world microarray expression data show that { PReliefF} p and { PReliefF} g nearly perfectly speedup to the number of nodes involved in the computing. When combined with several popular classification methods, the classifiers built from the genes selected from both methods have the same or even better methods than the genes selected from the original ReliefF method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The project described was supported by Award Number R01GM086707 from the National Institute Of General Medical Sciences (NIGMS) at the National Institutes of Health (NIH). The content is solely the responsibility of the authors and does not necessarily represent the official views of NIGMS or NIH.

References

  1. Moore, R., Baru, C., Marciano, R., Rajasekar, A., and Wan, M., Data-Intensive Computing, in, The Grid: Blueprint for a New Computing Infrastructure, Foster, I., and C. Kesselman, Morgan Kaufmann, San Francisco, 1999.

    Google Scholar 

  2. Rosenthal, A., Mork, P., Li, M., Stanford, J., Koester, D., and Reynolds, P., Cloud computing: A new business paradigm for biomedical information sharing, Journal of Biomedical Informatics, 43(2):342–353, 2010.

    Article  Google Scholar 

  3. Liora, X.: Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre, in Proceedings of the 11th Annual conference on Genetic and evolutionary computation, GECCO, (2009).

    Google Scholar 

  4. Fox, G., Qiu, X., Beason, S., Choi, J., Ekanayake, J., Gunarathne, T., Rho, M., Tang, H., Devadasan, N., and Liu, G., Biomedical Case Studies in Data Intensive Computing, in Proceedings of the 1st International Conference on Cloud Computing, CloudCom’09, (2009).

    Google Scholar 

  5. Zhu, X., Li, B., Wu, X., He, D., and Zhang, C., CLAP: Collaborative Pattern Mining for Distributed Information Systems, Decision Support Systems, http://you.myipcn.org/science/article/pii/S0167923611001102, (2011).

  6. Slavik, M. and Zhu, X. and Mahgoub, I. and Shoaib, M.: Parallel Selection of Informative Genes for Classification, in Proceedings of the First International Conference on Bioinformatics and Computational Biology (BICoB), New Orleans, April (2009).

    Google Scholar 

  7. Kamal, A., Gene Selection for Sample Sets with Biased Distributes, Master Thesis, Florida Atlantic University, http://www.cse.fau.edu/ \(\tilde{}\)xqzhu/students/akamal_thesis_2009.pdf, (2009)

  8. Researchers Pinpoint Genes Involved in Breast Cancer Growth, Cancer Celll, University of Illinois at Chicago, http://www.hopkinsbreastcenter.org/artemis/200308/feature6.html, July 22, (2003).

  9. Logsdon, C., Simeone, D., Binkley, C., Arumugam, T., Greenson, J., Giordano, T., Misek, D., and Hanash, S., Molecular profiling of pancreatic adenocarcinoma and chronic pancreatitis identifies multiple genes differentially regulated in pancreatic cancer, Cancer Research, 63:2649–2657, (2003).

    Google Scholar 

  10. Golub, T. et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286:531–537, (1999).

    Article  Google Scholar 

  11. Xiong, M. et al.: Biomarker identification by feature wrappers, Genome Research, 11: 1878–1887, (2001).

    Google Scholar 

  12. Baker, S. and Kramer, B.: Identifying genes that contribute most to good classification in microarrays, BMC Bioinformatics, 7:407, (2006).

    Article  Google Scholar 

  13. Segal, E. et al.: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genetics 34(2):166–176, 2003

    Article  Google Scholar 

  14. Quinlan, J.: C4.5: Programs for Machine learning M. Kaufmann (1993)

    Google Scholar 

  15. Hua, J. et al.: Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21:1509–1515, (2005).

    Article  Google Scholar 

  16. Zhan, J. and Deng, H., Gene selection for classification of microarray data based on the Bayes error, BMC Bioinformatics, 8:370, (2007).

    Article  Google Scholar 

  17. Diaz, R. and Alvarez, S.: Gene selection and classification of microarray data using random forest, BMC Bioinformatics, 7:3, (2006).

    Article  Google Scholar 

  18. Mamitsuka, H.: Selecting features in microarray classification using ROC curves, Pattern Recognition, 39:2393–2404, (2006).

    Article  MATH  Google Scholar 

  19. Dobbin, K. et al.: How large a training set is needed to develop a classifier for microarray data, Clinical Cancer Research, 14(1), (2008).

    Google Scholar 

  20. Mukherjee, S. and Roberts, S.: A Theoretical Analysis of Gene Selection, Proc. of IEEE Computer Society Bioinformatics Conference, 131–141, 2004.

    Google Scholar 

  21. Li T. et al., A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics, 20:2429–2437, 2004

    Article  Google Scholar 

  22. Statnikov A. et al., A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5):631–643, 2005.

    Article  Google Scholar 

  23. Witten, Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques Morgan Kaufmann (1999)

    Google Scholar 

  24. Plackett, R., Karl Pearson and the Chi-Squared Test. International Statistical Review, 51(1): 59–72, 1983

    Article  MATH  MathSciNet  Google Scholar 

  25. Robnik-Šikonja, Marko, Kononenko, Igor: Theoretical and Empirical Analysis of ReliefF and RReliefF Mach. Learn., Vol. 53, 23–69 (2003)

    Google Scholar 

  26. Gropp, W. et al.: MPICH2 User’s Guide Avail: http://www.mcs.anl.gov/research/projects/mpich2/index.php (2008)

  27. Kohavi, R. and John, G, Wrappers for Feature Subset Selection, Artificial Intelligence, 97(1-2):273–324, 1997.

    Article  MATH  Google Scholar 

  28. Kent Ridge Biomedical Data Set Repository, http://sdmc.i2r.a-star.edu.sg/rp/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xingquan Zhu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Slavik, M., Zhu, X., Mahgoub, I., Khoshgoftaar, T., Narayanan, R. (2011). Data Intensive Computing: A Biomedical Case Study in Gene Selection and Filtering. In: Furht, B., Escalante, A. (eds) Handbook of Data Intensive Computing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1415-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-1415-5_22

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-1414-8

  • Online ISBN: 978-1-4614-1415-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics