Data Intensive Computing: A Biomedical Case Study in Gene Selection and Filtering

Slavik, Michael; Zhu, Xingquan; Mahgoub, Imad; Khoshgoftaar, Taghi; Narayanan, Ramaswamy

doi:10.1007/978-1-4614-1415-5_22

Michael Slavik³,
Xingquan Zhu⁴,
Imad Mahgoub³,
Taghi Khoshgoftaar³ &
…
Ramaswamy Narayanan⁵

1483 Accesses

Abstract

In data intensive computing environments where the number of samples and data dimensions grow sufficiently large, existing methods in Bioinformatics research are not effective for selecting important genes. In this chapter, we propose two approaches for parallel selection of genes, both are based on the well known { ReliefF} feature selection method and cluster computing environments. In the first design, denoted by { PReliefF} _p, the input data are split into non-overlapping subsets assigned to cluster nodes. Each node carries out gene selection by using the { ReliefF} method on its own subset, without interaction with other clusters. The final ranking of the genes for selection is generated by gathering weight vectors from all nodes. In the second design, namely { PReliefF} _g, each node dynamically updates global weight vectors so the gene selection results in one node can be used to boost the selection process for other nodes. Experimental results from real-world microarray expression data show that { PReliefF} _p and { PReliefF} _g nearly perfectly speedup to the number of nodes involved in the computing. When combined with several popular classification methods, the classifiers built from the genes selected from both methods have the same or even better methods than the genes selected from the original ReliefF method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The project described was supported by Award Number R01GM086707 from the National Institute Of General Medical Sciences (NIGMS) at the National Institutes of Health (NIH). The content is solely the responsibility of the authors and does not necessarily represent the official views of NIGMS or NIH.

References

Moore, R., Baru, C., Marciano, R., Rajasekar, A., and Wan, M., Data-Intensive Computing, in, The Grid: Blueprint for a New Computing Infrastructure, Foster, I., and C. Kesselman, Morgan Kaufmann, San Francisco, 1999.
Google Scholar
Rosenthal, A., Mork, P., Li, M., Stanford, J., Koester, D., and Reynolds, P., Cloud computing: A new business paradigm for biomedical information sharing, Journal of Biomedical Informatics, 43(2):342–353, 2010.
Article Google Scholar
Liora, X.: Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre, in Proceedings of the 11th Annual conference on Genetic and evolutionary computation, GECCO, (2009).
Google Scholar
Fox, G., Qiu, X., Beason, S., Choi, J., Ekanayake, J., Gunarathne, T., Rho, M., Tang, H., Devadasan, N., and Liu, G., Biomedical Case Studies in Data Intensive Computing, in Proceedings of the 1st International Conference on Cloud Computing, CloudCom’09, (2009).
Google Scholar
Zhu, X., Li, B., Wu, X., He, D., and Zhang, C., CLAP: Collaborative Pattern Mining for Distributed Information Systems, Decision Support Systems, http://you.myipcn.org/science/article/pii/S0167923611001102, (2011).
Slavik, M. and Zhu, X. and Mahgoub, I. and Shoaib, M.: Parallel Selection of Informative Genes for Classification, in Proceedings of the First International Conference on Bioinformatics and Computational Biology (BICoB), New Orleans, April (2009).
Google Scholar
Kamal, A., Gene Selection for Sample Sets with Biased Distributes, Master Thesis, Florida Atlantic University, http://www.cse.fau.edu/ \(\tilde{}\)xqzhu/students/akamal_thesis_2009.pdf, (2009)
Researchers Pinpoint Genes Involved in Breast Cancer Growth, Cancer Celll, University of Illinois at Chicago, http://www.hopkinsbreastcenter.org/artemis/200308/feature6.html, July 22, (2003).
Logsdon, C., Simeone, D., Binkley, C., Arumugam, T., Greenson, J., Giordano, T., Misek, D., and Hanash, S., Molecular profiling of pancreatic adenocarcinoma and chronic pancreatitis identifies multiple genes differentially regulated in pancreatic cancer, Cancer Research, 63:2649–2657, (2003).
Google Scholar
Golub, T. et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286:531–537, (1999).
Article Google Scholar
Xiong, M. et al.: Biomarker identification by feature wrappers, Genome Research, 11: 1878–1887, (2001).
Google Scholar
Baker, S. and Kramer, B.: Identifying genes that contribute most to good classification in microarrays, BMC Bioinformatics, 7:407, (2006).
Article Google Scholar
Segal, E. et al.: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genetics 34(2):166–176, 2003
Article Google Scholar
Quinlan, J.: C4.5: Programs for Machine learning M. Kaufmann (1993)
Google Scholar
Hua, J. et al.: Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21:1509–1515, (2005).
Article Google Scholar
Zhan, J. and Deng, H., Gene selection for classification of microarray data based on the Bayes error, BMC Bioinformatics, 8:370, (2007).
Article Google Scholar
Diaz, R. and Alvarez, S.: Gene selection and classification of microarray data using random forest, BMC Bioinformatics, 7:3, (2006).
Article Google Scholar
Mamitsuka, H.: Selecting features in microarray classification using ROC curves, Pattern Recognition, 39:2393–2404, (2006).
Article MATH Google Scholar
Dobbin, K. et al.: How large a training set is needed to develop a classifier for microarray data, Clinical Cancer Research, 14(1), (2008).
Google Scholar
Mukherjee, S. and Roberts, S.: A Theoretical Analysis of Gene Selection, Proc. of IEEE Computer Society Bioinformatics Conference, 131–141, 2004.
Google Scholar
Li T. et al., A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics, 20:2429–2437, 2004
Article Google Scholar
Statnikov A. et al., A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5):631–643, 2005.
Article Google Scholar
Witten, Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques Morgan Kaufmann (1999)
Google Scholar
Plackett, R., Karl Pearson and the Chi-Squared Test. International Statistical Review, 51(1): 59–72, 1983
Article MATH MathSciNet Google Scholar
Robnik-Šikonja, Marko, Kononenko, Igor: Theoretical and Empirical Analysis of ReliefF and RReliefF Mach. Learn., Vol. 53, 23–69 (2003)
Google Scholar
Gropp, W. et al.: MPICH2 User’s Guide Avail: http://www.mcs.anl.gov/research/projects/mpich2/index.php (2008)
Kohavi, R. and John, G, Wrappers for Feature Subset Selection, Artificial Intelligence, 97(1-2):273–324, 1997.
Article MATH Google Scholar
Kent Ridge Biomedical Data Set Repository, http://sdmc.i2r.a-star.edu.sg/rp/

Download references

Author information

Authors and Affiliations

Department of Computer & Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, 33431, USA
Michael Slavik, Imad Mahgoub & Taghi Khoshgoftaar
Centre for Quantum Computation and Intelligent Systems, University of Technology, Sydney, NSW, 2007, Australia
Xingquan Zhu
Charles E. Schmidt College of Science, Florida Atlantic University, Boca Raton, FL, 33431, USA
Ramaswamy Narayanan

Authors

Michael Slavik
View author publications
You can also search for this author in PubMed Google Scholar
Xingquan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Imad Mahgoub
View author publications
You can also search for this author in PubMed Google Scholar
Taghi Khoshgoftaar
View author publications
You can also search for this author in PubMed Google Scholar
Ramaswamy Narayanan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xingquan Zhu .

Editor information

Editors and Affiliations

Dept. of Computer Science & Engineering, Florida Atlantic University, Boca Raton, 33431, Florida, USA
Borko Furht
LexisNexis, Boca Raton, 33487, Florida, USA
Armando Escalante

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Slavik, M., Zhu, X., Mahgoub, I., Khoshgoftaar, T., Narayanan, R. (2011). Data Intensive Computing: A Biomedical Case Study in Gene Selection and Filtering. In: Furht, B., Escalante, A. (eds) Handbook of Data Intensive Computing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1415-5_22

Download citation

DOI: https://doi.org/10.1007/978-1-4614-1415-5_22
Published: 11 November 2011
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1414-8
Online ISBN: 978-1-4614-1415-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics