Abstract
In data intensive computing environments where the number of samples and data dimensions grow sufficiently large, existing methods in Bioinformatics research are not effective for selecting important genes. In this chapter, we propose two approaches for parallel selection of genes, both are based on the well known { ReliefF} feature selection method and cluster computing environments. In the first design, denoted by { PReliefF} p , the input data are split into non-overlapping subsets assigned to cluster nodes. Each node carries out gene selection by using the { ReliefF} method on its own subset, without interaction with other clusters. The final ranking of the genes for selection is generated by gathering weight vectors from all nodes. In the second design, namely { PReliefF} g , each node dynamically updates global weight vectors so the gene selection results in one node can be used to boost the selection process for other nodes. Experimental results from real-world microarray expression data show that { PReliefF} p and { PReliefF} g nearly perfectly speedup to the number of nodes involved in the computing. When combined with several popular classification methods, the classifiers built from the genes selected from both methods have the same or even better methods than the genes selected from the original ReliefF method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The project described was supported by Award Number R01GM086707 from the National Institute Of General Medical Sciences (NIGMS) at the National Institutes of Health (NIH). The content is solely the responsibility of the authors and does not necessarily represent the official views of NIGMS or NIH.
References
Moore, R., Baru, C., Marciano, R., Rajasekar, A., and Wan, M., Data-Intensive Computing, in, The Grid: Blueprint for a New Computing Infrastructure, Foster, I., and C. Kesselman, Morgan Kaufmann, San Francisco, 1999.
Rosenthal, A., Mork, P., Li, M., Stanford, J., Koester, D., and Reynolds, P., Cloud computing: A new business paradigm for biomedical information sharing, Journal of Biomedical Informatics, 43(2):342–353, 2010.
Liora, X.: Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre, in Proceedings of the 11th Annual conference on Genetic and evolutionary computation, GECCO, (2009).
Fox, G., Qiu, X., Beason, S., Choi, J., Ekanayake, J., Gunarathne, T., Rho, M., Tang, H., Devadasan, N., and Liu, G., Biomedical Case Studies in Data Intensive Computing, in Proceedings of the 1st International Conference on Cloud Computing, CloudCom’09, (2009).
Zhu, X., Li, B., Wu, X., He, D., and Zhang, C., CLAP: Collaborative Pattern Mining for Distributed Information Systems, Decision Support Systems, http://you.myipcn.org/science/article/pii/S0167923611001102, (2011).
Slavik, M. and Zhu, X. and Mahgoub, I. and Shoaib, M.: Parallel Selection of Informative Genes for Classification, in Proceedings of the First International Conference on Bioinformatics and Computational Biology (BICoB), New Orleans, April (2009).
Kamal, A., Gene Selection for Sample Sets with Biased Distributes, Master Thesis, Florida Atlantic University, http://www.cse.fau.edu/Â \(\tilde{}\)xqzhu/students/akamal_thesis_2009.pdf, (2009)
Researchers Pinpoint Genes Involved in Breast Cancer Growth, Cancer Celll, University of Illinois at Chicago, http://www.hopkinsbreastcenter.org/artemis/200308/feature6.html, July 22, (2003).
Logsdon, C., Simeone, D., Binkley, C., Arumugam, T., Greenson, J., Giordano, T., Misek, D., and Hanash, S., Molecular profiling of pancreatic adenocarcinoma and chronic pancreatitis identifies multiple genes differentially regulated in pancreatic cancer, Cancer Research, 63:2649–2657, (2003).
Golub, T. et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286:531–537, (1999).
Xiong, M. et al.: Biomarker identification by feature wrappers, Genome Research, 11: 1878–1887, (2001).
Baker, S. and Kramer, B.: Identifying genes that contribute most to good classification in microarrays, BMC Bioinformatics, 7:407, (2006).
Segal, E. et al.: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genetics 34(2):166–176, 2003
Quinlan, J.: C4.5: Programs for Machine learning M. Kaufmann (1993)
Hua, J. et al.: Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21:1509–1515, (2005).
Zhan, J. and Deng, H., Gene selection for classification of microarray data based on the Bayes error, BMC Bioinformatics, 8:370, (2007).
Diaz, R. and Alvarez, S.: Gene selection and classification of microarray data using random forest, BMC Bioinformatics, 7:3, (2006).
Mamitsuka, H.: Selecting features in microarray classification using ROC curves, Pattern Recognition, 39:2393–2404, (2006).
Dobbin, K. et al.: How large a training set is needed to develop a classifier for microarray data, Clinical Cancer Research, 14(1), (2008).
Mukherjee, S. and Roberts, S.: A Theoretical Analysis of Gene Selection, Proc. of IEEE Computer Society Bioinformatics Conference, 131–141, 2004.
Li T. et al., A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics, 20:2429–2437, 2004
Statnikov A. et al., A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5):631–643, 2005.
Witten, Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques Morgan Kaufmann (1999)
Plackett, R., Karl Pearson and the Chi-Squared Test. International Statistical Review, 51(1): 59–72, 1983
Robnik-Šikonja, Marko, Kononenko, Igor: Theoretical and Empirical Analysis of ReliefF and RReliefF Mach. Learn., Vol. 53, 23–69 (2003)
Gropp, W. et al.: MPICH2 User’s Guide Avail: http://www.mcs.anl.gov/research/projects/mpich2/index.php (2008)
Kohavi, R. and John, G, Wrappers for Feature Subset Selection, Artificial Intelligence, 97(1-2):273–324, 1997.
Kent Ridge Biomedical Data Set Repository, http://sdmc.i2r.a-star.edu.sg/rp/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Slavik, M., Zhu, X., Mahgoub, I., Khoshgoftaar, T., Narayanan, R. (2011). Data Intensive Computing: A Biomedical Case Study in Gene Selection and Filtering. In: Furht, B., Escalante, A. (eds) Handbook of Data Intensive Computing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1415-5_22
Download citation
DOI: https://doi.org/10.1007/978-1-4614-1415-5_22
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1414-8
Online ISBN: 978-1-4614-1415-5
eBook Packages: Computer ScienceComputer Science (R0)