A Comparative Study of Classification-Based Machine Learning Methods for Novel Disease Gene Prediction

Le, Duc-Hau; Xuan Hoai, Nguyen; Kwon, Yung-Keun

doi:10.1007/978-3-319-11680-8_46

A Comparative Study of Classification-Based Machine Learning Methods for Novel Disease Gene Prediction

Duc-Hau Le⁵,
Nguyen Xuan Hoai⁶ &
Yung-Keun Kwon⁷

Conference paper

1878 Accesses
12 Citations
2 Altmetric

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 326))

Abstract

Prediction of novel genes associated to a disease is an important issue in biomedical research. At early days, annotation-based methods were proposed for this problem. In next stage, with high-throughput technologies, data of interaction between genes/proteins has grown quickly and covered almost genome and proteome, and therefore network-based methods for the issue is becoming prominent. Besides those two methods, the prediction problem can be also approached using machine learning techniques because it can be formulated as a classification task of machine learning. To date, a number of supervised learning techniques and various types of gene/protein annotation data have been used to solve the disease gene classification/ prediction problem. However, to the best of our knowledge, there has been no study on the comparison of these methods that work on comprehensive biomedical annotation data. In addition, it is generally true that no classifier is better than others for all classification problems. Therefore, in this study, we compare the performance of disease gene prediction of several supervised learning techniques that have been used in the literature such as Decision Tree Learning, k-Nearest Neighbor, Naive Bayesian, Artificial Neural Networks and Support Vector Machines. We additionally assess Random Forest, a relatively new decision-tree-based ensemble learning method. The simulation results indicate that Random Forest obtained the best performance of all. Also, all methods are stable with the change of known disease genes used as positive training samples.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kann, M.G.: Advances in translational bioinformatics: computational approaches for the hunting of disease genes. Briefings in Bioinformatics 11, 96–110 (2009)
Article Google Scholar
Tranchevent, L.-C., et al.: A guide to web tools to prioritize candidate genes. Briefings in Bioinformatics 12, 22–32 (2010)
Article Google Scholar
Turner, F., et al.: POCUS: mining genomic sequence annotation to predict disease genes. Genome Biology 4, R75 (2003)
Google Scholar
Adie, E.A., et al.: SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics 22, 773–774 (2006)
Article Google Scholar
Aerts, S., et al.: Gene prioritization through genomic data fusion. Nature Biotechnology 24, 537–544 (2006)
Article Google Scholar
Chen, J., et al.: Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformatics 8, 392 (2007)
Article Google Scholar
Wang, X., et al.: Network-based methods for human disease gene prediction. Briefings in Functional Genomics 10, 280–293 (2011)
Article Google Scholar
Tarca, A.L., et al.: Machine learning and its applications to biology. PLoS Computational Biology 3, e116 (2007)
Google Scholar
Larrañaga, P., et al.: Machine learning in bioinformatics. Briefings in Bioinformatics 7, 86–112 (2006)
Article Google Scholar
Yip, K.Y., et al.: Machine learning and genome annotation: a match meant to be? Genome Biology 14, 205 (2013)
Article Google Scholar
de Ridder, D., et al.: Pattern recognition in bioinformatics. Briefings in Bioinformatics 14, 633–647 (2013)
Article Google Scholar
Basford, K.E., et al.: On the classification of microarray gene-expression data. Briefings in Bioinformatics 14, 402–410 (2013)
Article Google Scholar
Maetschke, S.R., et al.: Supervised, semi-supervised and unsupervised inference of gene regulatory networks. Briefings in Bioinformatics (2013)
Google Scholar
Ding, H., et al.: Similarity-based machine learning methods for predicting drug-target interactions: a brief review. Briefings in Bioinformatics (2013)
Google Scholar
Upstill-Goddard, R., et al.: Machine learning approaches for the discovery of gene-gene interactions in disease data. Briefings in Bioinformatics 14, 251–260 (2012)
Article Google Scholar
Okser, S., et al.: Genetic variants and their interactions in disease risk prediction - machine learning and network perspectives. BioData Mining (2013)
Google Scholar
Lospez-Bigas, N., Ouzounis, C.A.: Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Research 32, 3108–3114 (2004)
Article Google Scholar
Adie, E., et al.: Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics 6, 55 (2005)
Article Google Scholar
Xu, J., Li, Y.: Discovering disease-genes by topological features in human protein-protein interaction network. Bioinformatics 22, 2800–2805 (2006)
Article Google Scholar
Calvo, S., et al.: Systematic identification of human mitochondrial disease genes through integrative genomics. Nat. Genet. 38, 576–582 (2006)
Article Google Scholar
Smalter, A., et al.: Human disease-gene classification with integrative sequence-based and topological features of protein-protein interaction networks. In: IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2007, pp. 209–216 (2007)
Google Scholar
Sun, J., et al.: Functional link artificial neural network-based disease gene prediction. In: Neural Networks, IJCNN 2009, pp. 3003–3010 (2009)
Google Scholar
Breiman, L., et al.: Classification and regression trees. Wadsworth & Brooks, Monterey (1984)
MATH Google Scholar
Schapire, R.E.: A brief introduction to boosting. Ijcai 99, 1401–1406 (1999)
Google Scholar
Radivojac, P., et al.: An integrated approach to inferring gene-disease associations in humans. Proteins: Structure, Function, and Bioinformatics 72, 1030–1037 (2008)
Article Google Scholar
Keerthikumar, S., et al.: Prediction of candidate primary immunodeficiency disease genes using a support vector machine learning approach. DNA Research 16, 345–351 (2009)
Article Google Scholar
Amberger, J., et al.: McKusick’s Online Mendelian Inheritance in Man (OMIM®). Nucleic Acids Research 37, D793–D796 (2009)
Google Scholar
Safran, M., et al.: GeneCards TM 2002: towards a complete, object-oriented, human gene compendium. Bioinformatics, 1542–1543 (2002)
Google Scholar
Lage, K., et al.: A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat. Biotech. 25, 309–316 (2007)
Article Google Scholar
Tu, Z., et al.: Further understanding human disease genes by comparing with housekeeping genes and other genes. BMC Genomics 7, 31 (2006)
Article Google Scholar
Brown, K.R., Jurisica, I.: Online Predicted Human Interaction Database. Bioinformatics 21, 2076–2082 (2005)
Article Google Scholar
Freudenberg, J., Propping, P.: A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics 18, S110–S115 (2002)
Google Scholar
The UniProt, C.: The Universal Protein Resource (UniProt) in 2010. Nucl. Acids Res. 38, D142–D148 (2010)
Google Scholar
Jonsson, P.F., Bates, P.A.: Global topological features of cancer proteins in the human interactome. Bioinformatics 22, 2291–2297 (2006)
Article Google Scholar
Apweiler, R., et al.: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Research 29, 37–40 (2001)
Article Google Scholar
Hunter, S., et al.: InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Research 40, D306–D312 (2011)
Google Scholar
Smedley, D., et al.: BioMart - biological queries made easy. BMC Genomics 10, 22 (2009)
Article Google Scholar
Sayers, E.W., et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 39, D38–D51 (2011)
Google Scholar
Luo, H., et al.: DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements. Nucleic Acids Research 42, D574–D580 (2014)
Google Scholar
Dennis, G., et al.: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology 4, R60 (2003)
Google Scholar
Quinlan, J.R.: Induction of decision trees. Machine Learning 1, 81–106 (1986)
Google Scholar
Olshen, L.B.J.H.F.R.A., Stone, C.J.: Classification and regression trees. Wadsworth International Group (1984)
Google Scholar
Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46, 175–185 (1992)
MathSciNet Google Scholar
Rish, I.: An empirical study of the naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20, 273–297 (1995)
MATH Google Scholar
Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001)
Article MATH Google Scholar
Hall, M., et al.: The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11, 10–18 (2009)
Article Google Scholar
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2, 27 (2011)
Google Scholar
Bollmann, P., Cherniavsky, V.S.: Restricted evaluation in information retrieval. ACM SIGIR Forum 16, 15–21 (1981)
Article Google Scholar
Mordelet, F., Vert, J.-P.: ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples. BMC Bioinformatics 12, 389 (2011)
Article Google Scholar
Yang, P., et al.: Positive-unlabeled learning for disease gene identification. Bioinformatics 28, 2640–2647 (2012)
Article Google Scholar
Yu, S., et al.: Gene prioritization and clustering by multi-view text mining. BMC Bioinformatics 11, 28 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Center for IT Services, Water Resources University, 175 Tay Son, Dong Da, Hanoi, Vietnam
Duc-Hau Le
Information Technology Research and Development Center, Hanoi Univerisity, Hanoi, Vietnam
Nguyen Xuan Hoai
School of Electrical Engineering, University of Ulsan, 93 Daehak-ro, Nam-gu, Ulsan, 680-749, Republic of Korea
Yung-Keun Kwon

Authors

Duc-Hau Le
View author publications
You can also search for this author in PubMed Google Scholar
Nguyen Xuan Hoai
View author publications
You can also search for this author in PubMed Google Scholar
Yung-Keun Kwon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam
Viet-Ha Nguyen
Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam
Anh-Cuong Le
School of Knowledge Science, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Van-Nam Huynh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Le, DH., Xuan Hoai, N., Kwon, YK. (2015). A Comparative Study of Classification-Based Machine Learning Methods for Novel Disease Gene Prediction. In: Nguyen, VH., Le, AC., Huynh, VN. (eds) Knowledge and Systems Engineering. Advances in Intelligent Systems and Computing, vol 326. Springer, Cham. https://doi.org/10.1007/978-3-319-11680-8_46

Download citation

DOI: https://doi.org/10.1007/978-3-319-11680-8_46
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11679-2
Online ISBN: 978-3-319-11680-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics