Skip to main content

A Comparative Study of Classification-Based Machine Learning Methods for Novel Disease Gene Prediction

  • Conference paper

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 326))

Abstract

Prediction of novel genes associated to a disease is an important issue in biomedical research. At early days, annotation-based methods were proposed for this problem. In next stage, with high-throughput technologies, data of interaction between genes/proteins has grown quickly and covered almost genome and proteome, and therefore network-based methods for the issue is becoming prominent. Besides those two methods, the prediction problem can be also approached using machine learning techniques because it can be formulated as a classification task of machine learning. To date, a number of supervised learning techniques and various types of gene/protein annotation data have been used to solve the disease gene classification/ prediction problem. However, to the best of our knowledge, there has been no study on the comparison of these methods that work on comprehensive biomedical annotation data. In addition, it is generally true that no classifier is better than others for all classification problems. Therefore, in this study, we compare the performance of disease gene prediction of several supervised learning techniques that have been used in the literature such as Decision Tree Learning, k-Nearest Neighbor, Naive Bayesian, Artificial Neural Networks and Support Vector Machines. We additionally assess Random Forest, a relatively new decision-tree-based ensemble learning method. The simulation results indicate that Random Forest obtained the best performance of all. Also, all methods are stable with the change of known disease genes used as positive training samples.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kann, M.G.: Advances in translational bioinformatics: computational approaches for the hunting of disease genes. Briefings in Bioinformatics 11, 96–110 (2009)

    Article  Google Scholar 

  2. Tranchevent, L.-C., et al.: A guide to web tools to prioritize candidate genes. Briefings in Bioinformatics 12, 22–32 (2010)

    Article  Google Scholar 

  3. Turner, F., et al.: POCUS: mining genomic sequence annotation to predict disease genes. Genome Biology 4, R75 (2003)

    Google Scholar 

  4. Adie, E.A., et al.: SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics 22, 773–774 (2006)

    Article  Google Scholar 

  5. Aerts, S., et al.: Gene prioritization through genomic data fusion. Nature Biotechnology 24, 537–544 (2006)

    Article  Google Scholar 

  6. Chen, J., et al.: Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformatics 8, 392 (2007)

    Article  Google Scholar 

  7. Wang, X., et al.: Network-based methods for human disease gene prediction. Briefings in Functional Genomics 10, 280–293 (2011)

    Article  Google Scholar 

  8. Tarca, A.L., et al.: Machine learning and its applications to biology. PLoS Computational Biology 3, e116 (2007)

    Google Scholar 

  9. Larrañaga, P., et al.: Machine learning in bioinformatics. Briefings in Bioinformatics 7, 86–112 (2006)

    Article  Google Scholar 

  10. Yip, K.Y., et al.: Machine learning and genome annotation: a match meant to be? Genome Biology 14, 205 (2013)

    Article  Google Scholar 

  11. de Ridder, D., et al.: Pattern recognition in bioinformatics. Briefings in Bioinformatics 14, 633–647 (2013)

    Article  Google Scholar 

  12. Basford, K.E., et al.: On the classification of microarray gene-expression data. Briefings in Bioinformatics 14, 402–410 (2013)

    Article  Google Scholar 

  13. Maetschke, S.R., et al.: Supervised, semi-supervised and unsupervised inference of gene regulatory networks. Briefings in Bioinformatics (2013)

    Google Scholar 

  14. Ding, H., et al.: Similarity-based machine learning methods for predicting drug-target interactions: a brief review. Briefings in Bioinformatics (2013)

    Google Scholar 

  15. Upstill-Goddard, R., et al.: Machine learning approaches for the discovery of gene-gene interactions in disease data. Briefings in Bioinformatics 14, 251–260 (2012)

    Article  Google Scholar 

  16. Okser, S., et al.: Genetic variants and their interactions in disease risk prediction - machine learning and network perspectives. BioData Mining (2013)

    Google Scholar 

  17. Lospez-Bigas, N., Ouzounis, C.A.: Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Research 32, 3108–3114 (2004)

    Article  Google Scholar 

  18. Adie, E., et al.: Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics 6, 55 (2005)

    Article  Google Scholar 

  19. Xu, J., Li, Y.: Discovering disease-genes by topological features in human protein-protein interaction network. Bioinformatics 22, 2800–2805 (2006)

    Article  Google Scholar 

  20. Calvo, S., et al.: Systematic identification of human mitochondrial disease genes through integrative genomics. Nat. Genet. 38, 576–582 (2006)

    Article  Google Scholar 

  21. Smalter, A., et al.: Human disease-gene classification with integrative sequence-based and topological features of protein-protein interaction networks. In: IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2007, pp. 209–216 (2007)

    Google Scholar 

  22. Sun, J., et al.: Functional link artificial neural network-based disease gene prediction. In: Neural Networks, IJCNN 2009, pp. 3003–3010 (2009)

    Google Scholar 

  23. Breiman, L., et al.: Classification and regression trees. Wadsworth & Brooks, Monterey (1984)

    MATH  Google Scholar 

  24. Schapire, R.E.: A brief introduction to boosting. Ijcai 99, 1401–1406 (1999)

    Google Scholar 

  25. Radivojac, P., et al.: An integrated approach to inferring gene-disease associations in humans. Proteins: Structure, Function, and Bioinformatics 72, 1030–1037 (2008)

    Article  Google Scholar 

  26. Keerthikumar, S., et al.: Prediction of candidate primary immunodeficiency disease genes using a support vector machine learning approach. DNA Research 16, 345–351 (2009)

    Article  Google Scholar 

  27. Amberger, J., et al.: McKusick’s Online Mendelian Inheritance in Man (OMIM®). Nucleic Acids Research 37, D793–D796 (2009)

    Google Scholar 

  28. Safran, M., et al.: GeneCards TM 2002: towards a complete, object-oriented, human gene compendium. Bioinformatics, 1542–1543 (2002)

    Google Scholar 

  29. Lage, K., et al.: A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat. Biotech. 25, 309–316 (2007)

    Article  Google Scholar 

  30. Tu, Z., et al.: Further understanding human disease genes by comparing with housekeeping genes and other genes. BMC Genomics 7, 31 (2006)

    Article  Google Scholar 

  31. Brown, K.R., Jurisica, I.: Online Predicted Human Interaction Database. Bioinformatics 21, 2076–2082 (2005)

    Article  Google Scholar 

  32. Freudenberg, J., Propping, P.: A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics 18, S110–S115 (2002)

    Google Scholar 

  33. The UniProt, C.: The Universal Protein Resource (UniProt) in 2010. Nucl. Acids Res. 38, D142–D148 (2010)

    Google Scholar 

  34. Jonsson, P.F., Bates, P.A.: Global topological features of cancer proteins in the human interactome. Bioinformatics 22, 2291–2297 (2006)

    Article  Google Scholar 

  35. Apweiler, R., et al.: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Research 29, 37–40 (2001)

    Article  Google Scholar 

  36. Hunter, S., et al.: InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Research 40, D306–D312 (2011)

    Google Scholar 

  37. Smedley, D., et al.: BioMart - biological queries made easy. BMC Genomics 10, 22 (2009)

    Article  Google Scholar 

  38. Sayers, E.W., et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 39, D38–D51 (2011)

    Google Scholar 

  39. Luo, H., et al.: DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements. Nucleic Acids Research 42, D574–D580 (2014)

    Google Scholar 

  40. Dennis, G., et al.: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology 4, R60 (2003)

    Google Scholar 

  41. Quinlan, J.R.: Induction of decision trees. Machine Learning 1, 81–106 (1986)

    Google Scholar 

  42. Olshen, L.B.J.H.F.R.A., Stone, C.J.: Classification and regression trees. Wadsworth International Group (1984)

    Google Scholar 

  43. Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46, 175–185 (1992)

    MathSciNet  Google Scholar 

  44. Rish, I.: An empirical study of the naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)

    Google Scholar 

  45. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20, 273–297 (1995)

    MATH  Google Scholar 

  46. Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  47. Hall, M., et al.: The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11, 10–18 (2009)

    Article  Google Scholar 

  48. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2, 27 (2011)

    Google Scholar 

  49. Bollmann, P., Cherniavsky, V.S.: Restricted evaluation in information retrieval. ACM SIGIR Forum 16, 15–21 (1981)

    Article  Google Scholar 

  50. Mordelet, F., Vert, J.-P.: ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples. BMC Bioinformatics 12, 389 (2011)

    Article  Google Scholar 

  51. Yang, P., et al.: Positive-unlabeled learning for disease gene identification. Bioinformatics 28, 2640–2647 (2012)

    Article  Google Scholar 

  52. Yu, S., et al.: Gene prioritization and clustering by multi-view text mining. BMC Bioinformatics 11, 28 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Le, DH., Xuan Hoai, N., Kwon, YK. (2015). A Comparative Study of Classification-Based Machine Learning Methods for Novel Disease Gene Prediction. In: Nguyen, VH., Le, AC., Huynh, VN. (eds) Knowledge and Systems Engineering. Advances in Intelligent Systems and Computing, vol 326. Springer, Cham. https://doi.org/10.1007/978-3-319-11680-8_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11680-8_46

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11679-2

  • Online ISBN: 978-3-319-11680-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics