Skip to main content

Disease gene identification by using graph kernels and Markov random fields

Abstract

Genes associated with similar diseases are often functionally related. This principle is largely supported by many biological data sources, such as disease phenotype similarities, protein complexes, protein-protein interactions, pathways and gene expression profiles. Integrating multiple types of biological data is an effective method to identify disease genes for many genetic diseases. To capture the gene-disease associations based on biological networks, a kernel-based MRF method is proposed by combining graph kernels and the Markov random field (MRF) method. In the proposed method, three kinds of kernels are employed to describe the overall relationships of vertices in five biological networks, respectively, and a novel weighted MRF method is developed to integrate those data. In addition, an improved Gibbs sampling procedure and a novel parameter estimation method are proposed to generate predictions from the kernel-based MRF method. Numerical experiments are carried out by integrating known gene-disease associations, protein complexes, protein-protein interactions, pathways and gene expression profiles. The proposed kernel-based MRF method is evaluated by the leave-one-out cross validation paradigm, achieving an AUC score of 0.771 when integrating all those biological data in our experiments, which indicates that our proposed method is very promising compared with many existing methods.

References

  1. Hwang T, Zhang W, Xie M, Liu J, Kuang R. Inferring disease and gene set associations with rank coherence in networks. Bioinformatics, 2011, 27: 2692–2699

    PubMed  CAS  Article  Google Scholar 

  2. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol, 2010, 6: e1000641

    PubMed  PubMed Central  Article  Google Scholar 

  3. Li Y, Agarwal P. A pathway-based view of human diseases and disease relationships. PLoS One, 2009, 4: e4346

    PubMed  PubMed Central  Article  Google Scholar 

  4. Wu X, Jiang R, Zhang MQ, Li S. Network-based global inference of human disease genes. Mol Syst Biol, 2008, 4: 189

    PubMed  PubMed Central  Article  Google Scholar 

  5. Ma X, Lee H, Wang L, Sun F. CGI: a new approach for prioritizing genes by combining gene expression and protein-protein interaction data. Bioinformatics, 2007, 23: 215–221

    PubMed  CAS  Article  Google Scholar 

  6. Lage K, Karlberg EO, Størling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby AM, Tümer Z, Pociot F, Tommerup N, Moreau Y, Brunak S. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol, 2007, 25: 309–316

    PubMed  CAS  Article  Google Scholar 

  7. Chen Y, Wang W, Zhou Y, Shields R, Chanda SK, Elston RC, Li J. In silico gene prioritization by integrating multiple data sources. PLoS One, 2011, 6: e21137

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  8. Strohman R. Maneuvering in the complex path from genotype to phenotype. Science, 2002, 296: 701–703

    PubMed  CAS  Article  Google Scholar 

  9. Deng M, Zhang K, Mehta S, Chen T, Sun F. Prediction of protein function using protein-protein interaction data. J Comput Biol, 2003, 10: 947–960

    PubMed  CAS  Article  Google Scholar 

  10. Deng M, Chen T, Sun F. An integrated probabilistic model for functional prediction of proteins. J Comput Biol, 2004, 11: 463–475

    PubMed  CAS  Article  Google Scholar 

  11. Kourmpetis YA, van Dijk AD, Bink MC, van Ham RC, ter Braak CJ. Bayesian Markov random field analysis for protein function prediction based on network data. PLoS One, 2010, 5: e9293

    PubMed  PubMed Central  Article  Google Scholar 

  12. Lee H, Tu Z, Deng M, Sun F, Chen T. Diffusion kernel-based logistic regression models for protein function prediction. OMICS, 2006, 10: 40–55

    PubMed  CAS  Article  Google Scholar 

  13. Deng M, Tu Z, Sun F, Chen T. Mapping gene ontology to proteins based on protein-protein interaction data. Bioinformatics, 2004, 20: 895–902

    PubMed  CAS  Article  Google Scholar 

  14. Letovsky S, Kasif S. Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics, 2003, 19: i197–i204

    PubMed  Article  Google Scholar 

  15. Wei Z, Li H. A Markov random field model for network-based analysis of genomic data. Bioinformatics, 2007, 23: 1537–1544

    PubMed  CAS  Article  Google Scholar 

  16. Kondor RI, Lafferty J. Diffusion kernels on graphs and other discrete input spaces. In: Proceedings of the Nineteenth International Conference on Machine Learning, San Mateo, USA, 2002. 315–322

    Google Scholar 

  17. Ma X, Chen T, Sun F. Integrative approaches for predicting protein function and prioritizing genes for complex phenotypes using protein interaction networks. Brief Bioinform, 2014, 15: 685–698

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  18. Schölkopf B, Tsuda K, Vert JP. Kernel Methods in Computational Biology. Cambridge: The MIT Press, 2004

    Google Scholar 

  19. Chen B, Wang J, Wu FX. Prioritizing human disease genes by multiple data integration. In: IEEE International Conference on Bioinformatics and Biomedicine, Shanghai, China, 2013. 621

    Google Scholar 

  20. Chen B, Wang J, Li M, Wu FX. Identifying disease genes by integrating multiple data sources. BMC Med Genomics, 2014, Suppl2: S2

    Article  Google Scholar 

  21. Li SZ. Markov Random Field Modeling in Image Analysis. 3rd ed. Berlin Heidelberg: Springer, 2009

    Google Scholar 

  22. Besag J. Spatial interaction and the statistical analysis of lattice systems. J Royal Statist Soc B, 1974, 36: 192–236

    Google Scholar 

  23. Kolaczyk ED. Statistical Analysis of Network Data. Berlin Heidelberg: Springer, 2009

    Book  Google Scholar 

  24. Kamberova G. Markov random field models: a Bayesian approach to computer vision problems. Department of Computer & Information Science Technical Reports, University of Pennsylvania, 1992

    Google Scholar 

  25. Suess EA, Trumbo BE. Introduction to probability simulation and Gibbs sampling with R. New York: Springer, 2010

    Book  Google Scholar 

  26. McKsick VA. Mendelian inheritance in man and its online version, OMIM. Am J Hum Genet, 2007, 80: 588–604

    Article  Google Scholar 

  27. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL. The human disease network. Proc Natl Acad Sci USA, 2007, 104: 8685–8690

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  28. Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G, Frishman G, Montrone C, Mewes HW. CORUM: the comprehensive resource of mammalian protein complexes-2009. Nucleic Acids Res, 2010, 38: D497–D501

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  29. Kikugawa S, Nishikata K, Murakami K, Sato Y, Suzuki M, Altaf-Ul-Amin M, Kanaya S, Imanishi T. PCDq: human protein complex database with quality index which summarizes different levels of evidences of protein complexes predicted from H-invitational protein-protein interactions integrative dataset. BMC Syst Biol, 2012, 6: S7

    PubMed  PubMed Central  Article  Google Scholar 

  30. Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A. Human protein reference database-2009 update. Nucleic Acids Res, 2009, 37: D767–772

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  31. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res, 2006, 34: D535–539

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  32. Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C, Dimmer E, Feuermann M, Friedrichsen A, Huntley R, Kohler C, Khadake J, Leroy C, Liban A, Lieftink C, Montecchi-Palazzi L, Orchard S, Risse J, Robbe K, Roechert B, Thorneycroft D, Zhang Y, Apweiler R, Hermjakob H. IntAct-open source resource for molecular interaction data. Nucleic Acids Res, 2007, 35: D561–565

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  33. Zhao B, Wang J, Li M, Wu, FX, Pan, Y: Detecting protein complexes based on uncertain graph model. IEEE/ACM Trans Comput Biol Bioinform, 2014, 11: 486–497

    PubMed  Article  Google Scholar 

  34. Wang J, Li M, Chen J, Pan Y. A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks. IEEE/ACM Trans Comput Biol Bioinform, 2011, 8: 607–620

    PubMed  Article  Google Scholar 

  35. Li M, Wu X, Wang J, Pan Y. Towards the identification of protein complexes and functional modules by integrating PPI network and gene expression data. BMC Bioinformatics, 2012, 13: 109

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  36. Li M, Chen J, Wang J, Hu B, Chen G: Modifying the DPClus algorithm for identifying protein complexes based on new topological structures. BMC Bioinformatics, 2008, 9: 398

    PubMed  PubMed Central  Article  Google Scholar 

  37. Wang J, Li M, Wang H, Pan, Y: Identification of essential proteins based on edge clustering coefficient. IEEE/ACM Trans Comput Biol Bioinform, 2012, 9: 1070–1080

    PubMed  Article  Google Scholar 

  38. Li M, Zheng R, Zhang H, Wang J, Pan Y. Effective identification of essential proteins based on priori knowledge, network topology and gene expressions. Methods, 2014, 67: 325–333

    PubMed  CAS  Article  Google Scholar 

  39. Tang X, Wang J, Zhong J, Pan Y. Predicting essential proteins based on weighted degree centrality. IEEE/ACM Trans Comput Biol Bioinform, 2014, 11: 407–418

    PubMed  Article  Google Scholar 

  40. Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res, 2000, 28: 27–30

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  41. Vastrik I, D’Eustachio P, Schmidt E, Gopinath G, Croft D, de Bono B, Gillespie M, Jassal B, Lewis S, Matthews L, Wu G, Birney E, Stein L. Reactome: a knowledge base of biologic pathways and processes. Genome Biol, 2007, 8: R39

    PubMed  PubMed Central  Article  Google Scholar 

  42. Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, Thorn CF, Altman RB, Klein TE. Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Ther, 2012, 92: 414–417

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  43. Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH. PID: the pathway interaction database. Nucleic Acids Res, 2009, 37: D674–679

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  44. Wu C, Orozco C, Boyer J, Leglise M, Goodale J, Batalov S, Hodge CL, Haase J, Janes J, Huss JW 3rd, Su AI. BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol, 2009, 10: R130

    PubMed  PubMed Central  Article  Google Scholar 

  45. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA, 2004, 101: 6062–6067

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  46. Köhler S, Bauer S, Horn D, Robinson PN. Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet, 2008, 82: 949–958

    PubMed  PubMed Central  Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fang-Xiang Wu.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, B., Li, M., Wang, J. et al. Disease gene identification by using graph kernels and Markov random fields. Sci. China Life Sci. 57, 1054–1063 (2014). https://doi.org/10.1007/s11427-014-4745-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11427-014-4745-8

Keywords

  • disease gene identification
  • data integration
  • Markov random field
  • graph kernel
  • Bayesian analysis