Disease Gene Classification with Metagraph Representations

  • Sezin Kircali Ata
  • Yuan Fang
  • Min Wu
  • Xiao-Li LiEmail author
  • Xiaokui Xiao
Part of the Methods in Molecular Biology book series (MIMB, volume 1807)


This chapter is based on exploiting the network-based representations of proteins, metagraphs, in protein-protein interaction network to identify candidate disease-causing proteins. Protein-protein interaction (PPI) networks are effective tools in studying the functional roles of proteins in the development of various diseases. However, they are insufficient without the support of additional biological knowledge for proteins such as their molecular functions and biological processes. To enhance PPI networks, we utilize biological properties of individual proteins as well. More specifically, we integrate keywords from UniProt database describing protein properties into the PPI network and construct a novel heterogeneous PPI-Keyword (PPIK) network consisting of both proteins and keywords. As proteins with similar functional duties or involving in the same metabolic pathway tend to have similar topological characteristics, we propose to represent them with metagraphs. Compared to the traditional network motif or subgraph, a metagraph can capture the topological arrangements through not only the protein-protein interactions but also protein-keyword associations. We feed those novel metagraph representations into classifiers for disease protein prediction and conduct our experiments on three different PPI databases. They show that the proposed method consistently increases disease protein prediction performance across various classifiers, by 15.3% in AUC on average. It outperforms the diffusion-based (e.g., RWR) and the module-based baselines by 13.8–32.9% in overall disease protein prediction. Breast cancer protein prediction outperforms RWR, PRINCE, and the module-based baselines by 6.6–14.2%. Finally, our predictions also exhibit better correlations with literature findings from PubMed database.

Key words

Protein-protein interaction UniProt keywords Metagraph Protein representations Disease protein prediction 


  1. 1.
    Nelson MR, Tipney H, Painter JL, Shen J, Nicoletti P, Shen Y, Floratos A, Sham PC, Li MJ, Wang J, Cardon LR, Whittaker JC, Sanseau P (2015) The support of human genetic evidence for approved drug indications. Nat Genet 47(8):856–860CrossRefPubMedGoogle Scholar
  2. 2.
    Sekar A, Bialas AR, de Rivera H, Davis A, Hammond TR, Kamitaki N, Tooley K, Presumey J, Baum M, Van Doren V, Genovese G, Rose SA, Handsaker RE, Consortium SWGotPG, Daly MJ, Carroll MC, Stevens B, McCarroll SA (2016) Schizophrenia risk from complex variation of complement component 4. Nature 530(7589):177–183CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Yang P, Li X, Chua H-N, Kwoh C-K, Ng S-K (2014) Ensemble positive unlabeled learning for disease gene identification. PLoS One 9(5):1–11Google Scholar
  4. 4.
    Yang P, Li X-L, Mei J-P, Kwoh C-K, Ng S-K (2012) Positive-unlabeled learning for disease gene identification. Bioinformatics 28(20):2640CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    Li M, Lu Y, Wang J, Wu F-X, Pan Y (2015) A topology potential-based method for identifying essential proteins from PPI networks. IEEE/ACM Trans Comput Biol Bioinform 12(2):372–383CrossRefPubMedGoogle Scholar
  6. 6.
    Fu L, Zhang S, Zhang L, Tong X, Zhang J, Zhang Y, Ouyang L, Liu B, Huang J (2015) Systems biology network-based discovery of a small molecule activator BL-AD008 targeting AMPK/ZIPK and inducing apoptosis in cervical cancer. Oncotarget 6(10):8071–8088PubMedPubMedCentralGoogle Scholar
  7. 7.
    Gui T, Dong X, Li R, Li Y, Wang Z (2015) Identification of hepatocellular carcinoma-related genes with a machine learning and network analysis. J Comput Biol 22(1):63–71CrossRefPubMedGoogle Scholar
  8. 8.
    Li X-L, Ng S-K (2009) Biological data mining in protein interaction networks. IGI Global, Hershey, PACrossRefGoogle Scholar
  9. 9.
    Chuang H-Y, Lee E, Liu Y-T, Lee D, Ideker T (2007) Network-based classification of breast cancer metastasis. Mol Syst Biol 3(1):140–n/aPubMedPubMedCentralGoogle Scholar
  10. 10.
    Ideker T, Sharan R (2008) Protein networks in disease. Genome Res 18(4):644–652CrossRefPubMedPubMedCentralGoogle Scholar
  11. 11.
    Xu J, Li Y (2006) Discovering disease-genes by topological features in human protein--protein interaction network. Bioinformatics 22(22):2800–2805CrossRefPubMedGoogle Scholar
  12. 12.
    Yang P, Li X, Wu M, Kwoh C-K, Ng S-K (2011) Inferring gene-phenotype associations via global protein complex network propagation. PLoS One 6(7):1–11Google Scholar
  13. 13.
    Lage K, Karlberg EO, Storling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby AM, Tumer Z, Pociot F, Tommerup N, Moreau Y, Brunak S (2007) A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotech 25(3):309–316CrossRefGoogle Scholar
  14. 14.
    Barabási A-L, Gulbahce N, Loscalzo J (2011) Network medicine: a network-based approach to human disease. Nat Rev Genet 12(1):56–68CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    Krauthammer M, Kaufmann CA, Gilliam TC, Rzhetsky A (2004) Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer’s disease. Proc Natl Acad Sci U S A 101(42):15148–15153CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Iossifov I, Zheng T, Baron M, Gilliam TC, Rzhetsky A (2008) Genetic-linkage mapping of complex hereditary disorders to a whole-genome molecular-interaction network. Genome Res 18(7):1150–1162CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Oti M, Snel B, Huynen MA, Brunner HG (2006) Predicting disease genes using protein-protein interactions. J Med Genet 43(8):691–698CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Navlakha S, Kingsford C (2010) The power of protein interaction networks for associating genes with diseases. Bioinformatics 26(8):1057CrossRefPubMedPubMedCentralGoogle Scholar
  19. 19.
    Suthram S, Dudley JT, Chiang AP, Chen R, Hastie TJ, Butte AJ (2010) Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets. PLoS Comput Biol 6(2):1–10CrossRefGoogle Scholar
  20. 20.
    Wu G, Stein L (2012) A network module-based method for identifying cancer prognostic signatures. Genome Biol 13(12):R112CrossRefPubMedPubMedCentralGoogle Scholar
  21. 21.
    Zhu J, Qin Y, Liu T, Wang J, Zheng X (2013) Prioritization of candidate disease genes by topological similarity between disease and protein diffusion profiles. BMC Bioinformatics 14(5):S5PubMedPubMedCentralGoogle Scholar
  22. 22.
    Shim JE, Hwang S, Lee I (2015) Pathway-dependent effectiveness of network algorithms for gene prioritization. PLoS One 10(6):1–10CrossRefGoogle Scholar
  23. 23.
    Zhu L, Deng S-P, Huang D-S (2015) A two-stage geometric method for pruning unreliable links in protein-protein networks. IEEE Trans Nanobioscience 14(5):528–534CrossRefPubMedGoogle Scholar
  24. 24.
    Marcatili P, Tramontano A (2009) Network cleansing: reliable interaction networks. In: Biological data mining in protein interaction networks. IGI Global, Hershey, PA, pp 80–97CrossRefGoogle Scholar
  25. 25.
    Consortium U et al (2015) UniProt: a hub for protein information. Nucleic Acids Res 43(Database issue):D204–D212CrossRefGoogle Scholar
  26. 26.
    Liu W, Wu A, Pellegrini M, Wang X (2015) Integrative analysis of human protein, function and disease networks. Sci Rep 5:14344 EPCrossRefGoogle Scholar
  27. 27.
    Singh-Blom UM, Natarajan N, Tewari A, Woods JO, Dhillon IS, Marcotte EM (2013) Prediction and validation of gene-disease associations using methods inspired by social network analyses. PLoS One 8(5):1–17CrossRefGoogle Scholar
  28. 28.
    Peng W, Wang J, Cai J, Chen L, Li M, Wu F-X (2014) Improving protein function prediction using domain and protein complexes in PPI networks. BMC Syst Biol 8:35–35CrossRefPubMedPubMedCentralGoogle Scholar
  29. 29.
    Yang ZH, Yu FY, Lin HF, Wang J (2014) Integrating PPI datasets with the PPI data from biomedical literature for protein complex detection. BMC Med Genet 7(Suppl 2):S3–S3Google Scholar
  30. 30.
    Sun K, Gonçalves JP, Larminie C, Pržulj N (2014) Predicting disease associations via biological network analysis. BMC Bioinformatics 15(1):304CrossRefPubMedPubMedCentralGoogle Scholar
  31. 31.
    Fang Y, Lin W, Zheng VW, Wu M, Chang KC-C, Li X (2016) Semantic proximity search on graphs with metagraph-based learning. In: 32nd {IEEE} International Conference on Data Engineering, {ICDE} 2016, Helsinki, Finland, May 16–20, 2016. pp 277–288Google Scholar
  32. 32.
    Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del Toro N, Duesbury M, Dumousseau M, Galeota E, Hinz U, Iannuccelli M, Jagannathan S, Jimenez R, Khadake J, Lagreid A, Licata L, Lovering RC, Meldal B, Melidoni AN, Milagros M, Peluso D, Perfetto L, Porras P, Raghunath A, Ricard-Blum S, Roechert B, Stutz A, Tognolli M, van Roey K, Cesareni G, Hermjakob H (2014) The MIntAct project IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 42(Database issue):D358–D363CrossRefPubMedGoogle Scholar
  33. 33.
    Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, Simonovic M, Roth A, Santos A, Tsafou KP, Kuhn M, Bork P, Jensen LJ, von Mering C (2015) STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res 43(D1):D447CrossRefPubMedGoogle Scholar
  34. 34.
    Maglott D, Ostell J, Pruitt KD, Tatusova T (2007) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 35(Suppl 1):D26CrossRefPubMedGoogle Scholar
  35. 35.
    De Las Rivas J, Fontanillo C (2010) Protein-protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol 6(6):e1000807CrossRefPubMedPubMedCentralGoogle Scholar
  36. 36.
    Chua HN, Sung W-K, Wong L (2006) Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 22(13):1623–1630. 10.1093/bioinformatics/btl145 CrossRefPubMedGoogle Scholar
  37. 37.
    Wu M, Yu Q, Li X-L, Zheng J, Huang J-F, Kwoh C-K (2013) Benchmarking human protein complexes to investigate drug-related systems and evaluate predicted protein complexes. PLoS One 8(2):e53197CrossRefPubMedPubMedCentralGoogle Scholar
  38. 38.
    Li X-L, Wu M, Kwoh C-K, Ng S-K (2010) Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genomics 11(Suppl 1):S3CrossRefPubMedPubMedCentralGoogle Scholar
  39. 39.
    Wu M, Li X-L, Kwoh C-K, Ng S-K (2009) A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinformatics 10:169CrossRefPubMedPubMedCentralGoogle Scholar
  40. 40.
    Hamosh A, Scott AF, Amberger J, Bocchini C, Valle D, McKusick VA (2002) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 30(1):52CrossRefPubMedPubMedCentralGoogle Scholar
  41. 41.
    Elseidy M, Abdelhamid E, Skiadopoulos S, Kalnis P (2014) GraMi: frequent subgraph and pattern mining in a single large graph. Proc VLDB Endow 7(7):517–528CrossRefGoogle Scholar
  42. 42.
    Köhler S, Bauer S, Horn D, Robinson PN (2008) Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet 82(4):949–958CrossRefPubMedPubMedCentralGoogle Scholar
  43. 43.
    Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R (2010) Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol 6(1):1–9CrossRefGoogle Scholar
  44. 44.
    van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JAM (2006) A text-mining analysis of the human phenome. Eur J Hum Genet 14(5):535–542CrossRefPubMedGoogle Scholar
  45. 45.
    Piñero J, Bravo À, Queralt-Rosinach N, Gutiérrez-Sacristán A, Deu-Pons J, Centeno E, García-García J, Sanz F, Furlong LI (2017) DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res 45(D1):D833CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Sezin Kircali Ata
    • 1
  • Yuan Fang
    • 2
  • Min Wu
    • 3
  • Xiao-Li Li
    • 3
    Email author
  • Xiaokui Xiao
    • 4
  1. 1.Computer Science and EngineeringNanyang Technological UniversitySingaporeSingapore
  2. 2.School of Information SystemsSingapore Management UniversitySingaporeSingapore
  3. 3.Data Analytics DepartmentInstitute for Infocomm ResearchSingaporeSingapore
  4. 4.School of ComputingNational University of SingaporeSingaporeSingapore

Personalised recommendations