Prediction of Human Gene - Phenotype Associations by Exploiting the Hierarchical Structure of the Human Phenotype Ontology

  • Giorgio Valentini
  • Sebastian Köhler
  • Matteo Re
  • Marco Notaro
  • Peter N. Robinson
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9043)


The Human Phenotype Ontology (HPO) provides a conceptualization of phenotype information and a tool for the computational analysis of human diseases. It covers a wide range of phenotypic abnormalities encountered in human diseases and its terms (classes) are structured according to a directed acyclic graph. In this context the prediction of the phenotypic abnormalities associated to human genes is a key tool to stratify patients into disease subclasses that share a common biological or pathophisiological basis. Methods are being developed to predict the HPO terms that are associated for a given disease or disease gene, but most such methods adopt a simple ”flat” approach, that is they do not take into account the hierarchical relationships of the HPO, thus loosing important a priori information about HPO terms. In this contribution we propose a novel Hierarchical Top-Down (HTD) algorithm that associates a specific learner to each HPO term and then corrects the predictions according to the hierarchical structure of the underlying DAG. Genome-wide experimental results relative to a complex HPO DAG including more than 4000 HPO terms show that the proposed hierarchical-aware approach significantly improves predictions obtained with flat methods, especially in terms of precision/recall results.


Human Phenotype Ontology term prediction Ensemble methods Hierarchical classification methods Disease gene prioritization 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Robinson, P., Krawitz, P., Mundlos, S.: Strategies for exome and genome sequence data analysis in disease-gene discovery projects. Cin. Genet. 80, 127–132 (2011)CrossRefGoogle Scholar
  2. 2.
    Robinson, P., Kohler, S., Bauer, S., Seelow, D., Horn, D., Mundlos, S.: The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet. 83, 610–615 (2008)CrossRefGoogle Scholar
  3. 3.
    Amberger, J., Bocchini, C., Amosh, A.: A new face and new challenges for Online Mendelian inheritance in Man (OMIM). Hum. Mutat. 32, 564–567 (2011)CrossRefGoogle Scholar
  4. 4.
    Kohler, S., et al.: The human phenotype ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Research 42(Database issue), D966–D974 (2014)Google Scholar
  5. 5.
    Moreau, Y., Tranchevent, L.: Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nature Rev. Genet. 13(8), 523–536 (2012)CrossRefGoogle Scholar
  6. 6.
    McGary, K., Lee, I., Marcotte, E.: Broad network-based predictability of Saccharomyces cerevisiae gene loss-of-function phenotypes. Genome Biology 8(R258) (2007)Google Scholar
  7. 7.
    Mehan, M., Nunez-Iglesias, J., Dai, C., Waterman, M., Zhou, X.: An integrative modular approach to systematically predict gene-phenotype associations. BMC Bioinformatics 11(suppl. 1) (2010)Google Scholar
  8. 8.
    Wang, P., et al.: Inference of gene-phenotype associations via protein-protein interaction and orthology. PLoS One 8(10) (2013)Google Scholar
  9. 9.
    Musso, G., et al.: Novel cardiovascular gene functions revealed via systematic phenotype prediction in zebrafish. Development 141, 224–235 (2014)CrossRefGoogle Scholar
  10. 10.
    Cerri, R., de Carvalho, A.: Hierarchical multilabel protein function prediction using local neural networks. In: Norberto de Souza, O., Telles, G.P., Palakal, M. (eds.) BSB 2011. LNCS, vol. 6832, pp. 10–17. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Silla, C., Freitas, A.: A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery 22(1-2), 31–72 (2011)CrossRefzbMATHMathSciNetGoogle Scholar
  12. 12.
    Valentini, G.: True Path Rule hierarchical ensembles for genome-wide gene function prediction. IEEE ACM Transactions on Computational Biology and Bioinformatics 8(3), 832–847 (2011)CrossRefMathSciNetGoogle Scholar
  13. 13.
    Cesa-Bianchi, N., Re, M., Valentini, G.: Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Machine Learning 88(1), 209–241 (2012)CrossRefzbMATHMathSciNetGoogle Scholar
  14. 14.
    Obozinski, G., Lanckriet, G., Grant, C., Jordan, M., Noble, W.: Consistent probabilistic output for protein function prediction. Genome Biology 9(S6) (2008)Google Scholar
  15. 15.
    Schietgat, L., Vens, C., Struyf, J., Blockeel, H., Dzeroski, S.: Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinformatics 11(2) (2010)Google Scholar
  16. 16.
    Valentini, G.: Hierarchical Ensemble Methods for Protein Function Prediction. ISRN Bioinformatics 2014(Article ID 901419), 34 pages (2014)Google Scholar
  17. 17.
    Gene Ontology Consortium: Gene Ontology annotations and resources. Nucleic Acids Research 41, D530–D535 (2013)Google Scholar
  18. 18.
    Cormen, T., Leiserson, C., Rivest, R.: Introduction to Algorithms. MIT Press, Boston (2009)zbMATHGoogle Scholar
  19. 19.
    Apweiler, R., Attwood, T., Bairoch, A., Bateman, A., et al.: The interpro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Research 29(1), 37–40 (2001)CrossRefGoogle Scholar
  20. 20.
    Finn, R., Tate, J., Mistry, J., Coggill, P., Sammut, J., Hotz, H., Ceric, G., Forslund, K., Eddy, S., Sonnhammer, E., Bateman, A.: The Pfam protein families database. Nucleic Acids Research 36, D281–D288 (2008)Google Scholar
  21. 21.
    Attwood, T.: The prints database: a resource for identification of protein families. Brief Bioinform. 3(3), 252–263 (2002)CrossRefGoogle Scholar
  22. 22.
    Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., Cuche, B., De Castro, E., Lachaize, C., Langendijk-Genevaux, P., Sigrist, C.: The 20 years of prosite. Nucleic Acids Research 36, D245–D249 (2008)Google Scholar
  23. 23.
    Schultz, J., Milpetz, F., Bork, P., Ponting, C.: Smart, a simple modular architecture research tool: identification of signaling domains. Proceedings of the National Academy of Sciences 95(11), 5857–5864 (1998)CrossRefGoogle Scholar
  24. 24.
    Gough, J., Karplus, K., Hughey, R., Chothia, C.: Assignment of homology to genome sequences using a library of hidden markov models that represent all proteins of known structure. Journal of Molecular Biology 313(4), 903–919 (2001)CrossRefGoogle Scholar
  25. 25.
    Valentini, G., Paccanaro, A., Caniza, H., Romero, A., Re, M.: An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods. Artificial Intelligence in Medicine 61(2), 63–78 (2014)CrossRefGoogle Scholar
  26. 26.
    Wu, G., Feng, X., Stein, L.: A human functional protein interaction network and its application to cancer data analysis. Genome Biol. 11, R53 (2010)Google Scholar
  27. 27.
    Lee, I., Blom, U., Wang, P.I., Shim, J., Marcotte, E.: Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Research 21(7), 1109–1121 (2011)CrossRefGoogle Scholar
  28. 28.
    Re, M., Valentini, G.: Cancer module genes ranking using kernelized score functions. BMC Bioinformatics 13(suppl.14/S3) (2012)Google Scholar
  29. 29.
    Re, M., Mesiti, M., Valentini, G.: A Fast Ranking Algorithm for Predicting Gene Functions in Biomolecular Networks. IEEE ACM Transactions on Computational Biology and Bioinformatics 9(6), 1812–1818 (2012)CrossRefGoogle Scholar
  30. 30.
    Re, M., Valentini, G.: Network-based Drug Ranking and Repositioning with respect to DrugBank Therapeutic Categories. IEEE/ACM Transactions on Computational Biology and Bioinformatics 10(6), 1359–1371 (2013)CrossRefGoogle Scholar
  31. 31.
    Oliver, S.: Guilt-by-association goes global. Nature 403, 601–603 (2000)CrossRefGoogle Scholar
  32. 32.
    Smola, A.J., Kondor, R.: Kernels and regularization on graphs. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 144–158. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  33. 33.
    Zhu, X., et al.: Semi-supervised learning with gaussian fields and harmonic functions. In: Proc. of the 20th Int. Conf. on Machine Learning, Washintgton DC, USA (2003)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Giorgio Valentini
    • 1
  • Sebastian Köhler
    • 2
  • Matteo Re
    • 1
  • Marco Notaro
    • 3
  • Peter N. Robinson
    • 2
    • 4
  1. 1.AnacletoLab - DI, Dipartimento di InformaticaUniversità degli Studi di MilanoItaly
  2. 2.Institut fur Medizinische Genetik und HumangenetikCharité - Universitatsmedizin BerlinGermany
  3. 3.Dipartimento di BioscienzeUniversità degli Studi di MilanoItaly
  4. 4.Institute of Bioinformatics, Department of Mathematics and Computer ScienceFreie Universitat BerlinGermany

Personalised recommendations