Validation Pipeline for Computational Prediction of Genomics Annotations

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9874)

Abstract

Controlled biomolecular annotations are key concepts in computational genomics and proteomics, since they can describe the functional features of genes and their products in both a simple and computational way. Despite the importance of these annotations, many of them are missing, and the available ones contain errors and inconsistencies; furthermore, the discovery and validation of new annotations are very time-consuming tasks. For these reasons, recently many computer scientists developed several machine-learning algorithms able to computationally predict new gene-function relationships. While several of these methods have been easily adapted from different domains to bioinformatics, their validation remains a challenging aspect of a computational pipeline. Here, we propose a validation procedure based upon three different sub-phases, which is able to assess the precision of any algorithm predictions with a reliable degree of accuracy. We show some validation results obtained for Gene Ontology annotations of Homo sapiens genes that demonstrate the effectiveness of our validation approach.

Keywords

Validation Gene Ontology Biomolecular annotations Receiver Operating Characteristic ROC curves Genomic and Proteomic Data Warehouse (GPDW) 

References

  1. 1.
    The Gene Ontology Consortium, Creating the Gene Ontology resource: Designand implementation. Genome Res. 11(8), 1425–1433 (2001)Google Scholar
  2. 2.
    Karp, P.D.: What we do not know about sequence analysis and sequence databases. Bioinformatics 14(9), 753–754 (1998)CrossRefGoogle Scholar
  3. 3.
    Pandey, G., Kumar, V., Steinbach, M.: Computational Approaches for Protein Function Prediction: A Survey. Department of Computer Science and Engineering, University of Minnesota, Twin Cities (2006)Google Scholar
  4. 4.
    Chicco, D., Tagliasacchi, M., Masseroli, M.: Biomolecular annotation prediction through information integration. In: Proceedings of CIBB 2011 - 8th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics, Gargnagno sul Garda, Italy, pp. 1–9 (2011)Google Scholar
  5. 5.
    Chicco, D., Masseroli, M.: A discrete optimization approach for SVD best truncation choice based on ROC curves. In: Proceedings of IEEE BIBE - the 13th IEEE International Conference on Bioinformatics and Bioengineering, pp. 1–8. IEEE, Chania (2013)Google Scholar
  6. 6.
    Pinoli, P., Chicco, D., Masseroli, M.: Improved biomolecular annotation prediction through weighting scheme methods. In: Proceedings of CIBB - 10th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics, Nice, France, pp. 1–9 (2013)Google Scholar
  7. 7.
    Pinoli, P., Chicco, D., Masseroli, M.: Weighting scheme methods for enhanced genomic annotation prediction. In: Formenti, E., Tagliaferri, R., Wit, E. (eds.) CIBB 2013. LNCS, vol. 8452, pp. 76–89. Springer, Heidelberg (2014)Google Scholar
  8. 8.
    Pinoli, P., Chicco, D., Masseroli, M.: Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations. In: Proceedings of IEEE BIBE - the 13th IEEE International Conference on Bioinformatics and Bioengineering, pp. 1–8. IEEE, Chania (2013)Google Scholar
  9. 9.
    Pinoli, P., Chicco, D., Masseroli, M.: Latent Dirichlet allocation based on Gibbs sampling for gene function prediction. In: Proceedings of CIBCB - the IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, pp. 1–8. IEEE (2014)Google Scholar
  10. 10.
    Chicco, D., Sadowski, P., Baldi, P.: Deep autoencoder neural networks for Gene Ontology annotation predictions. In: Proceedings of ACM BCB, pp. 533–540. ACM (2014)Google Scholar
  11. 11.
    Pinoli, P., Chicco, D., Masseroli, M.: Computational algorithms to predict Gene Ontology annotations. BMC Bioinformatics 16(Suppl. 6), S4, 1–15 (2015)Google Scholar
  12. 12.
    Chicco, D., Masseroli, M.: Ontology-based prediction and prioritization of gene function annotations. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 248–260 (2016). IEEECrossRefGoogle Scholar
  13. 13.
    Khatri, P., Done, B., Rao, A., Done, A., Draghici, S.: A semantic analysis of the annotations of the human genome. Bioinformatics 21(16), 3416–3421 (2005)CrossRefGoogle Scholar
  14. 14.
    Done, B., Khatri, P., Done, A., Draghici, S.: Semantic analysis of genome annotations using weighting schemes. In: Proceedings of CIBCB - the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pp. 212–218. IET, Honolulu (2007)Google Scholar
  15. 15.
    Done, B., Khatri, P., Done, A., Draghici, S.: Predicting novel human Gene Ontology annotations using semantic analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 7(1), 91–99 (2010)CrossRefGoogle Scholar
  16. 16.
    King, O.D., Foulger, R.E., Dwight, S.S., White, J.V., Roth, F.P.: Predicting gene function from patterns of annotation. Genome Res. 13(5), 896–904 (2003)CrossRefGoogle Scholar
  17. 17.
    Tao, Y., Sam, L., Li, J., Friedman, C., Lussier, Y.A.: Information theory applied to the sparse Gene Ontology annotation network to predict novel gene function. Bioinformatics 23(13), 529–538 (2007)CrossRefGoogle Scholar
  18. 18.
    Barutcuoglu, Z., Schapire, R.E., Troyanskaya, O.G.: Hierarchical multi-label prediction of gene function. Bioinformatics 22(7), 830–836 (2006)CrossRefGoogle Scholar
  19. 19.
    Chicco, D.: Computational Prediction of Gene Functions through Machine Learning methods and Multiple Validation Procedures, Doctoral Thesis, Politecnico di Milano (2014)Google Scholar
  20. 20.
    Fawcett, T.: ROC graphs: notes and practical considerations for researchers. ReCALL 31(HPL–2003–4), 1–38 (2004)Google Scholar
  21. 21.
    Canakoglu, A., Ghisalberti, G., Masseroli, M.: Integration of biomolecular interaction data in a genomic and proteomic data warehouse to support biomedical knowledge discovery. In: Biganzoli, E., Vellido, A., Ambrogi, F., Tagliaferri, R. (eds.) CIBB 2011. LNCS, vol. 7548, pp. 112–126. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  22. 22.
    Masseroli, M., Canakoglu, A., Ceri, S.: Integration and querying of genomic and proteomic semantic annotations for biomedical knowledge extraction. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 209–219 (2016). IEEECrossRefGoogle Scholar
  23. 23.
    Canakoglu, A., Ceri, S., Masseroli, M.: Biomolecular annotation integration and querying to help unveiling new biomedical knowledge. In: Ortuño, F., Rojas, I. (eds.) IWBBIO 2016. LNCS, vol. 9656, pp. 802–813. Springer, Heidelberg (2016)CrossRefGoogle Scholar
  24. 24.
    Genomic and Proteomic Knowledge Base (GPKB). http://www.bioinformatics.deib.polimi.it/GPKB/
  25. 25.
  26. 26.
    Carbon, S., Ireland, A., Mungall, C.J., Shu, S., Marshall, B., Lewis, S.: AmiGO: online access to ontology and annotation data. Bioinformatics 25(2), 288–289 (2009)CrossRefGoogle Scholar
  27. 27.
    Rebhan, M., Chalifa-Caspi, V., Prilusky, J., Lancet, D.: GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14(88), 656–664 (1998)CrossRefGoogle Scholar
  28. 28.
    Chicco, D., Masseroli, M.: Software suite for gene and protein annotation prediction and similarity search. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 12(4), 837–843 (2015)CrossRefGoogle Scholar
  29. 29.
    Chicco, D.: Integration of bioinformatics web services through the Search Computing technology. Technical Report, TR 2012/02, Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milan, ItalyGoogle Scholar
  30. 30.
    Masseroli, M., Picozzi, M., Ghisalberti, G., Ceri, S.: Explorative search of distributed bio-data to answer complex biomedical questions. BMC Bioinformatics 15(Suppl. 1), S3, 1–14 (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Princess Margaret Cancer CentreUniversity of TorontoOntarioCanada
  2. 2.Dipartimento di Elettronica Informazione e BioingegneriaPolitecnico di MilanoMilanItaly

Personalised recommendations