Random Perturbations of Term Weighted Gene Ontology Annotations for Discovering Gene Unknown Functionalities

  • Giacomo Domeniconi
  • Marco Masseroli
  • Gianluca Moro
  • Pietro Pinoli
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 553)


Computational analyses for biomedical knowledge discovery greatly benefit from the availability of the description of gene and protein functional features expressed through controlled terminologies and ontologies, i.e. of their controlled annotations. In the last years, several databases of such annotations have become available; yet, these annotations are incomplete and only some of them represent highly reliable human curated information. To predict and discover unknown or missing annotations existing approaches use unsupervised learning algorithms. We propose a new learning method that allows applying supervised algorithms to unsupervised problems, achieving much better annotation predictions. This method, which we also extend from our preceding work with data weighting techniques, is based on the generation of artificial labeled training sets through random perturbations of original data. We tested it on nine Gene Ontology annotation datasets; obtained results demonstrate that our approach achieves good effectiveness in novel annotation prediction, outperforming state of the art unsupervised methods.


Gene ontology Biomolecular annotation prediction Bioinformatics Knowledge discovery Supervised learning Term weighting 


  1. 1.
    GO Consortium, et al.: Creating the gene ontology resource: design and implementation. Genome Res. 11, 1425–1433 (2001)Google Scholar
  2. 2.
    Pandey, G., Kumar, V., Steinbach, M.: Computational approaches for protein function prediction: A survey. Technical report, Minneapolis, MN, USA (2006)Google Scholar
  3. 3.
    Domeniconi, G., Masseroli, M., Moro, G., Pinoli, P.: Discovering new gene functionalities from random perturbations of known gene ontological annotations. In: International Conference on Knowledge Discovery and Information Retrieval (KDIR 2014) (2014)Google Scholar
  4. 4.
    Canakoglu, A., Ghisalberti, G., Masseroli, M.: Integration of biomolecular interaction data in a genomic and proteomic data warehouse to support biomedical knowledge discovery. In: Biganzoli, E., Vellido, A., Ambrogi, F., Tagliaferri, R. (eds.) CIBB 2011. LNCS, vol. 7548, pp. 112–126. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  5. 5.
    Tanoue, J., Yoshikawa, M., Uemura, S.: The genearound go viewer. Bioinformatics 18, 1705–1706 (2002)CrossRefGoogle Scholar
  6. 6.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008) CrossRefzbMATHGoogle Scholar
  7. 7.
    Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: Cross-domain text classification through iterative refining of target categories representations. In: Proceedings of the 6th International Conference on Knowledge Discovery and Information Retrieval (2014)Google Scholar
  8. 8.
    Pinoli, P., Chicco, D., Masseroli, M.: Weighting Scheme Methods for Enhanced Genomic Annotation Prediction. In: Formenti, E., Tagliaferri, R., Wit, E. (eds.) Computational Intelligence Methods for Bioinformatics and Biostatistics. LNCS (LNBI), vol. 8452, pp. 76–89. Springer, Heidelberg (2014)Google Scholar
  9. 9.
    Sparck Jones, K.: Document Retrieval Systems, pp. 132–142. Taylor Graham Publishing, London (1988) Google Scholar
  10. 10.
    Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: Iterative refining of category profiles for nearest centroid cross-domain text classification. In: Fred, A., et al. (eds.) IC3K 2014. CCIS, vol. 553, pp. 50–67. Springer, Heidelberg (2015)Google Scholar
  11. 11.
    Done, B., Khatri, P., Done, A., Draghici, S.: Semantic analysis of genome annotations using weighting schemes. In: IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology, CIBCB 2007, pp. 212–218. IET (2007)Google Scholar
  12. 12.
    Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of SAC-03, 18th ACM Symposium on Applied Computing, pp. 784–788. ACM Press (2003)Google Scholar
  13. 13.
    Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31, 721–735 (2009)CrossRefGoogle Scholar
  14. 14.
    King, O.D., Foulger, R.E., Dwight, S.S., White, J.V., Roth, F.P.: Predicting gene function from patterns of annotation. Genome Res. 13, 896–904 (2003)CrossRefGoogle Scholar
  15. 15.
    Tao, Y., Sam, L., Li, J., Friedman, C., Lussier, Y.A.: Information theory applied to the sparse gene ontology annotation network to predict novel gene function. Bioinformatics 23, i529–i538 (2007)CrossRefGoogle Scholar
  16. 16.
    Barutcuoglu, Z., Schapire, R.E., Troyanskaya, O.G.: Hierarchical multi-label prediction of gene function. Bioinformatics 22, 830–836 (2006)CrossRefGoogle Scholar
  17. 17.
    Raychaudhuri, S., Chang, J.T., Sutphin, P.D., Altman, R.B.: Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. Genome Res. 12, 203–214 (2002)CrossRefGoogle Scholar
  18. 18.
    Pérez, A.J., Perez-Iratxeta, C., Bork, P., Thode, G., Andrade, M.A.: Gene annotation from scientific literature using mappings between keyword systems. Bioinformatics 20, 2084–2091 (2004)CrossRefGoogle Scholar
  19. 19.
    Khatri, P., Done, B., Rao, A., Done, A., Draghici, S.: A semantic analysis of the annotations of the human genome. Bioinformatics 21, 3416–3421 (2005)CrossRefGoogle Scholar
  20. 20.
    Done, B., Khatri, P., Done, A., Draghici, S.: Predicting novel human gene ontology annotations using semantic analysis. IEEE/ACM Trans. Comput. Biol. Bioinf. (TCBB) 7, 91–99 (2010)CrossRefGoogle Scholar
  21. 21.
    Chicco, D., Tagliasacchi, M., Masseroli, M.: Genomic annotation prediction based on integrated information. In: Biganzoli, E., Vellido, A., Ambrogi, F., Tagliaferri, R. (eds.) CIBB 2011. LNCS, vol. 7548, pp. 238–252. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  22. 22.
    Chicco, D., Masseroli, M.: A discrete optimization approach for SVD best truncation choice based on ROC curves. In: 2013 IEEE 13th International Conference on Bioinformatics and Bioengineering (BIBE), pp. 1–4. IEEE (2013)Google Scholar
  23. 23.
    Dumais, S.T., Furnas, G.W., Landauer, T.K., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: Proceedings of the SIGCHI Conference on Human factors in Computing Systems, pp. 281–285. ACM (1988)Google Scholar
  24. 24.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)Google Scholar
  25. 25.
    Masseroli, M., Chicco, D., Pinoli, P.: Probabilistic latent semantic analysis for prediction of gene ontology annotations. In: The 2012 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2012)Google Scholar
  26. 26.
    Pinoli, P., Chicco, D., Masseroli, M.: Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations. In: 2013 IEEE 13th International Conference on Bioinformatics and Bioengineering (BIBE), pp. 1–4. IEEE (2013)Google Scholar
  27. 27.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  28. 28.
    Bicego, M., Lovato, P., Oliboni, B., Perina, A.: Expression microarray classification using topic models. In: Proceedings of the 2010 ACM Symposium on Applied Computing, pp. 1516–1520. ACM (2010)Google Scholar
  29. 29.
    Perina, A., Lovato, P., Murino, V., Bicego, M.: Biologically-aware latent Dirichlet allocation (BaLDA) for the classification of expression microarray. In: Dijkstra, T.M.H., Tsivtsivadze, E., Marchiori, E., Heskes, T. (eds.) PRIB 2010. LNCS, vol. 6282, pp. 230–241. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  30. 30.
    Pinoli, P., Chicco, D., Masseroli, M.: Latent Dirichlet allocation based on gibbs sampling for gene function prediction. In: Proceedings of the International Conference on Computational Intelligence in Bioinformatics and Computational Biology, pp. 1–7. IEEE Computer Society (2014)Google Scholar
  31. 31.
    Griffiths, T.: Gibbs Sampling in the Generative Model of Latent Dirichlet Allocation, Technical report, Stanford University (2002)Google Scholar
  32. 32.
    Casella, G., George, E.I.: Explaining the gibbs sampler. Am. Stat. 46, 167–174 (1992)MathSciNetGoogle Scholar
  33. 33.
    Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., Welling, M.: Fast collapsed gibbs sampling for latent Dirichlet allocation. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 569–577. ACM (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Giacomo Domeniconi
    • 1
  • Marco Masseroli
    • 2
  • Gianluca Moro
    • 1
  • Pietro Pinoli
    • 2
  1. 1.DISIUniversità degli Studi di BolognaCesenaItaly
  2. 2.DEIBPolitecnico di MilanoMilanItaly

Personalised recommendations