Committee-Based Active Learning to Select Negative Examples for Predicting Protein Functions

  • Marco Frasca
  • Maryam Sepehri
  • Alessandro Petrini
  • Giuliano Grossi
  • Giorgio ValentiniEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11925)


The Automated Functional Prediction (AFP) of proteins became a challenging problem in bioinformatics and biomedicine aiming at handling and interpreting the extremely large-sized proteomes of several eukaryotic organisms. A central issue in AFP is the absence in public repositories for protein functions, e.g. the Gene Ontology (GO), of well defined sets of negative examples to learn accurate classifiers for AFP. In this paper we investigate the Query by Committee paradigm of active learning to select the negatives most informative for the classifier and the protein function to be inferred. We validated our approach in predicting the Gene Ontology function for the S.cerevisiae proteins.


Query By Committee Active learning Protein function prediction 



This work was supported by the grant title Machine learning algorithms to handle label imbalance in biomedical taxonomies, code PSR2017\(\_\)DIP\(\_\)010\(\_\)MFRAS, Università degli Studi di Milano.


  1. 1.
    Ashburner, M., et al.: Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat. Genet. 25, 25–29 (2000)CrossRefGoogle Scholar
  2. 2.
    Eisner, R., Poulin, B., Szafron, D., Lu, P.: Improving protein prediction using the hierarchical structure of the gene ontology. In: IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (2005)Google Scholar
  3. 3.
    Mostafavi, S., Morris, Q.: Using the gene ontology hierarchy when predicting gene function. In: Proceedings of the Twenty-Fifth Annual Conference on Uncertainty in Artificial Intelligence (UAI-09), (Corvallis, Oregon), pp. 419–427. AUAI Press (2009)Google Scholar
  4. 4.
    Youngs, N., Penfold-Brown, D., Bonneau, R., Shasha, D.: Negative example selection for protein function prediction: the NoGO database. PLoS Comput. Biol. 10, 1–12 (2014)CrossRefGoogle Scholar
  5. 5.
    Freund, Y., Seung, H.S., Shamir, E., Tishby, N.: Selective sampling using the query by committee algorithm. Mach. Learn. 28, 133–168 (1997)CrossRefGoogle Scholar
  6. 6.
    Bertoni, A., Frasca, M., Valentini, G.: COSNet: a cost sensitive neural network for semi-supervised learning in graphs. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6911, pp. 219–234. Springer, Heidelberg (2011). Scholar
  7. 7.
    Frasca, M., Lipreri, F., Malchiodi, D.: Analysis of informative features for negative selection in protein function prediction. In: Rojas, I., Ortuño, F. (eds.) IWBBIO 2017, Part II. LNCS, vol. 10209, pp. 267–276. Springer, Cham (2017). Scholar
  8. 8.
    Szklarczyk, D., et al.: String v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43(D1), D447–D452 (2015)CrossRefGoogle Scholar
  9. 9.
    Dagan, I., Engelson, S.P.: Committee-based sampling for training probabilistic classifiers. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 150–157. Morgan Kaufmann (1995)Google Scholar
  10. 10.
    Melville, P., Mooney, R.J.: Diverse ensembles for active learning. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML 2004, p. 74. ACM, New York (2004)Google Scholar
  11. 11.
    Abe, N., Mamitsuka, H.: Query learning strategies using boosting and bagging. In: Proceedings of the Fifteenth International Conference on Machine Learning, ICML 1998, San Francisco, CA, USA, pp. 1–9 (1998)Google Scholar
  12. 12.
    Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)CrossRefGoogle Scholar
  13. 13.
    Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines: and Other Kernel-based Learning Methods. Cambridge University Press, New York (2000)CrossRefGoogle Scholar
  14. 14.
    Breiman, L., Friedman, G., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984)zbMATHGoogle Scholar
  15. 15.
    Gini, C.: Variabilità e Mutuabilità. Contributo allo Studio delle Distribuzioni e delle Relazioni Statistiche, C. Cuppini, Bologna (1912)Google Scholar
  16. 16.
    Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics 1, 80–83 (1945)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Marco Frasca
    • 1
  • Maryam Sepehri
    • 1
  • Alessandro Petrini
    • 1
  • Giuliano Grossi
    • 1
  • Giorgio Valentini
    • 1
    Email author
  1. 1.Dipartimento di InformaticaUniversità degli Studi di MilanoMilanItaly

Personalised recommendations