Abstract
The Automated Functional Prediction (AFP) of proteins became a challenging problem in bioinformatics and biomedicine aiming at handling and interpreting the extremely large-sized proteomes of several eukaryotic organisms. A central issue in AFP is the absence in public repositories for protein functions, e.g. the Gene Ontology (GO), of well defined sets of negative examples to learn accurate classifiers for AFP. In this paper we investigate the Query by Committee paradigm of active learning to select the negatives most informative for the classifier and the protein function to be inferred. We validated our approach in predicting the Gene Ontology function for the S.cerevisiae proteins.
Keywords
- Query By Committee
- Active learning
- Protein function prediction
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
A balanced seed training set counterbalances the predominance of 0 labels.
References
Ashburner, M., et al.: Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat. Genet. 25, 25–29 (2000)
Eisner, R., Poulin, B., Szafron, D., Lu, P.: Improving protein prediction using the hierarchical structure of the gene ontology. In: IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (2005)
Mostafavi, S., Morris, Q.: Using the gene ontology hierarchy when predicting gene function. In: Proceedings of the Twenty-Fifth Annual Conference on Uncertainty in Artificial Intelligence (UAI-09), (Corvallis, Oregon), pp. 419–427. AUAI Press (2009)
Youngs, N., Penfold-Brown, D., Bonneau, R., Shasha, D.: Negative example selection for protein function prediction: the NoGO database. PLoS Comput. Biol. 10, 1–12 (2014)
Freund, Y., Seung, H.S., Shamir, E., Tishby, N.: Selective sampling using the query by committee algorithm. Mach. Learn. 28, 133–168 (1997)
Bertoni, A., Frasca, M., Valentini, G.: COSNet: a cost sensitive neural network for semi-supervised learning in graphs. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6911, pp. 219–234. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23780-5_24
Frasca, M., Lipreri, F., Malchiodi, D.: Analysis of informative features for negative selection in protein function prediction. In: Rojas, I., Ortuño, F. (eds.) IWBBIO 2017, Part II. LNCS, vol. 10209, pp. 267–276. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56154-7_25
Szklarczyk, D., et al.: String v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43(D1), D447–D452 (2015)
Dagan, I., Engelson, S.P.: Committee-based sampling for training probabilistic classifiers. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 150–157. Morgan Kaufmann (1995)
Melville, P., Mooney, R.J.: Diverse ensembles for active learning. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML 2004, p. 74. ACM, New York (2004)
Abe, N., Mamitsuka, H.: Query learning strategies using boosting and bagging. In: Proceedings of the Fifteenth International Conference on Machine Learning, ICML 1998, San Francisco, CA, USA, pp. 1–9 (1998)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines: and Other Kernel-based Learning Methods. Cambridge University Press, New York (2000)
Breiman, L., Friedman, G., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984)
Gini, C.: Variabilità e Mutuabilità. Contributo allo Studio delle Distribuzioni e delle Relazioni Statistiche, C. Cuppini, Bologna (1912)
Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics 1, 80–83 (1945)
Acknowledgments
This work was supported by the grant title Machine learning algorithms to handle label imbalance in biomedical taxonomies, code PSR2017\(\_\)DIP\(\_\)010\(\_\)MFRAS, Università degli Studi di Milano.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Frasca, M., Sepehri, M., Petrini, A., Grossi, G., Valentini, G. (2020). Committee-Based Active Learning to Select Negative Examples for Predicting Protein Functions. In: Raposo, M., Ribeiro, P., Sério, S., Staiano, A., Ciaramella, A. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2018. Lecture Notes in Computer Science(), vol 11925. Springer, Cham. https://doi.org/10.1007/978-3-030-34585-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-34585-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34584-6
Online ISBN: 978-3-030-34585-3
eBook Packages: Computer ScienceComputer Science (R0)