Abstract
Gene function prediction is a complex multilabel classification problem with several distinctive features: the hierarchical relationships between functional classes, the presence of multiple sources of biomolecular data, the unbalance between positive and negative examples for each class, the complexity of the whole-ontology and genome-wide dimensions. Unlike previous works, which mostly looked at each one of these issues in isolation, we explore the interaction and potential synergy of hierarchical multilabel methods, data fusion methods, and cost-sensitive approaches on whole-ontology and genome-wide gene function prediction. Besides classical top-down hierarchical multilabel ensemble methods, in our experiments we consider two recently proposed multilabel methods: one based on the approximation of the Bayesian optimal classifier with respect to the hierarchical loss, and one based on a heuristic approach inspired by the true path rule for the biological functional ontologies. Our experiments show that key factors for the success of hierarchical ensemble methods are the integration and synergy among multilabel hierarchical, data fusion, and cost-sensitive approaches, as well as the strategy of selecting negative examples.
Article PDF
Similar content being viewed by others
References
Altschul, S., Gish, W., Miller, W., Myers, E., & Lipman, D. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215, 403–410.
Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. (1997). Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research, 25(17), 3389–3402.
Amit, Y., Dekel, O., & Singer, Y. (2007). A boosting algorithm for label covering in multilabel problems. Journal of Machine Learning Research, W&C Proceedings, 2, 27–34.
Astikainen, K., Holm, L., Pitkanen, E., Szedmak, S., & Rousu, J. (2008). Towards structured output prediction of enzyme function. BMC Proceedings, 2(Suppl 4:S2).
Bakir, G., Hoffman, T., Scholkopf, B., Smola, A. J., Taskar, B., & Vishwanathan, S. (2007). Predicting structured data. Cambridge: MIT Press.
Barutcuoglu, Z., Schapire, R., & Troyanskaya, O. (2006). Hierarchical multi-label prediction of gene function. Bioinformatics, 22(7), 830–836.
Ben-Hur, A., & Noble, W. (2006). Choosing negative examples for the prediction of protein-protein interactions. BMC Bioinformatics, 7(Suppl 1/S2).
Bengio, Y., Delalleau, O., & Le Roux, N. (2006). Label Propagation and Quadratic Criterion. In O. Chapelle, B. Scholkopf, & A. Zien (Eds.), Semi-supervised learning (pp. 193–216). Cambridge: MIT Press.
Blockeel, H., Bruynooghe, M., Dzeroski, S., Ramon, J., & Struyf, J. (1998). Top-down induction of clustering trees. In Proc. of the 15th int. conf. on machine learning (pp. 55–63).
Blockeel, H., Schietgat, L., & Clare, A. (2006). Hierarchical multilabel classification trees for gene function prediction. In J. Rousu, S. Kaski, & E. Ukkonen (Eds.), Probabilistic modeling and machine learning in structural and systems biology, Tuusula, Finland. Helsinki: Helsinki University Printing House.
Cai, L., & Hofmann, T. (2004). Hierarchical document categorization with support vector machines. In Proceedings of the thirteenth ACM international conference on information and knowledge management, New York, NY, USA, CIKM’04 (pp. 78–87).
Cesa-Bianchi, N., & Valentini, G. (2010). Hierarchical cost-sensitive algorithms for genome-wide gene function prediction. Journal of Machine Learning Research, W&C Proceedings, Machine Learning in Systems Biology, 8, 14–29.
Cesa-Bianchi, N., Gentile, C., Tironi, A., & Zaniboni, L. (2005). Incremental algorithms for hierarchical classification. In Advances in neural information processing systems (Vol. 17, pp. 233–240). Cambridge: MIT Press.
Cesa-Bianchi, N., Gentile, C., & Zaniboni, L. (2006). Hierarchical classification: Combining Bayes with SVM. In Proc. of the 23rd int. conf. on machine learning (pp. 177–184). New York: ACM Press.
Cesa-Bianchi, N., Re, M., & Valentini, G. (2010a). Functional inference in FunCat through the combination of hierarchical ensembles with data fusion methods. In ICML-MLD 2nd international workshop on Learning from multi-label data, Haifa, Israel (pp. 13–20).
Cesa-Bianchi, N., Gentile, C., Vitale, F., & Zappella, G. (2010b). Random spanning trees and the prediction of weighted graphs. In Proceedings of the 27th international conference on machine learning, Haifa, Israel.
Chua, H., Sung, W., & Wong, L. (2007). An efficient strategy for extensive integration of diverse biological data for protein function prediction. Bioinformatics, 23(24), 3364–3373.
Clare, A., & King, R. (2003). Predicting gene function in saccharomices cerevisiae. Bioinformatics, 19(Supp.2), II42–II49.
Dembczynski, K., Cheng, W., & Hullermeier, E. (2010a). Bayes optimal multilabel classification via probabilistic classifier chains. In Proc. of ICML 2010 (pp. 1–10).
Dembczynski, K., Waegeman, W., Cheng, W., & Hullermeier, E. (2010b). On label dependence in multi-label classification. In ICML-MLD: 2nd international workshop on learning from multi-label data, Haifa, Israel (pp. 5–12).
Deng, M., Chen, T., & Sun, F. (2004). An integrated probabilistic model for functional prediction of proteins. Journal of Computational Biology, 11, 463–475.
desJardins, M., Karp, P., Krummenacker, M., Lee, T., & Ouzounis, C. (1997). Prediction of enzyme classification from protein sequence without the use of sequence similarity. In Proc. of the 5th ISMB (pp. 92–99). Menlo Park: AAAI Press.
Dimou, A., Tsoumakas, G., Mezaris, V., Kompatsiaris, I., & Vlahavas, I. (2009). An empirical study of multi-label methods for video annotation. In Proc. 7th international workshop on content-based multimedia indexing, CBMI 09, Chania, Greece.
Eddy, S. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755–763.
Eisner, R., Poulin, B., Szafron, D., & Lu, P. (2005). Improving protein prediction using the hierarchical structure of the Gene Ontology. In IEEE symposium on computational intelligence in bioinformatics and computational biology.
Finn, R., Tate, J., Mistry, J., Coggill, P., Sammut, J., Hotz, H., Ceric, G., Forslund, K., Eddy, S., Sonnhammer, E., & Bateman, A. (2008). The Pfam protein families database. Nucleic Acids Research, 36, D281–D288.
Friedberg, I. (2006). Automated protein function prediction-the genomic challenge. Briefings in Bioinformatics, 7, 225–242.
Gasch, P., et al. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell, 11, 4241–4257.
Gene Ontology Consortium (2010). True path rule. http://www.geneontology.org/GO.usage.shtml#truePathRule.
Guan, Y., Myers, C., Hess, D., Barutcuoglu, Z., Caudy, A., & Troyanskaya, O. (2008). Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biology, 9(S2).
Jiang, X., Nariai, N., Steffen, M., Kasif, S., & Kolaczyk, E. (2008). Integration of relational and hierarchical network information for protein function prediction. BMC Bioinformatics, 9(350).
Juncker, A., Jensen, L., Perleoni, A., Bernsel, A., Tress, M., Bork, P., von Heijne, G., Valencia, A., Ouzounis, A., Casadio, R., & Brunak, S. (2009). Sequence-based feature prediction and annotation of proteins. Genome Biology, 10:206.
Karaoz, U., et al. (2004). Whole-genome annotation by using evidence integration in functional-linkage networks. Proceedings of the National Academy of Sciences of the United States of America, 101, 2888–2893.
Kittler, J., Hatef, M., Duin, R., & Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 226–239.
Kuncheva, L., Bezdek, J., & Duin, R. (2001). Decision templates for multiple classifier fusion: an experimental comparison. Pattern Recognition, 34(2), 299–314.
Lampert, C., & Blaschko, M. (2009). Structured prediction by joint kernel support estimation. Machine Learning, 77, 249–269.
Lanckriet, G., Gert, R. G., Deng, M., Cristianini, N., Jordan, M., & Noble, W. (2004a). Kernel-based data fusion and its application to protein function prediction in yeast. In Proceedings of the pacific symposium on biocomputing (pp. 300–311).
Lanckriet, G., De Bie, T., Cristianini, N., Jordan, M., & Noble, W. (2004b). A statistical framework for genomic data fusion. Bioinformatics, 20, 2626–2635.
Lewis, D., Jebara, T., & Noble, W. (2006). Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure. Bioinformatics, 22(22), 2753–2760.
Lin, H., Lin, C., & Weng, R. (2007). A note on Platt’s probabilistic outputs for support vector machines. Machine Learning, 68, 267–276.
Loewenstein, Y., Raimondo, D., Redfern, O., Watson, J., Frishman, D., Linial, M., Orengo, C., Thornton, J., & Tramontano, A. (2009). Protein function annotation by homology-based inference. Genome Biology, 10, 207.
Marcotte, E., Pellegrini, M., Thompson, M., Yeates, T., & Eisenberg, D. (1999). A combined algorithm for genome-wide prediction of protein function. Nature, 402, 83–86.
McDermott, J., Bumgarner, R., & Samudrala, R. (2005). Functional annotation from predicted protein interaction networks. Bioinformatics, 21(15), 3217–3226.
Morik, K., Brockhausen, P., & Joachims, T. (1999). Combining statistical learning with a knowledge-based approach—a case study in intensive care monitoring. In Proceedings of 16th international conference on machine learning (ICML), Bled (Slovenia). Morgan Kaufmann: San Mateo.
Mostafavi, S., & Morris, Q. (2009). Using the gene ontology hierarchy when predicting gene function. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, Montreal, QC, Canada. Corvallis: AUAI Press.
Mostafavi, S., & Morris, Q. (2010). Fast integration of heterogeneous data sources for predicting gene function with limited annotation. Bioinformatics, 26(14), 1759–1765.
Mostafavi, S., Ray, D., Warde-Farley, D., Grouios, C., & Morris, Q. (2008). GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biology, 9(S4).
Myers, C., & Troyanskaya, O. (2007). Context-sensitive data integration and prediction of biological networks. Bioinformatics, 23, 2322–2330.
Nabieva, E., Jim, K., Agarwal, A., Chazelle, B., & Singh, M. (2005). Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 21(S1), 302–310.
Noble, W., & Ben-Hur, A. (2007). Integrating information for protein function prediction. In T. Lengauer (Ed.), Bioinformatics—from genomes to therapies (Vol. 3, pp. 1297–1314). New York: Wiley-VCH.
Obozinski, G., Lanckriet, G., Grant, C., M., J., & Noble, W., (2008). Consistent probabilistic output for protein function prediction. Genome Biology, 9(S6).
Oliver, S. (2000). Guilt-by-association goes global. Nature, 403, 601–603.
Pavlidis, P., Weston, J., Cai, J., & Noble, W. (2002). Learning gene functional classification from multiple data. Journal of Computational Biology, 9, 401–411.
Prlic, A., Down, T., Kulesha, E., Finn, R., Kahari, A., & Hubbard, T. (2007). Integrating sequence and structural biology with DAS. BMC Bioinformatics, 8(233).
Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1, 81–106.
Rakotomamonjy, A., Bach, F., Canu, S., & Grandvalet, Y. (2007). More efficiency in multiple kernel learning. In ICML’07: proceedings of the 24th international conference on machine learning (pp. 775–782). New York: ACM.
Re, M., & Valentini, G. (2010a). Integration of heterogeneous data sources for gene function prediction using Decision Templates and ensembles of learning machines. Neurocomputing, 73(7–9), 1533–1537.
Re, M., & Valentini, G. (2010b). Noise tolerance of Multiple Classifier Systems in data integration-based gene function prediction. Journal of Integrative. Bioinformatics, 7(3), 139.
Re, M., & Valentini, G. (2010c). Simple ensemble methods are competitive with state-of-the-art data integration methods for gene function prediction. Journal of Machine Learning Research, W&C Proceedings, Machine Learning in Systems Biology, 8, 98–111.
Rousu, J., Saunders, C., Szedmak, S., & Shawe-Taylor, J. (2006). Kernel-based learning of hierarchical multilabel classification models. Journal of Machine Learning Research, 7, 1601–1626.
Ruepp, A., Zollner, A., Maier, D., Albermann, K., Hani, J., Mokrejs, M., Tetko, I., Guldener, U., Mannhaupt, G., Munsterkotter, M., & Mewes, H. (2004). The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Research, 32(18), 5539–5545.
Saad, Y. (1996). Iterative methods for sparse linear systems. Boston: PWS Publishing Company.
Schietgat, L., Vens, C., Struyf, J., Blockeel, H., & Dzeroski, S. (2010). Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinformatics, 11(2).
Shahbaba, B., & Neal, M. (2006). Gene function classification using Bayesian models with hierarchy-based priors. BMC Bioinformatics, 7(448).
Sokolov, A., & Ben-Hur, A. (2010). Hierarchical classification of Gene Ontology terms using the GOstruct method. Journal of Bioinformatics and Computational Biology, 8(2), 357–376.
Sonnenburg, S., Ratsch, G., Schafer, C., & Scholkopf, B. (2006). Large scale multiple kernel learning. Journal of Machine Learning Research, 7, 1531–1565.
Spellman, P., et al. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomices cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9, 3273–3297.
Stark, C., Breitkreutz, B., Reguly, T., Boucher, L., Breitkreutz, A., & Tyers, M. (2006). BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 34, D535–D539.
The Gene Ontology Consortium (2000). Gene ontology: tool for the unification of biology. Nature Genet., 25, 25–29.
Trohidis, K., Tsoumahas, G., Kalliris, G., & Vlahavas, I. (2008). Multilabel classification of music into emotions. In Proc. of the 9th international conference on music information retrieval (pp. 325–330).
Troyanskaya, O., et al. (2003). A Bayesian framework for combining heterogeneous data sources for gene function prediction (in saccharomices cerevisiae). Proceedings of the National Academy of Sciences of the United States of America, 100, 8348–8353.
Tsochantaridis, I., Joachims, T., Hoffman, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.
Tsoumakas, G., & Katakis, I. (2007). Multi label classification: An overview. International Journal of Data Warehousing and Mining, 3(3), 1–13.
Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Random k-labelsets for multi-label classification. IEEE Transactions on Knowledge and Data Engineering, 23(7), 1079–1089.
Tsuda, K., Shin, H., & Scholkopf, B. (2005). Fast protein classification with multiple networks. Bioinformatics, 21(Suppl 2), ii59–ii65.
Valentini, G. (2011). True Path Rule hierarchical ensembles for genome-wide gene function prediction. IEEE ACM Transactions on Computational Biology and Bioinformatics, 8(3), 832–847.
Valentini, G., & Cesa-Bianchi, N. (2008). Hcgene: a software tool to support the hierarchical classification of genes. Bioinformatics, 24(5), 729–731.
Valentini, G., & Re, M. (2009). Weighted True Path Rule: a multilabel hierarchical algorithm for gene function prediction. In MLD-ECML 2009, 1st international workshop on learning from multi-label data, Bled, Slovenia (pp. 133–146).
Vazquez, A., Flammini, A., Maritan, A., & Vespignani, A. (2003). Global protein function prediction from protein-protein interaction networks. Nature Biotechnology, 21, 697–700.
Vens, C., Struyf, J., Schietgat, L., Dzeroski, S., & Blockeel, H. (2008). Decision trees for hierarchical multi-label classification. Machine Learning, 73(2), 185–214.
Verspoor, K., Cohn, J., Mnizewski, S., & Joslyn, C. (2006). A categorization approach to automated ontological function annotation. Protein Science, 15, 1544–1549.
von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S., Fields, S., & Bork, P. (2002). Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417, 399–403.
Xiong, J., et al. (2006). Genome wide prediction of gene function via a generic knowledge discovery approach based on evidence integration. BMC Bioinformatics, 7(268).
Zhang, M., & Zhou, Z. (2006). Multi-label neural network with applications to functional genomics and text categorization. IEEE Trans. on Knowledge and Data. Engineering, 18(10), 1338–1351.
Zhang, M., & Zhou, Z. (2007). ML-kNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7), 2038–2048.
Zhang, M., Tsoumakas, G., & Zhou, Z. (2010). In 2nd international workshop on learning from multi-label data (MLD’10)—working notes, Haifa, Israel.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Grigorios Tsoumakas, Min-Ling Zhang, and Zhi-Hua Zhou.
Rights and permissions
About this article
Cite this article
Cesa-Bianchi, N., Re, M. & Valentini, G. Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Mach Learn 88, 209–241 (2012). https://doi.org/10.1007/s10994-011-5271-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-011-5271-6