Skip to main content

Predicting Gene Function using Predictive Clustering Trees

  • Chapter
  • First Online:
Inductive Databases and Constraint-Based Data Mining

Abstract

In this chapter, we show how the predictive clustering tree framework can be used to predict the functions of genes. The gene function prediction task is an example of a hierarchical multi-label classification (HMC) task: genes may have multiple functions and these functions are organized in a hierarchy. The hierarchy of functions can be such that each function has at most one parent (tree structure) or such that functions may have multiple parents (DAG structure).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research 25: 3389–3402 (1997)

    Article  Google Scholar 

  2. Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., Harris, M., Hill, D., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J., Richardson, J., Ringwald, M., Rubin, G., Sherlock, G.: Gene Ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 25(1): 25–29 (2000)

    Article  Google Scholar 

  3. Astikainen, K., L., H., Pitkanen, E., S., S., Rousu, J.: Towards structured output prediction of enzyme function. BMC Proceedings 2(Suppl 4): S2(2008)

    Article  Google Scholar 

  4. Barutcuoglu, Z., Schapire, R., Troyanskaya, O.: Hierarchical multi-label prediction of gene function. Bioinformatics 22(7): 830–836 (2006).

    Article  Google Scholar 

  5. Blockeel, H., Bruynooghe, M., Džeroski, S., Ramon, J., Struyf, J.: Hierarchical multiclassification. In: Proc. Wshp on Multi-RelationalData Mining, pp. 21–35. ACM SIGKDD (2002)

    Google Scholar 

  6. Blockeel, H., De Raedt, L., Ramon, J.: Top-down induction of clustering trees. In: Proc. of the 15th Intl Conf. on Machine Learning, pp. 55–63. Morgan Kaufmann (1998)

    Google Scholar 

  7. Blockeel, H., Schietgat, L., Struyf, J., Džeroski, S., Clare, A.: Decision trees for hierarchical multilabel classification: A case study in functional genomics. In: Proc. of the 10th European Conf. on Principles and Practices of Knowledge Discovery in Databases, LNCS, vol. 4213, pp. 18–29. Springer (2006)

    Google Scholar 

  8. Breiman, L.: Bagging predictors. Machine Learning 24(2): 123–140 (1996)

    MATH  MathSciNet  Google Scholar 

  9. Breiman, L.: Out-of-bag estimation. Technical Report, Statistics Department, University of California (1996)

    Google Scholar 

  10. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984)

    MATH  Google Scholar 

  11. Cesa-Bianchi, N., Gentile, C., Zaniboni, L.: Incremental algorithms for hierarchical classification. Journal of Machine Learning Research 7: 31–54 (2006)

    MathSciNet  Google Scholar 

  12. Cesa-Bianchi, N., Valentini, G.: Hierarchical cost-sensitive algorithms for genome-wide gene function prediction. In Proc. 3rd Intl Wshp on Machine Learning in Systems Biology, JMLR: Workshop and Conference Proceedings 8: 14–29 (2010)

    Google Scholar 

  13. Chen, Y., Xu, D.: Global protein function annotation through mining genome-scale data in yeast saccharomyces cerevisiae. Nucleic Acids Research 32(21): 6414–6424 (2004)

    Article  Google Scholar 

  14. Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P., Herskowitz, I.: The transcriptional program of sporulation in budding yeast. Science 282: 699–705 (1998)

    Article  Google Scholar 

  15. Chua, H., Sung, W., Wong, L.: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 22(13): 1623–1630 (2006)

    Article  Google Scholar 

  16. Clare, A.: Machine Learning and Data Mining for Yeast Functional Genomics. Ph.D. thesis, University of Wales, Aberystwyth (2003)

    Google Scholar 

  17. Clare, A., Karwath, A., Ougham, H., King, R.D.: Functional bioinformatics for Arabidopsis thaliana. Bioinformatics 22(9): 1130–1136 (2006)

    Article  Google Scholar 

  18. Clare, A., King, R.D.: Predicting gene function in Saccharomyces cerevisiae. Bioinformatics 19(Suppl. 2): 42–49 (2003).

    Article  Google Scholar 

  19. Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In Proc. of the 23rd Intl Conf. on Machine Learning, pp. 233–240. ACM Press (2006)

    Google Scholar 

  20. Deng, M., Zhang, K., Mehta, S., Chen, T., Sun, F.: Prediction of protein function using proteinprotein interaction data. In Proc. of the IEEE Computer Society Bioinformatics Conf., pp. 197–206. IEEE Computer Society Press (2002)

    Google Scholar 

  21. DeRisi, J., Iyer, V., Brown, P.: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278: 680–686 (1997)

    Article  Google Scholar 

  22. Eisen, M., Spellman, P., Brown, P., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. In Proc. National Academy of Sciences of USA 95(14): 14863–14868 (1998)

    Article  Google Scholar 

  23. Gasch, A., Huang, M., Metzner, S., Botstein, D., Elledge, S., Brown, P.: Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Molecular Biology of the Cell 12(10): 2987–3000 (2001)

    Google Scholar 

  24. Gasch, A., Spellman, P., Kao, C., Carmel-Harel, O., Eisen, M., Storz, G., Botstein, D., Brown, P.: Genomic expression program in the response of yeast cells to environmental changes. Molecular Biology of the Cell 11: 4241–4257 (2000)

    Google Scholar 

  25. Geurts, P., Wehenkel, L., d’Alché Buc, F.: Kernelizing the output of tree-based methods. In Proc. of the 23rd Intl Conf. on Machine learning, pp. 345–352. ACM Press (2006).

    Google Scholar 

  26. Gough, J., Karplus, K., Hughey, R., Chothia, C.: Assignment of homology to genome sequences using a library of hidden markov models that represent all proteins of known structure. Molecular Biology 313(4): 903–919 (2001)

    Article  Google Scholar 

  27. Guan, Y., Myers, C., Hess, D., Barutcuoglu, Z., Caudy, A., Troyanskaya, O.: Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biology 9(Suppl 1): S3(2008)

    Article  Google Scholar 

  28. Joachims, T.: Making large-scale SVM learning practical. In: B. Scholkopf, C. Burges, A. Smola (eds.) Advances in Kernel Methods – Support Vector Learning. MIT Press (1999)

    Google Scholar 

  29. Karaoz, U., Murali, T., Letovsky, S., Zheng, Y., Ding, C., Cantor, C., Kasif, S.: Whole-genome annotation by using evidence integration in functional-linkage networks. Proc. National Academy of Sciences of USA 101(9): 2888–2893 (2004)

    Google Scholar 

  30. Kim, W., Krumpelman, C., Marcotte, E.: Inferring mouse gene functions from genomic-scale data using a combined functional network/classification strategy. Genome Biology 9(Suppl 1): S5(2008)

    Article  Google Scholar 

  31. Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Ensembles of multi-objective decision trees. In: Proc. of the 18th European Conf. on Machine Learning, LNCS, vol. 4701, pp. 624–631. Springer (2007)

    Google Scholar 

  32. Lanckriet, G.R., Deng, M., Cristianini, N., Jordan, M.I., Noble, W.S.: Kernel-based data fusion and its application to protein function prediction in yeast. In Proc. of the Pacific Symposium on Biocomputing, pp. 300–311. World Scientific Press (2004)

    Google Scholar 

  33. Lee, H., Tu, Z., Deng, M., Sun, F., Chen, T.: Diffusion kernel-based logistic regression models for protein function prediction. OMICS 10(1): 40–55 (2006)

    Article  Google Scholar 

  34. Mewes, H., Heumann, K., Kaps, A., Mayer, K., Pfeiffer, F., Stocker, S., Frishman, D.: MIPS: A database for protein sequences and complete genomes. Nucleic Acids Research 27: 44–48 (1999)

    Article  Google Scholar 

  35. Mostafavi, S., Ray, D., Warde-Farley, D., Grouios, C., Morris, Q.: GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biology 9(Suppl 1): S4(2008)

    Article  Google Scholar 

  36. Obozinski, G., Lanckriet, G., Grant, C., Jordan, M., Noble, W.: Consistent probabilistic outputs for protein function prediction. Genome Biology 9(Suppl 1): S6(2008)

    Article  Google Scholar 

  37. Ouali, M., King, R.: Cascaded multiple classifiers for secondary structure prediction. Protein Science 9(6): 1162–76 (2000)

    Article  Google Scholar 

  38. Provost, F., Fawcett, T.: Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In Proc. of the Third Intl Conf. on Knowledge Discovery and Data Mining, pp. 43–48. AAAI Press (1998)

    Google Scholar 

  39. Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993)

    Google Scholar 

  40. Roth, F., Hughes, J., Estep, P., Church, G.: Fining DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnology 16: 939–945 (1998)

    Article  Google Scholar 

  41. Rousu, J., Saunders, C., Szedmak, S., Shawe-Taylor, J.: Kernel-based learning of hierarchical multilabel classification models. Journal of Machine Learning Research 7: 1601–1626 (2006)

    MathSciNet  Google Scholar 

  42. Schietgat, L., Vens, C., Struyf, J., Blockeel, H., Kocev, D., Džeroski, S.: Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinformatics 11;2(2010)

    Article  Google Scholar 

  43. Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., Futcher, B.: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9: 3273–3297 (1998)

    Google Scholar 

  44. Taskar, B., Guestrin, C., Koller, D.: Max-margin Markov networks. Advances in Neural Information Processing Systems 16. MIT Press (2003)

    Google Scholar 

  45. Tian, W., Zhang, L., Tasan, M., Gibbons, F., King, O., Park, J., Wunderlich, Z., Cherry, J., Roth, F.: Combining guilt-by-association and guilt-by-profiling to predict saccharomyces cerevisiae gene function. Genome Biology 9(Suppl 1): S7(2008)

    Article  Google Scholar 

  46. Troyanskaya, O., Dolinski, K., Owen, A., Altman, R., D., B.: A bayesian framework for combining heterogeneous data sources for gene function prediction (in saccharomyces cerevisiae). Proc. National Academy of Sciences of USA 100(14): 8348–8353 (2003)

    Google Scholar 

  47. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6: 1453–1484 (2005)

    MathSciNet  Google Scholar 

  48. Valentini, G., Re, M.: Weighted true path rule: a multilabel hierarchical algorithm for gene function prediction. In Proc. of the 1st Intl Wshp on Learning from Multi-Label Data, pp. 133–146. ECML/PKDD (2009)

    Google Scholar 

  49. Vens, C., Struyf, J., Schietgat, L., Džeroski, S., Blockeel, H.: Decision trees for hierarchical multi-label classification. Machine Learning 73(2): 185–214 (2008)

    Article  Google Scholar 

  50. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics 1: 80–83 (1945)

    Article  Google Scholar 

  51. Zdobnov, E., Apweiler, R.: Interproscan - an integration platform for the signature-recognition methods in interpro. Bioinformatics 17(9): 847–848 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Celine Vens .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Vens, C., Schietgat, L., Struyf, J., Blockeel, H., Kocev, D., Džeroski, S. (2010). Predicting Gene Function using Predictive Clustering Trees. In: Džeroski, S., Goethals, B., Panov, P. (eds) Inductive Databases and Constraint-Based Data Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-7738-0_15

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-7738-0_15

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4419-7737-3

  • Online ISBN: 978-1-4419-7738-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics