Decision Trees for Hierarchical Multilabel Classification: A Case Study in Functional Genomics

  • Hendrik Blockeel
  • Leander Schietgat
  • Jan Struyf
  • Sašo Džeroski
  • Amanda Clare
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4213)

Abstract

Hierarchical multilabel classification (HMC) is a variant of classification where instances may belong to multiple classes organized in a hierarchy. The task is relevant for several application domains. This paper presents an empirical study of decision tree approaches to HMC in the area of functional genomics. We compare learning a single HMC tree (which makes predictions for all classes together) to learning a set of regular classification trees (one for each class). Interestingly, on all 12 datasets we use, the HMC tree wins on all fronts: it is faster to learn and to apply, easier to interpret, and has similar or better predictive performance than the set of regular trees. It turns out that HMC tree learning is more robust to overfitting than regular tree learning.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Rousu, J., Saunders, C., Szedmak, S., Shawe-Taylor, J.: Learning hierarchical multi-category text classification models. In: De Raedt, L., Wrobel, S. (eds.) Proceedings of the 22nd International Conference on Machine Learning, pp. 744–751. ACM Press, New York (2005)CrossRefGoogle Scholar
  2. 2.
    Barutcuoglu, Z., Schapire, R.E., Troyanskaya, O.G.: Hierarchical multi-label prediction of gene function. Bioinformatics 22(7), 830–836 (2006)CrossRefGoogle Scholar
  3. 3.
    Weiss, G.M., Provost, F.J.: Learning when training data are costly: The effect of class distribution on tree induction. J. Artif. Intell. Res. (JAIR) 19, 315–354 (2003)MATHGoogle Scholar
  4. 4.
    Clare, A., King, R.: Knowledge discovery in multi-label phenotype data. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS, vol. 2168, pp. 42–53. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  5. 5.
    Clare, A.: Machine learning and data mining for yeast functional genomics. PhD thesis, University of Wales, Aberystwyth (2003)Google Scholar
  6. 6.
    Blockeel, H., Bruynooghe, M., Džeroski, S., Ramon, J., Struyf, J.: Hierarchical multi-classification. In: Proceedings of the ACM SIGKDD 2002 Workshop on Multi-Relational Data Mining (MRDM 2002), pp. 21–35 (2002)Google Scholar
  7. 7.
    Blockeel, H., De Raedt, L., Ramon, J.: Top-down induction of clustering trees. In: Proceedings of the 15th International Conference on Machine Learning, pp. 55–63 (1998)Google Scholar
  8. 8.
    Struyf, J., Džeroski, S., Blockeel, H., Clare, A.: Hierarchical multi-classification with predictive clustering trees in functional genomics. In: Bento, C., Cardoso, A., Dias, G. (eds.) EPIA 2005. LNCS, vol. 3808, pp. 272–283. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  9. 9.
    Struyf, J., Vens, C., Croonenborghs, T., Dzeroski, S., Blockeel, H.: Applying predictive clustering trees to the inductive logic programming 2005 challenge data. In: ILP 2005 Late-Breaking Papers, Institut für Informatik der Technischen Universität München, pp. 111–116 (2005)Google Scholar
  10. 10.
    Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984)MATHGoogle Scholar
  11. 11.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann series in Machine Learning. Morgan Kaufmann, San Francisco (1993)Google Scholar
  12. 12.
    Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. Technical report, University of Wisconsin, Madison (2005)Google Scholar
  13. 13.
    Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (1999)Google Scholar
  14. 14.
    Torgo, L.: A comparative study of reliable error estimators for pruning regression trees. In: Coelho, H. (ed.) IBERAMIA 1998. Springer, Heidelberg (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Hendrik Blockeel
    • 1
  • Leander Schietgat
    • 1
  • Jan Struyf
    • 1
    • 2
  • Sašo Džeroski
    • 3
  • Amanda Clare
    • 4
  1. 1.Department of Computer ScienceKatholieke Universiteit LeuvenLeuvenBelgium
  2. 2.Dept. of Biostatistics and Medical InformaticsUniv. of WisconsinMadisonUSA
  3. 3.Department of Knowledge TechnologiesJožef Stefan InstituteLjubljanaSlovenia
  4. 4.Department of Computer ScienceUniversity of Wales AberystwythUK

Personalised recommendations