Consistency of Probabilistic Classifier Trees

  • Krzysztof Dembczyński
  • Wojciech Kotłowski
  • Willem Waegeman
  • Róbert Busa-Fekete
  • Eyke Hüllermeier
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9852)

Abstract

Label tree classifiers are commonly used for efficient multi-class and multi-label classification. They represent a predictive model in the form of a tree-like hierarchy of (internal) classifiers, each of which is trained on a simpler (often binary) subproblem, and predictions are made by (greedily) following these classifiers’ decisions from the root to a leaf of the tree. Unfortunately, this approach does normally not assure consistency for different losses on the original prediction task, even if the internal classifiers are consistent for their subtask. In this paper, we thoroughly analyze a class of methods referred to as probabilistic classifier trees (PCTs). Thanks to training probabilistic classifiers at internal nodes of the hierarchy, these methods allow for searching the tree-structure in a more sophisticated manner, thereby producing predictions of a less greedy nature. Our main result is a regret bound for 0/1 loss, which can easily be extended to ranking-based losses. In this regard, PCTs nicely complement a related approach called filter trees (FTs), and can indeed be seen as a natural alternative thereof. We compare the two approaches both theoretically and empirically.

References

  1. 1.
    Bartlett, P.L., Jordan, M.I., McAuliffe, J.D.: Convexity, classification, and risk bounds. J. Am. Stat. Assoc. 101(473), 138–156 (2006)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Bengio, S., Weston, J., Grangier, D.: Label embedding trees for large multi-class tasks. In: NIPS, vol. 23, pp. 163–171. Curran Associates, Inc. (2010)Google Scholar
  3. 3.
    Beygelzimer, A., Langford, J., Lifshits, Y., Sorkin, G.B., Strehl, A.L.: Conditional probability tree estimation analysis and algorithms. In: UAI, pp. 51–58 (2009)Google Scholar
  4. 4.
    Beygelzimer, A., Langford, J., Ravikumar, P.: Error-correcting tournaments. In: Chaudhuri, K., Gentile, C., Zilles, S. (eds.) ALT 2015. LNCS (LNAI), vol. 9355, pp. 247–262. Springer, Heidelberg (2009). doi:10.1007/978-3-642-04414-4_22 CrossRefGoogle Scholar
  5. 5.
    Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) Proceedings of COMPSTAT 2010, pp. 177–187. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Choromanska, A., Langford, J.: Logarithmic time online multiclass prediction. In: NIPS, vol. 29 (2015)Google Scholar
  7. 7.
    Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (1991)CrossRefMATHGoogle Scholar
  8. 8.
    Dekel, O., Shamir, O.: Multiclass-multilabel learning when the label set grows with the number of examples. In: AISTATS (2010)Google Scholar
  9. 9.
    Dembczyński, K., Cheng, W., Hüllermeier, E.: Bayes optimal multilabel classification via probabilistic classifier chains. In: ICML, pp. 279–286. Omnipress (2010)Google Scholar
  10. 10.
    Dembczyński, K., Waegeman, W., Cheng, W., Hüllermeier, E.: An analysis of chaining in multi-label classification. In: ECAI (2012)Google Scholar
  11. 11.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)Google Scholar
  12. 12.
    Deng, J., Satheesh, S., Berg, A.C., Li, F.F.: Fast and balanced: efficient label tree learning for large scale object recognition. In: NIPS, vol. 24, pp. 567–575 (2011)Google Scholar
  13. 13.
    Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. JMLR 10, 2899–2934 (2009)MathSciNetMATHGoogle Scholar
  14. 14.
    Fox, J.: Applied Regression Analysis, Linear Models, and Related Methods. Sage, Thousand Oaks (1997)Google Scholar
  15. 15.
    Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: Coello, C.A.C. (ed.) LION 2011. LNCS, vol. 6683, pp. 507–523. Springer, Heidelberg (2011). doi:10.1007/978-3-642-25566-3_40 CrossRefGoogle Scholar
  16. 16.
    Kumar, A., Vembu, S., Menon, A.K., Elkan, C.: Beam search algorithms for multilabel learning. Mach. Learn. 92(1), 65–89 (2013)MathSciNetCrossRefMATHGoogle Scholar
  17. 17.
    Li, C.L., Lin, H.T.: Condensed filter tree for cost-sensitive multi-label classification. In: ICML, pp. 423–431 (2014)Google Scholar
  18. 18.
    Mena, D., Montañés, E., Quevedo, J.R., del Coz, J.J.: Using A* for inference in probabilistic classifier chains. In: IJCAI, pp. 3707–3713 (2015)Google Scholar
  19. 19.
    Morin, F., Bengio, Y.: Hierarchical probabilistic neural network language model. In: AISTATS, pp. 246–252 (2005)Google Scholar
  20. 20.
    Prabhu, Y., Varma, M.: Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In: KDD, pp. 263–272. ACM (2014)Google Scholar
  21. 21.
    Reid, M.D., Williamson, R.C.: Composite binary losses. JMLR 11, 2387–2422 (2010)MathSciNetMATHGoogle Scholar
  22. 22.
    Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J.: Feature hashing for large scale multitask learning. In: ICML, pp. 1113–1120. ACM (2009)Google Scholar
  23. 23.
    Weston, J., Makadia, A., Yee, H.: Label partitioning for sublinear ranking. In: ICML, pp. 181–189 (2013)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Krzysztof Dembczyński
    • 1
  • Wojciech Kotłowski
    • 1
  • Willem Waegeman
    • 2
  • Róbert Busa-Fekete
    • 3
  • Eyke Hüllermeier
    • 3
  1. 1.Poznan University of TechnologyPoznańPoland
  2. 2.Ghent UniversityGhentBelgium
  3. 3.Paderborn UniversityPaderbornGermany

Personalised recommendations