Machine Learning

, Volume 92, Issue 1, pp 65–89

Beam search algorithms for multilabel learning

  • Abhishek Kumar
  • Shankar Vembu
  • Aditya Krishna Menon
  • Charles Elkan
Article

Abstract

Multilabel learning is a machine learning task that is important for applications, but challenging. A recent method for multilabel learning called probabilistic classifier chains (PCCs) has several appealing properties. However, PCCs suffer from the computational issue that inference (i.e., predicting the label of an example) requires time exponential in the number of tags. Also, PCC accuracy is sensitive to the ordering of the tags while training. In this paper, we show how to use the classical technique of beam search to solve both these problems. Specifically, we show how to apply beam search to make inference tractable, and how to integrate beam search with training to determine a suitable tag ordering. Experimental results on a range of datasets show that the proposed improvements yield a state-of-the-art method for multilabel learning.

Keywords

Multilabel classification Probabilistic models Beam search Structured prediction 

References

  1. Balcan, M. F., Blum, A., & Vempala, S. (2006). Kernels as features: on kernels, margins, and low-dimensional mappings. Machine Learning, 65(1), 79–94. CrossRefGoogle Scholar
  2. Balcan, M. F., Blum, A., & Srebro, N. (2008). A theory of learning with similarity functions. Machine Learning, 72(1–2), 89–112. CrossRefGoogle Scholar
  3. Bi, W., & Kwok, J. T. (2011). Multilabel classification on tree- and DAG-structured hierarchies. In Proceedings of the twenty-eighth international conference on machine learning. Google Scholar
  4. Bo, L., & Sminchisescu, C. (2009). Structured output-associative regression. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition. Google Scholar
  5. Breiman, L., & Friedman, J. H. (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 59, 3–54. MathSciNetMATHCrossRefGoogle Scholar
  6. Cristianini, N., Shawe-Taylor, J., Elisseeff, A., & Kandola, J. S. (2001). On kernel-target alignment. In Advances in neural information processing systems (Vol. 14). Google Scholar
  7. Dembczyński, K., Cheng, W., & Hüllermeier, E. (2010). Bayes optimal multilabel classification via probabilistic classifier chains. In Proceedings of the twenty-seventh international conference on machine learning. Google Scholar
  8. Dembczyński, K., Waegeman, W., Cheng, W., & Hüllermeier, E. (2011a). An exact algorithm for F-measure maximization. In Advances in neural information processing systems (Vol. 24). Google Scholar
  9. Dembczyński, K., Waegeman, W., & Hüllermeier, E. (2011b). Joint mode estimation in multi-label classification by chaining. In Proceedings of the workshop on collective learning and inference on structured data at the European conference on machine learning and principles and practice of knowledge discovery in databases. Google Scholar
  10. Dembczyński, K., Waegeman, W., & Hüllermeier, E. (2012). An analysis of chaining in multi-label classification. In Proceedings of the twentieth European conference on artificial intelligence. Google Scholar
  11. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30. MATHGoogle Scholar
  12. Drineas, P., & Mahoney, M. W. (2005). On the Nyström method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6, 2153–2175. MathSciNetMATHGoogle Scholar
  13. Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56, 52–64. MathSciNetMATHCrossRefGoogle Scholar
  14. Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceedings of the seventeenth international joint conference on artificial intelligence. Google Scholar
  15. Finley, T., & Joachims, T. (2008). Training structural SVMs when exact inference is intractable. In Proceedings of the twenty-fifth international conference on machine learning. Google Scholar
  16. Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32, 675–701. CrossRefGoogle Scholar
  17. Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11, 86–92. CrossRefGoogle Scholar
  18. Ghamrawi, N., & McCallum, A. (2005). Collective multi-label classification. In Proceedings of the ACM fourteenth conference on information and knowledge management. Google Scholar
  19. Ham, J., Lee, D. D., Mika, S., & Schölkopf, B. (2004). A kernel view of the dimensionality reduction of manifolds. In Proceedings of the twenty-first international conference on machine learning. Google Scholar
  20. Hart, P., Nilsson, N., & Raphael, B. (1968). A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2), 100–107. CrossRefGoogle Scholar
  21. Hastings, W. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1), 97–109. MATHCrossRefGoogle Scholar
  22. Hsu, D., Kakade, S., Langford, J., & Zhang, T. (2009). Multi-label prediction via compressed sensing. In Advances in neural information processing systems (Vol. 22). Google Scholar
  23. Huber, M. (1998). Exact sampling and approximate counting techniques. In Proceedings of the thirtieth annual ACM symposium on the theory of computing. Google Scholar
  24. Jerrum, M., & Sinclair, A. (1996). The Markov chain Monte Carlo method: An approach to approximate counting and integration. In Approximation algorithms for NP-hard problems (pp. 482–520). Boston: PWS-Kent. Google Scholar
  25. King, G., & Zeng, L. (2001). Logistic regression in rare events data. Political Analysis, 9(2), 137–163. CrossRefGoogle Scholar
  26. Kumar, A., Vembu, S., Menon, A. K., & Elkan, C. (2012). Learning and inference in probabilistic classifier chains with beam search. In Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases. Google Scholar
  27. Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning. Google Scholar
  28. Luaces, O., Díez, J., Barranquero, J., del Coz, J., & Bahamonde, A. (2012). Binary relevance efficacy for multilabel classification. Progress in Artificial Intelligence, 1, 303–313. CrossRefGoogle Scholar
  29. McCallum, A., Freitag, D., & Pereira, F. C. N. (2000). Maximum entropy Markov models for information extraction and segmentation. In Proceedings of the seventeenth international conference on machine learning. Google Scholar
  30. Menon, A. K., Jiang, X., Vembu, S., Elkan, C., & Ohno-Machado, L. (2012). Predicting accurate probabilities with a ranking loss. In Proceedings of the twenty-ninth international conference on machine learning. Google Scholar
  31. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equation of state calculation by fast computing machines. Journal of Chemical Physics, 21, 1087–1092. CrossRefGoogle Scholar
  32. Propp, J. G., & Wilson, D. B. (1996). Exact sampling with coupled Markov chains and applications to statistical mechanics. Random Structures & Algorithms, 9(1–2), 223–252. MathSciNetMATHCrossRefGoogle Scholar
  33. Randall, D. (2003). Mixing. In Proceedings of the fourty-fourth annual IEEE symposium on foundations of computer science. Google Scholar
  34. Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2011). Classifier chains for multi-label classification. Machine Learning, 85(3), 333–359. CrossRefGoogle Scholar
  35. Read, J., Martino, L., & Luengo, D. (2012). Efficient Monte Carlo optimization for multi-dimensional classifier chains. arXiv:1211.2190.
  36. Russell, S., & Norvig, P. (2003). Artificial intelligence: a modern approach (2nd ed.). Englewood Cliffs: Prentice-Hall. Google Scholar
  37. Schölkopf, B., Smola, A. J., & Müller, K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319. CrossRefGoogle Scholar
  38. Schölkopf, B., Mika, S., Burges, C. J. C., Knirsch, P., Müller, K. R., Rätsch, G., & Smola, A. J. (1999). Input space versus feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5), 1000–1017. CrossRefGoogle Scholar
  39. Sorower, M. S. (2010). A literature survey on algorithms for multi-label learning. Tech. rep., Oregon State University, Corvallis, OR, USA. Google Scholar
  40. Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: an overview. International Journal of Data Warehousing and Mining, 3(3), 1–13. CrossRefGoogle Scholar
  41. Tsoumakas, G., Katakis, I., & Vlahavas, I. P. (2010). Mining multi-label data. In Data mining and knowledge discovery handbook (pp. 667–685). Berlin: Springer. Google Scholar
  42. Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., & Vlahavas, I. (2011). Mulan: a Java library for multi-label learning. Journal of Machine Learning Research, 12, 2411–2414. MathSciNetGoogle Scholar
  43. Weston, J., Chapelle, O., Elisseeff, A., Schölkopf, B., & Vapnik, V. (2002). Kernel dependency estimation. In Advances in neural information processing systems (Vol. 15). Google Scholar
  44. Williams, C. K. I., & Seeger, M. (2000). Using the Nyström method to speed up kernel machines. In Advances in neural information processing systems (Vol. 13). Google Scholar
  45. Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. Google Scholar
  46. Zaragoza, J., Sucar, L., & Morales, E. (2011). Bayesian chain classifiers for multidimensional classification. In Proceedings of the twenty-second international joint conference on artificial intelligence. Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  • Abhishek Kumar
    • 1
  • Shankar Vembu
    • 2
  • Aditya Krishna Menon
    • 1
  • Charles Elkan
    • 1
  1. 1.Department of Computer Science and EngineeringUniversity of CaliforniaSan DiegoUSA
  2. 2.Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoCanada

Personalised recommendations