Machine Learning

, Volume 107, Issue 5, pp 859–886 | Cite as

Online multi-label dependency topic models for text classification

  • Sophie Burkhardt
  • Stefan Kramer


Multi-label text classification is an increasingly important field as large amounts of text data are available and extracting relevant information is important in many application contexts. Probabilistic generative models are the basis of a number of popular text mining methods such as Naive Bayes or Latent Dirichlet Allocation. However, Bayesian models for multi-label text classification often are overly complicated to account for label dependencies and skewed label frequencies while at the same time preventing overfitting. To solve this problem we employ the same technique that contributed to the success of deep learning in recent years: greedy layer-wise training. Applying this technique in the supervised setting prevents overfitting and leads to better classification accuracy. The intuition behind this approach is to learn the labels first and subsequently add a more abstract layer to represent dependencies among the labels. This allows using a relatively simple hierarchical topic model which can easily be adapted to the online setting. We show that our method successfully models dependencies online for large-scale multi-label datasets with many labels and improves over the baseline method not modeling dependencies. The same strategy, layer-wise greedy training, also makes the batch variant competitive with existing more complex multi-label topic models.


Multi-label classification Online learning LDA Topic model 



The authors would like to thank PRIME Research for their support.


  1. AlSumait, L., Barbar, D., & Domeniconi, C. (2008). On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In 2008 eighth IEEE international conference on data mining (pp. 3–12).Google Scholar
  2. Asuncion, A., Welling, M., Smyth, P., & Teh, Y. W. (2009). On smoothing and inference for topic models. In Proceedings of the 25th conference on uncertainty in artificial intelligence, UAI ’09 (pp. 27–34). Arlington, Virginia, United States: AUAI Press.Google Scholar
  3. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al. (2007). Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems, 19, 153.Google Scholar
  4. Bishop, C. M. (2006). Pattern recognition and machine learning (information science and statistics). New York: Springer.zbMATHGoogle Scholar
  5. Canini, K. R., Shi, L., & Griffiths, T. L. (2009). Online inference of topics with latent dirichlet allocation. In Proceedings of the international conference on artificial intelligence and statistics (pp. 65–72).Google Scholar
  6. Capp, O., & Moulines, E. (2009). On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society Series B, 71(3), 593–613.MathSciNetCrossRefzbMATHGoogle Scholar
  7. Foulds, J., Boyles, L., DuBois, C., Smyth, P., & Welling, M. (2013). Stochastic collapsed variational bayesian inference for latent dirichlet allocation. In Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’13 (pp. 446–454). New York, USA: ACM.Google Scholar
  8. Ghamrawi, N., & McCallum, A. (2005). Collective multi-label classification. In Proceedings of the 14th ACM international conference on information and knowledge management (pp. 195–200). New York: ACM.Google Scholar
  9. Gouk, H., Pfahringer, B., & Cree, M. J. (2016). Learning distance metrics for multi-label classification. In 8th Asian conference on machine learning (Vol. 63, pp. 318–333).Google Scholar
  10. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. In Proceedings of the National Academy of Sciences of the United States of America (Vol. 101, pp. 5228–5235). National Academy of Sciences.Google Scholar
  11. Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.MathSciNetCrossRefzbMATHGoogle Scholar
  12. Hoffman, M., Bach, F. R., & Blei, D. M. (2010). Online learning for latent dirichlet allocation. In Advances in neural information processing systems (pp. 856–864).Google Scholar
  13. Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14, 1303–1347.MathSciNetzbMATHGoogle Scholar
  14. Huang, S. J., & Zhou, Z. H. (2012). Multi-label learning by exploiting label correlations locally. In Proceedings of the twenty-sixth AAAI conference on artificial intelligence, AAAI’12 (pp. 949–955). Toronto, Ontario, Canada: AAAI Press.Google Scholar
  15. Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.Google Scholar
  16. Li, W., & McCallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on machine learning (pp. 577–584). New York: ACM.Google Scholar
  17. Li, A. Q., Ahmed, A., Ravi, S., & Smola, A. J. (2014). Reducing the sampling complexity of topic models. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’14 (pp. 891–900). New York, NY, USA: ACM.Google Scholar
  18. Loza Mencía, E., & Fürnkranz, J. (2010). Efficient multilabel classification algorithms for large-scale problems in the legal domain. In E. Francesconi, S. Montemagni, W. Peters, & D. Tiscornia (Eds.), Semantic processing of legal texts—Where the language of law meets the law of language, lecture notes in artificial intelligence (1st ed., Vol. 6036, pp. 192–215). Berlin: Springer.Google Scholar
  19. Nam, J., Kim, J., Loza Mencía, E., Gurevych, I., & Fürnkranz, J. (2014). Large-scale multi-label text classification—Revisiting neural networks. In T. Calders, F. Esposito, E. Hüllermeier, & R. Meo (Eds.), Proceedings of ECML-PKDD, Part II (pp. 437–452). Berlin, Heidelberg: Springer.Google Scholar
  20. Papanikolaou, Y., Foulds, J. R., Rubin, T. N., & Tsoumakas, G. (2015). Dense distributions from sparse samples: Improved Gibbs sampling parameter estimators for LDA. ArXiv e-prints.Google Scholar
  21. Prabhu, Y., & Varma, M. (2014). Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’14 (pp. 263–272). New York, USA: ACM.Google Scholar
  22. Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 conference on empirical methods in natural language processing: Volume 1, EMNLP ’09 (pp. 248–256). Stroudsburg, USA: Association for Computational Linguistics.Google Scholar
  23. Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2011). Classifier chains for multi-label classification. Machine Learning, 85(3), 333–359.MathSciNetCrossRefGoogle Scholar
  24. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Proceedings of the 20th conference on uncertainty in artificial intelligence, UAI ’04 (pp. 487–494). Arlington, Virginia, United States: AUAI Press.Google Scholar
  25. Rubin, T., Chambers, A., Smyth, P., & Steyvers, M. (2012). Statistical topic models for multi-label document classification. Machine Learning, 88(1–2), 157–208.MathSciNetCrossRefzbMATHGoogle Scholar
  26. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581.MathSciNetCrossRefzbMATHGoogle Scholar
  27. Teh, Y. W., Newman, D., & Welling, M. (2006). A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In Advances in neural information processing systems (pp. 1353–1360).Google Scholar
  28. Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 2007, 1–13.CrossRefGoogle Scholar
  29. Tsoumakas, G., & Vlahavas, I. (2007). Random k-labelsets: An ensemble method for multilabel classification. In J. N. Kok, J. Koronacki, R. L. d. Mantaras, S. Matwin, D. Mladenič, & A. Skowron (Eds.), Proceedings of ECML (pp. 406–417). Warsaw, Poland: Springer.Google Scholar
  30. Wallach, H. M., Mimno, D. M., & McCallum, A. (2009). Rethinking lda: Why priors matter. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, A. Culotta (Eds.), Advances in neural information processing systems 22 (pp. 1973–1981). Curran Associates Inc.Google Scholar
  31. Wicker, J., Pfahringer, B., & Kramer, S. (2012). Multi-label classification using boolean matrix decomposition. In Proceedings of the 27th annual ACM symposium on applied computing, SAC ’12 (pp. 179–186). New York, USA: ACM.Google Scholar
  32. Wicker, J., Tyukin, A., & Kramer, S. (2016). A Nonlinear Label Compression and Transformation Method for Multi-label Classification Using Autoencoders (pp. 328–340). Cham: Springer International Publishing.Google Scholar
  33. Yen, I.E.H., Huang, X., Ravikumar, P., Zhong, K., & Dhillon, I. (2016). Pd-sparse: A primal and dual sparse approach to extreme multiclass and multilabel classification. In Proceedings of the 33rd international conference on machine learning (pp. 3069–3077). New York: ACM.Google Scholar
  34. Zhang, L., Shah, S., & Kakadiaris, I. (2017). Hierarchical multi-label classification using fully associative ensemble learning. Pattern Recognition, 70, 89–103.CrossRefGoogle Scholar
  35. Zhang, M. L., & Zhang, K. (2010). Multi-label learning by exploiting label dependency. In Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’10 (pp. 999–1008). Washington, DC, USA: ACM.Google Scholar
  36. Zhang, M. L., & Zhou, Z. H. (2014). A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1819–1837.CrossRefGoogle Scholar

Copyright information

© The Author(s) 2017

Authors and Affiliations

  1. 1.Institute of Computer ScienceJohannes Gutenberg-University of MainzMainzGermany

Personalised recommendations