Statistical topic models for multi-label document classification

Rubin, Timothy N.; Chambers, America; Smyth, Padhraic; Steyvers, Mark

doi:10.1007/s10994-011-5272-5

Statistical topic models for multi-label document classification

Published: 29 December 2011

Volume 88, pages 157–208, (2012)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Statistical topic models for multi-label document classification

Download PDF

Timothy N. Rubin¹,
America Chambers²,
Padhraic Smyth² &
…
Mark Steyvers¹

5356 Accesses
189 Citations
6 Altmetric
Explore all metrics

Abstract

Machine learning approaches to multi-label document classification have to date largely relied on discriminative modeling techniques such as support vector machines. A drawback of these approaches is that performance rapidly drops off as the total number of labels and the number of labels per document increase. This problem is amplified when the label frequencies exhibit the type of highly skewed distributions that are often observed in real-world datasets. In this paper we investigate a class of generative statistical topic models for multi-label documents that associate individual word tokens with different labels. We investigate the advantages of this approach relative to discriminative models, particularly with respect to classification problems involving large numbers of relatively rare labels. We compare the performance of generative and discriminative approaches on document labeling tasks ranging from datasets with several thousand labels to datasets with tens of labels. The experimental results indicate that probabilistic generative models can achieve competitive multi-label classification performance compared to discriminative methods, and have advantages for datasets with many labels and skewed label frequencies.

References

The EUR-Lex repository, June 2010. URL http://www.ke.tu-darmstadt.de/resources/eurlex/eurlex.html.
Allwein, E. L., Schapire, R. E., & Singer, Y. (2001). Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141.
MathSciNet MATH Google Scholar
Blei, D., & McAuliffe, J. (2008). Supervised topic models. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems 20 (pp. 121–128). Cambridge: MIT Press.
Google Scholar
Blei, D. M., & Lafferty, J. D. (2005). Correlated topic models. In Advances in neural information processing systems.
Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.
MATH Google Scholar
Blei, D. M., Griffiths, T. L., & Jordan, M. I. (2010). The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57, 7:1–7:30.
Article MathSciNet Google Scholar
Cao, L., & Fei-fei, L. (2007). Spatially coherent latent topic model for concurrent object segmentation and classification. In Proceedings of IEEE International Conference in Computer Vision (ICCV).
Google Scholar
Crammer, K., & Singer, Y. (2003). A family of additive online algorithms for category ranking. Journal of Machine Learning Research, 3, 1025–1058.
MathSciNet MATH Google Scholar
Davis, J., & Goadrich, M. (2006). The relationship between precision-recall and roc curves. In ICML’06: proceedings of the 23rd international conference on machine learning (pp. 233–240). New York: ACM.
Chapter Google Scholar
de Carvalho, A. C. P. L. F., & Freitas, A. A. (2009). A tutorial on multi-label classification techniques. In foundations of computational intelligence: Vol. 5. Studies in computational intelligence 205 (pp. 177–195). Berlin: Springer.
Google Scholar
Dekel, O., & Shamir, O. (2010). Multiclass-multilabel classification with more classes than examples. Journal of Machine Learning Research—Proceedings Track, 9, 137–144.
Google Scholar
Druck, G., Pal, C., McCallum, A., & Zhu, X. (2007). Semi-supervised classification with hybrid generative/discriminative methods. In KDD’07: proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 280–289). New York: ACM.
Chapter Google Scholar
Eyheramendy, S., Genkin, A., Ju, W.-H., Lewis, D. D., & Madigan, D. (2003). Sparse Bayesian classifiers for text categorization (Technical report). Journal of Intelligence Community Research and Development.
Fan, R.-E., & Lin, C.-J. (2007). A study on threshold selection for multi-label classification (Technical report). National Taiwan University.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.
MATH Google Scholar
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.
MATH Google Scholar
Fürnkranz, J., Hüllermeier, E., Mencía, E. L., & Brinker, K. (2008). Multilabel classification via calibrated label ranking. Machine Learning, 73(2), 133–153.
Article Google Scholar
Ghamrawi, N., & McCallum, A. (2005). Collective multi-label classification. In CIKM’05: proceedings of the 14th ACM international conference on information and knowledge management (pp. 195–200). New York: ACM.
Chapter Google Scholar
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1), 5228–5235.
Article Google Scholar
Har-Peled, S., Roth, D., & Zimak, D. (2002). Constraint classification: A new approach to multiclass classification and ranking (Technical report). Champaign, IL, USA.
Hersh, W., Buckley, C., Leone, T. J., & Hickam, D. (1994). OHSUMED: an interactive retrieval evaluation and new large test collection for research. In SIGIR’94: proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (pp. 192–201). New York: Springer.
Google Scholar
Ioannou, M., Sakkas, G., Tsoumakas, G., & Vlahavas, I. (2010). Obtaining bipartitions from score vectors for multi-label classification. In Proceedings of the 2010 22nd IEEE international conference on tools with artificial intelligence—Volume 01, ICTAI’10 (pp. 409–416). Washington: IEEE Comput. Soc. ISBN 978-0-7695-4263-8. doi:http://dx.doi.org/10.1109/ICTAI.2010.65. URL http://dx.doi.org/10.1109/ICTAI.2010.65.
Chapter Google Scholar
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 429–449.
MATH Google Scholar
Ji, S., Tang, L., Yu, S., & Ye, J. (2008). Extracting shared subspace for multi-label classification. In KDD’08: proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 381–389). New York: ACM.
Chapter Google Scholar
Lacoste-Julien, S., Sha, F., & Jordan, M. I. (2008). DiscLDA: discriminative learning for dimensionality reduction and classification. In NIPS (pp. 897–904).
Google Scholar
Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.
Google Scholar
Liu, T.-Y., Yang, Y., Wan, H., Zeng, H.-J., Chen, Z., & Ma, W.-Y. (2005). Support vector machines classification with a very large-scale taxonomy. SIGKDD Explorations Newsletter, 7(1), 36–43.
Article Google Scholar
Loza Mencía, E., & Fürnkranz, J. (2008a). Efficient pairwise multilabel classification for large-scale problems in the legal domain. In ECML PKDD’08: proceedings of the European conference on machine learning and knowledge discovery in databases—Part II (pp. 50–65). Berlin: Springer.
Chapter Google Scholar
Loza Mencía, E., & Fürnkranz, J. (2008b). Efficient multilabel classification algorithms for large-scale problems in the legal domain. In Proceedings of the LREC 2008 workshop on semantic processing of legal texts.
Google Scholar
McCallum, A. K. (1999). Multi-label text classification with a mixture model trained by EM. In AAAI 99 workshop on text learning.
Google Scholar
Mimno, D., & McCallum, A. (2008). Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In Proceedings of the 24th conference on uncertainty in artificial intelligence (UAI’08).
Google Scholar
Mimno, D., Li, W., & McCallum, A. (2007). Mixtures of hierarchical topics with pachinko allocation. In ICML’07: proceedings of the 24th international conference on machine learning (pp. 633–640). New York: ACM.
Chapter Google Scholar
Rak, R., Kurgan, L., & Reformat, M. (2005). Multi-label associative classification of medical documents from medline. In ICMLA’05: proceedings of the fourth international conference on machine learning and applications, Washington, DC, USA (pp. 177–186).
Chapter Google Scholar
Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 conference on empirical methods in natural language processing, Singapore, August 2009 (pp. 248–256). Association for Computational Linguistics.
Google Scholar
Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classifier chains for multi-label classification. In ECML/PKDD (2) (pp. 254–269).
Google Scholar
Rifkin, R. & Klautau, A. (2004). In defense of one-vs-all classification. Journal of Machine Learning Research, 5, 1532–4435.
Google Scholar
Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In AUAI’04: proceedings of the 20th conference on uncertainty in artificial intelligence (pp. 487–494). Arlington: AUAI Press.
Google Scholar
Sandhaus, E. (2008). The New York Times Annotated Corpus. Linguistic Data Consortium. Philadelphia.
Schneider, K.-M. (2004). On word frequency information and negative evidence in naive Bayes text classification. In España for natural language processing, EsTAL.
Google Scholar
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Article Google Scholar
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2004). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101, 1566–1581.
Article MathSciNet Google Scholar
Tsoumakas, G., & Katakis, I. (2007). Multi label classification: An overview. International Journal of Data Warehouse and Mining, 3(3), 1–13.
Article Google Scholar
Tsoumakas, G., Katakis, I., & Vlahavas, I. (2009). Data mining and knowledge discovery handbook. Mining multi-label data. Berlin: Springer.
Google Scholar
Ueda, N., & Saito, K. (2002). Parametric mixture models for multi-labeled text. In NIPS (pp. 721–728).
Google Scholar
Wang, Y., Sabzmeydani, P., & Mori, G. (2007). Semi-latent Dirichlet allocation: a hierarchical model for human action recognition. In Proceedings of the 2nd conference on human motion: understanding, modeling, capture and animation (pp. 240–254). Berlin: Springer.
Chapter Google Scholar
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1–2), 69–90.
Article Google Scholar
Yang, Y. (2001). A study of thresholding strategies for text categorization. In SIGIR’01: proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 137–145). New York: ACM.
Chapter Google Scholar
Yang, Y., Zhang, J., & Kisiel, B. (2003). A scalability analysis of classifiers in text categorization. In SIGIR’03: proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (pp. 96–103). New York: ACM.
Google Scholar
Zhang, M.-L., & Zhang, K. (2010). Multi-label learning by exploiting label dependency. In KDD’10: proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 999–1008). New York: ACM.
Chapter Google Scholar
Zhang, M.-L., Peña, J. M., & Robles, V. (2009). Feature selection for multi-label naive Bayes classification. Information Science, 179(19), 3218–3229.
Article MATH Google Scholar
Zhu, J., Ahmed, A., & Xing, E. P. (2009). MedLDA: maximum margin supervised topic models for regression and classification. In Proceedings of the 26th annual international conference on machine learning, ICML’09 (pp. 1257–1264). New York: ACM.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Cognitive Sciences, University of California, Irvine, Irvine, CA, 92697, USA
Timothy N. Rubin & Mark Steyvers
Department of Computer Science, University of California, Irvine, Irvine, CA, 92697, USA
America Chambers & Padhraic Smyth

Authors

Timothy N. Rubin
View author publications
You can also search for this author in PubMed Google Scholar
America Chambers
View author publications
You can also search for this author in PubMed Google Scholar
Padhraic Smyth
View author publications
You can also search for this author in PubMed Google Scholar
Mark Steyvers
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Timothy N. Rubin.

Additional information

Editors: Grigorios Tsoumakas, Min-Ling Zhang, and Zhi-Hua Zhou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rubin, T.N., Chambers, A., Smyth, P. et al. Statistical topic models for multi-label document classification. Mach Learn 88, 157–208 (2012). https://doi.org/10.1007/s10994-011-5272-5

Download citation

Received: 01 October 2010
Accepted: 09 November 2011
Published: 29 December 2011
Issue Date: July 2012
DOI: https://doi.org/10.1007/s10994-011-5272-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Statistical topic models for multi-label document classification

Abstract

Article PDF

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

A survey on semi-supervised learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Statistical topic models for multi-label document classification

Abstract

Article PDF

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

A survey on semi-supervised learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation