Abstract
Machine learning approaches to multi-label document classification have to date largely relied on discriminative modeling techniques such as support vector machines. A drawback of these approaches is that performance rapidly drops off as the total number of labels and the number of labels per document increase. This problem is amplified when the label frequencies exhibit the type of highly skewed distributions that are often observed in real-world datasets. In this paper we investigate a class of generative statistical topic models for multi-label documents that associate individual word tokens with different labels. We investigate the advantages of this approach relative to discriminative models, particularly with respect to classification problems involving large numbers of relatively rare labels. We compare the performance of generative and discriminative approaches on document labeling tasks ranging from datasets with several thousand labels to datasets with tens of labels. The experimental results indicate that probabilistic generative models can achieve competitive multi-label classification performance compared to discriminative methods, and have advantages for datasets with many labels and skewed label frequencies.
Article PDF
Similar content being viewed by others
References
The EUR-Lex repository, June 2010. URL http://www.ke.tu-darmstadt.de/resources/eurlex/eurlex.html.
Allwein, E. L., Schapire, R. E., & Singer, Y. (2001). Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141.
Blei, D., & McAuliffe, J. (2008). Supervised topic models. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems 20 (pp. 121–128). Cambridge: MIT Press.
Blei, D. M., & Lafferty, J. D. (2005). Correlated topic models. In Advances in neural information processing systems.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.
Blei, D. M., Griffiths, T. L., & Jordan, M. I. (2010). The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57, 7:1–7:30.
Cao, L., & Fei-fei, L. (2007). Spatially coherent latent topic model for concurrent object segmentation and classification. In Proceedings of IEEE International Conference in Computer Vision (ICCV).
Crammer, K., & Singer, Y. (2003). A family of additive online algorithms for category ranking. Journal of Machine Learning Research, 3, 1025–1058.
Davis, J., & Goadrich, M. (2006). The relationship between precision-recall and roc curves. In ICML’06: proceedings of the 23rd international conference on machine learning (pp. 233–240). New York: ACM.
de Carvalho, A. C. P. L. F., & Freitas, A. A. (2009). A tutorial on multi-label classification techniques. In foundations of computational intelligence: Vol. 5. Studies in computational intelligence 205 (pp. 177–195). Berlin: Springer.
Dekel, O., & Shamir, O. (2010). Multiclass-multilabel classification with more classes than examples. Journal of Machine Learning Research—Proceedings Track, 9, 137–144.
Druck, G., Pal, C., McCallum, A., & Zhu, X. (2007). Semi-supervised classification with hybrid generative/discriminative methods. In KDD’07: proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 280–289). New York: ACM.
Eyheramendy, S., Genkin, A., Ju, W.-H., Lewis, D. D., & Madigan, D. (2003). Sparse Bayesian classifiers for text categorization (Technical report). Journal of Intelligence Community Research and Development.
Fan, R.-E., & Lin, C.-J. (2007). A study on threshold selection for multi-label classification (Technical report). National Taiwan University.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.
Fürnkranz, J., Hüllermeier, E., Mencía, E. L., & Brinker, K. (2008). Multilabel classification via calibrated label ranking. Machine Learning, 73(2), 133–153.
Ghamrawi, N., & McCallum, A. (2005). Collective multi-label classification. In CIKM’05: proceedings of the 14th ACM international conference on information and knowledge management (pp. 195–200). New York: ACM.
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1), 5228–5235.
Har-Peled, S., Roth, D., & Zimak, D. (2002). Constraint classification: A new approach to multiclass classification and ranking (Technical report). Champaign, IL, USA.
Hersh, W., Buckley, C., Leone, T. J., & Hickam, D. (1994). OHSUMED: an interactive retrieval evaluation and new large test collection for research. In SIGIR’94: proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (pp. 192–201). New York: Springer.
Ioannou, M., Sakkas, G., Tsoumakas, G., & Vlahavas, I. (2010). Obtaining bipartitions from score vectors for multi-label classification. In Proceedings of the 2010 22nd IEEE international conference on tools with artificial intelligence—Volume 01, ICTAI’10 (pp. 409–416). Washington: IEEE Comput. Soc. ISBN 978-0-7695-4263-8. doi:http://dx.doi.org/10.1109/ICTAI.2010.65. URL http://dx.doi.org/10.1109/ICTAI.2010.65.
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 429–449.
Ji, S., Tang, L., Yu, S., & Ye, J. (2008). Extracting shared subspace for multi-label classification. In KDD’08: proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 381–389). New York: ACM.
Lacoste-Julien, S., Sha, F., & Jordan, M. I. (2008). DiscLDA: discriminative learning for dimensionality reduction and classification. In NIPS (pp. 897–904).
Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.
Liu, T.-Y., Yang, Y., Wan, H., Zeng, H.-J., Chen, Z., & Ma, W.-Y. (2005). Support vector machines classification with a very large-scale taxonomy. SIGKDD Explorations Newsletter, 7(1), 36–43.
Loza Mencía, E., & Fürnkranz, J. (2008a). Efficient pairwise multilabel classification for large-scale problems in the legal domain. In ECML PKDD’08: proceedings of the European conference on machine learning and knowledge discovery in databases—Part II (pp. 50–65). Berlin: Springer.
Loza Mencía, E., & Fürnkranz, J. (2008b). Efficient multilabel classification algorithms for large-scale problems in the legal domain. In Proceedings of the LREC 2008 workshop on semantic processing of legal texts.
McCallum, A. K. (1999). Multi-label text classification with a mixture model trained by EM. In AAAI 99 workshop on text learning.
Mimno, D., & McCallum, A. (2008). Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In Proceedings of the 24th conference on uncertainty in artificial intelligence (UAI’08).
Mimno, D., Li, W., & McCallum, A. (2007). Mixtures of hierarchical topics with pachinko allocation. In ICML’07: proceedings of the 24th international conference on machine learning (pp. 633–640). New York: ACM.
Rak, R., Kurgan, L., & Reformat, M. (2005). Multi-label associative classification of medical documents from medline. In ICMLA’05: proceedings of the fourth international conference on machine learning and applications, Washington, DC, USA (pp. 177–186).
Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 conference on empirical methods in natural language processing, Singapore, August 2009 (pp. 248–256). Association for Computational Linguistics.
Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classifier chains for multi-label classification. In ECML/PKDD (2) (pp. 254–269).
Rifkin, R. & Klautau, A. (2004). In defense of one-vs-all classification. Journal of Machine Learning Research, 5, 1532–4435.
Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In AUAI’04: proceedings of the 20th conference on uncertainty in artificial intelligence (pp. 487–494). Arlington: AUAI Press.
Sandhaus, E. (2008). The New York Times Annotated Corpus. Linguistic Data Consortium. Philadelphia.
Schneider, K.-M. (2004). On word frequency information and negative evidence in naive Bayes text classification. In España for natural language processing, EsTAL.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2004). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101, 1566–1581.
Tsoumakas, G., & Katakis, I. (2007). Multi label classification: An overview. International Journal of Data Warehouse and Mining, 3(3), 1–13.
Tsoumakas, G., Katakis, I., & Vlahavas, I. (2009). Data mining and knowledge discovery handbook. Mining multi-label data. Berlin: Springer.
Ueda, N., & Saito, K. (2002). Parametric mixture models for multi-labeled text. In NIPS (pp. 721–728).
Wang, Y., Sabzmeydani, P., & Mori, G. (2007). Semi-latent Dirichlet allocation: a hierarchical model for human action recognition. In Proceedings of the 2nd conference on human motion: understanding, modeling, capture and animation (pp. 240–254). Berlin: Springer.
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1–2), 69–90.
Yang, Y. (2001). A study of thresholding strategies for text categorization. In SIGIR’01: proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 137–145). New York: ACM.
Yang, Y., Zhang, J., & Kisiel, B. (2003). A scalability analysis of classifiers in text categorization. In SIGIR’03: proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (pp. 96–103). New York: ACM.
Zhang, M.-L., & Zhang, K. (2010). Multi-label learning by exploiting label dependency. In KDD’10: proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 999–1008). New York: ACM.
Zhang, M.-L., Peña, J. M., & Robles, V. (2009). Feature selection for multi-label naive Bayes classification. Information Science, 179(19), 3218–3229.
Zhu, J., Ahmed, A., & Xing, E. P. (2009). MedLDA: maximum margin supervised topic models for regression and classification. In Proceedings of the 26th annual international conference on machine learning, ICML’09 (pp. 1257–1264). New York: ACM.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Grigorios Tsoumakas, Min-Ling Zhang, and Zhi-Hua Zhou.
Rights and permissions
About this article
Cite this article
Rubin, T.N., Chambers, A., Smyth, P. et al. Statistical topic models for multi-label document classification. Mach Learn 88, 157–208 (2012). https://doi.org/10.1007/s10994-011-5272-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-011-5272-5