Supervised labeled latent Dirichlet allocation for document categorization

Li, Ximing; Ouyang, Jihong; Zhou, Xiaotang; Lu, You; Liu, Yanhui

doi:10.1007/s10489-014-0595-0

Supervised labeled latent Dirichlet allocation for document categorization

Published: 25 November 2014

Volume 42, pages 581–593, (2015)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Ximing Li^1,2,
Jihong Ouyang^1,2,
Xiaotang Zhou^1,2,
You Lu^1,2 &
…
Yanhui Liu^1,2

1144 Accesses
11 Citations
3 Altmetric
Explore all metrics

Abstract

Recently, supervised topic modeling approaches have received considerable attention. However, the representative labeled latent Dirichlet allocation (L-LDA) method has a tendency to over-focus on the pre-assigned labels, and does not give potentially lost labels and common semantics sufficient consideration. To overcome these problems, we propose an extension of L-LDA, namely supervised labeled latent Dirichlet allocation (SL-LDA), for document categorization. Our model makes two fundamental assumptions, i.e., Prior 1 and Prior 2, that relax the restriction of label sampling and extend the concept of topics. In this paper, we develop a Gibbs expectation-maximization algorithm to learn the SL-LDA model. Quantitative experimental results demonstrate that SL-LDA is competitive with state-of-the-art approaches on both single-label and multi-label corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised Latent Dirichlet Allocation for Multi-label Text Classification

Hetero-Labeled LDA: A Partially Supervised Topic Model with Heterogeneous Labels

Using Both Latent and Supervised Shared Topics for Multitask Learning

Notes

In fact, \(Dir^{\prime }(\overrightarrow \alpha ,{y_{d}})\) is a pseudo-PDF, because \({\int }_{- \infty }^{\infty } Dir^{\prime }(\overrightarrow \alpha ,{y_{d}}) = \) 1−𝜃 ^∗. Here, it is used to sample the absence topic components in \(\overrightarrow {{{\Lambda }_{d}}}\).
http://people.csail.mit.edu/jrennie/20Newsgroups/
http://mallet.cs.umass.edu
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
http://www.cs.princeton.edu/~blei/topicmodeling.html
http://www.ml-thu.net/~jun/software.html
http://www.kecl.ntt.co.jp/as/members/ueda/yahoo.tar.gz
Results for the Yahoo! Health collection are very similar.
We set K = 100 for Yahoo! Arts and K = 240 for RCV1-v2.

References

Ali D, Faqir M (2012) Group topic modeling for academic knowledge discovery. Appl. Intell. 36(4):870–886
Article Google Scholar
Andrieu C, Freitas ND, Doucet A, Jordan MI (2003) An introduction to MCMC for machine learning. Mach Learn 50(1):5–43
Article MATH Google Scholar
Blei DM, Lafferty JD (2007) A correlated topic model fo science. Ann Appl Stat 1(1):17–35
Article MATH MathSciNet Google Scholar
Blei DM, McAuliffe JD (2007) Supervised topic models. In: Neural information processing systems
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Fei-Fei L, Perona P (2005) A Bayesian hierarchical model for learning natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition , vol 2, pp 524–531
Heinrich G. (2005) Parameter estimation for text analysis. http://www.arbylon.net/publications/textest
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp 50–57
Jaegul C, Changhyun L, Chandan KR, Park H (2013) Utopian: user-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Trans Vis Comput Graph 19(12):1992–2001
Article Google Scholar
Ji S, Tang L, Yu S, Ye J (2008) Extracting shared subspace for multi-label classification. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 381–389
Kim D, Kim S, Oh A (2012) Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. In: 29th International conference on machine learning, pp 727–734
Lacoste-Julien S, Sha F, Jordan MI (2009) Disclda: discriminative learning for dimensionality reduction and classification. In: Neural information processing systems, pp 897–904
Lewis DD, andTony G, Rose YY, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
Google Scholar
Quelhas P, Monay F, Odobez JM, Gatica-Perez D, Tuytelaars T, Van Gool L (2005) Modeling scenes with local descriptors and latent aspects. Comput Vis IEEE Int Conf 1:883–890
Google Scholar
Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Conference on empirical methods in natural language processing, pp 248–256. Association for Computational Linguistics
Ramage D, Manning CD, Dumais S (2011) Partially labeled topic models for interpretable text mining. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 457–465
Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1–2):157–208
Article MATH MathSciNet Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Article Google Scholar
Wallach H (2006) Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd international conference on Machine learning, pp 977–984. ACM
Xie P, Xing EP (2013) Integrating document clustering and topic modeling. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, pp 694–703
Xu Y, Guo R (2014) An inproved nu-twin support vector machine. Appl Intell 41(1):42–54
Article Google Scholar
Zhang ML, Zhang K (2010) Multi-label learning by exploiting label dependency. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp 999–1008
Zhu J, Ahmed A, Xing E (2009) Medlda: maximum margin supervised topic models for regression and classification. In: Proceedings of the 26th annual international conference on machine learning, pp 1257–1264. ACM
Zhu J, Ahmed A, Xing E. (2012) Medlda: maximum margin supervised topic models

Download references

Acknowledgments

This work was supported by National Nature Science Foundation of China (NSFC) under the Grant No. 61170092, 61133011, and 61103091.

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, Changchun, China
Ximing Li, Jihong Ouyang, Xiaotang Zhou, You Lu & Yanhui Liu
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
Ximing Li, Jihong Ouyang, Xiaotang Zhou, You Lu & Yanhui Liu

Authors

Ximing Li
View author publications
You can also search for this author in PubMed Google Scholar
Jihong Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaotang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
You Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yanhui Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jihong Ouyang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, X., Ouyang, J., Zhou, X. et al. Supervised labeled latent Dirichlet allocation for document categorization. Appl Intell 42, 581–593 (2015). https://doi.org/10.1007/s10489-014-0595-0

Download citation

Published: 25 November 2014
Issue Date: April 2015
DOI: https://doi.org/10.1007/s10489-014-0595-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Supervised labeled latent Dirichlet allocation for document categorization

Abstract

Access this article

Similar content being viewed by others

Semi-supervised Latent Dirichlet Allocation for Multi-label Text Classification

Hetero-Labeled LDA: A Partially Supervised Topic Model with Heterogeneous Labels

Using Both Latent and Supervised Shared Topics for Multitask Learning

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Supervised labeled latent Dirichlet allocation for document categorization

Abstract

Access this article

Similar content being viewed by others

Semi-supervised Latent Dirichlet Allocation for Multi-label Text Classification

Hetero-Labeled LDA: A Partially Supervised Topic Model with Heterogeneous Labels

Using Both Latent and Supervised Shared Topics for Multitask Learning

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation