Skip to main content
Log in

Supervised labeled latent Dirichlet allocation for document categorization

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Recently, supervised topic modeling approaches have received considerable attention. However, the representative labeled latent Dirichlet allocation (L-LDA) method has a tendency to over-focus on the pre-assigned labels, and does not give potentially lost labels and common semantics sufficient consideration. To overcome these problems, we propose an extension of L-LDA, namely supervised labeled latent Dirichlet allocation (SL-LDA), for document categorization. Our model makes two fundamental assumptions, i.e., Prior 1 and Prior 2, that relax the restriction of label sampling and extend the concept of topics. In this paper, we develop a Gibbs expectation-maximization algorithm to learn the SL-LDA model. Quantitative experimental results demonstrate that SL-LDA is competitive with state-of-the-art approaches on both single-label and multi-label corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. In fact, \(Dir^{\prime }(\overrightarrow \alpha ,{y_{d}})\) is a pseudo-PDF, because \({\int }_{- \infty }^{\infty } Dir^{\prime }(\overrightarrow \alpha ,{y_{d}}) = \) 1−𝜃 . Here, it is used to sample the absence topic components in \(\overrightarrow {{{\Lambda }_{d}}}\).

  2. http://people.csail.mit.edu/jrennie/20Newsgroups/

  3. http://mallet.cs.umass.edu

  4. http://www.csie.ntu.edu.tw/~cjlin/libsvm/

  5. http://www.cs.princeton.edu/~blei/topicmodeling.html

  6. http://www.ml-thu.net/~jun/software.html

  7. http://www.kecl.ntt.co.jp/as/members/ueda/yahoo.tar.gz

  8. Results for the Yahoo! Health collection are very similar.

  9. We set K = 100 for Yahoo! Arts and K = 240 for RCV1-v2.

References

  1. Ali D, Faqir M (2012) Group topic modeling for academic knowledge discovery. Appl. Intell. 36(4):870–886

    Article  Google Scholar 

  2. Andrieu C, Freitas ND, Doucet A, Jordan MI (2003) An introduction to MCMC for machine learning. Mach Learn 50(1):5–43

    Article  MATH  Google Scholar 

  3. Blei DM, Lafferty JD (2007) A correlated topic model fo science. Ann Appl Stat 1(1):17–35

    Article  MATH  MathSciNet  Google Scholar 

  4. Blei DM, McAuliffe JD (2007) Supervised topic models. In: Neural information processing systems

  5. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  6. Fei-Fei L, Perona P (2005) A Bayesian hierarchical model for learning natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition , vol 2, pp 524–531

  7. Heinrich G. (2005) Parameter estimation for text analysis. http://www.arbylon.net/publications/textest

  8. Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp 50–57

  9. Jaegul C, Changhyun L, Chandan KR, Park H (2013) Utopian: user-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Trans Vis Comput Graph 19(12):1992–2001

    Article  Google Scholar 

  10. Ji S, Tang L, Yu S, Ye J (2008) Extracting shared subspace for multi-label classification. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 381–389

  11. Kim D, Kim S, Oh A (2012) Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. In: 29th International conference on machine learning, pp 727–734

  12. Lacoste-Julien S, Sha F, Jordan MI (2009) Disclda: discriminative learning for dimensionality reduction and classification. In: Neural information processing systems, pp 897–904

  13. Lewis DD, andTony G, Rose YY, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397

    Google Scholar 

  14. Quelhas P, Monay F, Odobez JM, Gatica-Perez D, Tuytelaars T, Van Gool L (2005) Modeling scenes with local descriptors and latent aspects. Comput Vis IEEE Int Conf 1:883–890

    Google Scholar 

  15. Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Conference on empirical methods in natural language processing, pp 248–256. Association for Computational Linguistics

  16. Ramage D, Manning CD, Dumais S (2011) Partially labeled topic models for interpretable text mining. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 457–465

  17. Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1–2):157–208

    Article  MATH  MathSciNet  Google Scholar 

  18. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47

    Article  Google Scholar 

  19. Wallach H (2006) Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd international conference on Machine learning, pp 977–984. ACM

  20. Xie P, Xing EP (2013) Integrating document clustering and topic modeling. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, pp 694–703

  21. Xu Y, Guo R (2014) An inproved nu-twin support vector machine. Appl Intell 41(1):42–54

    Article  Google Scholar 

  22. Zhang ML, Zhang K (2010) Multi-label learning by exploiting label dependency. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp 999–1008

  23. Zhu J, Ahmed A, Xing E (2009) Medlda: maximum margin supervised topic models for regression and classification. In: Proceedings of the 26th annual international conference on machine learning, pp 1257–1264. ACM

  24. Zhu J, Ahmed A, Xing E. (2012) Medlda: maximum margin supervised topic models

Download references

Acknowledgments

This work was supported by National Nature Science Foundation of China (NSFC) under the Grant No. 61170092, 61133011, and 61103091.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jihong Ouyang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Ouyang, J., Zhou, X. et al. Supervised labeled latent Dirichlet allocation for document categorization. Appl Intell 42, 581–593 (2015). https://doi.org/10.1007/s10489-014-0595-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-014-0595-0

Keywords

Navigation