Skip to main content
Log in

Twin labeled LDA: a supervised topic model for document classification

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Recently, some statistic topic modeling approaches, e.g., Latent Dirichlet allocation (LDA), have been widely applied in the field of document classification. However, standard LDA is a completely unsupervised algorithm, and then there is growing interest in incorporating prior information into the topic modeling procedure. Some effective approaches have been developed to model different kinds of prior information, for example, observed labels, hidden labels, the correlation among labels, label frequencies; however, these methods often need heavy computing because of model complexity. In this paper, we propose a new supervised topic model for document classification problems, Twin Labeled LDA (TL-LDA), which has two sets of parallel topic modeling processes, one incorporates the prior label information by hierarchical Dirichlet distributions, the other models the grouping tags, which have prior knowledge about the label correlation; the two processes are independent from each other, so the TL-LDA can be trained efficiently by multi-thread parallel computing. Quantitative experimental results compared with state-of-the-art approaches demonstrate our model gets the best scores on both rank-based and binary prediction metrics in solving single-label classification, and gets the best scores on three metrics, i.e., One Error, Micro-F1, and Macro-F1 while multi-label classification, including non power-law and power-law datasets. The results show benefit from modeling fully prior knowledge, our model has outstanding performance and generalizability on document classification. Further comparisons with recent works also indicate the proposed model is competitive with state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://github.com/timothyrubin/DependencyLDA

References

  1. Asuncion AU, Welling M, Smyth P, Teh YW (2009) On smoothing and inference for topic models. In: UAI proceedings of the 25th conference on uncertainty in artificial intelligence, Montreal, QC, Canada, pp 27–34

  2. Burkhardt S, Kramer S (2018) Online multi-label dependency topic models for text classification. Mach Learn 107:859–886

    Article  MathSciNet  Google Scholar 

  3. Burkhardt S, Kramer S (2019) A survey of multi-label topic models. ACM SIGKDD Explorations Newsletter 21:61–79

    Article  Google Scholar 

  4. Boutell MR, Luo J, Shen X, Brown CM (2004) Learning multi-label scene classification. Pattern Recogn 37(9):1757–1771

    Article  Google Scholar 

  5. Blei DM, McAuliffe JD (2008) Supervised topic models. Advances in Neural Information Processing Systems, pp 121–128

  6. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  7. Cowans PJ (2006) Probabilistic document modeling. PhD thesis University of Cambridge, Cambridgeshire, UK

  8. Clare A, King RD (2001) Knowledge discovery in multi-label phenotype data. In: European conference on principles of data mining and knowledge discovery, pp 42–53

  9. Crammer K, Singer Y (2003) A family of additive online algorithms for category ranking. J Mach Learn Res 3:1025–1058

    MathSciNet  MATH  Google Scholar 

  10. Fürnkranz J, Hüllermeier E, Mencía EL, Brinker K (2008) Multilabel classification via calibrated label ranking. Mach Learn 73(2):133–153

    Article  Google Scholar 

  11. Griffiths TL, Steyvers M (2004) Finding scientific topics. PNAS 101(Suppl 1):5228–5235

    Article  Google Scholar 

  12. Ji S, Tang L, Yu S, Ye J (2008) Extracting shared subspace for multi-label classification. In: KDD’08: proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 381–389

  13. Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y (2019) Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl 78(11):15169– 15211

    Article  Google Scholar 

  14. Li X, Ma Z, Peng P, Guo X, Huang F, Wang X, Guo J (2018) Supervised latent Dirichlet allocation with a mixture of sparse softmax. Neurocomputing 312:324–335

    Article  Google Scholar 

  15. Li X, Ouyang J, Zhou X (2015) Supervised topic models for multi-label classification. Neurocomputing 149:811–819

    Article  Google Scholar 

  16. Li X, Ouyang J, Zhou X, Lu Y, Liu Y (2015) Supervised labeled latent Dirichlet allocation for document categorization. Appl Intell 42:581–593

    Article  Google Scholar 

  17. Lacoste-Julien S, Sha F, Jordan MI (2009) Disclda: discriminative learning for dimensionality reduction and classification. In: Neural information processing systems, pp 897–904

  18. Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397

    Google Scholar 

  19. Magnusson M, Jonsson L, Villani M (2020) DOLDA: a regularized supervised topic model for high-dimensional multi-class regression. Comput Stat 35:175–201

    Article  Google Scholar 

  20. Padmanabhan D, Bhat S, Shevade S, Narahari Y (2017) Multi-label classification from multiple noisy sources using topic models. Information 8(2):52–75

    Article  Google Scholar 

  21. Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1–2):157–208

    Article  MathSciNet  Google Scholar 

  22. Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled LDA: a supervised topic model for credit attribution in multilabeled corpora. In: Conference on empirical methods in natural language processing, Association for Computational Linguistics, pp 248–256

  23. Ramage D, Manning CD, Dumais S (2011) Partially labeled topic models for interpretable text mining. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 457–465

  24. Sandhaus E (2008) The New York times annotated corpus. Linguistic Data Consortium. Philadelphia

  25. Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. International Journal of Data Warehousing and Mining, pp 1–13

  26. Tsoumakas G, Vlahavas I (2007) Random k-labelsets: an ensemble method for multilabel classification. In: European conference on machine learning, pp 406–417

  27. Ueda N, Saito K (2002) Parametric mixture models for multi-labeled text. In: Advances in neural information processing systems, pp 721–728

  28. Wallach HM (2008) Structured topic models for language. PhD thesis University of Cambridge, Cambridgeshire, UK

  29. Wallach HM, Mimno D, McCallum A (2009) Rethinking LDA: why priors matter. Advances in Neural Information Processing Systems 23:1973–1981

    Google Scholar 

  30. Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retr 1(1–2):69–90

    Article  Google Scholar 

  31. Yang Y, Zhang J, Kisiel B (2003) A scalability analysis of classifiers in text categorization. In: SIGIR’03, proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, New York, pp 96–103

  32. Zhu J, Ahmed A, Xing E (2009) Medlda: maximum margin supervised topic models for regression and classification. In: ACM proceedings of the 26th annual international conference on machine learning, pp 1257–1264

  33. Zhang Y, Ma J, Wang Z, Chen B (2018) LF-LDA: a topic model for multi-label classification. Advances in Internetworking, Data and Web Technologies, pp 618–628

  34. Zhang M, Zhou Z (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn 40(7):2038–2048

    Article  Google Scholar 

  35. Zhang M, Zhou Z (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61772352, and in part by the Science and Technology Planning Project of Sichuan Province under Grant No.2019YFG0400, 2018GZDZX0031, 2018GZDZX0004, 2017GZDZX0003, 2018JY0182, 19ZDYF1286.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, W., Guo, B., Shen, Y. et al. Twin labeled LDA: a supervised topic model for document classification. Appl Intell 50, 4602–4615 (2020). https://doi.org/10.1007/s10489-020-01798-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01798-x

Keywords

Navigation