Skip to main content
Log in

An Improved LDA Topic Modeling Method Based on Partition for Medium and Long Texts

  • Published:
Annals of Data Science Aims and scope Submit manuscript

Abstract

Latent Dirichlet Allocation (LDA) is a topic model that represents a document as a distribution of multiple topics. It expresses each topic as a distribution of multiple words by mining semantic relationships hidden in text. However, traditional LDA ignores some of the semantic features hidden inside the document semantic structure of medium and long texts. Instead of using the original LDA to model the topic at the document level, it is better to refine the document into different semantic topic units. In this paper, we propose an improved LDA topic model based on partition (LDAP) for medium and long texts. LDAP not only preserves the benefits of the original LDA but also refines the modeled granularity from the document level to the semantic topic level, which is particularly suitable for the topic modeling of the medium and long text. The extensive experimental classification results on Fudan University corpus and Sougou Lab corpus demonstrate that LDAP achieves better performance compared with other topic models, such as LDA, HDP, LSA and doc2vec.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Lopez-Gazpio I, Maritxalar M, Gonzalez-Agirre A, Rigau G, Uria L, Agirre E (2017) Interpretable semantic textual similarity: finding and explaining differences between sentences. Knowl Based Syst 119:186–199

    Article  Google Scholar 

  2. Wu Z, Zhu H, Li G, Cui Z, Huang H, Li J, Chen E, Xu G (2017) An efficient Wikipedia semantic matching approach to text document classification. Inf Sci 393:15–28

    Article  Google Scholar 

  3. Lu M, Zhao X, Zhang L, Li F (2016) Semi-supervised concept factorization for document clustering. Inf Sci 331:86–98

    Article  Google Scholar 

  4. Li W, Zhu L, Guo K et al (2018) Build a tourism-specific sentiment Lexicon via word2vec. Ann Data Sci 5(1):1–7

    Article  Google Scholar 

  5. Chen Y, Yin C, Lin Y, Zuo W (2018) On-line evolutionary sentiment topic analysis modeling. Int J Comput Intell Syst 11(1):634–651

    Article  Google Scholar 

  6. Ali D, Faqir M (2012) Group topic modeling for academic knowledge discovery. Appl Intell 36(4):870–886

    Article  Google Scholar 

  7. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    Google Scholar 

  8. Griffiths TL, Steyvers M, Blei DM, Tenenbaum JB (2004) Integrating topics and syntax. Proc Int Conf Neural Inf Process Syst 17:537–544

    Google Scholar 

  9. Wang X, Mccallum A (2005) A note on topical N-grams. University of Massachusetts, Springfield

    Google Scholar 

  10. Boyd-Graber J, Blei DM (2008) Syntactic topic models. Proc Int Conf Neural Inf Process Syst 21:185–192

    Google Scholar 

  11. Shen J, Chi M (2018) A novel multiview topic model to compute correlation of heterogeneous data. Ann Data Sci 5(1):9–19

    Article  Google Scholar 

  12. Quan X, Liu G, Lu Z, Ni X, Wenxin L (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3):473–491

    Article  Google Scholar 

  13. Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. Am Assoc Artif Intell 1:775–780

    Google Scholar 

  14. Hong Y, Zhang Y, Fan J, Liu T, Li S (2008) New event detection based on division comparison of subtopic. Chin J Comput 31(4):687–695

    Article  Google Scholar 

  15. Lv N, Luo J, Liu Y, Yang H (2009) Topic three layer model based topic evolution analysis algorithm. Comput Eng 35(23):71–74

    Google Scholar 

  16. Ling LU, Yang Wu, Yuanlun Wang et al (2018) Combining attention mechanism for long text classification. J Comput Appl 38(5):1272–1277

    Google Scholar 

  17. shuai Wang, xiang Zhao, bo Li et al (2018) TP-AS: a two-phase approach to long text automatic summarization. J Chin Inf Process 32(6):71–79

    Google Scholar 

  18. Wang G, Wen C, Yan B et al (2013) Topic hypergraph: hierarchical visualization of thematic structures in long documents. Sci Chin Inf Sci 56(5):1–14

    Google Scholar 

  19. Hearst MA, Plaunt C (1993) Subtopic structuring for full-length document access. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, pp 59–68

  20. Landauer TK, Laham D, Derr M (2004) From paragraph to graph: latent semantic analysis for information visualization. Proc Natl Acad Sci USA 101(suppl 1):5214–5219

    Article  Google Scholar 

  21. Dai AM, Olah C, Le QV (2015) Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998

  22. Teh Y, Jordan MI, Beal M, Blei DM (2004) Hierarchical Dirichlet process. J Am Stat Assoc 101(476):1566–1581

    Article  Google Scholar 

  23. Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2–3):259–284

    Article  Google Scholar 

  24. Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on international conference on machine learning, pp 1188–1196

  25. Ren F, Sohrab MG (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci 236:109–125

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported in part by the National Natural Science Foundation of China [Grant Nos. 71771034, 71421001], Science and Technology Program of Jieyang (2017xm041), and the Scientific and Technological Innovation Foundation of Dalian (2018J11CY009).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chonghui Guo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, C., Lu, M. & Wei, W. An Improved LDA Topic Modeling Method Based on Partition for Medium and Long Texts. Ann. Data. Sci. 8, 331–344 (2021). https://doi.org/10.1007/s40745-019-00218-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40745-019-00218-3

Keywords

Navigation