Abstract
Latent Dirichlet Allocation (LDA) is a topic model that represents a document as a distribution of multiple topics. It expresses each topic as a distribution of multiple words by mining semantic relationships hidden in text. However, traditional LDA ignores some of the semantic features hidden inside the document semantic structure of medium and long texts. Instead of using the original LDA to model the topic at the document level, it is better to refine the document into different semantic topic units. In this paper, we propose an improved LDA topic model based on partition (LDAP) for medium and long texts. LDAP not only preserves the benefits of the original LDA but also refines the modeled granularity from the document level to the semantic topic level, which is particularly suitable for the topic modeling of the medium and long text. The extensive experimental classification results on Fudan University corpus and Sougou Lab corpus demonstrate that LDAP achieves better performance compared with other topic models, such as LDA, HDP, LSA and doc2vec.
Similar content being viewed by others
References
Lopez-Gazpio I, Maritxalar M, Gonzalez-Agirre A, Rigau G, Uria L, Agirre E (2017) Interpretable semantic textual similarity: finding and explaining differences between sentences. Knowl Based Syst 119:186–199
Wu Z, Zhu H, Li G, Cui Z, Huang H, Li J, Chen E, Xu G (2017) An efficient Wikipedia semantic matching approach to text document classification. Inf Sci 393:15–28
Lu M, Zhao X, Zhang L, Li F (2016) Semi-supervised concept factorization for document clustering. Inf Sci 331:86–98
Li W, Zhu L, Guo K et al (2018) Build a tourism-specific sentiment Lexicon via word2vec. Ann Data Sci 5(1):1–7
Chen Y, Yin C, Lin Y, Zuo W (2018) On-line evolutionary sentiment topic analysis modeling. Int J Comput Intell Syst 11(1):634–651
Ali D, Faqir M (2012) Group topic modeling for academic knowledge discovery. Appl Intell 36(4):870–886
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Griffiths TL, Steyvers M, Blei DM, Tenenbaum JB (2004) Integrating topics and syntax. Proc Int Conf Neural Inf Process Syst 17:537–544
Wang X, Mccallum A (2005) A note on topical N-grams. University of Massachusetts, Springfield
Boyd-Graber J, Blei DM (2008) Syntactic topic models. Proc Int Conf Neural Inf Process Syst 21:185–192
Shen J, Chi M (2018) A novel multiview topic model to compute correlation of heterogeneous data. Ann Data Sci 5(1):9–19
Quan X, Liu G, Lu Z, Ni X, Wenxin L (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3):473–491
Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. Am Assoc Artif Intell 1:775–780
Hong Y, Zhang Y, Fan J, Liu T, Li S (2008) New event detection based on division comparison of subtopic. Chin J Comput 31(4):687–695
Lv N, Luo J, Liu Y, Yang H (2009) Topic three layer model based topic evolution analysis algorithm. Comput Eng 35(23):71–74
Ling LU, Yang Wu, Yuanlun Wang et al (2018) Combining attention mechanism for long text classification. J Comput Appl 38(5):1272–1277
shuai Wang, xiang Zhao, bo Li et al (2018) TP-AS: a two-phase approach to long text automatic summarization. J Chin Inf Process 32(6):71–79
Wang G, Wen C, Yan B et al (2013) Topic hypergraph: hierarchical visualization of thematic structures in long documents. Sci Chin Inf Sci 56(5):1–14
Hearst MA, Plaunt C (1993) Subtopic structuring for full-length document access. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, pp 59–68
Landauer TK, Laham D, Derr M (2004) From paragraph to graph: latent semantic analysis for information visualization. Proc Natl Acad Sci USA 101(suppl 1):5214–5219
Dai AM, Olah C, Le QV (2015) Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998
Teh Y, Jordan MI, Beal M, Blei DM (2004) Hierarchical Dirichlet process. J Am Stat Assoc 101(476):1566–1581
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2–3):259–284
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on international conference on machine learning, pp 1188–1196
Ren F, Sohrab MG (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci 236:109–125
Acknowledgements
This research was supported in part by the National Natural Science Foundation of China [Grant Nos. 71771034, 71421001], Science and Technology Program of Jieyang (2017xm041), and the Scientific and Technological Innovation Foundation of Dalian (2018J11CY009).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Guo, C., Lu, M. & Wei, W. An Improved LDA Topic Modeling Method Based on Partition for Medium and Long Texts. Ann. Data. Sci. 8, 331–344 (2021). https://doi.org/10.1007/s40745-019-00218-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40745-019-00218-3