An Improved LDA Topic Modeling Method Based on Partition for Medium and Long Texts

Guo, Chonghui; Lu, Menglin; Wei, Wei

doi:10.1007/s40745-019-00218-3

An Improved LDA Topic Modeling Method Based on Partition for Medium and Long Texts

Published: 25 April 2019

Volume 8, pages 331–344, (2021)
Cite this article

Annals of Data Science Aims and scope Submit manuscript

Chonghui Guo¹,
Menglin Lu¹ &
Wei Wei²

1330 Accesses
17 Citations
Explore all metrics

Abstract

Latent Dirichlet Allocation (LDA) is a topic model that represents a document as a distribution of multiple topics. It expresses each topic as a distribution of multiple words by mining semantic relationships hidden in text. However, traditional LDA ignores some of the semantic features hidden inside the document semantic structure of medium and long texts. Instead of using the original LDA to model the topic at the document level, it is better to refine the document into different semantic topic units. In this paper, we propose an improved LDA topic model based on partition (LDAP) for medium and long texts. LDAP not only preserves the benefits of the original LDA but also refines the modeled granularity from the document level to the semantic topic level, which is particularly suitable for the topic modeling of the medium and long text. The extensive experimental classification results on Fudan University corpus and Sougou Lab corpus demonstrate that LDAP achieves better performance compared with other topic models, such as LDA, HDP, LSA and doc2vec.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Group topic model: organizing topics into groups

Article 10 September 2014

Topic Modeling for Short Texts via Adaptive P $$\acute{o}$$ lya Urn Dirichlet Multinomial Mixture

PSLDA: a novel supervised pseudo document-based topic model for short texts

Article 27 May 2022

References

Lopez-Gazpio I, Maritxalar M, Gonzalez-Agirre A, Rigau G, Uria L, Agirre E (2017) Interpretable semantic textual similarity: finding and explaining differences between sentences. Knowl Based Syst 119:186–199
Article Google Scholar
Wu Z, Zhu H, Li G, Cui Z, Huang H, Li J, Chen E, Xu G (2017) An efficient Wikipedia semantic matching approach to text document classification. Inf Sci 393:15–28
Article Google Scholar
Lu M, Zhao X, Zhang L, Li F (2016) Semi-supervised concept factorization for document clustering. Inf Sci 331:86–98
Article Google Scholar
Li W, Zhu L, Guo K et al (2018) Build a tourism-specific sentiment Lexicon via word2vec. Ann Data Sci 5(1):1–7
Article Google Scholar
Chen Y, Yin C, Lin Y, Zuo W (2018) On-line evolutionary sentiment topic analysis modeling. Int J Comput Intell Syst 11(1):634–651
Article Google Scholar
Ali D, Faqir M (2012) Group topic modeling for academic knowledge discovery. Appl Intell 36(4):870–886
Article Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Google Scholar
Griffiths TL, Steyvers M, Blei DM, Tenenbaum JB (2004) Integrating topics and syntax. Proc Int Conf Neural Inf Process Syst 17:537–544
Google Scholar
Wang X, Mccallum A (2005) A note on topical N-grams. University of Massachusetts, Springfield
Google Scholar
Boyd-Graber J, Blei DM (2008) Syntactic topic models. Proc Int Conf Neural Inf Process Syst 21:185–192
Google Scholar
Shen J, Chi M (2018) A novel multiview topic model to compute correlation of heterogeneous data. Ann Data Sci 5(1):9–19
Article Google Scholar
Quan X, Liu G, Lu Z, Ni X, Wenxin L (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3):473–491
Article Google Scholar
Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. Am Assoc Artif Intell 1:775–780
Google Scholar
Hong Y, Zhang Y, Fan J, Liu T, Li S (2008) New event detection based on division comparison of subtopic. Chin J Comput 31(4):687–695
Article Google Scholar
Lv N, Luo J, Liu Y, Yang H (2009) Topic three layer model based topic evolution analysis algorithm. Comput Eng 35(23):71–74
Google Scholar
Ling LU, Yang Wu, Yuanlun Wang et al (2018) Combining attention mechanism for long text classification. J Comput Appl 38(5):1272–1277
Google Scholar
shuai Wang, xiang Zhao, bo Li et al (2018) TP-AS: a two-phase approach to long text automatic summarization. J Chin Inf Process 32(6):71–79
Google Scholar
Wang G, Wen C, Yan B et al (2013) Topic hypergraph: hierarchical visualization of thematic structures in long documents. Sci Chin Inf Sci 56(5):1–14
Google Scholar
Hearst MA, Plaunt C (1993) Subtopic structuring for full-length document access. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, pp 59–68
Landauer TK, Laham D, Derr M (2004) From paragraph to graph: latent semantic analysis for information visualization. Proc Natl Acad Sci USA 101(suppl 1):5214–5219
Article Google Scholar
Dai AM, Olah C, Le QV (2015) Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998
Teh Y, Jordan MI, Beal M, Blei DM (2004) Hierarchical Dirichlet process. J Am Stat Assoc 101(476):1566–1581
Article Google Scholar
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2–3):259–284
Article Google Scholar
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on international conference on machine learning, pp 1188–1196
Ren F, Sohrab MG (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci 236:109–125
Article Google Scholar

Download references

Acknowledgements

This research was supported in part by the National Natural Science Foundation of China [Grant Nos. 71771034, 71421001], Science and Technology Program of Jieyang (2017xm041), and the Scientific and Technological Innovation Foundation of Dalian (2018J11CY009).

Author information

Authors and Affiliations

Institute of Systems Engineering, Dalian University of Technology, Dalian, 116024, Liaoning, People’s Republic of China
Chonghui Guo & Menglin Lu
Center for Energy, Environment and Economy Research, Zhengzhou University, Zhengzhou, 450001, Henan, People’s Republic of China
Wei Wei

Authors

Chonghui Guo
View author publications
You can also search for this author in PubMed Google Scholar
Menglin Lu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chonghui Guo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, C., Lu, M. & Wei, W. An Improved LDA Topic Modeling Method Based on Partition for Medium and Long Texts. Ann. Data. Sci. 8, 331–344 (2021). https://doi.org/10.1007/s40745-019-00218-3

Download citation

Received: 25 September 2018
Revised: 02 April 2019
Accepted: 17 April 2019
Published: 25 April 2019
Issue Date: June 2021
DOI: https://doi.org/10.1007/s40745-019-00218-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Improved LDA Topic Modeling Method Based on Partition for Medium and Long Texts

Abstract

Access this article

Similar content being viewed by others

Group topic model: organizing topics into groups

Topic Modeling for Short Texts via Adaptive P $$\acute{o}$$ lya Urn Dirichlet Multinomial Mixture

PSLDA: a novel supervised pseudo document-based topic model for short texts

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Improved LDA Topic Modeling Method Based on Partition for Medium and Long Texts

Abstract

Access this article

Similar content being viewed by others

Group topic model: organizing topics into groups

Topic Modeling for Short Texts via Adaptive P $$\acute{o}$$ lya Urn Dirichlet Multinomial Mixture

PSLDA: a novel supervised pseudo document-based topic model for short texts

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation