Semantic similarity measure for topic modeling using latent Dirichlet allocation and collapsed Gibbs sampling

Ajinaja, Micheal Olalekan; Adetunmbi, Adebayo Olusola; Ugwu, Chukwuemeka Christian; Popoola, Olugbemiga Solomon

doi:10.1007/s42044-022-00124-7

Semantic similarity measure for topic modeling using latent Dirichlet allocation and collapsed Gibbs sampling

Research
Published: 08 November 2022

Volume 6, pages 81–94, (2023)
Cite this article

Iran Journal of Computer Science Aims and scope Submit manuscript

Micheal Olalekan Ajinaja¹,
Adebayo Olusola Adetunmbi¹,
Chukwuemeka Christian Ugwu¹ &
…
Olugbemiga Solomon Popoola²

232 Accesses
Explore all metrics

Abstract

Automatically extracting topics from large amounts of text is one of the main uses of natural language processing (NLP). The latent Dirichlet allocation (LDA) technique is frequently used to extract topics from pre-processed materials based on word frequency. One of the main problems of LDA is that the topics extracted are of poor quality if the document does not coherently belong to a single topic. However, Gibbs sampling operates on a word-by-word basis, which allows it to be used on documents with a variety of topics and modifies the topic assignment of a single word. To improve the quality of topics extracted, this paper developed a hybrid-based semantic similarity measure for topic modeling combining LDA and Gibbs sampling to maximize the coherence score. To verify the effectiveness of the suggested model, an unstructured dataset was taken from a public repository. The evaluation carried out shows that the proposed LDA-Gibbs had a coherence score of 0.52650 as against the LDA coherence score of 0.46504. The proposed multi-level model provides better quality of topics extracted.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Inference Algorithms in Latent Dirichlet Allocation for Semantic Classification

A Document Clustering Algorithm Based on Semi-constrained Hierarchical Latent Dirichlet Allocation

Nonparametric method of topic identification using granularity concept and graph-based modeling

Article 13 January 2021

Data availability

The data that support the findings of this study are openly available in Newsgroup Master Datasets at https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json.

References

Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)
Article Google Scholar
Vayansky, I., Kumar, S.A.P.: A review of topic modeling methods. Inform. Syst. 94, 101582 (2020). https://doi.org/10.1016/j.is.2020.101582
Article Google Scholar
Qiang, J., Qian, Z., Li, Y., Yuan, Y., Wu, X.: Short text topic modeling techniques, applications, and performance: a survey. IEEE Trans Knowl Data Eng 34(3), 1427–1445 (2022). https://doi.org/10.1109/TKDE.2020.2992485
Article Google Scholar
Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing Twitter and traditional media using topic models. In: Advances in information retrieval, pp. 338–349. Springer (2011)
Chapter Google Scholar
Ramage, D., Dumais, S., Liebling, D.: Characterizing microblogs with topic models, In: Fourth International AAAI Conference on Weblogs and social media (2010)
Dai, Z., Sun, A., Liu, X.-Y.: Crest: cluster-based representation enrichment for short text classification. In: Pacific–Asia Conference on Knowledge Discovery and Data Mining, Springer, pp 256–267 (2013)
Razavi, A. H., Inkpen, D.: Text representation using multi-level latent dirichlet allocation. In: Canadian Conference on Artificial Intelligence, Springer, pp 215–226 (2014)
Lin, C. X., Zhao, B., Mei, Q., Han, J.: Pet: a statistical model for popular events tracking in social communities. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 929–938 (2010)
Aggarwal, C.C., Subbian, K.: Event detection in social streams. In: Proceedings of the 2012 SIAM international conference on data mining, SIAM, pp. 624–635 (2012)
Ritter, A., Etzioni, O., Clark, S., et al.: Open domain event extraction from twitter. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 1104–1112 (2012)
Yin, H., Hu, Z., Zhou, X., Hao, W., Kai, Z., Nguyen, Q.V.H., Sadiq, S.: Discovering interpretable geo-social communities for user behavior prediction. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE) (2016)
Goyal, A., Kashyap, I.: Latent Dirichlet Allocation–An approach for topic discovery. In: 2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), pp. 97–102 (2022). https://doi.org/10.1109/COM-IT-CON54601.2022.9850912
Barde, B. V., Bainwad, A. M.: An overview of topic modeling methods and tools. In: 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), pp 745–750 (2017) https://doi.org/10.1109/ICCONS.2017.8250563
Qin, D., Zheng, G., Liu, L., Li, L., Wang, Y., Zhao, L.: A knowledge search algorithm based on multidimensional semantic similarity analysis in knowledge graph systems of power grid networks. In: 2020 IEEE 20th International Conference on Communication Technology (ICCT), pp. 1447–1451 (2020) https://doi.org/10.1109/ICCT50939.2020.9295697
Jiao, Y., Jing, Ma., Kang, F.: Computing text semantic similarity with syntactic network of co-occurrence distance. Data Anal. Knowl. Discov. 12, 93–100 (2019)
Google Scholar
Sheng, Q., Ying, G.: Measuring semantic similarity in ontology and its application in information retrieval. In: 2008 Congress on Image and Signal Processing, pp. 525–529 (2008) https://doi.org/10.1109/CISP.2008.596
Newsgroup Master dataset. Retrieved from https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json
Hamed, J., Yongli, W., Chi, Y., Xia, F., Xiahui, J., Yanchao, L., Liang, Z.: Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey. ArXiv.org e-print archive. (n.d.). https://arxiv.org/ftp/arxiv/papers/1310/1310.8059.pdf. Retrieved 7 Aug 2022
Kherwa, P., Bansal, P.: Topic modelling: a comprehensive review. J. EAI Endorsed Trans. Scalable Inform. Syst. (2019). https://doi.org/10.4108/eai.13-7-2018.159623
Article Google Scholar
Adegoke, M.A., Ayeni, J.O., Adewole, P.A.: Empirical prior latent Dirichlet allocation model. Niger. J. Technol. (NIJOTECH). 38(1), 223–232 (2019)
Article Google Scholar
Mohamed, M., Oussalah, M.: SRL-ESA-TextSum: a text summarization approach based on semantic role labeling and explicit semantic analysis. Inform. Process. Manag. 56(4), 1356–1372 (2019)
Article Google Scholar
Sainani, A. et al.: Extracting and classifying requirements from software engineering contracts. In: 2020 IEEE 28th International Requirements Engineering Conference (RE) (2020) [Preprint]. https://doi.org/10.1109/re48521.2020.00026
Jonsson, E., Stolee, J.: An evaluation of topic modelling techniques for twitter. An evaluation of topic modelling techniques for Twitter. (n.d.). https://www.cs.toronto.edu/jstolee/projects/topic.pdf. Retrieved 7 Aug 2022
Asmussen, C.B., Møller, C.: Smart literature review: a practical topic modelling approach to exploratory literature review. J. Big Data 6, 93 (2019). https://doi.org/10.1186/s40537-019-0255-7
Article Google Scholar
Akila, D., Jayakumar, C.: Semantic similarity—a review of approaches and metrics. Int. J. Appl. Eng. Res. 9(24), 27581–27600 (2014)
Google Scholar
Sontag, D., Roy, D.M.: Complexity of inference in latent Dirichlet allocation. (2022). https://people.csail.mit.edu/dsontag/papers/SontagRoy_nips11.pdf. Accessed: 31 Oct 2022
Bailón-Elvira, J.C., Cobo, M.J., Herrera-Viedma, E., López-Herrera, A.G.: Latent Dirichlet allocation (LDA) for improving the topic modeling of the official bulletin of the Spanish state (BOE). Procedia Comput. Sci. 162, 207–214 (2019)
Article Google Scholar
Horn, N., Gampfer, F., Buchkremer, R.: Latent Dirichlet allocation and t-distributed stochastic neighbor embedding enhance scientific reading comprehension of articles related to enterprise architecture. Institute of IT Management and Digitization Research (IFID) (2021)
Book Google Scholar
Anima A. et al.: A spectral algorithm for latent Dirichlet allocation. Retrieved from https://www.cs.columbia.edu/~djhsu/papers/lda-nips.pdf
Lei, S., Griffiths, T.L., Kevin, R.C.: Online inference of topics with latent Dirichlet allocation. In: 2th International Conference on Artificial Intelligence and Statistics (AISTATS) vol. 5 (2009)
Gross, A., Murthy, D.: Modeling virtual organizations with latent Dirichlet allocation: a case for natural language processing. J. Neural Netw. 58, 38–49 (2014)
Article Google Scholar
Špeh, J., Muhic, J., Rupnik, J.: Parameter estimation for the Latent Dirichlet Allocation. Retrieved from https://ailab.ijs.si/dunja/SiKDD2013/Papers/Speh-ldaAlgorithms.pdf
Zhe, C., Dossaca, H.. Inference for the number of topics in the Latent Dirichlet allocation model via Bayesian mixture modelling. Retrieved from https://users.stat.ufl.edu/~doss/Research/lda-ntopics.pdf
Foster, A., Li, H., Maierhofer, G., Shearer, M.: An extension of standard latent dirichlet allocation to multiple corpora. Retrieved on April 2016 from http://evoq-eval.siam.org/Portals/0/Publications/SIURO/Vol9/AN_EXTENSION_STANDARD_LATENT_DIRICHLET_ALLOCATION.pdf?ver=2018-04-06-152049-177
Hew, Z.J., Olanrewaju, V.J., Chew, X.Y., Khaw, K.W.: Text summarization for news articles by machine learning techniques. J. Appl. Math. Comput. Intell. (2022)

Download references

Funding

There was no outside funding for the study.

Author information

Authors and Affiliations

Department of Computer Science, Federal University of Technology, Akure, Nigeria
Micheal Olalekan Ajinaja, Adebayo Olusola Adetunmbi & Chukwuemeka Christian Ugwu
Department of Computer Science, Osun State College of Education, Ila-Orangun, Nigeria
Olugbemiga Solomon Popoola

Authors

Micheal Olalekan Ajinaja
View author publications
You can also search for this author in PubMed Google Scholar
Adebayo Olusola Adetunmbi
View author publications
You can also search for this author in PubMed Google Scholar
Chukwuemeka Christian Ugwu
View author publications
You can also search for this author in PubMed Google Scholar
Olugbemiga Solomon Popoola
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The research's conception and design were influenced by the work of all contributors. Data gathering, material preparation, and analysis were completed by MOA. The first draft of the manuscript was written by AAO, UCC, POS. All the authors read and approved the final manuscript and also agreed to all the content of the article including the author list and contributions.

Corresponding author

Correspondence to Micheal Olalekan Ajinaja.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to disclose about the article's content.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ajinaja, M.O., Adetunmbi, A.O., Ugwu, C.C. et al. Semantic similarity measure for topic modeling using latent Dirichlet allocation and collapsed Gibbs sampling. Iran J Comput Sci 6, 81–94 (2023). https://doi.org/10.1007/s42044-022-00124-7

Download citation

Received: 16 August 2022
Accepted: 25 October 2022
Published: 08 November 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s42044-022-00124-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic similarity measure for topic modeling using latent Dirichlet allocation and collapsed Gibbs sampling

Abstract

Access this article

Similar content being viewed by others

Inference Algorithms in Latent Dirichlet Allocation for Semantic Classification

A Document Clustering Algorithm Based on Semi-constrained Hierarchical Latent Dirichlet Allocation

Nonparametric method of topic identification using granularity concept and graph-based modeling

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semantic similarity measure for topic modeling using latent Dirichlet allocation and collapsed Gibbs sampling

Abstract

Access this article

Similar content being viewed by others

Inference Algorithms in Latent Dirichlet Allocation for Semantic Classification

A Document Clustering Algorithm Based on Semi-constrained Hierarchical Latent Dirichlet Allocation

Nonparametric method of topic identification using granularity concept and graph-based modeling

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation