Skip to main content
Log in

Semantic similarity measure for topic modeling using latent Dirichlet allocation and collapsed Gibbs sampling

  • Research
  • Published:
Iran Journal of Computer Science Aims and scope Submit manuscript

Abstract

Automatically extracting topics from large amounts of text is one of the main uses of natural language processing (NLP). The latent Dirichlet allocation (LDA) technique is frequently used to extract topics from pre-processed materials based on word frequency. One of the main problems of LDA is that the topics extracted are of poor quality if the document does not coherently belong to a single topic. However, Gibbs sampling operates on a word-by-word basis, which allows it to be used on documents with a variety of topics and modifies the topic assignment of a single word. To improve the quality of topics extracted, this paper developed a hybrid-based semantic similarity measure for topic modeling combining LDA and Gibbs sampling to maximize the coherence score. To verify the effectiveness of the suggested model, an unstructured dataset was taken from a public repository. The evaluation carried out shows that the proposed LDA-Gibbs had a coherence score of 0.52650 as against the LDA coherence score of 0.46504. The proposed multi-level model provides better quality of topics extracted.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The data that support the findings of this study are openly available in Newsgroup Master Datasets at https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json.

References

  1. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)

    Article  Google Scholar 

  2. Vayansky, I., Kumar, S.A.P.: A review of topic modeling methods. Inform. Syst. 94, 101582 (2020). https://doi.org/10.1016/j.is.2020.101582

    Article  Google Scholar 

  3. Qiang, J., Qian, Z., Li, Y., Yuan, Y., Wu, X.: Short text topic modeling techniques, applications, and performance: a survey. IEEE Trans Knowl Data Eng 34(3), 1427–1445 (2022). https://doi.org/10.1109/TKDE.2020.2992485

    Article  Google Scholar 

  4. Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing Twitter and traditional media using topic models. In: Advances in information retrieval, pp. 338–349. Springer (2011)

    Chapter  Google Scholar 

  5. Ramage, D., Dumais, S., Liebling, D.: Characterizing microblogs with topic models, In: Fourth International AAAI Conference on Weblogs and social media (2010)

  6. Dai, Z., Sun, A., Liu, X.-Y.: Crest: cluster-based representation enrichment for short text classification. In: Pacific–Asia Conference on Knowledge Discovery and Data Mining, Springer, pp 256–267 (2013)

  7. Razavi, A. H., Inkpen, D.: Text representation using multi-level latent dirichlet allocation. In: Canadian Conference on Artificial Intelligence, Springer, pp 215–226 (2014)

  8. Lin, C. X., Zhao, B., Mei, Q., Han, J.: Pet: a statistical model for popular events tracking in social communities. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 929–938 (2010)

  9. Aggarwal, C.C., Subbian, K.: Event detection in social streams. In: Proceedings of the 2012 SIAM international conference on data mining, SIAM, pp. 624–635 (2012)

  10. Ritter, A., Etzioni, O., Clark, S., et al.: Open domain event extraction from twitter. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 1104–1112 (2012)

  11. Yin, H., Hu, Z., Zhou, X., Hao, W., Kai, Z., Nguyen, Q.V.H., Sadiq, S.: Discovering interpretable geo-social communities for user behavior prediction. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE) (2016)

  12. Goyal, A., Kashyap, I.: Latent Dirichlet Allocation–An approach for topic discovery. In: 2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), pp. 97–102 (2022). https://doi.org/10.1109/COM-IT-CON54601.2022.9850912

  13. Barde, B. V., Bainwad, A. M.: An overview of topic modeling methods and tools. In: 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), pp 745–750 (2017) https://doi.org/10.1109/ICCONS.2017.8250563

  14. Qin, D., Zheng, G., Liu, L., Li, L., Wang, Y., Zhao, L.: A knowledge search algorithm based on multidimensional semantic similarity analysis in knowledge graph systems of power grid networks. In: 2020 IEEE 20th International Conference on Communication Technology (ICCT), pp. 1447–1451 (2020) https://doi.org/10.1109/ICCT50939.2020.9295697

  15. Jiao, Y., Jing, Ma., Kang, F.: Computing text semantic similarity with syntactic network of co-occurrence distance. Data Anal. Knowl. Discov. 12, 93–100 (2019)

    Google Scholar 

  16. Sheng, Q., Ying, G.: Measuring semantic similarity in ontology and its application in information retrieval. In: 2008 Congress on Image and Signal Processing, pp. 525–529 (2008) https://doi.org/10.1109/CISP.2008.596

  17. Newsgroup Master dataset. Retrieved from https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json

  18. Hamed, J., Yongli, W., Chi, Y., Xia, F., Xiahui, J., Yanchao, L., Liang, Z.: Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey. ArXiv.org e-print archive. (n.d.). https://arxiv.org/ftp/arxiv/papers/1310/1310.8059.pdf. Retrieved 7 Aug 2022

  19. Kherwa, P., Bansal, P.: Topic modelling: a comprehensive review. J. EAI Endorsed Trans. Scalable Inform. Syst. (2019). https://doi.org/10.4108/eai.13-7-2018.159623

    Article  Google Scholar 

  20. Adegoke, M.A., Ayeni, J.O., Adewole, P.A.: Empirical prior latent Dirichlet allocation model. Niger. J. Technol. (NIJOTECH). 38(1), 223–232 (2019)

    Article  Google Scholar 

  21. Mohamed, M., Oussalah, M.: SRL-ESA-TextSum: a text summarization approach based on semantic role labeling and explicit semantic analysis. Inform. Process. Manag. 56(4), 1356–1372 (2019)

    Article  Google Scholar 

  22. Sainani, A. et al.: Extracting and classifying requirements from software engineering contracts. In: 2020 IEEE 28th International Requirements Engineering Conference (RE) (2020) [Preprint]. https://doi.org/10.1109/re48521.2020.00026

  23. Jonsson, E., Stolee, J.: An evaluation of topic modelling techniques for twitter. An evaluation of topic modelling techniques for Twitter. (n.d.). https://www.cs.toronto.edu/jstolee/projects/topic.pdf. Retrieved 7 Aug 2022

  24. Asmussen, C.B., Møller, C.: Smart literature review: a practical topic modelling approach to exploratory literature review. J. Big Data 6, 93 (2019). https://doi.org/10.1186/s40537-019-0255-7

    Article  Google Scholar 

  25. Akila, D., Jayakumar, C.: Semantic similarity—a review of approaches and metrics. Int. J. Appl. Eng. Res. 9(24), 27581–27600 (2014)

    Google Scholar 

  26. Sontag, D., Roy, D.M.: Complexity of inference in latent Dirichlet allocation. (2022). https://people.csail.mit.edu/dsontag/papers/SontagRoy_nips11.pdf. Accessed: 31 Oct 2022

  27. Bailón-Elvira, J.C., Cobo, M.J., Herrera-Viedma, E., López-Herrera, A.G.: Latent Dirichlet allocation (LDA) for improving the topic modeling of the official bulletin of the Spanish state (BOE). Procedia Comput. Sci. 162, 207–214 (2019)

    Article  Google Scholar 

  28. Horn, N., Gampfer, F., Buchkremer, R.: Latent Dirichlet allocation and t-distributed stochastic neighbor embedding enhance scientific reading comprehension of articles related to enterprise architecture. Institute of IT Management and Digitization Research (IFID) (2021)

    Book  Google Scholar 

  29. Anima A. et al.: A spectral algorithm for latent Dirichlet allocation. Retrieved from https://www.cs.columbia.edu/~djhsu/papers/lda-nips.pdf

  30. Lei, S., Griffiths, T.L., Kevin, R.C.: Online inference of topics with latent Dirichlet allocation. In: 2th International Conference on Artificial Intelligence and Statistics (AISTATS) vol. 5 (2009)

  31. Gross, A., Murthy, D.: Modeling virtual organizations with latent Dirichlet allocation: a case for natural language processing. J. Neural Netw. 58, 38–49 (2014)

    Article  Google Scholar 

  32. Špeh, J., Muhic, J., Rupnik, J.: Parameter estimation for the Latent Dirichlet Allocation. Retrieved from https://ailab.ijs.si/dunja/SiKDD2013/Papers/Speh-ldaAlgorithms.pdf

  33. Zhe, C., Dossaca, H.. Inference for the number of topics in the Latent Dirichlet allocation model via Bayesian mixture modelling. Retrieved from https://users.stat.ufl.edu/~doss/Research/lda-ntopics.pdf

  34. Foster, A., Li, H., Maierhofer, G., Shearer, M.: An extension of standard latent dirichlet allocation to multiple corpora. Retrieved on April 2016 from http://evoq-eval.siam.org/Portals/0/Publications/SIURO/Vol9/AN_EXTENSION_STANDARD_LATENT_DIRICHLET_ALLOCATION.pdf?ver=2018-04-06-152049-177

  35. Hew, Z.J., Olanrewaju, V.J., Chew, X.Y., Khaw, K.W.: Text summarization for news articles by machine learning techniques. J. Appl. Math. Comput. Intell. (2022)

Download references

Funding

There was no outside funding for the study.

Author information

Authors and Affiliations

Authors

Contributions

The research's conception and design were influenced by the work of all contributors. Data gathering, material preparation, and analysis were completed by MOA. The first draft of the manuscript was written by AAO, UCC, POS. All the authors read and approved the final manuscript and also agreed to all the content of the article including the author list and contributions.

Corresponding author

Correspondence to Micheal Olalekan Ajinaja.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to disclose about the article's content.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ajinaja, M.O., Adetunmbi, A.O., Ugwu, C.C. et al. Semantic similarity measure for topic modeling using latent Dirichlet allocation and collapsed Gibbs sampling. Iran J Comput Sci 6, 81–94 (2023). https://doi.org/10.1007/s42044-022-00124-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42044-022-00124-7

Keywords

Navigation