Normalized Approach to Find Optimal Number of Topics in Latent Dirichlet Allocation (LDA)

Hasan, Mahedi; Rahman, Anichur; Karim, Md. Razaul; Khan, Md. Saikat Islam; Islam, Md. Jahidul

doi:10.1007/978-981-33-4673-4_27

Mahedi Hasan ORCID: orcid.org/0000-0002-7826-8600¹⁸,
Anichur Rahman¹⁸,
Md. Razaul Karim¹⁹,
Md. Saikat Islam Khan¹⁹ &
…
Md. Jahidul Islam²⁰

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1309))

1672 Accesses
24 Citations
3 Altmetric

Abstract

Feature extraction is one of the challenging works in the Machine Learning (ML) arena. The more features one able to extract correctly, the more accurate knowledge one can exploit from data. Latent Dirichlet Allocation (LDA) is a form of topic modeling used to extract features from text data. But finding the optimal number of topics (on which success of LDA depends on) is tremendous challenging, especially if there is no prior knowledge about the data. Some studies suggest perplexity; some are Rate of Perplexity Change (RPC); some suggest coherence as a method to find an optimal number of a topic for achieving both of accuracy and less processing time for LDA. In this study, the authors propose two new methods named Normalized Absolute Coherence (NAC) and Normalized Absolute Perplexity (NAP) for predicting the optimal number of topics. The authors run highly standard ML experiments to measure and compare the reliability of existing methods (perplexity, coherence, RPC) and proposed NAC and NAP in searching for an optimal number of topics in LDA. The study successfully proves and suggests that NAC and NAP work better than existing methods. This investigation also suggests that perplexity, coherence, and RPC are sometimes distracting and confusing to estimate the optimal number of topics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Text Retrieval Conference Data (2004). https://dmice.ohsu.edu/trec-gen/data/2004/. Accessed: 01 Aug 2020
Asmussen, C.B., Møller, C.: Smart literature review: a practical topic modelling approach to exploratory literature review. J. Big Data 6(1), 93 (2019)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan):993–1022 (2003)
Google Scholar
Dewangan, J.K., Sharaff, A., Pandey, S.: Improving topic coherence using parsimonious language model and latent semantic indexing. In: Lecture Notes in Electrical Engineering, vol. 601, pp. 823–830. Springer (2020)
Google Scholar
Fang, A., Macdonald, C., Ounis, I., Habel, P.: Examining the coherence of the top ranked tweet topics. In: SIGIR 2016—Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 825–828. ACM, New York, New York, USA (2016)
Google Scholar
Gelbukh, A.: Computational Linguistics and Intelligent Text Processing: 13th International Conference, CICLing: New Delhi, India, 11–17, 2012. Proceedings, Part II (2012)(2012)
Google Scholar
Gerlach, M., Peixoto, T.P., Altmann, E.G.: A network approach to topic models. Sci. Adv. 4(7), eaaq1360 (2018)
Google Scholar
Hersh, W., Cohen, A., Yang, J., Teja Bhupatiraju, R., Roberts, P., Hearst, M.: TREC 2005 Genomics Track Overview. Technical report
Google Scholar
Huang, C.M.: Incorporating prior knowledge by selective context features to enhance topic coherence. In: Communications in Computer and Information Science, vol. 1013, pp. 310–318. Springer Verlag (2019)
Google Scholar
Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., Zhao, L.: Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools Appl. 78(11), 15169–15211
Google Scholar
Kobayashi, H.: Perplexity on reduced corpora. In: Proceedings of the 52nd Annual Meeting of the ACL (Volume 1: Long Papers), pp. 797–806, Maryland, ACL (2014)
Google Scholar
Neishabouri, A., Desmarais, M.C.: Reliability of perplexity to find number of latent topics. In: The Thirty-Third International Flairs Conference (2020)
Google Scholar
Pathik, N., Shukla, P.: Simulated annealing based algorithm for tuning LDA hyper parameters. In: Advances in Intelligent Systems and Computing, vol. 1154, pp. 515–521. Springer (2020)
Google Scholar
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM ’15, pp. 399–408. Association for Computing Machinery, New York, NY, USA (2015)
Google Scholar
Sbalchiero, S., Eder, M.: Topic modeling, long texts and the best number of topics. Some problems and solutions. Qual. Quant. 1–14 (2020)
Google Scholar
Text Retrieval Conference TREC 2005 Genomics Track Ad Hoc Retrieval Topics. Technical report (2005)
Google Scholar
Thiyagarajan, D., Shanthi, N.: A modified multi objective heuristic for effective feature selection in text classification. Cluster Comput. 22(5), 10625–10635 (2019)
Article Google Scholar
Dang, T., Nguyen, V.T.: ComModeler: topic modeling using community detection. EuroVis Workshop on Visual Analytics (EuroVA) (2018)
Google Scholar
Wang, H., Wang, J., Zhang, Y., Wang, M., Mao, C.: Optimization of topic recognition model for news texts based on LDA
Google Scholar
Wang, R., Zhou, D., He, Y.: Optimising topic coherence with Weighted Poólya Urn scheme. Neurocomputing 385, 329–339 (2020)
Article Google Scholar
Yuan, B., Wu, G.: A hybrid hdp-me-lda model for sentiment analysis. In: Proceedings of the 2017 2nd International Conference on Automation, Mechanical Control and Computational Engineering (AMCCE 2017), pp. 659–663. Atlantis Press (2017)
Google Scholar
Zhao, W., Chen, J.J., Perkins, R., Liu, Z., Ge, W., Ding, Y., Zou, W.: A heuristic approach to determine an appropriate number of topics in topic modeling. In: BMC bioinformatics, vol. 16, pp. S8. Springer (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Textile Engineering and Research (NITER), Savar, Dhaka, Bangladesh
Mahedi Hasan & Anichur Rahman
Mawlana Bhashani Science and Technology University, Tangail, Bangladesh
Md. Razaul Karim & Md. Saikat Islam Khan
Green University of Bangladesh, Dhaka, Bangladesh
Md. Jahidul Islam

Authors

Mahedi Hasan
View author publications
You can also search for this author in PubMed Google Scholar
Anichur Rahman
View author publications
You can also search for this author in PubMed Google Scholar
Md. Razaul Karim
View author publications
You can also search for this author in PubMed Google Scholar
Md. Saikat Islam Khan
View author publications
You can also search for this author in PubMed Google Scholar
Md. Jahidul Islam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mahedi Hasan .

Editor information

Editors and Affiliations

Jahangirnagar University, Dhaka, Bangladesh
M. Shamim Kaiser
National Institute for Materials Science, Tsukuba, Japan
Anirban Bandyopadhyay
Nottingham Trent University, Nottingham, UK
Mufti Mahmud
Amity University, Jaipur, Rajasthan, India
Kanad Ray

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hasan, M., Rahman, A., Karim, M.R., Khan, M.S.I., Islam, M.J. (2021). Normalized Approach to Find Optimal Number of Topics in Latent Dirichlet Allocation (LDA). In: Kaiser, M.S., Bandyopadhyay, A., Mahmud, M., Ray, K. (eds) Proceedings of International Conference on Trends in Computational and Cognitive Engineering. Advances in Intelligent Systems and Computing, vol 1309. Springer, Singapore. https://doi.org/10.1007/978-981-33-4673-4_27

Download citation

DOI: https://doi.org/10.1007/978-981-33-4673-4_27
Published: 17 December 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-4672-7
Online ISBN: 978-981-33-4673-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics