Skip to main content

Normalized Approach to Find Optimal Number of Topics in Latent Dirichlet Allocation (LDA)

  • Conference paper
  • First Online:
Proceedings of International Conference on Trends in Computational and Cognitive Engineering

Abstract

Feature extraction is one of the challenging works in the Machine Learning (ML) arena. The more features one able to extract correctly, the more accurate knowledge one can exploit from data. Latent Dirichlet Allocation (LDA) is a form of topic modeling used to extract features from text data. But finding the optimal number of topics (on which success of LDA depends on) is tremendous challenging, especially if there is no prior knowledge about the data. Some studies suggest perplexity; some are Rate of Perplexity Change (RPC); some suggest coherence as a method to find an optimal number of a topic for achieving both of accuracy and less processing time for LDA. In this study, the authors propose two new methods named Normalized Absolute Coherence (NAC) and Normalized Absolute Perplexity (NAP) for predicting the optimal number of topics. The authors run highly standard ML experiments to measure and compare the reliability of existing methods (perplexity, coherence, RPC) and proposed NAC and NAP in searching for an optimal number of topics in LDA. The study successfully proves and suggests that NAC and NAP work better than existing methods. This investigation also suggests that perplexity, coherence, and RPC are sometimes distracting and confusing to estimate the optimal number of topics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Text Retrieval Conference Data (2004). https://dmice.ohsu.edu/trec-gen/data/2004/. Accessed: 01 Aug 2020

  2. Asmussen, C.B., Møller, C.: Smart literature review: a practical topic modelling approach to exploratory literature review. J. Big Data 6(1), 93 (2019)

    Article  Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan):993–1022 (2003)

    Google Scholar 

  4. Dewangan, J.K., Sharaff, A., Pandey, S.: Improving topic coherence using parsimonious language model and latent semantic indexing. In: Lecture Notes in Electrical Engineering, vol. 601, pp. 823–830. Springer (2020)

    Google Scholar 

  5. Fang, A., Macdonald, C., Ounis, I., Habel, P.: Examining the coherence of the top ranked tweet topics. In: SIGIR 2016—Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 825–828. ACM, New York, New York, USA (2016)

    Google Scholar 

  6. Gelbukh, A.: Computational Linguistics and Intelligent Text Processing: 13th International Conference, CICLing: New Delhi, India, 11–17, 2012. Proceedings, Part II (2012)(2012)

    Google Scholar 

  7. Gerlach, M., Peixoto, T.P., Altmann, E.G.: A network approach to topic models. Sci. Adv. 4(7), eaaq1360 (2018)

    Google Scholar 

  8. Hersh, W., Cohen, A., Yang, J., Teja Bhupatiraju, R., Roberts, P., Hearst, M.: TREC 2005 Genomics Track Overview. Technical report

    Google Scholar 

  9. Huang, C.M.: Incorporating prior knowledge by selective context features to enhance topic coherence. In: Communications in Computer and Information Science, vol. 1013, pp. 310–318. Springer Verlag (2019)

    Google Scholar 

  10. Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., Zhao, L.: Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools Appl. 78(11), 15169–15211

    Google Scholar 

  11. Kobayashi, H.: Perplexity on reduced corpora. In: Proceedings of the 52nd Annual Meeting of the ACL (Volume 1: Long Papers), pp. 797–806, Maryland, ACL (2014)

    Google Scholar 

  12. Neishabouri, A., Desmarais, M.C.: Reliability of perplexity to find number of latent topics. In: The Thirty-Third International Flairs Conference (2020)

    Google Scholar 

  13. Pathik, N., Shukla, P.: Simulated annealing based algorithm for tuning LDA hyper parameters. In: Advances in Intelligent Systems and Computing, vol. 1154, pp. 515–521. Springer (2020)

    Google Scholar 

  14. Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM ’15, pp. 399–408. Association for Computing Machinery, New York, NY, USA (2015)

    Google Scholar 

  15. Sbalchiero, S., Eder, M.: Topic modeling, long texts and the best number of topics. Some problems and solutions. Qual. Quant. 1–14 (2020)

    Google Scholar 

  16. Text Retrieval Conference TREC 2005 Genomics Track Ad Hoc Retrieval Topics. Technical report (2005)

    Google Scholar 

  17. Thiyagarajan, D., Shanthi, N.: A modified multi objective heuristic for effective feature selection in text classification. Cluster Comput. 22(5), 10625–10635 (2019)

    Article  Google Scholar 

  18. Dang, T., Nguyen, V.T.: ComModeler: topic modeling using community detection. EuroVis Workshop on Visual Analytics (EuroVA) (2018)

    Google Scholar 

  19. Wang, H., Wang, J., Zhang, Y., Wang, M., Mao, C.: Optimization of topic recognition model for news texts based on LDA

    Google Scholar 

  20. Wang, R., Zhou, D., He, Y.: Optimising topic coherence with Weighted Poólya Urn scheme. Neurocomputing 385, 329–339 (2020)

    Article  Google Scholar 

  21. Yuan, B., Wu, G.: A hybrid hdp-me-lda model for sentiment analysis. In: Proceedings of the 2017 2nd International Conference on Automation, Mechanical Control and Computational Engineering (AMCCE 2017), pp. 659–663. Atlantis Press (2017)

    Google Scholar 

  22. Zhao, W., Chen, J.J., Perkins, R., Liu, Z., Ge, W., Ding, Y., Zou, W.: A heuristic approach to determine an appropriate number of topics in topic modeling. In: BMC bioinformatics, vol. 16, pp. S8. Springer (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mahedi Hasan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hasan, M., Rahman, A., Karim, M.R., Khan, M.S.I., Islam, M.J. (2021). Normalized Approach to Find Optimal Number of Topics in Latent Dirichlet Allocation (LDA). In: Kaiser, M.S., Bandyopadhyay, A., Mahmud, M., Ray, K. (eds) Proceedings of International Conference on Trends in Computational and Cognitive Engineering. Advances in Intelligent Systems and Computing, vol 1309. Springer, Singapore. https://doi.org/10.1007/978-981-33-4673-4_27

Download citation

Publish with us

Policies and ethics