Skip to main content

Topic Modeling for Text Classification

  • Conference paper
  • First Online:
Emerging Technology in Modelling and Graphics

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 937))

Abstract

Topic models allude to statistical algorithms for finding out an extensive text body’s latent semantic structures. Standing here in today’s world, the measure of the textual data and information we come across in our day-to-day lives is basically beyond our handling limit. Topic models can provide a way out for us to understand and manage the vast accumulations of unstructured textual data and information. Initially emerged as a text-mining instrument, topic models have found applications in various other fields. This paper makes a thorough comparative study of LSA with that of commonly used TF-IDF approach for text classification and proves that LSA yields better accuracy in classifying texts. The novelty of the paper lies in the fact that we are using a much sparser representation than usual TF-IDF and also, LSA can get from the topic if there are any synonym words. This paper proposes a method, using the concept of entropy, which further increases the accuracy of text classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. S. Zelikovitz, Transductive LSI for short text classification problems, in Proceedings of the 17th International Flairs Conference (2004)

    Google Scholar 

  2. Q. Pu, G. Yang, Short-text classification based on ICA and LSA, in Advances in Neural Networks ISNN 2006 (2006), pp. 265–270

    Google Scholar 

  3. S. Zelikovitz, H. Hirsh, Using LSI for text classification in the presence of background text, in Proceedings of 10th International Conference on Information and Knowledge Management (2001), pp. 113–118

    Google Scholar 

  4. B. Wang, Y. Huang, W. Yang, X. Li, Short text classification based on strong feature thesaurus. J. Zhejiang Univ. Sci. C 13(9), 649–659 (2012)

    Article  Google Scholar 

  5. X. Phan, L. Nguyen, S. Horiguchi, Learning to classify short and sparse text & web with hidden topics from large-scale data collections, in IW3C2 (2008)

    Google Scholar 

  6. M. Sahami, T.D. Heilman, A web-based kernel function for measuring the similarity of short text snippets, in WWW ’06: Proceedings of the 15th International Conference on World Wide Web (2006), pp. 377–386

    Google Scholar 

  7. W.-T. Yih, C. Meek, Improving similarity measures for short segments of text, in AAAI’07: Proceedings of the 22nd National Conference on Artificial Intelligence (2007), pp. 1489–1494

    Google Scholar 

  8. D. Ramage, D. Hall, R. Nallapati, C.D. Manning, Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora, in EMNLP ’09: Proceedings of the Conference on Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2009), pp. 248–256

    Google Scholar 

  9. D. Ramage, D. Hall, R. Nallapati, C.D. Manning, Labeled LDA: a supervised topic model for credit attribution in multilabeled corpora, in Proceeding of the 2009 Conference on Empirical Methods in Natural Language Processing (2009)

    Google Scholar 

  10. S. Dey Sarkar, S. Goswami, A. Agarwal, J. Aktar, A novel feature selection technique for text classification using Naïve Bayes. Int. Sch. Res. Not. 2014, Article ID 717092. https://doi.org/10.1155/2014/717092

    Article  Google Scholar 

  11. G. Maskeri, S. Sarkar, K. Heafield, Mining Business Topics in Source Code using Latent Dirichlet Allocation (ACM, 2008)

    Google Scholar 

  12. D. Ramage, C.D. Manning, S. Dumais, Partially labeled topic models for interpretable text mining, San Diego, California, USA (2011)

    Google Scholar 

  13. D. Ramage, E. Rosen, Stanford Topic modelling Toolbox, Dec 2011. [Online]. Available: http://nlp.stanford.edu/software/tmt/tmt-0.4

  14. A. Rajaraman, J.D. Ullman, Data Mining. Mining of Massive Datasets (PDF) pp. 1–17 (2011). https://doi.org/10.1017/cbo9781139058452.002. ISBN 978-1-139-05845-2

  15. Ł. Dębowski, Consistency of the plug-in estimator of the entropy rate for ergodic processes, in Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain 10–15 July 2016, pp. 1651–1655

    Google Scholar 

  16. J. Jiao, K. Venkat, Y. Han, T. Weissman, Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 61, 2835–2885 (2015)

    Article  MathSciNet  Google Scholar 

  17. A. Lesne, J.L. Blanc, L. Pezard, Entropy estimation of very short symbolic sequences. Phys. Rev. E 79, 046208 (2009)

    Article  MathSciNet  Google Scholar 

  18. G.P. Basharin, On a statistical estimate for the entropy of a sequence of independent random variables. Theory Probab. Appl. 4, 333–336 (1959)

    Article  MathSciNet  Google Scholar 

  19. C.E. Shannon, W. Weaver, The Mathematical Theory of Communication (The University of Illinois Press, Urbana, IL, 1949)

    MATH  Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge the contribution of MUST Research Club and CU Data Science Group. MUST Research Club is a non-profit organization registered under Society Act of India. MUST Research Club is dedicated to promote excellence and competence in the field of data science, cognitive computing, artificial intelligence, machine learning and advanced analytics for the benefit of the society. Calcutta University Data Science Group is a group to create solutions to solve societal problems, as well as form generic solutions to common data science and engineering issue. This is a forum which brings together researchers, industry practitioners, scholars, interns and form groups.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pinaki Prasad Guha Neogi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Neogi, P.P.G., Das, A.K., Goswami, S., Mustafi, J. (2020). Topic Modeling for Text Classification. In: Mandal, J., Bhattacharya, D. (eds) Emerging Technology in Modelling and Graphics. Advances in Intelligent Systems and Computing, vol 937. Springer, Singapore. https://doi.org/10.1007/978-981-13-7403-6_36

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-7403-6_36

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-7402-9

  • Online ISBN: 978-981-13-7403-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics