Topic Modeling for Text Classification

Neogi, Pinaki Prasad Guha; Das, Amit Kumar; Goswami, Saptarsi; Mustafi, Joy

doi:10.1007/978-981-13-7403-6_36

Pinaki Prasad Guha Neogi¹⁶,
Amit Kumar Das¹⁷,
Saptarsi Goswami¹⁷ &
…
Joy Mustafi¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 937))

3075 Accesses
9 Citations

Abstract

Topic models allude to statistical algorithms for finding out an extensive text body’s latent semantic structures. Standing here in today’s world, the measure of the textual data and information we come across in our day-to-day lives is basically beyond our handling limit. Topic models can provide a way out for us to understand and manage the vast accumulations of unstructured textual data and information. Initially emerged as a text-mining instrument, topic models have found applications in various other fields. This paper makes a thorough comparative study of LSA with that of commonly used TF-IDF approach for text classification and proves that LSA yields better accuracy in classifying texts. The novelty of the paper lies in the fact that we are using a much sparser representation than usual TF-IDF and also, LSA can get from the topic if there are any synonym words. This paper proposes a method, using the concept of entropy, which further increases the accuracy of text classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

S. Zelikovitz, Transductive LSI for short text classification problems, in Proceedings of the 17th International Flairs Conference (2004)
Google Scholar
Q. Pu, G. Yang, Short-text classification based on ICA and LSA, in Advances in Neural Networks ISNN 2006 (2006), pp. 265–270
Google Scholar
S. Zelikovitz, H. Hirsh, Using LSI for text classification in the presence of background text, in Proceedings of 10th International Conference on Information and Knowledge Management (2001), pp. 113–118
Google Scholar
B. Wang, Y. Huang, W. Yang, X. Li, Short text classification based on strong feature thesaurus. J. Zhejiang Univ. Sci. C 13(9), 649–659 (2012)
Article Google Scholar
X. Phan, L. Nguyen, S. Horiguchi, Learning to classify short and sparse text & web with hidden topics from large-scale data collections, in IW3C2 (2008)
Google Scholar
M. Sahami, T.D. Heilman, A web-based kernel function for measuring the similarity of short text snippets, in WWW ’06: Proceedings of the 15th International Conference on World Wide Web (2006), pp. 377–386
Google Scholar
W.-T. Yih, C. Meek, Improving similarity measures for short segments of text, in AAAI’07: Proceedings of the 22nd National Conference on Artificial Intelligence (2007), pp. 1489–1494
Google Scholar
D. Ramage, D. Hall, R. Nallapati, C.D. Manning, Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora, in EMNLP ’09: Proceedings of the Conference on Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2009), pp. 248–256
Google Scholar
D. Ramage, D. Hall, R. Nallapati, C.D. Manning, Labeled LDA: a supervised topic model for credit attribution in multilabeled corpora, in Proceeding of the 2009 Conference on Empirical Methods in Natural Language Processing (2009)
Google Scholar
S. Dey Sarkar, S. Goswami, A. Agarwal, J. Aktar, A novel feature selection technique for text classification using Naïve Bayes. Int. Sch. Res. Not. 2014, Article ID 717092. https://doi.org/10.1155/2014/717092
Article Google Scholar
G. Maskeri, S. Sarkar, K. Heafield, Mining Business Topics in Source Code using Latent Dirichlet Allocation (ACM, 2008)
Google Scholar
D. Ramage, C.D. Manning, S. Dumais, Partially labeled topic models for interpretable text mining, San Diego, California, USA (2011)
Google Scholar
D. Ramage, E. Rosen, Stanford Topic modelling Toolbox, Dec 2011. [Online]. Available: http://nlp.stanford.edu/software/tmt/tmt-0.4
A. Rajaraman, J.D. Ullman, Data Mining. Mining of Massive Datasets (PDF) pp. 1–17 (2011). https://doi.org/10.1017/cbo9781139058452.002. ISBN 978-1-139-05845-2
Ł. Dębowski, Consistency of the plug-in estimator of the entropy rate for ergodic processes, in Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain 10–15 July 2016, pp. 1651–1655
Google Scholar
J. Jiao, K. Venkat, Y. Han, T. Weissman, Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 61, 2835–2885 (2015)
Article MathSciNet Google Scholar
A. Lesne, J.L. Blanc, L. Pezard, Entropy estimation of very short symbolic sequences. Phys. Rev. E 79, 046208 (2009)
Article MathSciNet Google Scholar
G.P. Basharin, On a statistical estimate for the entropy of a sequence of independent random variables. Theory Probab. Appl. 4, 333–336 (1959)
Article MathSciNet Google Scholar
C.E. Shannon, W. Weaver, The Mathematical Theory of Communication (The University of Illinois Press, Urbana, IL, 1949)
MATH Google Scholar

Download references

Acknowledgements

We gratefully acknowledge the contribution of MUST Research Club and CU Data Science Group. MUST Research Club is a non-profit organization registered under Society Act of India. MUST Research Club is dedicated to promote excellence and competence in the field of data science, cognitive computing, artificial intelligence, machine learning and advanced analytics for the benefit of the society. Calcutta University Data Science Group is a group to create solutions to solve societal problems, as well as form generic solutions to common data science and engineering issue. This is a forum which brings together researchers, industry practitioners, scholars, interns and form groups.

Author information

Authors and Affiliations

Department of CSE, Meghnad Saha Institute of Technology, Kolkata, 700150, India
Pinaki Prasad Guha Neogi
A.K. Choudhury School of Information Technology, Calcutta University, Kolkata, 700106, India
Amit Kumar Das & Saptarsi Goswami
MUST Research Club, Hyderabad, 500107, India
Joy Mustafi

Authors

Pinaki Prasad Guha Neogi
View author publications
You can also search for this author in PubMed Google Scholar
Amit Kumar Das
View author publications
You can also search for this author in PubMed Google Scholar
Saptarsi Goswami
View author publications
You can also search for this author in PubMed Google Scholar
Joy Mustafi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pinaki Prasad Guha Neogi .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, India
Jyotsna Kumar Mandal
Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, West Bengal, India
Debika Bhattacharya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Neogi, P.P.G., Das, A.K., Goswami, S., Mustafi, J. (2020). Topic Modeling for Text Classification. In: Mandal, J., Bhattacharya, D. (eds) Emerging Technology in Modelling and Graphics. Advances in Intelligent Systems and Computing, vol 937. Springer, Singapore. https://doi.org/10.1007/978-981-13-7403-6_36

Download citation

DOI: https://doi.org/10.1007/978-981-13-7403-6_36
Published: 17 July 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-7402-9
Online ISBN: 978-981-13-7403-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics