Abstract
Topic models allude to statistical algorithms for finding out an extensive text body’s latent semantic structures. Standing here in today’s world, the measure of the textual data and information we come across in our day-to-day lives is basically beyond our handling limit. Topic models can provide a way out for us to understand and manage the vast accumulations of unstructured textual data and information. Initially emerged as a text-mining instrument, topic models have found applications in various other fields. This paper makes a thorough comparative study of LSA with that of commonly used TF-IDF approach for text classification and proves that LSA yields better accuracy in classifying texts. The novelty of the paper lies in the fact that we are using a much sparser representation than usual TF-IDF and also, LSA can get from the topic if there are any synonym words. This paper proposes a method, using the concept of entropy, which further increases the accuracy of text classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
S. Zelikovitz, Transductive LSI for short text classification problems, in Proceedings of the 17th International Flairs Conference (2004)
Q. Pu, G. Yang, Short-text classification based on ICA and LSA, in Advances in Neural Networks ISNN 2006 (2006), pp. 265–270
S. Zelikovitz, H. Hirsh, Using LSI for text classification in the presence of background text, in Proceedings of 10th International Conference on Information and Knowledge Management (2001), pp. 113–118
B. Wang, Y. Huang, W. Yang, X. Li, Short text classification based on strong feature thesaurus. J. Zhejiang Univ. Sci. C 13(9), 649–659 (2012)
X. Phan, L. Nguyen, S. Horiguchi, Learning to classify short and sparse text & web with hidden topics from large-scale data collections, in IW3C2 (2008)
M. Sahami, T.D. Heilman, A web-based kernel function for measuring the similarity of short text snippets, in WWW ’06: Proceedings of the 15th International Conference on World Wide Web (2006), pp. 377–386
W.-T. Yih, C. Meek, Improving similarity measures for short segments of text, in AAAI’07: Proceedings of the 22nd National Conference on Artificial Intelligence (2007), pp. 1489–1494
D. Ramage, D. Hall, R. Nallapati, C.D. Manning, Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora, in EMNLP ’09: Proceedings of the Conference on Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2009), pp. 248–256
D. Ramage, D. Hall, R. Nallapati, C.D. Manning, Labeled LDA: a supervised topic model for credit attribution in multilabeled corpora, in Proceeding of the 2009 Conference on Empirical Methods in Natural Language Processing (2009)
S. Dey Sarkar, S. Goswami, A. Agarwal, J. Aktar, A novel feature selection technique for text classification using Naïve Bayes. Int. Sch. Res. Not. 2014, Article ID 717092. https://doi.org/10.1155/2014/717092
G. Maskeri, S. Sarkar, K. Heafield, Mining Business Topics in Source Code using Latent Dirichlet Allocation (ACM, 2008)
D. Ramage, C.D. Manning, S. Dumais, Partially labeled topic models for interpretable text mining, San Diego, California, USA (2011)
D. Ramage, E. Rosen, Stanford Topic modelling Toolbox, Dec 2011. [Online]. Available: http://nlp.stanford.edu/software/tmt/tmt-0.4
A. Rajaraman, J.D. Ullman, Data Mining. Mining of Massive Datasets (PDF) pp. 1–17 (2011). https://doi.org/10.1017/cbo9781139058452.002. ISBN 978-1-139-05845-2
Ł. Dębowski, Consistency of the plug-in estimator of the entropy rate for ergodic processes, in Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain 10–15 July 2016, pp. 1651–1655
J. Jiao, K. Venkat, Y. Han, T. Weissman, Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 61, 2835–2885 (2015)
A. Lesne, J.L. Blanc, L. Pezard, Entropy estimation of very short symbolic sequences. Phys. Rev. E 79, 046208 (2009)
G.P. Basharin, On a statistical estimate for the entropy of a sequence of independent random variables. Theory Probab. Appl. 4, 333–336 (1959)
C.E. Shannon, W. Weaver, The Mathematical Theory of Communication (The University of Illinois Press, Urbana, IL, 1949)
Acknowledgements
We gratefully acknowledge the contribution of MUST Research Club and CU Data Science Group. MUST Research Club is a non-profit organization registered under Society Act of India. MUST Research Club is dedicated to promote excellence and competence in the field of data science, cognitive computing, artificial intelligence, machine learning and advanced analytics for the benefit of the society. Calcutta University Data Science Group is a group to create solutions to solve societal problems, as well as form generic solutions to common data science and engineering issue. This is a forum which brings together researchers, industry practitioners, scholars, interns and form groups.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Neogi, P.P.G., Das, A.K., Goswami, S., Mustafi, J. (2020). Topic Modeling for Text Classification. In: Mandal, J., Bhattacharya, D. (eds) Emerging Technology in Modelling and Graphics. Advances in Intelligent Systems and Computing, vol 937. Springer, Singapore. https://doi.org/10.1007/978-981-13-7403-6_36
Download citation
DOI: https://doi.org/10.1007/978-981-13-7403-6_36
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-7402-9
Online ISBN: 978-981-13-7403-6
eBook Packages: EngineeringEngineering (R0)