Abstract
Text categorization has a variety of applications, such as sentiment analysis of user’s tweet, categorizing blog posts into different categories, etc. The real-time data available for categorization is usually unstructured. An efficient algorithm for preprocessing the data can help to achieve better accuracy. Term frequency–inverse document frequency (tf-idf) and word2vec word embedding techniques are used widely before applying the text classification model. In order to show the enactment of these techniques on text categorization, we are comparing the accuracies of different multi-class text categorization algorithms such as Support Vector Machine (SVM), Logistic Regression and K-Nearest Neighbor (KNN) on these techniques. TagMyNews dataset is used to train the model. The results indicate that word2vec is efficient word embedding technique as it possesses higher accuracies for all the classification methods (KNN: 79.38%, SVM: 93.59%, Logistic Regression: 87.46%) as compared to tf-idf (KNN: 73.37%, SVM: 84%, Logistic Regression: 73.98%).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1–2):245–271
Pradhan L et al (2017) Comparison of text classifiers on news articles. Int Res J Eng Technol 4(3):2513–2517
Dadgar SM, Araghi MS, Farahani MM (2016) A novel text mining approach based on TFIDF and support vector machine for news classification. In: IEEE international conference on engineering and technology (ICETECH), IEEE
Suleymanov U, Rustamov S (2018) Automated news categorization using machine learning methods. In: IOP conference series: materials science and engineering, vol 459, no 1. IOP Publishing
Effrosynidis D, Symeonidis S, Arampatzis A (2017) A comparison of pre-processing techniques for twitter sentiment analysis. In: International conference on theory and practice of digital libraries. Springer, Cham
Romero F, Koochak Z (2015) Assessing and implementing automated news classification
News Category Dataset. https://www.kaggle.com/rmisra/news-category-dataset. Last accessed 21 Dec 2019
Virmani D, Taneja S (2019) A text preprocessing approach for efficacious information retrieval. In: Smart innovations in communication and computational sciences. Springer, Singapore, pp 13–22
Grootendorst M, Vanschoren J (2019) Beyond bag-of-concepts: vectors of locally aggregated concepts. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, Cham
A math-first explanation of Word2Vec. https://medium.com/analytics-vidhya/maths-behind-word2vec-explained-38d74f32726b. Last accessed 15 Dec 2019
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Arora, M., Mittal, V., Aggarwal, P. (2021). Enactment of tf-idf and word2vec on Text Categorization. In: Abraham, A., Castillo, O., Virmani, D. (eds) Proceedings of 3rd International Conference on Computing Informatics and Networks. Lecture Notes in Networks and Systems, vol 167. Springer, Singapore. https://doi.org/10.1007/978-981-15-9712-1_17
Download citation
DOI: https://doi.org/10.1007/978-981-15-9712-1_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9711-4
Online ISBN: 978-981-15-9712-1
eBook Packages: EngineeringEngineering (R0)