Enactment of tf-idf and word2vec on Text Categorization

Arora, Monika; Mittal, Vrinda; Aggarwal, Priyanka

doi:10.1007/978-981-15-9712-1_17

Monika Arora¹²,
Vrinda Mittal¹² &
Priyanka Aggarwal¹²

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 167))

879 Accesses
2 Citations

Abstract

Text categorization has a variety of applications, such as sentiment analysis of user’s tweet, categorizing blog posts into different categories, etc. The real-time data available for categorization is usually unstructured. An efficient algorithm for preprocessing the data can help to achieve better accuracy. Term frequency–inverse document frequency (tf-idf) and word2vec word embedding techniques are used widely before applying the text classification model. In order to show the enactment of these techniques on text categorization, we are comparing the accuracies of different multi-class text categorization algorithms such as Support Vector Machine (SVM), Logistic Regression and K-Nearest Neighbor (KNN) on these techniques. TagMyNews dataset is used to train the model. The results indicate that word2vec is efficient word embedding technique as it possesses higher accuracies for all the classification methods (KNN: 79.38%, SVM: 93.59%, Logistic Regression: 87.46%) as compared to tf-idf (KNN: 73.37%, SVM: 84%, Logistic Regression: 73.98%).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1–2):245–271
Google Scholar
Pradhan L et al (2017) Comparison of text classifiers on news articles. Int Res J Eng Technol 4(3):2513–2517
Google Scholar
Dadgar SM, Araghi MS, Farahani MM (2016) A novel text mining approach based on TFIDF and support vector machine for news classification. In: IEEE international conference on engineering and technology (ICETECH), IEEE
Google Scholar
Suleymanov U, Rustamov S (2018) Automated news categorization using machine learning methods. In: IOP conference series: materials science and engineering, vol 459, no 1. IOP Publishing
Google Scholar
Effrosynidis D, Symeonidis S, Arampatzis A (2017) A comparison of pre-processing techniques for twitter sentiment analysis. In: International conference on theory and practice of digital libraries. Springer, Cham
Google Scholar
Romero F, Koochak Z (2015) Assessing and implementing automated news classification
Google Scholar
News Category Dataset. https://www.kaggle.com/rmisra/news-category-dataset. Last accessed 21 Dec 2019
Virmani D, Taneja S (2019) A text preprocessing approach for efficacious information retrieval. In: Smart innovations in communication and computational sciences. Springer, Singapore, pp 13–22
Google Scholar
Grootendorst M, Vanschoren J (2019) Beyond bag-of-concepts: vectors of locally aggregated concepts. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, Cham
Google Scholar
A math-first explanation of Word2Vec. https://medium.com/analytics-vidhya/maths-behind-word2vec-explained-38d74f32726b. Last accessed 15 Dec 2019

Download references

Author information

Authors and Affiliations

Information Technology, BPIT, New Delhi, India
Monika Arora, Vrinda Mittal & Priyanka Aggarwal

Authors

Monika Arora
View author publications
You can also search for this author in PubMed Google Scholar
Vrinda Mittal
View author publications
You can also search for this author in PubMed Google Scholar
Priyanka Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Monika Arora .

Editor information

Editors and Affiliations

Scientific Network for Innovation and Research Excellence, Machine Intelligence Research Labs (MIR Labs), Auburn, WA, USA
Ajith Abraham
Tijuana Institute of Technology, Tijuana, Mexico
Oscar Castillo
Bhagwan Parshuram Institute of Technology, New Delhi, India
Deepali Virmani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arora, M., Mittal, V., Aggarwal, P. (2021). Enactment of tf-idf and word2vec on Text Categorization. In: Abraham, A., Castillo, O., Virmani, D. (eds) Proceedings of 3rd International Conference on Computing Informatics and Networks. Lecture Notes in Networks and Systems, vol 167. Springer, Singapore. https://doi.org/10.1007/978-981-15-9712-1_17

Download citation

DOI: https://doi.org/10.1007/978-981-15-9712-1_17
Published: 15 March 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9711-4
Online ISBN: 978-981-15-9712-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics