Skip to main content

Enactment of tf-idf and word2vec on Text Categorization

  • Conference paper
  • First Online:
Proceedings of 3rd International Conference on Computing Informatics and Networks

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 167))

Abstract

Text categorization has a variety of applications, such as sentiment analysis of user’s tweet, categorizing blog posts into different categories, etc. The real-time data available for categorization is usually unstructured. An efficient algorithm for preprocessing the data can help to achieve better accuracy. Term frequency–inverse document frequency (tf-idf) and word2vec word embedding techniques are used widely before applying the text classification model. In order to show the enactment of these techniques on text categorization, we are comparing the accuracies of different multi-class text categorization algorithms such as Support Vector Machine (SVM), Logistic Regression and K-Nearest Neighbor (KNN) on these techniques. TagMyNews dataset is used to train the model. The results indicate that word2vec is efficient word embedding technique as it possesses higher accuracies for all the classification methods (KNN: 79.38%, SVM: 93.59%, Logistic Regression: 87.46%) as compared to tf-idf (KNN: 73.37%, SVM: 84%, Logistic Regression: 73.98%).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1–2):245–271

    Google Scholar 

  2. Pradhan L et al (2017) Comparison of text classifiers on news articles. Int Res J Eng Technol 4(3):2513–2517

    Google Scholar 

  3. Dadgar SM, Araghi MS, Farahani MM (2016) A novel text mining approach based on TFIDF and support vector machine for news classification. In: IEEE international conference on engineering and technology (ICETECH), IEEE

    Google Scholar 

  4. Suleymanov U, Rustamov S (2018) Automated news categorization using machine learning methods. In: IOP conference series: materials science and engineering, vol 459, no 1. IOP Publishing

    Google Scholar 

  5. Effrosynidis D, Symeonidis S, Arampatzis A (2017) A comparison of pre-processing techniques for twitter sentiment analysis. In: International conference on theory and practice of digital libraries. Springer, Cham

    Google Scholar 

  6. Romero F, Koochak Z (2015) Assessing and implementing automated news classification

    Google Scholar 

  7. News Category Dataset. https://www.kaggle.com/rmisra/news-category-dataset. Last accessed 21 Dec 2019

  8. Virmani D, Taneja S (2019) A text preprocessing approach for efficacious information retrieval. In: Smart innovations in communication and computational sciences. Springer, Singapore, pp 13–22

    Google Scholar 

  9. Grootendorst M, Vanschoren J (2019) Beyond bag-of-concepts: vectors of locally aggregated concepts. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, Cham

    Google Scholar 

  10. A math-first explanation of Word2Vec. https://medium.com/analytics-vidhya/maths-behind-word2vec-explained-38d74f32726b. Last accessed 15 Dec 2019

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Monika Arora .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Arora, M., Mittal, V., Aggarwal, P. (2021). Enactment of tf-idf and word2vec on Text Categorization. In: Abraham, A., Castillo, O., Virmani, D. (eds) Proceedings of 3rd International Conference on Computing Informatics and Networks. Lecture Notes in Networks and Systems, vol 167. Springer, Singapore. https://doi.org/10.1007/978-981-15-9712-1_17

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-9712-1_17

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-9711-4

  • Online ISBN: 978-981-15-9712-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics