Skip to main content

Empirical Study to Evaluate the Performance of Classification Algorithms on Public Datasets

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 545))

Abstract

In today’s world, a huge amount of data is stored in the form of electronic documents in the World Wide Web. Text classification algorithms have been widely used for classifying those text documents into a fixed number of predefined classes. The applicable scopes and their performances of these algorithms are different. Therefore, finding an appropriate algorithm for a dataset is becoming a significant emphasis for researchers to solve practical problems quickly. This paper puts forward an experimental evaluation of five significant text classification algorithms with each other and with TF and TF-IDF feature selection methods built using decision tree (C5.0), support vector machine, K-nearest neighbor, Naïve Bayes, and neural network on four public datasets, namely 20news-bydate, ohsumed-first-20000-docs, Reuters 21578-Apte-90 Cat, and 20 Newsgroup. The experimental results are examined from multiple perspectives and summarized to provide usefulness of different algorithms on different datasets.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Gentle Introduction to Naive Bayes algorithm. http://www.cs.columbia.edu/~evs/ml/OthelloStudProj/huang/write-up.html

  2. Lewis DD, Ringutte M (1994) A comparison of two learning algorithms for text categorization. In: Third annual symposium on document analysis and information retrieval, Las Vegas, NV, pp 81–93

    Google Scholar 

  3. Hull D, Pedersen J, Schutze H (1996) Document routing as statistical classification. In: AAAI Spring symposium on machine learning in information access technical papers, Palo Alto

    Google Scholar 

  4. Weiss S, Kasif S, Brill E (1996) Text classification in USENET newsgroup: a progress report. In: AAAI Spring symposium on machine learning in information access technical papers, Palo Alto

    Google Scholar 

  5. Schutze H, Hull D, Pedersen J (1995) A comparison of classifiers and document representations for the routing problem. In: Proceedings of SIGIR, pp 229–237

    Google Scholar 

  6. Pazzani M, Muramatsu J, Billsus D (1996) Syskill and webert: identifying interesting web sites. In: AAAI Spring symposium on machine learning in information access technical papers, Palo Alto

    Google Scholar 

  7. Taruna S, Pandey M (2014) An empirical analysis of classification techniques for predicting academic performance. In: IEEE international advance computing conference (IACC)

    Google Scholar 

  8. Tan S, Zhang J (2008) An empirical study of sentiment analysis for chinese documents. Expert Syst Appl 2622–2629

    Google Scholar 

  9. Rasjida ZE, Setiawana R (2017) Performance comparison and optimization of text document classification using k-nn and naïve bayes classification techniques. In: 2017 2nd international conference on computer science and computational intelligence ICCSCI, 13–14 Oct 2017, Bali, Indonesia

    Google Scholar 

  10. Core Team R (2015) A language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria. https://www.R-project.org/

  11. Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Kaufmann M

    Google Scholar 

  12. Wu X, Kumar V (2009) The top ten algorithms in data mining. Data mining and knowledge discovery. Chapman & Hall/CRC, CRC Press

    Google Scholar 

  13. Saha D (2011) Web text classification using a neural network. In: Second international conference on emerging applications of information technology

    Google Scholar 

  14. Ali S, Smith KA (2006) On learning algorithm selection for classification. Appl Soft Comput 6:119–138

    Article  Google Scholar 

  15. Dataset. http://qwone.com/~jason/20Newsgroups/

  16. Dataset. http://disi.unitn.it/moschitti/corpora.htm

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. M. Bramesh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bramesh, S.M., Anil Kumar, K.M. (2019). Empirical Study to Evaluate the Performance of Classification Algorithms on Public Datasets. In: Sridhar, V., Padma, M., Rao, K. (eds) Emerging Research in Electronics, Computer Science and Technology. Lecture Notes in Electrical Engineering, vol 545. Springer, Singapore. https://doi.org/10.1007/978-981-13-5802-9_41

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-5802-9_41

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-5801-2

  • Online ISBN: 978-981-13-5802-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics