Skip to main content
Log in

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

  • Original Paper
  • Published:
Augmented Human Research Aims and scope Submit manuscript

Abstract

In the current generation, a huge amount of textual documents are generated and there is an urgent need to organize them in a proper structure so that classification can be performed and categories can be properly defined. The key technology for gaining the insights into a text information and organizing that information is known as text classification. The classes are then classified by determining the text types of the content. Based on different machine learning algorithms used in the current paper, the system of text classification is divided into four sections namely text pre-treatment, text representation, implementation of the classifier and classification. In this paper, a BBC news text classification system is designed. In the classifier implementation section, the authors separately chose and compared logistic regression, random forest and K-nearest neighbour as our classification algorithms. Then, these classifiers were tested, analysed and compared with each other and finally got a conclusion. The experimental conclusion shows that BBC news text classification model gets satisfying results on the basis of algorithms tested on the data set. The authors decided to show the comparison based on five parameters namely precision, accuracy, F1-score, support and confusion matrix. The classifier which gets the highest among all these parameters is termed as the best machine learning algorithm for the BBC news data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Availability of Data and Material

All relevant data and material are presented in the main paper.

References

  1. Ahir K, Govani K, Gajera R, Shah M (2020) Application on virtual reality for enhanced education learning, military training and sports. Augment Hum Res 5:7

    Article  Google Scholar 

  2. Al Amrani Y, Lazaar M, El Kadiri KE (2018) Random forest and support vector machine based hybrid approach to sentiment analysis. Proc Comput Sci 127:511–520

    Article  Google Scholar 

  3. Altınel B, Ganiz MC (2018) Semantic text classification: a survey of past and recent advances. Inf Process Manag 54(6):1129–1153

    Article  Google Scholar 

  4. Aseervatham S, Antoniadis A, Gaussier E, Burlet M, Denneulin Y (2011) A sparse version of the ridge logistic regression for large-scale text categorization. Pattern Recogn Lett 32(2):101–106. https://doi.org/10.1016/j.patrec.2010.09.023

    Article  Google Scholar 

  5. Aydoğan M, Karci A (2019) Improving the accuracy using pre-trained word embedding on deep neural networks for Turkish text classification. Stat Mech Its Appl, Physica A. https://doi.org/10.1016/j.physa.2019.123288

    Book  Google Scholar 

  6. Bafna P, Pramod D, Vaidya A (2016) Document clustering: TF-IDF approach. In: 2016 international conference on electrical, electronics, and optimization techniques (ICEEOT), Chennai, pp 61–66

  7. Bouaziz A, Dartigues-Pallez C, da Costa Pereira C, Precioso F, Lloret P (2014) Short text classification using semantic random forest. In: Bellatreche L, Mohania MK (eds) Data warehousing and knowledge discovery. DaWaK 2014. Lecture notes in computer science, vol 8646. Springer, Cham

    Google Scholar 

  8. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  9. Chatzigeorgakidis G, Karagiorgou S, Athanasiou S, Skiadopoulos S (2018) FML-kNN: scalable machine learning on Big Data using k-nearest neighbor joins. J Big Data 5:4. https://doi.org/10.1186/s40537-018-0115-x

    Article  Google Scholar 

  10. Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naïve Bayes. Expert Syst Appl 36(3–1):5432–5435

    Article  Google Scholar 

  11. Cheng Y, Rui K (2017) Text classification of minimal risk with three-way decisions. J Inf Optim Sci 39(4):973–987

    Google Scholar 

  12. Elghazel H, Aussem A, Gharroudi O, Saadaoui W (2016) Ensemble multi-label text categorization based on rotation forest and latent semantic indexing. Expert Syst Appl 57:1–11. https://doi.org/10.1016/j.eswa.2016.03.041

    Article  Google Scholar 

  13. Ferrari A (2018) Natural language requirements processing: from research to practice. In: IEEE/ACM 40th international conference on software engineering: companion (ICSE-Companion), Gothenburg, pp 536–537

  14. Gandhi M, Kamdar J, Shah M (2020) Preprocessing of Non-symmetrical images for edge detection. Augment Hum Res 5:10. https://doi.org/10.1007/s41133-019-0030-5

    Article  Google Scholar 

  15. Garla V, Taylor C, Brandt C (2013) Semi-supervised clinical text classification with Laplacian SVMs: an application to cancer case management. J Biomed Inf 46(5):869–875

    Article  Google Scholar 

  16. Genkin A, Lewis DD, Madigan D (2007) Large-scale Bayesian logistic regression for text categorization. Technometrics 49(3):291–304

    Article  MathSciNet  Google Scholar 

  17. Hmeidi I, Hawashin B, El-Qawasmeh E (2008) Performance of KNN and SVM classifiers on full word Arabic articles. Adv Eng Inf 22(1):106–111

    Article  Google Scholar 

  18. Jani K, Chaudhuri M, Patel H, Shah M (2019) Machine learning in films: an approach towards automation in film censoring. J Data Inf Manag. https://doi.org/10.1007/s42488-019-00016-9

    Article  Google Scholar 

  19. Jha K, Doshi A, Patel P, Shah M (2019) A comprehensive review on automation in agriculture using artificial intelligence. Artif Intell Agric 2:1–12

    Google Scholar 

  20. Jiang Y, Lin H, Wang X, Lu D (2011) A Technique for improving the performance of Naive Bayes text classification. In: Lecture notes in computer science, pp 196–203

  21. Jiang S, Pang G, Wu M, Kuang L (2012) An improved K-nearest-neighbour algorithm for text categorization. Expert Syst Appl 39(1):1503–1509

    Article  Google Scholar 

  22. Kabir M, Jahangir M, Xu S, Badhon B (2019) An empirical research on sentiment analysis using machine learning approaches. Int J Comput Appl. https://doi.org/10.1080/1206212x.2019.1643584

    Article  Google Scholar 

  23. Kakkad V, Patel M, Shah M (2019) Biometric authentication and image encryption for image security in cloud framework. Multiscale Multidiscip Model Exp Des. https://doi.org/10.1007/s41939-019-00049-y

    Article  Google Scholar 

  24. Kumar R, Kaur J (2020) Random forest-based sarcastic tweet classification using multiple feature collection. In: Tanwar S, Tyagi S, Kumar N (eds) Multimedia big data computing for IoT applications. Intelligent systems reference library, vol 163. Springer, Singapore

    Google Scholar 

  25. Kundalia K, Patel Y, Shah M (2020) Multi-label movie genre detection from a movie poster using knowledge transfer learning. Augment Hum Res 5:11. https://doi.org/10.1007/s41133-019-0029-y

    Article  Google Scholar 

  26. Li J, Deng X, Yao Y (2013) Multistage email spam filtering based on three-way decisions. In: Lingras P, Wolski M, Cornelis C, Mitra S, Wasilewski P (eds) Rough sets and knowledge technology. RSKT 2013. Lecture notes in computer science, vol 8171. Springer, Berlin, pp 313–324

    Google Scholar 

  27. Liao Y, Vemuri VR (2002) Use of K-Nearest Neighbor classifier for intrusion detection. Comput Secur 22(5):439–448

    Article  Google Scholar 

  28. Liu Y, Loh HT, Tor SB (2005) Comparison of extreme learning machine with support vector machine for text classification. In: Ali M, Esposito F (eds) Innovations in applied artificial intelligence. IEA/AIE 2005. Lecture notes in computer science, vol 3533. Springer, Berlin, pp 390–399

    Google Scholar 

  29. Liu YY, Yang M, Ramsay M, Li XS, Coid JW (2011) A comparison of logistic regression, classification and regression tree, and neural networks models in predicting violent re-offending. J Quant Criminol 27(4):547–553

    Article  Google Scholar 

  30. Liu H, Zhang S, Wu X (2014) MLSLR: multilabel learning via sparse logistic regression. Inf Sci 281:310–320

    Article  MathSciNet  Google Scholar 

  31. Mehmood RM, Lee HJ (2015) Emotion classification of EEG brain signal using SVM and KNN. In: IEEE international conference on multimedia and expo workshops. IEEE, pp 1–5

  32. Miao F, Zhang P, Jin L, Wu H (2018) Chinese news text classification based on machine learning algorithm. In: 2018 10th international conference on intelligent human-machine systems and cybernetics (IHMSC), Hangzhou, pp 48–51

  33. Moldagulova A, Sulaiman RB (2018) Document classification based on KNN algorithm by term vector space reduction. In: 18th international conference on control, automation and systems (ICCAS), Daegwallyeong, pp 387–391

  34. Nadi A, Moradi H (2019) Increasing the views and reducing the depth in random forest. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2019.07.018

    Article  Google Scholar 

  35. Pandya R, Nadiadwala S, Shah R, Shah M (2019) Buildout of methodology for meticulous diagnosis of K-complex in EEG for aiding the detection of Alzheimer’s by artificial intelligence. Augment Hum Res. https://doi.org/10.1007/s41133-019-0021-6

    Article  Google Scholar 

  36. Parekh V, Shah D, Shah M (2020) Fatigue detection using artificial intelligence framework. Augment Hum Res 5:5

    Article  Google Scholar 

  37. Patel D, Shah Y, Thakkar N, Shah K, Shah M (2020) Implementation of artificial intelligence techniques for cancer detection. Augment Hum Res. https://doi.org/10.1007/s41133-019-0024-3

    Article  Google Scholar 

  38. Patel D, Shah D, Shah M (2020) The intertwine of brain and body: a quantitative analysis on how big data influences the system of sports. Ann Data Sci. https://doi.org/10.1007/s40745-019-00239-y

    Article  Google Scholar 

  39. Prabhat A, Khullar V (2017) Sentiment classification on big data using Naïve bayes and logistic regression. In: International conference on computer communication and informatics (ICCCI), pp 1–5

  40. Ranjitha KV (2018) Classification and optimization scheme for text data using machine learning Naïve Bayes classifier. In: IEEE world symposium on communication engineering (WSCE), pp 33–36

  41. Raychaudhuri K, Kumar M, Bhanu S (2017) A comparative study and performance analysis of classification techniques: support vector machine, neural networks and decision trees. In: Advances in computing and data sciences, pp 13–21

  42. Salles T, Gonçalves M, Rodrigues V, Rocha L (2018) Improving random forests by neighborhood projection for effective text classification. Inf Syst 77:1–21

    Article  Google Scholar 

  43. Shah G, Shah A, Shah M (2019) Panacea of challenges in real-world application of big data analytics in healthcare sector. Data Inf Manag. https://doi.org/10.1007/s42488-019-00010-1

    Article  Google Scholar 

  44. Solangi YA, Solangi ZA, Aarain S, Abro A, Mallah GA, Shah A (2018) Review on natural language processing (NLP) and its toolkits for opinion mining and sentiment analysis. In: IEEE 5th international conference on engineering technologies and applied sciences (ICETAS), pp 1–4

  45. Szymaski J (2014) Comparative analysis of text representation methods using classification. Cybern Syst 45(2):180–199

    Article  Google Scholar 

  46. Tan S (2006) An effective refinement strategy for KNN text classifier. Expert Syst Appl 30(2):290–298

    Article  Google Scholar 

  47. Tan Y (2018) An improved KNN text classification algorithm based on K-Medoids and rough set. In: 10th international conference on intelligent human–machine systems and cybernetics (IHMSC), pp 109–113

  48. Trstenjak B, Mikac S, Donko D (2014) KNN with TF-IDF based framework for text categorization. Proc Eng 69:1356–1364

    Article  Google Scholar 

  49. Wahiba BA, Ahmed BEF (2015) New fuzzy decision tree model for text classification. In: The 1st international conference on advanced intelligent system and informatics (AISI2015), November 28–30, 2015, Beni Suef, Egypt, pp 309–320. https://doi.org/10.1007/978-3-319-26690-9_28

  50. Wu Q, Ye Y, Zhang H, Ng MK, Ho S (2014) ForesTexter: an efficient random forest algorithm for imbalanced text Categorization. Knowl Based Syst 67:105–116

    Article  Google Scholar 

  51. Yao H, Liu C, Zhang P, Wang L (2017) A feature selection method based on synonym merging in text classification system. EURASIP J Wirel Commun Netw 2017:166. https://doi.org/10.1186/s13638-017-0950-z

    Article  Google Scholar 

  52. Yen SJ, Lee YS, Ying JC, Wu YC (2011) A logistic regression-based smoothing method for Chinese text categorization. Expert Syst Appl 38(9):11581–11590

    Article  Google Scholar 

  53. Yuntao Z, Ling G, Yongcheng W, Yin Z (2003) An effective concept extraction method for improving text classification performance. Geo-Spatial Inf Sci 6(4):66–72

    Article  Google Scholar 

  54. Zhu J, Wang H, Zhang X (2006) Discrimination-based feature selection for multinomial Naïve Bayes text classification. In: Lecture notes in computer science, pp 149–156

Download references

Acknowledgements

The authors are grateful to Indus University and School of Technology, Pandit Deendayal Petroleum University for the permission to publish this research.

Funding

Not Applicable.

Author information

Authors and Affiliations

Authors

Contributions

All the authors make substantial contribution in this manuscript. KS, HP, DS and MS participated in drafting the manuscript. KS, HP and DS wrote the main manuscript; all the authors discussed the results and implication on the manuscript at all stages.

Corresponding author

Correspondence to Manan Shah.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shah, K., Patel, H., Sanghvi, D. et al. A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification. Augment Hum Res 5, 12 (2020). https://doi.org/10.1007/s41133-020-00032-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41133-020-00032-0

Keywords

Navigation