A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Shah, Kanish; Patel, Henil; Sanghvi, Devanshi; Shah, Manan

doi:10.1007/s41133-020-00032-0

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Original Paper
Published: 05 March 2020

Volume 5, article number 12, (2020)
Cite this article

Augmented Human Research Aims and scope Submit manuscript

Kanish Shah¹,
Henil Patel¹,
Devanshi Sanghvi¹ &
…
Manan Shah²

10k Accesses
239 Citations
7 Altmetric
1 Mention
Explore all metrics

Abstract

In the current generation, a huge amount of textual documents are generated and there is an urgent need to organize them in a proper structure so that classification can be performed and categories can be properly defined. The key technology for gaining the insights into a text information and organizing that information is known as text classification. The classes are then classified by determining the text types of the content. Based on different machine learning algorithms used in the current paper, the system of text classification is divided into four sections namely text pre-treatment, text representation, implementation of the classifier and classification. In this paper, a BBC news text classification system is designed. In the classifier implementation section, the authors separately chose and compared logistic regression, random forest and K-nearest neighbour as our classification algorithms. Then, these classifiers were tested, analysed and compared with each other and finally got a conclusion. The experimental conclusion shows that BBC news text classification model gets satisfying results on the basis of algorithms tested on the data set. The authors decided to show the comparison based on five parameters namely precision, accuracy, F1-score, support and confusion matrix. The classifier which gets the highest among all these parameters is termed as the best machine learning algorithm for the BBC news data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluation of Text Classification Using Support Vector Machine Compare with Naive Bayes, Random Forest Decision Tree and K-NN

Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus

An Efficient Model of Text Categorization Based on Feature Selection and Random Forests: Case for Business Documents

Availability of Data and Material

All relevant data and material are presented in the main paper.

References

Ahir K, Govani K, Gajera R, Shah M (2020) Application on virtual reality for enhanced education learning, military training and sports. Augment Hum Res 5:7
Article Google Scholar
Al Amrani Y, Lazaar M, El Kadiri KE (2018) Random forest and support vector machine based hybrid approach to sentiment analysis. Proc Comput Sci 127:511–520
Article Google Scholar
Altınel B, Ganiz MC (2018) Semantic text classification: a survey of past and recent advances. Inf Process Manag 54(6):1129–1153
Article Google Scholar
Aseervatham S, Antoniadis A, Gaussier E, Burlet M, Denneulin Y (2011) A sparse version of the ridge logistic regression for large-scale text categorization. Pattern Recogn Lett 32(2):101–106. https://doi.org/10.1016/j.patrec.2010.09.023
Article Google Scholar
Aydoğan M, Karci A (2019) Improving the accuracy using pre-trained word embedding on deep neural networks for Turkish text classification. Stat Mech Its Appl, Physica A. https://doi.org/10.1016/j.physa.2019.123288
Book Google Scholar
Bafna P, Pramod D, Vaidya A (2016) Document clustering: TF-IDF approach. In: 2016 international conference on electrical, electronics, and optimization techniques (ICEEOT), Chennai, pp 61–66
Bouaziz A, Dartigues-Pallez C, da Costa Pereira C, Precioso F, Lloret P (2014) Short text classification using semantic random forest. In: Bellatreche L, Mohania MK (eds) Data warehousing and knowledge discovery. DaWaK 2014. Lecture notes in computer science, vol 8646. Springer, Cham
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Chatzigeorgakidis G, Karagiorgou S, Athanasiou S, Skiadopoulos S (2018) FML-kNN: scalable machine learning on Big Data using k-nearest neighbor joins. J Big Data 5:4. https://doi.org/10.1186/s40537-018-0115-x
Article Google Scholar
Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naïve Bayes. Expert Syst Appl 36(3–1):5432–5435
Article Google Scholar
Cheng Y, Rui K (2017) Text classification of minimal risk with three-way decisions. J Inf Optim Sci 39(4):973–987
Google Scholar
Elghazel H, Aussem A, Gharroudi O, Saadaoui W (2016) Ensemble multi-label text categorization based on rotation forest and latent semantic indexing. Expert Syst Appl 57:1–11. https://doi.org/10.1016/j.eswa.2016.03.041
Article Google Scholar
Ferrari A (2018) Natural language requirements processing: from research to practice. In: IEEE/ACM 40th international conference on software engineering: companion (ICSE-Companion), Gothenburg, pp 536–537
Gandhi M, Kamdar J, Shah M (2020) Preprocessing of Non-symmetrical images for edge detection. Augment Hum Res 5:10. https://doi.org/10.1007/s41133-019-0030-5
Article Google Scholar
Garla V, Taylor C, Brandt C (2013) Semi-supervised clinical text classification with Laplacian SVMs: an application to cancer case management. J Biomed Inf 46(5):869–875
Article Google Scholar
Genkin A, Lewis DD, Madigan D (2007) Large-scale Bayesian logistic regression for text categorization. Technometrics 49(3):291–304
Article MathSciNet Google Scholar
Hmeidi I, Hawashin B, El-Qawasmeh E (2008) Performance of KNN and SVM classifiers on full word Arabic articles. Adv Eng Inf 22(1):106–111
Article Google Scholar
Jani K, Chaudhuri M, Patel H, Shah M (2019) Machine learning in films: an approach towards automation in film censoring. J Data Inf Manag. https://doi.org/10.1007/s42488-019-00016-9
Article Google Scholar
Jha K, Doshi A, Patel P, Shah M (2019) A comprehensive review on automation in agriculture using artificial intelligence. Artif Intell Agric 2:1–12
Google Scholar
Jiang Y, Lin H, Wang X, Lu D (2011) A Technique for improving the performance of Naive Bayes text classification. In: Lecture notes in computer science, pp 196–203
Jiang S, Pang G, Wu M, Kuang L (2012) An improved K-nearest-neighbour algorithm for text categorization. Expert Syst Appl 39(1):1503–1509
Article Google Scholar
Kabir M, Jahangir M, Xu S, Badhon B (2019) An empirical research on sentiment analysis using machine learning approaches. Int J Comput Appl. https://doi.org/10.1080/1206212x.2019.1643584
Article Google Scholar
Kakkad V, Patel M, Shah M (2019) Biometric authentication and image encryption for image security in cloud framework. Multiscale Multidiscip Model Exp Des. https://doi.org/10.1007/s41939-019-00049-y
Article Google Scholar
Kumar R, Kaur J (2020) Random forest-based sarcastic tweet classification using multiple feature collection. In: Tanwar S, Tyagi S, Kumar N (eds) Multimedia big data computing for IoT applications. Intelligent systems reference library, vol 163. Springer, Singapore
Google Scholar
Kundalia K, Patel Y, Shah M (2020) Multi-label movie genre detection from a movie poster using knowledge transfer learning. Augment Hum Res 5:11. https://doi.org/10.1007/s41133-019-0029-y
Article Google Scholar
Li J, Deng X, Yao Y (2013) Multistage email spam filtering based on three-way decisions. In: Lingras P, Wolski M, Cornelis C, Mitra S, Wasilewski P (eds) Rough sets and knowledge technology. RSKT 2013. Lecture notes in computer science, vol 8171. Springer, Berlin, pp 313–324
Google Scholar
Liao Y, Vemuri VR (2002) Use of K-Nearest Neighbor classifier for intrusion detection. Comput Secur 22(5):439–448
Article Google Scholar
Liu Y, Loh HT, Tor SB (2005) Comparison of extreme learning machine with support vector machine for text classification. In: Ali M, Esposito F (eds) Innovations in applied artificial intelligence. IEA/AIE 2005. Lecture notes in computer science, vol 3533. Springer, Berlin, pp 390–399
Google Scholar
Liu YY, Yang M, Ramsay M, Li XS, Coid JW (2011) A comparison of logistic regression, classification and regression tree, and neural networks models in predicting violent re-offending. J Quant Criminol 27(4):547–553
Article Google Scholar
Liu H, Zhang S, Wu X (2014) MLSLR: multilabel learning via sparse logistic regression. Inf Sci 281:310–320
Article MathSciNet Google Scholar
Mehmood RM, Lee HJ (2015) Emotion classification of EEG brain signal using SVM and KNN. In: IEEE international conference on multimedia and expo workshops. IEEE, pp 1–5
Miao F, Zhang P, Jin L, Wu H (2018) Chinese news text classification based on machine learning algorithm. In: 2018 10th international conference on intelligent human-machine systems and cybernetics (IHMSC), Hangzhou, pp 48–51
Moldagulova A, Sulaiman RB (2018) Document classification based on KNN algorithm by term vector space reduction. In: 18th international conference on control, automation and systems (ICCAS), Daegwallyeong, pp 387–391
Nadi A, Moradi H (2019) Increasing the views and reducing the depth in random forest. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2019.07.018
Article Google Scholar
Pandya R, Nadiadwala S, Shah R, Shah M (2019) Buildout of methodology for meticulous diagnosis of K-complex in EEG for aiding the detection of Alzheimer’s by artificial intelligence. Augment Hum Res. https://doi.org/10.1007/s41133-019-0021-6
Article Google Scholar
Parekh V, Shah D, Shah M (2020) Fatigue detection using artificial intelligence framework. Augment Hum Res 5:5
Article Google Scholar
Patel D, Shah Y, Thakkar N, Shah K, Shah M (2020) Implementation of artificial intelligence techniques for cancer detection. Augment Hum Res. https://doi.org/10.1007/s41133-019-0024-3
Article Google Scholar
Patel D, Shah D, Shah M (2020) The intertwine of brain and body: a quantitative analysis on how big data influences the system of sports. Ann Data Sci. https://doi.org/10.1007/s40745-019-00239-y
Article Google Scholar
Prabhat A, Khullar V (2017) Sentiment classification on big data using Naïve bayes and logistic regression. In: International conference on computer communication and informatics (ICCCI), pp 1–5
Ranjitha KV (2018) Classification and optimization scheme for text data using machine learning Naïve Bayes classifier. In: IEEE world symposium on communication engineering (WSCE), pp 33–36
Raychaudhuri K, Kumar M, Bhanu S (2017) A comparative study and performance analysis of classification techniques: support vector machine, neural networks and decision trees. In: Advances in computing and data sciences, pp 13–21
Salles T, Gonçalves M, Rodrigues V, Rocha L (2018) Improving random forests by neighborhood projection for effective text classification. Inf Syst 77:1–21
Article Google Scholar
Shah G, Shah A, Shah M (2019) Panacea of challenges in real-world application of big data analytics in healthcare sector. Data Inf Manag. https://doi.org/10.1007/s42488-019-00010-1
Article Google Scholar
Solangi YA, Solangi ZA, Aarain S, Abro A, Mallah GA, Shah A (2018) Review on natural language processing (NLP) and its toolkits for opinion mining and sentiment analysis. In: IEEE 5th international conference on engineering technologies and applied sciences (ICETAS), pp 1–4
Szymaski J (2014) Comparative analysis of text representation methods using classification. Cybern Syst 45(2):180–199
Article Google Scholar
Tan S (2006) An effective refinement strategy for KNN text classifier. Expert Syst Appl 30(2):290–298
Article Google Scholar
Tan Y (2018) An improved KNN text classification algorithm based on K-Medoids and rough set. In: 10th international conference on intelligent human–machine systems and cybernetics (IHMSC), pp 109–113
Trstenjak B, Mikac S, Donko D (2014) KNN with TF-IDF based framework for text categorization. Proc Eng 69:1356–1364
Article Google Scholar
Wahiba BA, Ahmed BEF (2015) New fuzzy decision tree model for text classification. In: The 1st international conference on advanced intelligent system and informatics (AISI2015), November 28–30, 2015, Beni Suef, Egypt, pp 309–320. https://doi.org/10.1007/978-3-319-26690-9_28
Wu Q, Ye Y, Zhang H, Ng MK, Ho S (2014) ForesTexter: an efficient random forest algorithm for imbalanced text Categorization. Knowl Based Syst 67:105–116
Article Google Scholar
Yao H, Liu C, Zhang P, Wang L (2017) A feature selection method based on synonym merging in text classification system. EURASIP J Wirel Commun Netw 2017:166. https://doi.org/10.1186/s13638-017-0950-z
Article Google Scholar
Yen SJ, Lee YS, Ying JC, Wu YC (2011) A logistic regression-based smoothing method for Chinese text categorization. Expert Syst Appl 38(9):11581–11590
Article Google Scholar
Yuntao Z, Ling G, Yongcheng W, Yin Z (2003) An effective concept extraction method for improving text classification performance. Geo-Spatial Inf Sci 6(4):66–72
Article Google Scholar
Zhu J, Wang H, Zhang X (2006) Discrimination-based feature selection for multinomial Naïve Bayes text classification. In: Lecture notes in computer science, pp 149–156

Download references

Acknowledgements

The authors are grateful to Indus University and School of Technology, Pandit Deendayal Petroleum University for the permission to publish this research.

Funding

Not Applicable.

Author information

Authors and Affiliations

Department of Computer Engineering, Indus University, Ahmedabad, Gujarat, India
Kanish Shah, Henil Patel & Devanshi Sanghvi
Department of Chemical Engineering, School of Technology, Pandit Deendayal Petroleum University, Gandhinagar, Gujarat, India
Manan Shah

Authors

Kanish Shah
View author publications
You can also search for this author in PubMed Google Scholar
Henil Patel
View author publications
You can also search for this author in PubMed Google Scholar
Devanshi Sanghvi
View author publications
You can also search for this author in PubMed Google Scholar
Manan Shah
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All the authors make substantial contribution in this manuscript. KS, HP, DS and MS participated in drafting the manuscript. KS, HP and DS wrote the main manuscript; all the authors discussed the results and implication on the manuscript at all stages.

Corresponding author

Correspondence to Manan Shah.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shah, K., Patel, H., Sanghvi, D. et al. A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification. Augment Hum Res 5, 12 (2020). https://doi.org/10.1007/s41133-020-00032-0

Download citation

Received: 27 August 2019
Revised: 06 February 2020
Accepted: 13 February 2020
Published: 05 March 2020
DOI: https://doi.org/10.1007/s41133-020-00032-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Abstract

Access this article

Similar content being viewed by others

Evaluation of Text Classification Using Support Vector Machine Compare with Naive Bayes, Random Forest Decision Tree and K-NN

Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus

An Efficient Model of Text Categorization Based on Feature Selection and Random Forests: Case for Business Documents

Availability of Data and Material

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent for publication

Ethics approval and consent to participate

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Abstract

Access this article

Similar content being viewed by others

Evaluation of Text Classification Using Support Vector Machine Compare with Naive Bayes, Random Forest Decision Tree and K-NN

Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus

An Efficient Model of Text Categorization Based on Feature Selection and Random Forests: Case for Business Documents

Availability of Data and Material

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent for publication

Ethics approval and consent to participate

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation