Abstract
In the current generation, a huge amount of textual documents are generated and there is an urgent need to organize them in a proper structure so that classification can be performed and categories can be properly defined. The key technology for gaining the insights into a text information and organizing that information is known as text classification. The classes are then classified by determining the text types of the content. Based on different machine learning algorithms used in the current paper, the system of text classification is divided into four sections namely text pre-treatment, text representation, implementation of the classifier and classification. In this paper, a BBC news text classification system is designed. In the classifier implementation section, the authors separately chose and compared logistic regression, random forest and K-nearest neighbour as our classification algorithms. Then, these classifiers were tested, analysed and compared with each other and finally got a conclusion. The experimental conclusion shows that BBC news text classification model gets satisfying results on the basis of algorithms tested on the data set. The authors decided to show the comparison based on five parameters namely precision, accuracy, F1-score, support and confusion matrix. The classifier which gets the highest among all these parameters is termed as the best machine learning algorithm for the BBC news data set.
Similar content being viewed by others
Availability of Data and Material
All relevant data and material are presented in the main paper.
References
Ahir K, Govani K, Gajera R, Shah M (2020) Application on virtual reality for enhanced education learning, military training and sports. Augment Hum Res 5:7
Al Amrani Y, Lazaar M, El Kadiri KE (2018) Random forest and support vector machine based hybrid approach to sentiment analysis. Proc Comput Sci 127:511–520
Altınel B, Ganiz MC (2018) Semantic text classification: a survey of past and recent advances. Inf Process Manag 54(6):1129–1153
Aseervatham S, Antoniadis A, Gaussier E, Burlet M, Denneulin Y (2011) A sparse version of the ridge logistic regression for large-scale text categorization. Pattern Recogn Lett 32(2):101–106. https://doi.org/10.1016/j.patrec.2010.09.023
Aydoğan M, Karci A (2019) Improving the accuracy using pre-trained word embedding on deep neural networks for Turkish text classification. Stat Mech Its Appl, Physica A. https://doi.org/10.1016/j.physa.2019.123288
Bafna P, Pramod D, Vaidya A (2016) Document clustering: TF-IDF approach. In: 2016 international conference on electrical, electronics, and optimization techniques (ICEEOT), Chennai, pp 61–66
Bouaziz A, Dartigues-Pallez C, da Costa Pereira C, Precioso F, Lloret P (2014) Short text classification using semantic random forest. In: Bellatreche L, Mohania MK (eds) Data warehousing and knowledge discovery. DaWaK 2014. Lecture notes in computer science, vol 8646. Springer, Cham
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Chatzigeorgakidis G, Karagiorgou S, Athanasiou S, Skiadopoulos S (2018) FML-kNN: scalable machine learning on Big Data using k-nearest neighbor joins. J Big Data 5:4. https://doi.org/10.1186/s40537-018-0115-x
Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naïve Bayes. Expert Syst Appl 36(3–1):5432–5435
Cheng Y, Rui K (2017) Text classification of minimal risk with three-way decisions. J Inf Optim Sci 39(4):973–987
Elghazel H, Aussem A, Gharroudi O, Saadaoui W (2016) Ensemble multi-label text categorization based on rotation forest and latent semantic indexing. Expert Syst Appl 57:1–11. https://doi.org/10.1016/j.eswa.2016.03.041
Ferrari A (2018) Natural language requirements processing: from research to practice. In: IEEE/ACM 40th international conference on software engineering: companion (ICSE-Companion), Gothenburg, pp 536–537
Gandhi M, Kamdar J, Shah M (2020) Preprocessing of Non-symmetrical images for edge detection. Augment Hum Res 5:10. https://doi.org/10.1007/s41133-019-0030-5
Garla V, Taylor C, Brandt C (2013) Semi-supervised clinical text classification with Laplacian SVMs: an application to cancer case management. J Biomed Inf 46(5):869–875
Genkin A, Lewis DD, Madigan D (2007) Large-scale Bayesian logistic regression for text categorization. Technometrics 49(3):291–304
Hmeidi I, Hawashin B, El-Qawasmeh E (2008) Performance of KNN and SVM classifiers on full word Arabic articles. Adv Eng Inf 22(1):106–111
Jani K, Chaudhuri M, Patel H, Shah M (2019) Machine learning in films: an approach towards automation in film censoring. J Data Inf Manag. https://doi.org/10.1007/s42488-019-00016-9
Jha K, Doshi A, Patel P, Shah M (2019) A comprehensive review on automation in agriculture using artificial intelligence. Artif Intell Agric 2:1–12
Jiang Y, Lin H, Wang X, Lu D (2011) A Technique for improving the performance of Naive Bayes text classification. In: Lecture notes in computer science, pp 196–203
Jiang S, Pang G, Wu M, Kuang L (2012) An improved K-nearest-neighbour algorithm for text categorization. Expert Syst Appl 39(1):1503–1509
Kabir M, Jahangir M, Xu S, Badhon B (2019) An empirical research on sentiment analysis using machine learning approaches. Int J Comput Appl. https://doi.org/10.1080/1206212x.2019.1643584
Kakkad V, Patel M, Shah M (2019) Biometric authentication and image encryption for image security in cloud framework. Multiscale Multidiscip Model Exp Des. https://doi.org/10.1007/s41939-019-00049-y
Kumar R, Kaur J (2020) Random forest-based sarcastic tweet classification using multiple feature collection. In: Tanwar S, Tyagi S, Kumar N (eds) Multimedia big data computing for IoT applications. Intelligent systems reference library, vol 163. Springer, Singapore
Kundalia K, Patel Y, Shah M (2020) Multi-label movie genre detection from a movie poster using knowledge transfer learning. Augment Hum Res 5:11. https://doi.org/10.1007/s41133-019-0029-y
Li J, Deng X, Yao Y (2013) Multistage email spam filtering based on three-way decisions. In: Lingras P, Wolski M, Cornelis C, Mitra S, Wasilewski P (eds) Rough sets and knowledge technology. RSKT 2013. Lecture notes in computer science, vol 8171. Springer, Berlin, pp 313–324
Liao Y, Vemuri VR (2002) Use of K-Nearest Neighbor classifier for intrusion detection. Comput Secur 22(5):439–448
Liu Y, Loh HT, Tor SB (2005) Comparison of extreme learning machine with support vector machine for text classification. In: Ali M, Esposito F (eds) Innovations in applied artificial intelligence. IEA/AIE 2005. Lecture notes in computer science, vol 3533. Springer, Berlin, pp 390–399
Liu YY, Yang M, Ramsay M, Li XS, Coid JW (2011) A comparison of logistic regression, classification and regression tree, and neural networks models in predicting violent re-offending. J Quant Criminol 27(4):547–553
Liu H, Zhang S, Wu X (2014) MLSLR: multilabel learning via sparse logistic regression. Inf Sci 281:310–320
Mehmood RM, Lee HJ (2015) Emotion classification of EEG brain signal using SVM and KNN. In: IEEE international conference on multimedia and expo workshops. IEEE, pp 1–5
Miao F, Zhang P, Jin L, Wu H (2018) Chinese news text classification based on machine learning algorithm. In: 2018 10th international conference on intelligent human-machine systems and cybernetics (IHMSC), Hangzhou, pp 48–51
Moldagulova A, Sulaiman RB (2018) Document classification based on KNN algorithm by term vector space reduction. In: 18th international conference on control, automation and systems (ICCAS), Daegwallyeong, pp 387–391
Nadi A, Moradi H (2019) Increasing the views and reducing the depth in random forest. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2019.07.018
Pandya R, Nadiadwala S, Shah R, Shah M (2019) Buildout of methodology for meticulous diagnosis of K-complex in EEG for aiding the detection of Alzheimer’s by artificial intelligence. Augment Hum Res. https://doi.org/10.1007/s41133-019-0021-6
Parekh V, Shah D, Shah M (2020) Fatigue detection using artificial intelligence framework. Augment Hum Res 5:5
Patel D, Shah Y, Thakkar N, Shah K, Shah M (2020) Implementation of artificial intelligence techniques for cancer detection. Augment Hum Res. https://doi.org/10.1007/s41133-019-0024-3
Patel D, Shah D, Shah M (2020) The intertwine of brain and body: a quantitative analysis on how big data influences the system of sports. Ann Data Sci. https://doi.org/10.1007/s40745-019-00239-y
Prabhat A, Khullar V (2017) Sentiment classification on big data using Naïve bayes and logistic regression. In: International conference on computer communication and informatics (ICCCI), pp 1–5
Ranjitha KV (2018) Classification and optimization scheme for text data using machine learning Naïve Bayes classifier. In: IEEE world symposium on communication engineering (WSCE), pp 33–36
Raychaudhuri K, Kumar M, Bhanu S (2017) A comparative study and performance analysis of classification techniques: support vector machine, neural networks and decision trees. In: Advances in computing and data sciences, pp 13–21
Salles T, Gonçalves M, Rodrigues V, Rocha L (2018) Improving random forests by neighborhood projection for effective text classification. Inf Syst 77:1–21
Shah G, Shah A, Shah M (2019) Panacea of challenges in real-world application of big data analytics in healthcare sector. Data Inf Manag. https://doi.org/10.1007/s42488-019-00010-1
Solangi YA, Solangi ZA, Aarain S, Abro A, Mallah GA, Shah A (2018) Review on natural language processing (NLP) and its toolkits for opinion mining and sentiment analysis. In: IEEE 5th international conference on engineering technologies and applied sciences (ICETAS), pp 1–4
Szymaski J (2014) Comparative analysis of text representation methods using classification. Cybern Syst 45(2):180–199
Tan S (2006) An effective refinement strategy for KNN text classifier. Expert Syst Appl 30(2):290–298
Tan Y (2018) An improved KNN text classification algorithm based on K-Medoids and rough set. In: 10th international conference on intelligent human–machine systems and cybernetics (IHMSC), pp 109–113
Trstenjak B, Mikac S, Donko D (2014) KNN with TF-IDF based framework for text categorization. Proc Eng 69:1356–1364
Wahiba BA, Ahmed BEF (2015) New fuzzy decision tree model for text classification. In: The 1st international conference on advanced intelligent system and informatics (AISI2015), November 28–30, 2015, Beni Suef, Egypt, pp 309–320. https://doi.org/10.1007/978-3-319-26690-9_28
Wu Q, Ye Y, Zhang H, Ng MK, Ho S (2014) ForesTexter: an efficient random forest algorithm for imbalanced text Categorization. Knowl Based Syst 67:105–116
Yao H, Liu C, Zhang P, Wang L (2017) A feature selection method based on synonym merging in text classification system. EURASIP J Wirel Commun Netw 2017:166. https://doi.org/10.1186/s13638-017-0950-z
Yen SJ, Lee YS, Ying JC, Wu YC (2011) A logistic regression-based smoothing method for Chinese text categorization. Expert Syst Appl 38(9):11581–11590
Yuntao Z, Ling G, Yongcheng W, Yin Z (2003) An effective concept extraction method for improving text classification performance. Geo-Spatial Inf Sci 6(4):66–72
Zhu J, Wang H, Zhang X (2006) Discrimination-based feature selection for multinomial Naïve Bayes text classification. In: Lecture notes in computer science, pp 149–156
Acknowledgements
The authors are grateful to Indus University and School of Technology, Pandit Deendayal Petroleum University for the permission to publish this research.
Funding
Not Applicable.
Author information
Authors and Affiliations
Contributions
All the authors make substantial contribution in this manuscript. KS, HP, DS and MS participated in drafting the manuscript. KS, HP and DS wrote the main manuscript; all the authors discussed the results and implication on the manuscript at all stages.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shah, K., Patel, H., Sanghvi, D. et al. A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification. Augment Hum Res 5, 12 (2020). https://doi.org/10.1007/s41133-020-00032-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41133-020-00032-0