Abstract
In supervised machine learning, with an imbalanced dataset, achieving better classification in minority classes is a major challenge. In such situation, machine learning model shows biasness toward majority classes, which result into poor performance in another set of classes. This paper examined how Synthetic Minority Over-Sampling Technique (SMOTE) techniques help in multinomial text classification on the imbalanced dataset. The performance of SMOTE was examined with Naive Bayes (NB) and Extreme Gradient Boosting (XGBoost) algorithms. 701 questions were collected from college students residing in Mumbai, related to their lifestyle problems. Results showed that XGBoost with SMOTE (XGBoost + SMOTE) technique worked better on an imbalanced dataset in comparison with NB with SMOTE (NB + SMOTE), NB without SMOTE (NB-SMOTE), and XGBoost without SMOTE (XGBoost-SMOTE) techniques. The average classification accuracy for Naive Bayes (with and without SMOTE) was 68.0% while the average accuracy for XGBoost was 71.0%. In the selection of XGBoost and NB, researcher can opt for XGBoost with SMOTE technique to work on the multinomial imbalanced dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alzamma, A., Binsallee, H., AsSadha, B., Kyriakopoulos, K., Lambotharan, S.: Comparative analysis on imbalanced multi-class classification for Malware samples using CNN. In: 2019 International Conference on Advances in the Emerging Computing Technologies (AECT), (2020)
Chen, L., Dong, P., Su, W., Zhang, Y.: Improving classification of imbalanced datasets based on KM++ SMOTE algorithm. In: 2019 2nd International Conference on Safety Produce Informatization (IICSPI), (2019)
Darus, F.M., Ahmad, N.A., Ariffin, A.F.M.: Android malware classification using XGBoost on data image pattern. In: 2019 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS), pp. 118–122 (2019)
Das, R., Biswas, S.K., Devi, D., Sarma, B.: An oversampling technique by integrating reverse nearest neighbor in SMOTE: reverse-SMOTE. In: 2020 International Conference on Smart Electronics and Communication (ICOSEC), (2020)
Elbes, M., Aldajah, A., Sadaqa, O.: P-Stemmer or NLTK stemmer for arabic text classification. In: 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 516–520 (2019)
Er, M.J., Venkatesan, R., Wang, N.: An online universal classifier for binary, multi-class and multi-label classification. In: 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC), (2016)
Flores, A.C., Icoy, R.I., Pena, C.F., Gorro, K.D.: An evaluation of SVM and Naive Bayes with SMOTE on sentiment analysis data set. In: 2018 International Conference on Engineering, Applied Sciences, and Technology (ICEAST), (2018)
Ge, Y., Yue, D., Chen, L.: Prediction of wind turbine blades icing based on MBK-SMOTE and random forest in imbalanced data set. In: 2017 IEEE Conference on Energy Internet and Energy System Integration (EI2), (2017)
Gumus, M., Kiran, M.S.: Crude oil price forecasting using XGBoost. In: 2017 International Conference on Computer Science and Engineering (UBMK), pp. 1100–1103 (2017)
Habib, M., Faris, H., Hassonah, M.A., Alqatawna, J., Sheta, A.F., Al-Zoubi, A.M.: Automatic email spam detection using genetic programming with SMOTE. In: 2018 Fifth HCT Information Technology Trends (ITT), (2018)
Jegierski, H., Saganowski, S.: An “Outside the Box” solution for imbalanced data classification. IEEE Access 8, 125191–125209 (2020)
Jidong, L., Ran, Z.: Dynamic weighting multi factor stock selection strategy based on XGboost machine learning algorithm. In: 2018 IEEE International Conference of Safety Produce Informatization (IICSPI), pp. 868–872 (2018)
Koto, F.: SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: an enhancement strategy to handle imbalance in data level. In: 2014 International Conference on Advanced Computer Science and Information System, pp. 280–284 (2014)
Kumar, V., Subba, B.: A TfidfVectorizer and SVM based sentiment analysis framework for text data corpus. In: 2020 National Conference on Communications (NCC), (2020)
Liao, X., Cao, N., Li, M., Kang, X.: Research on short-term load forecasting using XGBoost based on similar days. In: 2019 International Conference on Intelligent Transportation, Big Data and Smart City (ICITBS), pp. 675–678 (2019)
Oughali, M.S., Bahloul, M., Rahman, S.A.E.: Analysis of NBA players and shot prediction using random forest and XGBoost models. In: 2019 International Conference on Computer and Information Sciences (ICCIS), (2019)
Sarakit, P., Theeramunkong, T., Haruechaiyasak, C.: Improving emotion classification in imbalanced YouTube dataset using SMOTE algorithm. In: 2015 2nd International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA), (2015)
Tang, Q., Xia, G., Zhang, X., Long, F.: A customer Churn prediction model based on XGBoost and MLP. In: 2020 International Conference on Computer Engineering and Application (ICCEA), pp. 608–612 (2020)
Xu, H., Wang, H.: Identifying diseases that cause psychological trauma and social avoidance by Xgboost. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1809–1813 (2019)
Zeng, M., Zou, B., Wei, F., Liu, X., Wang, L.: Effective prediction of three common disease by combining SMOTE with Tomek links techniques for imbalanced medical data. In: IEEE International Conference of Online Analysis and Computing Science (ICOACS), pp. 225–228 (2021)
Zhang, X., Wang, W., Zheng, X., Ma, Y., Wei, Y., Li, M., Zhang, Y.: A clutter suppression method based on SOM-SMOTE random forest. In: 2019 IEEE Radar Conference (RadarConf), (2019)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chaturvedi, A., Yadav, S., Ansari, M.A.M.H., Kanojia, M. (2022). Comparative Multinomial Text Classification Analysis of Naïve Bayes and XGBoost with SMOTE on Imbalanced Dataset. In: Das, A.K., Nayak, J., Naik, B., Dutta, S., Pelusi, D. (eds) Computational Intelligence in Pattern Recognition . Advances in Intelligent Systems and Computing, vol 1349. Springer, Singapore. https://doi.org/10.1007/978-981-16-2543-5_29
Download citation
DOI: https://doi.org/10.1007/978-981-16-2543-5_29
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2542-8
Online ISBN: 978-981-16-2543-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)