Skip to main content

Comparative Multinomial Text Classification Analysis of Naïve Bayes and XGBoost with SMOTE on Imbalanced Dataset

  • Conference paper
  • First Online:
Computational Intelligence in Pattern Recognition

Abstract

In supervised machine learning, with an imbalanced dataset, achieving better classification in minority classes is a major challenge. In such situation, machine learning model shows biasness toward majority classes, which result into poor performance in another set of classes. This paper examined how Synthetic Minority Over-Sampling Technique (SMOTE) techniques help in multinomial text classification on the imbalanced dataset. The performance of SMOTE was examined with Naive Bayes (NB) and Extreme Gradient Boosting (XGBoost) algorithms. 701 questions were collected from college students residing in Mumbai, related to their lifestyle problems. Results showed that XGBoost with SMOTE (XGBoost + SMOTE) technique worked better on an imbalanced dataset in comparison with NB with SMOTE (NB + SMOTE), NB without SMOTE (NB-SMOTE), and XGBoost without SMOTE (XGBoost-SMOTE) techniques. The average classification accuracy for Naive Bayes (with and without SMOTE) was 68.0% while the average accuracy for XGBoost was 71.0%. In the selection of XGBoost and NB, researcher can opt for XGBoost with SMOTE technique to work on the multinomial imbalanced dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alzamma, A., Binsallee, H., AsSadha, B., Kyriakopoulos, K., Lambotharan, S.: Comparative analysis on imbalanced multi-class classification for Malware samples using CNN. In: 2019 International Conference on Advances in the Emerging Computing Technologies (AECT), (2020)

    Google Scholar 

  2. Chen, L., Dong, P., Su, W., Zhang, Y.: Improving classification of imbalanced datasets based on KM++ SMOTE algorithm. In: 2019 2nd International Conference on Safety Produce Informatization (IICSPI), (2019)

    Google Scholar 

  3. Darus, F.M., Ahmad, N.A., Ariffin, A.F.M.: Android malware classification using XGBoost on data image pattern. In: 2019 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS), pp. 118–122 (2019)

    Google Scholar 

  4. Das, R., Biswas, S.K., Devi, D., Sarma, B.: An oversampling technique by integrating reverse nearest neighbor in SMOTE: reverse-SMOTE. In: 2020 International Conference on Smart Electronics and Communication (ICOSEC), (2020)

    Google Scholar 

  5. Elbes, M., Aldajah, A., Sadaqa, O.: P-Stemmer or NLTK stemmer for arabic text classification. In: 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 516–520 (2019)

    Google Scholar 

  6. Er, M.J., Venkatesan, R., Wang, N.: An online universal classifier for binary, multi-class and multi-label classification. In: 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC), (2016)

    Google Scholar 

  7. Flores, A.C., Icoy, R.I., Pena, C.F., Gorro, K.D.: An evaluation of SVM and Naive Bayes with SMOTE on sentiment analysis data set. In: 2018 International Conference on Engineering, Applied Sciences, and Technology (ICEAST), (2018)

    Google Scholar 

  8. Ge, Y., Yue, D., Chen, L.: Prediction of wind turbine blades icing based on MBK-SMOTE and random forest in imbalanced data set. In: 2017 IEEE Conference on Energy Internet and Energy System Integration (EI2), (2017)

    Google Scholar 

  9. Gumus, M., Kiran, M.S.: Crude oil price forecasting using XGBoost. In: 2017 International Conference on Computer Science and Engineering (UBMK), pp. 1100–1103 (2017)

    Google Scholar 

  10. Habib, M., Faris, H., Hassonah, M.A., Alqatawna, J., Sheta, A.F., Al-Zoubi, A.M.: Automatic email spam detection using genetic programming with SMOTE. In: 2018 Fifth HCT Information Technology Trends (ITT), (2018)

    Google Scholar 

  11. Jegierski, H., Saganowski, S.: An “Outside the Box” solution for imbalanced data classification. IEEE Access 8, 125191–125209 (2020)

    Article  Google Scholar 

  12. Jidong, L., Ran, Z.: Dynamic weighting multi factor stock selection strategy based on XGboost machine learning algorithm. In: 2018 IEEE International Conference of Safety Produce Informatization (IICSPI), pp. 868–872 (2018)

    Google Scholar 

  13. Koto, F.: SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: an enhancement strategy to handle imbalance in data level. In: 2014 International Conference on Advanced Computer Science and Information System, pp. 280–284 (2014)

    Google Scholar 

  14. Kumar, V., Subba, B.: A TfidfVectorizer and SVM based sentiment analysis framework for text data corpus. In: 2020 National Conference on Communications (NCC), (2020)

    Google Scholar 

  15. Liao, X., Cao, N., Li, M., Kang, X.: Research on short-term load forecasting using XGBoost based on similar days. In: 2019 International Conference on Intelligent Transportation, Big Data and Smart City (ICITBS), pp. 675–678 (2019)

    Google Scholar 

  16. Oughali, M.S., Bahloul, M., Rahman, S.A.E.: Analysis of NBA players and shot prediction using random forest and XGBoost models. In: 2019 International Conference on Computer and Information Sciences (ICCIS), (2019)

    Google Scholar 

  17. Sarakit, P., Theeramunkong, T., Haruechaiyasak, C.: Improving emotion classification in imbalanced YouTube dataset using SMOTE algorithm. In: 2015 2nd International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA), (2015)

    Google Scholar 

  18. Tang, Q., Xia, G., Zhang, X., Long, F.: A customer Churn prediction model based on XGBoost and MLP. In: 2020 International Conference on Computer Engineering and Application (ICCEA), pp. 608–612 (2020)

    Google Scholar 

  19. Xu, H., Wang, H.: Identifying diseases that cause psychological trauma and social avoidance by Xgboost. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1809–1813 (2019)

    Google Scholar 

  20. Zeng, M., Zou, B., Wei, F., Liu, X., Wang, L.: Effective prediction of three common disease by combining SMOTE with Tomek links techniques for imbalanced medical data. In: IEEE International Conference of Online Analysis and Computing Science (ICOACS), pp. 225–228 (2021)

    Google Scholar 

  21. Zhang, X., Wang, W., Zheng, X., Ma, Y., Wei, Y., Li, M., Zhang, Y.: A clutter suppression method based on SOM-SMOTE random forest. In: 2019 IEEE Radar Conference (RadarConf), (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chaturvedi, A., Yadav, S., Ansari, M.A.M.H., Kanojia, M. (2022). Comparative Multinomial Text Classification Analysis of Naïve Bayes and XGBoost with SMOTE on Imbalanced Dataset. In: Das, A.K., Nayak, J., Naik, B., Dutta, S., Pelusi, D. (eds) Computational Intelligence in Pattern Recognition . Advances in Intelligent Systems and Computing, vol 1349. Springer, Singapore. https://doi.org/10.1007/978-981-16-2543-5_29

Download citation

Publish with us

Policies and ethics