Skip to main content

Improving Multi-class Text Classification Using Balancing Techniques

  • Conference paper
  • First Online:
Artificial Intelligence: Theories and Applications (ICAITA 2022)

Abstract

Social media platforms and micro-blogging websites have grown in popularity in recent years. These platforms are used to express persons’ thoughts and feelings regarding items, people, and events. This massive amount of textual data must be exploited. Sentiment analysis is one of the tools used to take advantage of this text data, in which we classify text into different classes such as positive, negative, neutral, or a number of star classes. It has been investigated by many researchers in several languages. Deep Learning approaches such as CNN, RNN, and LSTM applied on balanced datasets have given efficient results compared to classical machine learning approaches such as SVM, NB, and LR. Furthermore, the apparition of BERT has revolutionized the text classification field, even in sentiment analysis tasks. The main problem that the datasets which have been collected from social media platforms, certain classes dominate others, meaning that the datasets are imbalanced. As a result, classifiers lose efficiency. This paper addresses this issue by introducing an ensemble of mathematical balancing techniques to increase the efficiency of sentiment analysis models based on BERT scheme. The obtained results are significant, indicating that our two main metrics, AVG-Recall and F1-PN, are 17% and 19% higher, respectively, when compared to the classifiers’ results applied to the imbalanced dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Yu, B., Deng, C., Bu, L.: Policy text classification algorithm based on BERT. In: 2022 11th International Conference of Information and Communication Technology (ICTech), pp. 488–491 (2022). https://doi.org/10.1109/ICTech55460.2022.00103

  2. Yang, J., Yang, J.: Aspect based sentiment analysis with self-attention and gated convolutional networks. In: 2020 IEEE 11th International Conference on Software Engineering and Service Science (ICSESS), pp. 146–149 (2020). https://doi.org/10.1109/ICSESS49938.2020.9237640

  3. Ertam, F.: Deep learning based text classification with Web Scraping methods. In: International Conference on Artificial Intelligence and Data Processing (IDAP), pp. 1–4 (2018). https://doi.org/10.1109/IDAP.2018.8620790

  4. Alsukhni, B.: Multi-label Arabic text classification based on deep learning. In: 2021 12th International Conference on Information and Communication Systems (ICICS), pp. 475–477 (2021). https://doi.org/10.1109/ICICS52457.2021.9464538

  5. Salur, M.U., Aydin, İ.: The impact of preprocessing on classification performance in convolutional neural networks for Turkish text. In: International Conference on Artificial Intelligence and Data Processing (IDAP), pp. 1–4 (2018). https://doi.org/10.1109/IDAP.2018.8620722

  6. Zhang, H., Li, Z., Shahriar, H., Tao, L., Bhattacharya, P., Qian, Y.: Improving prediction accuracy for logistic regression on imbalanced datasets. In: 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), pp. 918–919 (2019). https://doi.org/10.1109/COMPSAC.2019.00140

  7. Hanskunatai, A.: A new hybrid sampling approach for classification of imbalanced datasets. In: 2018 3rd International Conference on Computer and Communication Systems (ICCCS), pp. 67–71 (2018). https://doi.org/10.1109/CCOMS.2018.8463228

  8. Hanif, A., Azhar, N.: Resolving class imbalance and feature selection in customer churn dataset. In: International Conference on Frontiers of Information Technology (FIT), pp. 82–86 (2017). https://doi.org/10.1109/FIT.2017.00022

  9. Raj, R.J.R., Das, P., Sahu, P.: Emotion classification on Twitter data using word embedding and lexicon based approach. In: 2020 IEEE 9th International Conference on Communication Systems and Network Technologies (CSNT), pp. 150–154 (2020). https://doi.org/10.1109/CSNT48778.2020.9115750

  10. Agarwal, A., Sharma, V., Sikka, G., Dhir, R.: Opinion mining of news headlines using SentiWordNet. In: Symposium on Colossal Data Analysis and Networking (CDAN), pp. 1–5 (2016). https://doi.org/10.1109/CDAN.2016.7570949

  11. Rabab’ah, A.M., Al-Ayyoub, M., Jararweh, Y., Al-Kabi, M.N.: Evaluating SentiStrength for Arabic sentiment analysis. In: 2016 7th International Conference on Computer Science and Information Technology (CSIT), pp. 1–6 (2016). https://doi.org/10.1109/CSIT.2016.7549458

  12. Zheng, Y.: An exploration on text classification with classical machine learning algorithm. In: 2019 International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), pp. 81–85 (2019). https://doi.org/10.1109/MLBDBI48998.2019.00023

  13. Venkatesh, Ranjitha, K.V.: Classification and optimization scheme for text data using machine learning Naïve Bayes classifier. In: 2018 IEEE World Symposium on Communication Engineering (WSCE), pp. 33–36 (2018). https://doi.org/10.1109/WSCE.2018.8690536

  14. Pathuri, S.K., Anbazhagan, N., Prakash, G.B.: Feature based sentimental analysis for prediction of mobile reviews using hybrid bag-boost algorithm. In: 2020 7th International Conference on Smart Structures and Systems (ICSSS), pp. 1–5 (2020). https://doi.org/10.1109/ICSSS49621.2020.9201990

  15. Dhahi, S.H., Waleed, J.: Emotions polarity of tweets based on semantic similarity and user behavior features. In: 2020 1st Information Technology to Enhance e-Learning and Other Application (IT-ELA), pp. 1–6 (2020). https://doi.org/10.1109/IT-ELA50150.2020.9253088

  16. Putra, B.P., Irawan, B., Setianingsih, C., Rahmadani, A., Imanda, F., Fawwas, I.Z.: Hate speech detection using convolutional neural network algorithm based on image. In: 2021 International Seminar on Machine Learning, Optimization, and Data Science (ISMODE), pp. 207–212 (2022). https://doi.org/10.1109/ISMODE53584.2022.9742810

  17. Amrutha, B.R., Bindu, K.R.: Detecting hate speech in tweets using different deep neural network architectures. In: International Conference on Intelligent Computing and Control Systems (ICCS), pp. 923–926 (2019). https://doi.org/10.1109/ICCS45141.2019.9065763

  18. Zhou, K., Long, F.: Sentiment analysis of text based on CNN and bi-directional LSTM model. In: 2018 24th International Conference on Automation and Computing (ICAC), pp. 1–5 (2018). https://doi.org/10.23919/IConAC.2018.8749069

  19. Santos, M.S., Soares, J.P., Abreu, P.H., Araujo, H., Santos, J.: Cross-validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Comput. Intell. Mag. 13(4), 59–76 (2018). https://doi.org/10.1109/MCI.2018.2866730

    Article  Google Scholar 

  20. Mohammadi, S., Chapon, M.: Investigating the performance of fine-tuned text classification models based-on BERT. In: 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 1252–1257 (2020). https://doi.org/10.1109/HPCC-SmartCity-DSS50907.2020.00162

  21. Weijie, D., Yunyi, L., Jing, Z., Xuchen, S.: Long text classification based on BERT. In: 2021 IEEE 5th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), pp. 1147–1151 (2021). https://doi.org/10.1109/ITNEC52019.2021.9587007

  22. Shao, Y., Taylor, S., Marshall, N., Morioka, C., Zeng-Treitler, Q.: Clinical text classification with word embedding features vs. bag-of-words features. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 2874–2878 (2018). https://doi.org/10.1109/BigData.2018.8622345

  23. Alessa, A., Faezipour, M., Alhassan, Z.: Text classification of flu-related tweets using FastText with sentiment and keyword features. In: IEEE International Conference on Healthcare Informatics (ICHI), pp. 366–367 (2018). https://doi.org/10.1109/ICHI.2018.00058

  24. Shrivastava, P., Sharma, D.K.: Fake content identification using pre-trained glove-embedding. In: 2021 5th International Conference on Information Systems and Computer Networks (ISCON), pp. 1–6 (2021). https://doi.org/10.1109/ISCON52037.2021.9702379

  25. Yue, W., Li, L.: Sentiment analysis using Word2vec-CNN-BiLSTM classification. In: 2020 Seventh International Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 1–5 (2020). https://doi.org/10.1109/SNAMS52053.2020.9336549

  26. Liu, C., et al.: Constrained oversampling: an oversampling approach to reduce noise generation in imbalanced datasets with class overlapping. IEEE Access 10, 91452–91465 (2020). https://doi.org/10.1109/ACCESS.2020.3018911

    Article  Google Scholar 

  27. Srinilta, C., Kanharattanachai, S.: Application of natural neighbor-based algorithm on oversampling SMOTE algorithms. In: 2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST), pp. 217–220 (2021). https://doi.org/10.1109/ICEAST52143.2021.9426310

  28. Cahyana, N., Khomsah, S., Aribowo, A.S.: Improving imbalanced dataset classification using oversampling and gradient boosting. In: 2019 5th International Conference on Science in Information Technology (ICSITech), pp. 217–222 (2019). https://doi.org/10.1109/ICSITech46713.2019.8987499

  29. Veni, C.V.K., Rani, T.S.: Quartiles based undersampling (QUS): a simple and novel method to increase the classification rate of positives in imbalanced datasets. In: Ninth International Conference on Advances in Pattern Recognition (ICAPR), pp. 1–6 (2017). https://doi.org/10.1109/ICAPR.2017.8593202

  30. Luqyana, W.A., Ahmadie, B.L., Supianto, A.A.: K-nearest neighbors undersampling as balancing data for cyber troll detection. In: International Conference on Sustainable Information Engineering and Technology (SIET), pp. 322–325 (2019). https://doi.org/10.1109/SIET48054.2019.8986079

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laouni Mahmoudi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mahmoudi, L., Salem, M. (2023). Improving Multi-class Text Classification Using Balancing Techniques. In: Salem, M., Merelo, J.J., Siarry, P., Bachir Bouiadjra, R., Debakla, M., Debbat, F. (eds) Artificial Intelligence: Theories and Applications. ICAITA 2022. Communications in Computer and Information Science, vol 1769. Springer, Cham. https://doi.org/10.1007/978-3-031-28540-0_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-28540-0_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-28539-4

  • Online ISBN: 978-3-031-28540-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics