Abstract
Society needs to develop a system to detect hate and offense to build a healthy and safe environment. However, current research in this field still faces four major shortcomings, including deficient pre-processing techniques, indifference to data imbalance issues, modest performance models, and lacking practical applications. This paper focused on developing an intelligent system capable of addressing these shortcomings. Firstly, we proposed an efficient pre-processing technique to clean comments collected from Vietnamese social media. Secondly, a novel hate speech detection (HSD) model, which is the combination of a pre-trained PhoBERT model and a Text-CNN model, was proposed for solving tasks in Vietnamese. Thirdly, EDA techniques are applied to deal with imbalanced data to improve the performance of classification models. Besides, various experiments were conducted as baselines to compare and investigate the proposed model’s performance against state-of-the-art methods. The experiment results show that the proposed PhoBERT-CNN model outperforms SOTA methods and achieves an F1-score of 67.46% and 98.45% on two benchmark datasets, ViHSD and HSD-VLSP, respectively. Finally, we also built a streaming HSD application to demonstrate the practicality of our proposed system.
Similar content being viewed by others
Notes
The several examples in this article are given to demonstrate the seriousness of the hate speech problem. They are based on actual online data and do not reflect the authors’ opinions.
References
Mohan S, Guha A, Harris M, Popowich F, Schuster A, Priebe C (2017) The impact of toxic language on the health of reddit communities. In: Canadian conference on artificial intelligence. Springer, pp 51–56
Abu-Ghazaleh S, Hassona Y, Hattar S (2018) Dental trauma in social media-analysis of facebook content and public engagement. Dent Traumatol 34(6):394–400
Statista: Global number of hate speech-containing content removed by Facebook from 4th quarter 2017 to 2nd quarter 2021 (2018). https://www.statista.com/statistics/1013804/facebook-hate-speech- content-deletion-quarter
Seetharaman D (2018) Facebook throws more money at wiping out hate speech and bad actors. https://www.wsj.com/articles/facebook-throws-more-cash-at-tough-problem-stamping-out-bad-content-15263932
Microsoft: Global number of hate speech-containing content removed by Facebook from 4th quarter 2017 to 2nd quarter 2021 (2020). https://www.microsoft.com/en-us/online-safety/digital-civility
Keane TM, Fisher LM, Krinsley KE, Niles BL (1994) Posttraumatic stress disorder. Springer, Berlin, pp 237–260
Malmasi S, Zampieri M (2017) Detecting hate speech in social media. In: Proceedings of the international conference recent advances in natural language processing. INCOMA Ltd., Varna, pp 467–472. https://doi.org/10.26615/978-954-452-049-6_062
Schmidt A, Wiegand M (2017) A survey on hate speech detection using natural language processing. In: Proceedings of the fifth international workshop on natural language processing for social media, pp 1–10
Vu X-S, Vu T, Tran M-V, Le-Cong T, Nguyen H (2020) HSD shared task in VLSP campaign 2019: hate speech detection for social good. arXiv preprint. arXiv:2007.06493
Luu ST, Nguyen KV, Nguyen NL-T (2021) A large-scale dataset for hate speech detection on Vietnamese social media texts. In: Fujita H, Selamat A, Lin JC-W, Ali M (eds) Advances and trends in artificial intelligence. Artificial intelligence practices. Springer, Cham, pp 415–426
Naseem U, Razzak I, Eklund PW (2021) A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter. Multimed Tools Appl 80(28):35239–35266
Nguyen KP-Q, Van Nguyen K (2020) Exploiting Vietnamese social media characteristics for textual emotion recognition in Vietnamese. In: International conference on Asian language processing (IALP). IEEE, pp 276–281
Vu T, Nguyen DQ, Nguyen DQ, Dras M, Johnson M (2018) VnCoreNLP: a Vietnamese natural language processing toolkit. In: Proceedings of the 2018 conference of the North American Chapter of the Association for computational linguistics: demonstrations. Association for Computational Linguistics, New Orleans, pp 56–60. https://doi.org/10.18653/v1/N18-5012
Fortuna P, Nunes S (2018) A survey on automatic detection of hate speech in text. ACM Comput Surv (CSUR) 51(4):1–30
Alrehili A (2019) Automatic hate speech detection on social media: a brief survey. In: IEEE/ACS 16th International conference on computer systems and applications (AICCSA). IEEE, pp. 1–6
Waseem Z, Hovy D (2016) Hateful symbols or hateful people? Predictive features for hate speech detection on twitter. In: Proceedings of the NAACL student research workshop, pp 88–93
Chen J, Yan S, Wong K-C (2018) Verbal aggression detection on twitter comments: convolutional neural network for short-text sentiment analysis. Neural Comput Appl 32:10809–10818
Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. In: Proceedings of the international AAAI conference on web and social media, vol 11
Do HT-T, Huynh HD, Van Nguyen K, Nguyen NL-T, Nguyen AG-T (2019) Hate speech detection on Vietnamese social media text using the bidirectional-lstm model. arXiv preprint. arXiv:1911.03648
Huu QP, Trung SN, Pham HA (2019) Automated hate speech detection on Vietnamese social networks. Technical report, EasyChair
Huynh HD, Do HT-T, Nguyen KV, Nguyen NT-L (2020) A simple and efficient ensemble classifier combining multiple neural network models on social media datasets in Vietnamese. In: Proceedings of the 34th Pacific Asia conference on language, information and computation. Association for Computational Linguistics, Hanoi, pp 420–429
Luu ST, Nguyen HP, Van Nguyen K, Nguyen NL-T (2020) Comparison between traditional machine learning models and neural network models for Vietnamese hate speech detection. In: RIVF international conference on computing and communication technologies (RIVF). IEEE, pp 1–6
Nguyen TB, Nguyen QM, Nguyen TH, Pham NP, Nguyen TL, Do QT (2019) Vais hate speech detection system: a deep learning based approach for system combination. arXiv preprint. arXiv:1910.05608
Van Thin D, Le LS, Nguyen NL-T (2019) Nlp@ uit: Exploring feature engineer and ensemble model for hate speech detection at vlsp 2019. Training 5:3–51
Martins R, Gomes M, Almeida JJ, Novais P, Henriques P (2018) Hate speech classification in social media using emotional analysis. In: 7th Brazilian conference on intelligent systems (BRACIS). IEEE, pp 61–66
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint. arXiv:1907.11692
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2020) Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 8440–8451 (Online). https://doi.org/10.18653/v1/2020.acl-main.747
Safaya A, Abdullatif M, Yuret D (2020) Kuisail at semeval-2020 task 12: Bert-cnn for offensive speech identification in social media. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 2054–2059
Liu Y, Liu H, Wong L-P, Lee L-K, Zhang H, Hao T (2020) A hybrid neural network rbert-c based on pre-trained roberta and cnn for user intent classification. In: International conference on neural computing for advanced applications. Springer, pp 306–319
Saha D, Paharia N, Chakraborty D, Saha P, Mukherjee A (2021) Hate-alert@DravidianLangTech-EACL2021: ensembling strategies for transformer-based offensive language detection. In: Proceedings of the first workshop on speech and language technologies for Dravidian languages. Association for Computational Linguistics, Kyiv, pp 270–276
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint. arXiv:1412.3555
He C, Chen S, Huang S, Zhang J, Song X (2019) Using convolutional neural network with bert for intent determination. In: International conference on Asian language processing (IALP). IEEE, pp 65–70
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, pp 1746–1751. https://doi.org/10.3115/v1/D14-1181
Nguyen DQ, Tuan Nguyen A (2020) PhoBERT: pre-trained language models for Vietnamese. In: Findings of the association for computational linguistics: EMNLP 2020. Association for Computational Linguistics, pp 1037–1042 (Online). https://doi.org/10.18653/v1/2020.findings-emnlp.92
Nagarajan SM, Gandhi UD (2019) Classifying streaming of twitter data based on sentiment analysis using hybridization. Neural Comput Appl 31(5):1425–1433
Zaki ND, Hashim NY, Mohialden YM, Mohammed MA, Sutikno T, Ali AH (2020) A real-time big data sentiment analysis for iraqi tweets using spark streaming. Bull Electric Eng Inform 9(4):1411–1419
Burnap P, Williams ML (2015) Cyber hate speech on twitter: an application of machine classification and statistical modeling for policy and decision making. Policy Internet 7(2):223–242
Anagnostou A, Mollas I, Tsoumakas, G (2018) Hatebusters: a web application for actively reporting youtube hate speech. In: IJCAI, pp 5796–5798
Bird S (2006) Nltk: the natural language toolkit. In: Proceedings of the COLING/ACL 2006 interactive presentation sessions, pp 69–72
Le V-D (2017) Stopwords: Vietnamese. GitHub
Luu S, Nguyen K, Nguyen N (2020) Empirical study of text augmentation on social media text in Vietnamese. In: Proceedings of the 34th Pacific Asia conference on language, information and computation. Association for Computational Linguistics, Hanoi, pp 462–470
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Wei J, Zou K (2019) EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, pp 6382–6388. https://doi.org/10.18653/v1/D19-1670
Pham-Hong B-T, Chokshi S (2020) PGSG at SemEval-2020 task 12: BERT-LSTM with tweets’ pretrained model and noisy student training method. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 2111–2116
Li X, Bing L, Zhang W, Lam W (2019) Exploiting BERT for end-to-end aspect-based sentiment analysis. In: Proceedings of the 5th workshop on noisy user-generated text (W-NUT 2019). Association for Computational Linguistics, Hong Kong, pp 34–41. https://doi.org/10.18653/v1/D19-5505
Yi R, Hu W (2019) Pre-trained BERT-GRU model for relation extraction. In: Proceedings of the 2019 8th international conference on computing and pattern recognition, pp 453–457
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ et al (2016) Apache spark: a unified engine for big data processing. Commun ACM 59(11):56–65
Rish I et al (2001) An empirical study of the naive Bayes classifier. In: IJCAI 2001 Workshop on empirical methods in artificial intelligence, vol 3, pp 41–46
Kim S-B, Rim H-C, Yook D, Lim H-S (2002) Effective methods for improving naive Bayes text classifiers. In: Pacific rim international conference on artificial intelligence. Springer, pp 414–423
Liu S, Forss T (2014) Combining N-gram based similarity analysis with sentiment analysis in web content classification. In: KDIR, pp 530–537
Genkin A, Lewis DD, Madigan D (2007) Large-scale Bayesian logistic regression for text categorization. Technometrics 49(3):291–304
Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. Wiley, Hoboken
Pranckevičius T, Marcinkevičius V (2017) Comparison of naive Bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification. Baltic J Mod Comput 5(2):221
Ikonomakis M, Kotsiantis S, Tampakas V (2005) Text classification using machine learning techniques. WSEAS Trans Comput 4(8):966–974
Burnap P, Williams ML (2016) Us and them: identifying cyber hate on twitter across multiple protected characteristics. EPJ Data Sci 5:1–15
Liaw A, Wiener M et al (2002) Classification and regression by randomforest. R news 2(3):18–22
Islam MZ, Liu J, Li J, Liu L, Kang W (2019) A semantics aware random forest for text classification. In: Proceedings of the 28th ACM international conference on information and knowledge management, pp 1061–1070
Badjatiya P, Gupta S, Gupta M, Varma V (2017) Deep learning for hate speech detection in tweets. In: Proceedings of the 26th international conference on world wide web companion, pp 759–760
Medsker L, Jain LC (1999) Recurrent neural networks: design and applications. CRC Press, Boca Raton
Tenney I, Das D, Pavlick E (2019) BERT rediscovers the classical NLP pipeline. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, pp 4593–4601. https://doi.org/10.18653/v1/P19-1452
Michel P, Levy O, Neubig G (2019) Are sixteen heads really better than one? In: Wallach H, Larochelle H, Beygelzimer A, d’ Alché-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems, vol 32. Curran Associates, Inc., Red Hook
Rogers A, Kovaleva O, Rumshisky A (2020) A primer in bertology: what we know about how bert works. Trans Assoc Comput Linguist 8:842–866
Sigurbergsson GI, Derczynski L (2019) Offensive language and hate speech detection for Danish. arXiv preprint. arXiv:1908.04531
Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21(1):1–13
Vu Xuan S, Vu T, Tran S, Jiang L (2019) ETNLP: a visual-aided systematic approach to select pre-trained embeddings for a downstream task. In: Proceedings of the international conference on recent advances in natural language processing (RANLP 2019). INCOMA Ltd., Varna, pp 1285–1294. https://doi.org/10.26615/978-954-452-056-4_147
Nguyen AT, Dao MH, Nguyen DQ (2020) A pilot study of text-to-SQL semantic parsing for Vietnamese. In: Findings of the association for computational linguistics: EMNLP 2020, pp 4079–4085
Datareportal: Digital 2021: Vietnam (2021). https://datareportal.com/reports/digital-2021-vietnam
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46
Mozafari M, Farahbakhsh R, Crespi N (2019) A bert-based transfer learning approach for hate speech detection in online social media. In: International conference on complex networks and their applications. Springer, pp 928–940
Mathew B, Saha P, Yimam SM, Biemann C, Goyal P, Mukherjee A (2021) Hatexplain: a benchmark dataset for explainable hate speech detection. Proc AAAI Conf Artif Intell 35(17):14867–14875
Pavlopoulos J, Sorensen J, Laugier L, Androutsopoulos I (2021) Semeval-2021 task 5: toxic spans detection. In: Proceedings of the 15th international workshop on semantic evaluation (SemEval-2021), pp 59–69
Acknowledgments
This research was supported by The VNUHCM-University of Information Technology’s Scientific Research Support Fund.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Quoc Tran, K., Trong Nguyen, A., Hoang, P.G. et al. Vietnamese hate and offensive detection using PhoBERT-CNN and social media streaming data. Neural Comput & Applic 35, 573–594 (2023). https://doi.org/10.1007/s00521-022-07745-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-07745-w