Skip to main content
Log in

A comparison of text preprocessing techniques for hate and offensive speech detection in Twitter

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Preprocessing is a crucial step for each task related to text classification. Preprocessing can have a significant impact on classification performance, but at present there are few large-scale studies evaluating the effectiveness of preprocessing techniques and their combinations. In this work, we explore the impact of 26 widely used text preprocessing techniques on the performance of hate and offensive speech detection algorithms. We evaluate six common machine learning models, such as logistic regression, random forest, linear support vector classifier, convolutional neural network, bidirectional encoder representations from transformers (BERT), and RoBERTa, on four common Twitter benchmarks. Our results show that some preprocessing techniques are useful for improving the accuracy of models while others may even cause a loss of efficiency. In addition, the effectiveness of preprocessing techniques varies depending on the chosen dataset and the classification method. We also explore two ways to combine the techniques that have proved effective during a separate evaluation. Our results show that combining techniques can produce different results. In our experiments, combining techniques works better for traditional machine learning methods than for other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. https://github.com/s/preprocessor.

  2. https://github.com/savoirfairelinux/num2words.

  3. https://github.com/carpedm20/emoji.

  4. https://github.com/barrust/pyspellchecker.

  5. https://github.com/snguyenthanh/better_profanity.

  6. https://github.com/matchado/HashTagSplitter.

  7. https://huggingface.co/bert-base-uncased.

  8. https://huggingface.co/roberta-base.

References

  • Alam S, Yao N (2019) The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis. Comput Math Org Theory 25:319–335

    Article  Google Scholar 

  • Alfina I, Mulia R, Fanany MI, Ekanata Y (2017) Hate speech detection in the Indonesian language: a dataset and preliminary study. In: 2017 international conference on advanced computer science and information systems (ICACSIS). IEEE, pp 233–238

  • Alonso P, Saini R, Kovacs G (2020) TheNorth at SemEval-2020 task 12: hate speech detection using Roberta. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 2197–2202

  • Alrehili A (2019) Automatic hate speech detection on social media: a brief survey. In: 2019 IEEE/ACS 16th international conference on computer systems and applications (AICCSA). IEEE, pp 1–6

  • Alshalan R, Al-Khalifa H (2020) A deep learning approach for automatic hate speech detection in the Saudi Twittersphere. Appl Sci 10(23):8614

    Article  Google Scholar 

  • Ameer I, Siddiqui MHF, Sidorov G, Gelbukh A (2019) CIC at SemEval-2019 task 5: simple yet very efficient approach to hate speech detection, aggressive behavior detection, and target classification in Twitter. In: Proceedings of the 13th international workshop on semantic evaluation, pp 382–386

  • Angiani G, Ferrari L, Fontanini T, Fornacciari P, Iotti E, Magliani F, Manicardi S (2016) A comparison between preprocessing techniques for sentiment analysis in twitter. In: KDWeb

  • Ashraf N, Rafiq A, Butt S, Shehzad HMF, Sidorov G, Gelbukh AF (2022) Youtube based religious hate speech and extremism detection dataset with machine learning baselines. J Intell Fuzzy Syst 42:4769–4777

    Article  Google Scholar 

  • Badjatiya P, Gupta S, Gupta M, Varma V (2017) Deep learning for hate speech detection in tweets. In: Proceedings of the 26th international conference on world wide web companion, pp 759–760

  • Bai Q, Dan Q, Mu Z, Yang M (2019) A systematic review of emoji: current research and future perspectives. Front Psychol 10:2221

    Article  Google Scholar 

  • Balouchzahi F, Shashirekha H (2020) Las for hasoc-learning approaches for hate speech and offensive content identification. In: FIRE (working notes), pp 145–151

  • Banerjee S, Sarkar M, Agrawal N, Saha P, Das M (2021) Exploring transformer based models to identify hate speech and offensive content in English and Indo-Aryan languages. arXiv preprint arXiv:2111.13974

  • Barbieri F, Camacho-Collados J, Espinosa Anke L, Neves L (2020) TweetEval: unified benchmark and comparative evaluation for tweet classification. In: Findings of the association for computational linguistics: EMNLP 2020, pp. 1644–1650. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.148 . https://aclanthology.org/2020.findings-emnlp.148

  • Baruah A, Barbhuiya F, Dey K (2019) ABARUAH at SemEval-2019 task 5: bi-directional LSTM for hate speech detection. In: Proceedings of the 13th international workshop on semantic evaluation, pp 371–376

  • Basile V, Bosco C, Fersini E, Debora N, Patti V, Pardo FMR, Rosso P, Sanguinetti M (2019) SemEval-2019 task 5: multilingual detection of hate speech against immigrants and women in Twitter. In: 13th international workshop on semantic evaluation. Association for Computational Linguistics, pp 54–63

  • Bhandari A, Shah SB, Thapa S, Naseem U, Nasim M (2023) CrisisHateMM: multimodal analysis of directed and undirected hate speech in text-embedded images from Russia-Ukraine conflict. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) workshops, pp 1993–2002

  • Bird S (2006) NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL 2006 interactive presentation sessions, pp 69–72

  • Bölücü N, Canbay P (2021) Hate speech and offensive content identification with graph convolutional networks. In: Forum for information retrieval evaluation (working notes)(FIRE), CEUR-WS.org, pp 44–51

  • Caselli T, Basile V, Mitrović J, Granitzer M (2021) HateBERT: retraining BERT for abusive language detection in English. In: Proceedings of the 5th workshop on online abuse and harms (WOAH 2021), pp 17–25

  • Caselli T, Basile V, Mitrović J, Kartoziya I, Granitzer M (2020) I feel offended, don’t be abusive! implicit/explicit messages in offensive and abusive language. In: Proceedings of the 12th language resources and evaluation conference, pp 6193–6202

  • Chollet F et al. Keras. https://github.com/fchollet/keras

  • Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave É, Ott M, Zettlemoyer L, Stoyanov V (2020) Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 8440–8451

  • Das AK, Al Asif A, Paul A, Hossain MN (2021) Bangla hate speech detection on social media using attention-based recurrent neural network. J Intell Syst 30(1):578–591

    Google Scholar 

  • Davidson T, Bhattacharya D, Weber I (2019) Racial bias in hate speech and abusive language detection datasets. In: Proceedings of the third workshop on abusive language online, pp 25–35

  • Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. In: Proceedings of the international AAAI conference on web and social media, vol 11, pp 512–515

  • Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186

  • Do HT-T, Huynh HD, Van Nguyen K, Nguyen NL-T, Nguyen AG-T (2019) Hate speech detection on Vietnamese social media text using the bidirectional-LSTM model. In: The sixth international workshop on Vietnamese language and speech processing VLSP 2019

  • Dogru HB, Tilki S, Jamil A, Hameed AA (2021) Deep learning-based classification of news texts using Doc2vec model. In: 2021 1st international conference on artificial intelligence and data analytics (CAIDA). IEEE, pp 91–96

  • Fersini E, Nozza D, Rosso P (2018) Overview of the Evalita 2018 task on automatic misogyny identification (AMI). In: CEUR workshop proceedings. CEUR-WS, vol 2263, pp 1–9

  • Fersini E, Rosso P, Anzovino M (2018) Overview of the task on automatic misogyny identification at IberEval 2018. In: CEUR workshop proceedings. CEUR-WS, vol 2150, pp 214–228

  • Fortuna P, Nunes S (2018) A survey on automatic detection of hate speech in text. ACM Comput Surv CSUR 51(4):1–30

    Google Scholar 

  • Fromknecht J, Palmer A (2020) UNT linguistics at SemEval-2020 task 12: linear SVC with pre-trained word embeddings as document vectors and targeted linguistic features. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 2209–2215

  • Garain A, Basu A (2019) The titans at SemEval-2019 task 5: detection of hate speech against immigrants and women in Twitter. In: Proceedings of the 13th international workshop on semantic evaluation, pp 494–497

  • Garouani M, Chrita H, Kharroubi J (2021) Sentiment analysis of Moroccan tweets using text mining. In: Digital technologies and applications: proceedings of ICDTA 21, Fez, Morocco. Springer, pp 597–608

  • Glazkova A, Kadantsev M, Glazkov M (2021) Fine-tuning of pre-trained transformers for hate, offensive, and profane content detection in English and Marathi. In: FIRE 2021 working notes, pp 52–62

  • Guibon G, Ochs M, Bellot P (2016) From emojis to sentiment analysis. In: WACAI 2016

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  • Huang X, Xing L, Dernoncourt F, Paul MJ (2020) Multilingual twitter corpus and baselines for evaluating demographic bias in hate speech recognition. In: LREC

  • Hu R, Dorris W, Vishwamitra N, Luo F, Costello M (2020) On the impact of word representation in hate speech and offensive language detection and explanation. In: Proceedings of the tenth ACM conference on data and application security and privacy, pp 171–173

  • Jianqiang Z, Xiaolin G (2017) Comparison research on text pre-processing methods on twitter sentiment analysis. IEEE Access 5:2870–2879

    Article  Google Scholar 

  • Joulin A, Grave E, Bojanowski P, Mikolov T (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759

  • Kadhim AI (2018) An evaluation of preprocessing techniques for text classification. Int J Comput Sci Inf Secur IJCSIS 16(6):22–32

    Google Scholar 

  • Kaibi I, Satori H (2019) A comparative evaluation of word embeddings techniques for twitter sentiment analysis. In: 2019 international conference on wireless technologies, embedded and intelligent systems (WITS). IEEE, pp 1–4

  • Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1746–1751

  • Kirk H, Yin W, Vidgen B, Röttger P (2023) SemEval-2023 task 10: explainable detection of online sexism. In: Proceedings of the 17th international workshop on semantic evaluation (SemEval-2023). Association for Computational Linguistics, Toronto, Canada, pp 2193–2210. https://aclanthology.org/2023.semeval-1.305

  • Kodali P, Bhatnagar A, Ahuja N, Shrivastava M, Kumaraguru P (2022) HashSet—a dataset for hashtag segmentation. arXiv preprint arXiv:2201.06741

  • Krouska A, Troussas C, Virvou M (2016) The effect of preprocessing techniques on twitter sentiment analysis. In: 2016 7th international conference on information, intelligence, systems & applications (IISA). IEEE, pp 1–5

  • Liao W, Zeng B, Liu J, Wei P, Cheng X, Zhang W (2021) Multi-level graph neural network for text sentiment analysis. Comput Electr Eng 92:107096

    Article  Google Scholar 

  • Li M, Liao S, Okpala E, Tong M, Costello M, Cheng L, Hu H, Luo F (2021) COVID-hateBERT: a pre-trained language model for COVID-19 related hate speech detection. In: 2021 20th IEEE international conference on machine learning and applications (ICMLA), pp 233–238. IEEE

  • Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

  • Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations

  • Luu ST, Nguyen HP, Van Nguyen K, Nguyen NL-T (2020) Comparison between traditional machine learning models and neural network models for Vietnamese hate speech detection. In: 2020 RIVF international conference on computing and communication technologies (RIVF). IEEE, pp 1–6

  • MacAvaney S, Yao H-R, Yang E, Russell K, Goharian N, Frieder O (2019) Hate speech detection: challenges and solutions. PLoS ONE 14(8):0221152

    Article  Google Scholar 

  • Mandl T, Modha S, Kumar M A, Chakravarthi BR (2020) Overview of the HASOC track at fire 2020: hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In: Forum for information retrieval evaluation, pp 29–32

  • Mandl T, Modha S, Majumder P, Patel D, Dave M, Mandlia C, Patel A (2019) Overview of the HASOC track at fire 2019: hate speech and offensive content identification in Indo-European languages. In: Proceedings of the 11th forum for information retrieval evaluation, pp 14–17

  • Menini S, Aprosio AP, Tonelli S (2021) Abuse is contextual, what about NLP? The role of context in abusive language annotation and detection. arXiv preprint arXiv:2103.14916

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  • Mishra AK, Saumya S, Kumar A (2020) Iiit_dwd@hasoc 2020: identifying offensive content in Indo-European languages. In: FIRE (working notes), pp 139–144

  • Modha S, Mandl T, Majumder P, Satapara S, Patel T, Madhu H (2022) Overview of the HASOC subtrack at fire 2022: identification of conversational hate-speech in Hindi-English code-mixed and German language. Working notes of FIRE

  • Modha S, Mandl T, Shahi GK, Madhu H, Satapara S, Ranasinghe T, Zampieri M (2021) Overview of the HASOC subtrack at fire 2021: hate speech and offensive content identification in English and Indo-Aryan languages and conversational hate speech. In: Forum for information retrieval evaluation, pp 1–3

  • Mohammad F (2018) Is preprocessing of text really worth your time for toxic comment classification? In: Proceedings on the international conference on artificial intelligence (ICAI). The Steering Committee of The World Congress in Computer Science, Computer, pp 447–453

  • Montejo-Ráez A, Jiménez-Zafra SM, Garcia-Cumbreras MA, Díaz-Galiano MC (2019) SINAI-DL at SemEval-2019 task 5: recurrent networks and data augmentation by paraphrasing. In: Proceedings of the 13th international workshop on semantic evaluation, pp 480–483

  • Naseem U, Razzak I, Hameed IA (2019) Deep context-aware embedding for abusive and hate speech detection on twitter. Aust J Intell Inf Process Syst 15(3):69–76

    Google Scholar 

  • Naseem U, Razzak I, Eklund PW (2021) A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter. Multimedia Tools Appl 80(28):35239–35266

    Article  Google Scholar 

  • Nobata C, Tetreault J, Thomas A, Mehdad Y, Chang Y (2016) Abusive language detection in online user content. In: Proceedings of the 25th international conference on world wide web, pp 145–153

  • Nugroho K, Noersasongko E, Fanani AZ, Basuki RS (2019) Improving random forest method to detect hatespeech and offensive word. In: 2019 international conference on information and communications technology (ICOIACT). IEEE, pp 514–518

  • Oliveira DN, Merschmann LHDC (2021) Joint evaluation of preprocessing tasks with classifiers for sentiment analysis in Brazilian Portuguese language. Multimedia Tools Appl 80:15391–15412

    Article  Google Scholar 

  • Oriola O, Kotzé E (2020) Evaluating machine learning techniques for detecting offensive and hate speech in South African tweets. IEEE Access 8:21496–21509. https://doi.org/10.1109/ACCESS.2020.2968173

    Article  Google Scholar 

  • Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32

  • Pavlopoulos J, Sorensen J, Laugier L, Androutsopoulos I (2021) SemEval-2021 task 5: toxic spans detection. In: Proceedings of the 15th international workshop on semantic evaluation (SemEval-2021), pp 59–69

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  Google Scholar 

  • Pennington J, Socher R, Manning C.D (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  • Plaza-Del-Arco FM, Molina-González MD, Ureña-López LA, Martín-Valdivia MT (2021) A multi-task learning approach to hate speech detection leveraging sentiment analysis. IEEE Access 9:112478–112489

    Article  Google Scholar 

  • Poletto F, Basile V, Sanguinetti M, Bosco C, Patti V (2021) Resources and benchmark corpora for hate speech detection: a systematic review. Lang Resour Eval 55(2):477–523

    Article  Google Scholar 

  • Porter MF (2001) Snowball: a language for stemming algorithms

  • Rajapakse TC (2019) Simple transformers. https://github.com/ThilinaRajapakse/simpletransformers

  • Ramachandran D, Parvathi R (2019) Analysis of twitter specific preprocessing technique for tweets. Procedia Comput Sci 165:245–251

    Article  Google Scholar 

  • Ranasinghe T, Hettiarachchi H (2020) Brums at SemEval-2020 task 12: transformer based multilingual offensive language identification in social media. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 1906–1915

  • Renault T (2020) Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages. Digit Finance 2(1–2):1–13

    Article  Google Scholar 

  • Reuter J, Pereira-Martins J, Kalita J (2016) Segmenting Twitter hashtags. Int J Nat Lang Comput 5(4):23–36

    Article  Google Scholar 

  • Rogers A, Kovaleva O, Rumshisky A (2020) A primer in BERTology: what we know about how BERT works. Trans Assoc Comput Ling 8:842–866

    Google Scholar 

  • Saeed AM, Ismael AN, Rasul DL, Majeed RS, Rashid TA (2022) Hate speech detection in social media for the Kurdish language. In: Proceedings of the ICR’22 international conference on innovations in computing research. Springer, pp 253–260

  • Saeed NM, Helal NA, Badr NL, Gharib TF (2018) The impact of spam reviews on feature-based sentiment analysis. In: 2018 13th international conference on computer engineering and systems (ICCES). IEEE, pp 633–639

  • Saeed NM, Helal NA, Badr NL, Gharib TF (2020) An enhanced feature-based sentiment analysis approach. Wiley Interdiscip Rev Data Min Knowl Discov 10(2):1347

    Article  Google Scholar 

  • Saeed RM, Rady S, Gharib TF (2021) Optimizing sentiment classification for Arabic opinion texts. Cogn Comput 13(1):164–178

    Article  Google Scholar 

  • Saeed RM, Rady S, Gharib TF (2022) An ensemble approach for spam detection in Arabic opinion texts. J King Saud Univ Comput Inf Sci 34(1):1407–1416

    Google Scholar 

  • Schmidt A, Wiegand M (2019) A survey on hate speech detection using natural language processing. In: Proceedings of the fifth international workshop on natural language processing for social media, April 3, 2017, Valencia, Spain. Association for Computational Linguistics, pp 1–10

  • Silva SC, Ferreira TC, Ramos RMS, Paraboni I (2020) Data driven and psycholinguistics motivated approaches to hate speech detection. Computación y Sistemas 24

  • Štrimaitis R, Stefanovič P, Ramanauskaitė S, Slotkienė A (2021) Financial context news sentiment analysis for the Lithuanian language. Appl Sci 11(10):4443

    Article  Google Scholar 

  • Symeonidis S, Effrosynidis D, Arampatzis A (2018) A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Syst Appl 110:298–310

    Article  Google Scholar 

  • Thapa S, Jafri FA, Hürriyetoğlu A, Vargas F, Lee RK-W, Naseem U (2023) Multimodal hate speech event detection—shared task 4, CASE 2023. In: Proceedings of the 6th workshop on challenges and applications of automated extraction of socio-political events from text (CASE)

  • Toraman C, Şahinuç F, Yılmaz EH (2022) Large-scale hate speech detection with cross-domain transfer. arXiv preprint arXiv:2203.01111

  • Wallace E, Wang Y, Li S, Singh S, Gardner M (2019) Do NLP models know numbers? Probing numeracy in embeddings. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 5307–5315

  • Wang B, Ding Y, Liu S, Zhou X (2019) Ynu_wb at HASOC 2019: ordered neurons LSTM with attention for identifying hate speech and offensive language. In: FIRE (working notes), pp 191–198

  • Wang S, Liu J, Ouyang X, Sun Y (2020) Galileo at SemEval-2020 task 12: multi-lingual learning for offensive language identification using pre-trained language models. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 1448–1455

  • Wang D, Liu P, Zheng Y, Qiu X, Huang X-J (2020) Heterogeneous graph neural networks for extractive document summarization. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 6209–6219

  • Wiedemann G, Yimam SM, Biemann C (2020) UHH-LT at SemEval-2020 task 12: fine-tuning of pre-trained transformer networks for offensive language detection. arXiv preprint arXiv:2004.11493

  • Wiegand M, Ruppenhofer J, Kleinbauer T (2019) Detection of abusive language: the problem of biased datasets. In: Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, volume 1 (long and short papers), pp 602–608

  • Yin W, Zubiaga A (2021) Towards generalisable hate speech detection: a review on obstacles and solutions. PeerJ Comput Sci 7:598

    Article  Google Scholar 

  • Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R (2019) SemEval-2019 task 6: identifying and categorizing offensive language in social media (offenseval). In: Proceedings of the 13th international workshop on semantic evaluation, pp 75–86

  • Zampieri M, Nakov P, Rosenthal S, Atanasova P, Karadzhov G, Mubarak H, Derczynski L, Pitenis Z, Çöltekin Ç (2020) SemEval-2020 task 12: multilingual offensive language identification in social media (offenseval 2020). In: Proceedings of the fourteenth workshop on semantic evaluation, pp 1425–1447

  • Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q (2019) ERNIE: enhanced language representation with informative entities. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 1441–1451

  • Zhou Y, Yang Y, Liu H, Liu X, Savage N (2020) Deep learning based fusion approach for hate speech detection. IEEE Access 8:128923–128929

    Article  Google Scholar 

  • Zhou X, Yong Y, Fan X, Ren G, Song Y, Diao Y, Yang L, Lin H (2021) Hate speech detection based on sentiment knowledge sharing. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pp 7158–7166

Download references

Author information

Authors and Affiliations

Authors

Contributions

The author confirms sole responsibility for the following: study conception and design, data collection, analysis and interpretation of results, and manuscript preparation.

Corresponding author

Correspondence to Anna Glazkova.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Appendices

Appendix A: Lists of emoticons, contractions, and acronyms

  • Converting emoticons to words (17)::D, )),:),:)), etc. \(\rightarrow \) happy; ((,:(,:((, etc. \(\rightarrow \) sad; XD \(\rightarrow \) laugh;:* \(\rightarrow \) kiss.

  • Decontraction (22): won’t \(\rightarrow \) will not; shan’t \(\rightarrow \) shall not; can’t \(\rightarrow \) can not; couldn’t \(\rightarrow \) could not; ’re \(\rightarrow \) are; ’s \(\rightarrow \) is; ’d \(\rightarrow \) would not; ’ll \(\rightarrow \) will not; ’ve \(\rightarrow \) have; ’m \(\rightarrow \) am.

  • Replacing acronyms (23): ?4U \(\rightarrow \) question for you, 2moro \(\rightarrow \) tomorrow, 2nte \(\rightarrow \) tonight, asap \(\rightarrow \) as soon as possible, atm \(\rightarrow \) at the moment, awsm \(\rightarrow \) awesome, b2w \(\rightarrow \) back to work, bff \(\rightarrow \) best friends forever, brb \(\rightarrow \) be right back, btw \(\rightarrow \) by the way, cu \(\rightarrow \) see you, cus \(\rightarrow \) see you soon, deti \(\rightarrow \) don’t even think it, diky \(\rightarrow \) do i know you, g2g \(\rightarrow \) got to go, gr8 \(\rightarrow \) great, idc \(\rightarrow \) i don’t care, idk \(\rightarrow \) i don’t care, lmk \(\rightarrow \) let me know, lol \(\rightarrow \) laughing out loud, lu \(\rightarrow \) love you, ly \(\rightarrow \) love you, myob \(\rightarrow \) mind your own business, omg \(\rightarrow \) oh my god, pls \(\rightarrow \) please, plz \(\rightarrow \) please, rip \(\rightarrow \) rest in peace, rofl \(\rightarrow \) rolling on the floor, sep \(\rightarrow \) someone else’s problem, sup \(\rightarrow \) what’s up, tnx \(\rightarrow \) thanks, ttyl \(\rightarrow \) talk to you later.

Appendix B: The main parameters of LR, RF, and LSVC

See Table 13.

Table 13 Parameters for LR, RF, and LSVC
Table 14 Parameters for grid search
Fig. 3
figure 3

CNN architecture

Appendix C: The architecture of CNN

See Table 14 and Fig. 3.

Appendix D: The values of standard deviation across the folds for separate evaluation of preprocessing techniques

See Tables 15 and 16.

Table 15 Separate evaluation of techniques for logistic regression, random forest, and linear support vector classifier, standard deviation across the folds (%)
Table 16 Separate evaluation of techniques for Convolutional Neural Network, BERT, and RoBERTa (RB), standard deviation across the folds (%)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Glazkova, A. A comparison of text preprocessing techniques for hate and offensive speech detection in Twitter. Soc. Netw. Anal. Min. 13, 155 (2023). https://doi.org/10.1007/s13278-023-01156-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-023-01156-y

Keywords

Navigation