Skip to main content
Log in

UHated: hate speech detection in Urdu language using transfer learning

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Social media has become a driving force for social change in the global society. Events that take place in one part of the world can quickly reverberate across the globe due to the vast amount of data generated on these platforms. However, developers of these platforms face numerous challenges in keeping cyberspace as inclusive and healthy as possible. In recent years, there has been an increase in offensive and hate speech on social media. Manual efforts to address this issue have been inadequate due to the vast scope of the problem. Therefore, there is a need for an automated technique that can detect and remove offensive and hateful comments before they can cause harm. In this research, we use transfer learning to utilize pre-trained FastText Urdu word embeddings and multi-lingual BERT embeddings (RoBERTa) for our task. We also develop an Urdu language hate lexicon and use it to create an annotated dataset of 7800 Urdu tweets. Our results show that RoBERTa is able to achieve a macro F1-score of 0.82 on our multi-class classification task, outperforming deep learning and machine learning baseline models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://www.coe.int/en/web/european-commission-against-racism-and-intolerance/hate-speech-and-violence.

  2. A spreadsheet containing Hate lexicons is available at: shorturl.at/vDO26.

  3. A spreadsheet containing Hate speech dataset is available at: shorturl.at/vDO26.

References

  • Akhter, M. P., Jiangbin, Z., Naqvi, I. R., Abdelmajeed, M., & Sadiq, M. T. (2020a). Automatic detection of offensive language for Urdu and Roman Urdu. IEEE Access, 8, 91213–91226. https://doi.org/10.1109/ACCESS.2020.2994950

    Article  Google Scholar 

  • Akhter, M. P., Jiangbin, Z., Naqvi, I. R., Abdelmajeed, M., & Sadiq, M. T. (2020b). Automatic detection of offensive language for Urdu and Roman Urdu. IEEE Access, 8, 91213–91226. https://doi.org/10.1109/ACCESS.2020.2994950

    Article  Google Scholar 

  • Alatawi, H. S., Alhothali, A. M., & Moria, K. M. (2021). Detecting white supremacist hate speech using domain specific word embedding with deep learning and bert. IEEE Access, 9(106), 363–106,374.

    Google Scholar 

  • Albadi, N., Kurdi, M., & Mishra, S. (2018). Are they our brothers? Analysis and detection of religious hate speech in the Arabic twittersphere. In: 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 69–76. IEEE.

  • Ali, M. Z., Ehsan-ul Haq, A., Rauf, S., Javed, K., & Hussain, S. (2021). Improving hate speech detection of Urdu tweets using sentiment analysis. IEEE Access, 9, 84296–3305.

    Article  Google Scholar 

  • Araque, O., & Iglesias, C. A. (2021). An ensemble method for radicalization and hate speech detection online empowered by sentic computing. Cognitive Computation, 14, 1–14.

    Google Scholar 

  • Arshad, M. U., Bashir, M. F., Majeed, A., Shahzad, W., & Beg, M. O. (2019). Corpus for emotion detection on Roman Urdu. In: 2019 22nd International Multitopic Conference (INMIC), pp. 1–6. IEEE

  • Awan, M. N., & Beg, M. O. (2021). Top-rank: A topicalpostionrank for extraction and classification of keyphrases in text. Computer Speech & Language, 65, 101116.

    Article  Google Scholar 

  • Baruah, A., Barbhuiya, F., Dey, K. (2019). ABARUAH at SemEval-2019 task 5 : Bi-directional LSTM for hate speech detection. In: Proceedings of the 13th International Workshop on Semantic Evaluation. Association for Computational Linguistics, Minneapolis, Minnesota, USA

  • Benito, D., Araque, O., Iglesias, C. A. (2019). GSI-UPM at SemEval-2019 task 5: Semantic similarity and word embeddings for multilingual detection of hate speech against immigrants and women on Twitter. In: Proceedings of the 13th International Workshop on Semantic Evaluation. Association for Computational Linguistics, Minneapolis, Minnesota, USA

  • Daud, A., Khan, W., & Che, D. (2017). Urdu language processing: A survey. Artificial Intelligence Review, 47(3), 279–311.

    Article  Google Scholar 

  • Davidson, T., Warmsley, D., Macy, M., Weber, I. (2017). Automated hate speech detection and the problem of offensive language

  • Gertner, A., Henderson, J., Merkhofer, E., Marsh, A., Wellner, B., & Zarrella, G. (2019). MITRE at SemEval-2019 task 5: Transfer learning for multilingual hate speech detection. In: Proceedings of the 13th International Workshop on Semantic Evaluation. Association for Computational Linguistics, Minneapolis, Minnesota, USA

  • Ghosh Chowdhury, A., Didolkar, A., Sawhney, R., Shah, R. R. (2019). ARHNet—Leveraging community interaction for detection of religious hate speech in Arabic. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, Florence, Italy.

  • HaCohen-Kerner, Y., Shayovitz, E., Rochman, S., Cahn, E., Didi, G., & Ben-David, Z. (2019). JCTDHS at SemEval-2019 task 5: Detection of hate speech in tweets using deep learning methods, character n-gram features, and preprocessing methods. In: Proceedings of the 13th International Workshop on Semantic Evaluation. Association for Computational Linguistics, Minneapolis, Minnesota, USA.

  • Haq, N.U., Ullah, M., Khan, R., Ahmad, A., Almogren, A., Hayat, B., & Shafi, B. (2020). Usad: An intelligent system for slang and abusive text detection in Perso-Arabic-Scripted Urdu. Complexity 2020

  • Javed, A. R., Beg, M. O., Asim, M., Baker, T., Al-Bayatti, A. H. (2020). Alphalogger: Detecting motion-based side-channel attack using smartphone keystrokes. Journal of Ambient Intelligence and Humanized Computing pp. 1–14.

  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

  • Majeed, A., Mujtaba, H., Beg, M. O. (2020). Emotion detection in roman Urdu text using machine learning. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering Workshops, pp. 125–130

  • Mustafa, R. U., Nawaz, M. S., Farzund, J., Lali, M., Shahzad, B., & Viger, P. (2017). Early detection of controversial Urdu speeches from social media. Data Sci. Pattern Recognit., 1(2), 26–42.

    Google Scholar 

  • Nacem, S., Iqbal, M., Saqib, M., Saad, M., Raza, M.S., Ali, Z., Akhtar, N., Beg, M. O., Shahzad, W., Arshad, M. U. (2020). Subspace gaussian mixture model for continuous Urdu speech recognition using Kaldi. In: 2020 14th International Conference on Open Source Systems and Technologies (ICOSST), pp. 1–7. IEEE

  • Naeem, B., Khan, A., Beg, M.O., & Mujtaba, H. (2020). A deep learning framework for clickbait detection on social area network using natural language cues. Journal of Computational Social Science pp. 1–13.

  • Pamungkas, E. W., Basile, V., & Patti, V. (2020). Do you really want to hurt me? Predicting abusive swearing in social media. In: The 12th Language Resources and Evaluation Conference, pp. 6237–6246. European Language Resources Association

  • Pham, Q. H., Nguyen, V. A., Doan, L. B., Tran, N. N., & Thanh, T. M. (2020). From universal language model to downstream task: Improving roberta-based vietnamese hate speech detection. In: 2020 12th International Conference on Knowledge and Systems Engineering (KSE), pp. 37–42. IEEE

  • Qamar, S., Mujtaba, H., Majeed, H., & Beg, M. O. (2021). Relationship identification between conversational agents using emotion analysis. Cognitive Computation, 13, 1–15.

    Article  Google Scholar 

  • Rizwan, H., Shakeel, M. H., & Karim, A. (2020). Hate-speech and offensive language detection in Roman Urdu. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online

  • Sajjad, M., Zulifqar, F., Khan, M. U. G., & Azeem, M. (2019). Hate speech detection using fusion approach. In: 2019 International Conference on Applied and Engineering Mathematics (ICAEM), pp. 251–255

  • Schmidt, A., & Wiegand, M. (2017). A survey on hate speech detection using natural language processing. In: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media. Association for Computational Linguistics, Valencia, Spain

  • Wang, S. I., & Manning, C. D. (2012). Baselines and bigrams: Simple, good sentiment and topic classification. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 90–94

  • Waseem, Z., & Hovy, D. (2016). Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In: Proceedings of the NAACL Student Research Workshop, pp. 88–93. Association for Computational Linguistics, San Diego, California. https://doi.org/10.18653/v1/N16-2013

  • Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., & Kumar, R. (2019). SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval). In: Proceedings of the 13th International Workshop on Semantic Evaluation. Association for Computational Linguistics, Minneapolis, Minnesota, USA

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Umair Arshad.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Arshad, M.U., Ali, R., Beg, M.O. et al. UHated: hate speech detection in Urdu language using transfer learning. Lang Resources & Evaluation 57, 713–732 (2023). https://doi.org/10.1007/s10579-023-09642-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-023-09642-7

Keywords

Navigation