Predicting the type and target of offensive social media posts in Marathi

Zampieri, Marcos; Ranasinghe, Tharindu; Chaudhari, Mrinal; Gaikwad, Saurabh; Krishna, Prajwal; Nene, Mayuresh; Paygude, Shrunali

doi:10.1007/s13278-022-00906-8

Predicting the type and target of offensive social media posts in Marathi

Original Article
Published: 09 July 2022

Volume 12, article number 77, (2022)
Cite this article

Social Network Analysis and Mining Aims and scope Submit manuscript

Marcos Zampieri¹,
Tharindu Ranasinghe²,
Mrinal Chaudhari¹,
Saurabh Gaikwad¹,
Prajwal Krishna¹,
Mayuresh Nene¹ &
…
Shrunali Paygude¹

530 Accesses
9 Citations
2 Altmetric
Explore all metrics

Abstract

The presence of offensive language on social media is very common motivating platforms to invest in strategies to make communities safer. This includes developing robust machine learning systems capable of recognizing offensive content online. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English and a few other high-resource languages such as French, German, and Spanish. In this paper, we address this gap by tackling offensive language identification in Marathi, a low-resource Indo-Aryan language spoken in India. We introduce the Marathi Offensive Language Dataset v.2.0 or MOLD 2.0 and present multiple experiments on this dataset. MOLD 2.0 is a much larger version of MOLD with expanded annotation to the levels B (type) and C (target) of the popular OLID taxonomy. MOLD 2.0 is the first hierarchical offensive language dataset compiled for Marathi, thus opening new avenues for research in low-resource Indo-Aryan languages. Finally, we also introduce SeMOLD, a larger dataset annotated following the semi-supervised methods presented in SOLID (Rosenthal et al. in SOLID: a large-scale semi-supervised dataset for offensive language identification. In: Findings of ACL, 2021).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SOLD: Sinhala offensive language dataset

Article Open access 06 March 2024

Offensive language identification with multi-task learning

Article 29 April 2023

Aggressive and Offensive Language Identification in Hindi, Bangla, and English: A Comparative Study

Article 09 January 2021

Notes

Dataset available at: https://github.com/tharindudr/MOLD.
Tweepy Python library documentation is available on https://www.tweepy.org/.
Marathi FastText embeddings are available on https://fasttext.cc/docs/en/crawl-vectors.html.
Marathi word embeddings are available on https://www.cfilt.iitb.ac.in/~diptesh/embeddings/.
DeepOffense is available as a pip package in https://pypi.org/project/deepoffense/.

References

Alakrot A, Murray L, Nikolov NS (2018) Towards accurate detection of offensive language in online communication in arabic. Procedia Comput Sci 142:315–320
Article Google Scholar
Aroyehun ST, Gelbukh A (2018) Aggression detection in social media: using deep neural networks, data augmentation, and pseudo labeling. In: Proceedings of TRAC
Basile V, Bosco C, Fersini E, Nozza D, Patti V, Pardo FMR, Rosso P, Sanguinetti M (2019) Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In: Proceedings of SemEval
Bassignana E, Basile V, Patti V ( 2018) Hurtlex: a multilingual lexicon of words to hurt. In: Proceedings of CliC-It
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:1
Google Scholar
Carletta J (1996) Assessing agreement on classification tasks: the kappa statistic. Comput Linguist 22(2):249–254
Google Scholar
Chiril P, Benamara Zitoune F, Moriceau V, Coulomb-Gully M, Kumar A ( 2019) Multilingual and multitarget hate speech detection in tweets. In: Proceedings of TALN
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2019) Unsupervised cross-lingual representation learning at scale. In: Proceedings of ACL
Çöltekin c (2020) A Corpus of Turkish Offensive Language on Social Media. In: Proceedings of LREC
Dadvar M, Trieschnigg D, Ordelman R, de Jong F (2013) Improving dyberbullying detection with user context. In: Proceedings of ECIR,
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL
Fišer D, Erjavec T, Ljubešić N (2017) Legal framework, dataset and annotation schema for socially unacceptable on-line discourse practices in Slovene. In: Proceedings ALW
Fortuna P, da Silva JR, Wanner L, Nunes S, et al ( 2019) A hierarchically-labeled portuguese hate speech dataset. In: Proceedings of ALW
Gaikwad SS, Ranasinghe T, Zampieri M, Homan C ( 2021) Cross-lingual offensive language identification for low resource languages: the case of Marathi. In: Proceedings of RANLP
Ghadery E, Moens M-F (2020) LIIR at semeval-2020 task 12: a cross-lingual augmentation approach for multilingual offensive language identification. Proceedings of SemEval
Goudjil M, Koudil M, Bedda M, Ghoggali N (2018) A novel active learning method using svm for text classification. Int J Autom Comput 15(3):290–298
Article Google Scholar
Hettiarachchi H, Ranasinghe T (2019) Emoji powered capsule network to detect type and target of offensive posts in social media. In: Proceedings of RANLP
Kakwani D, Kunchukuttan A, Golla S, NC G, Bhattacharyya A, Khapra MM, Kumar P ( 2020) IndicNLPSuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of EMNLP
Kumar R, Ojha AK, Malmasi S, Zampieri M ( 2020) Evaluating aggression identification in social media. In: Proceedings of TRAC
Kumar R, Ojha AK, Malmasi S, Zampieri M (2018) Benchmarking aggression identification in social media. In: Proceedings of TRAC
Kumar S, Kumar S, Kanojia D, Bhattacharyya,P (2020) A passage to India: Pre-trained word embeddings for Indian languages. In: Proceedings of SLTU
Liu P, Li, W, Zou L (2019) NULI at SemEval-2019 task 6: transfer learning for offensive language detection using bidirectional transformers. In: Proceedings of SemEval
Malmasi S, Zampieri M ( 2017) Detecting hate speech in social media. In: Proceedings of RANLP
Mandl T, Modha S, Majumder P, Patel D, Dave M, Mandlia C, Patel, A (2019) Overview of the Hasoc track at fire 2019: hate speech and offensive content identification in Indo-European languages. In: Proceedings of FIRE
Mandl T, Modha S, Kumar M A, Chakravarthi BR ( 2020) Overview of the hasoc track at fire 2020: hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In: Proceedings of FIRE
Modha S, Mandl T, Shahi GK, Madhu H, Satapara S, Ranasinghe T, Zampieri M (2021) Overview of the HASOC Subtrack at FIRE 2021: hate speech and offensive content identification in English and Indo-Aryan languages and conversational hate speech. In: Proceedings of FIRE
Mubarak H, Rashed A, Darwish K, Samih Y, Abdelali A ( 2021) Arabic offensive language on twitter: analysis and experiments. In: Proceedings of WANLP
Pamungkas, EW, Patti V (2019) Cross-domain and cross-lingual abusive language detection: a hybrid approach with deep learning and a multilingual lexicon. In: Proceedings ACL:SRW
Pitenis Z, Zampieri M, Ranasinghe T (2020) Offensive language identification in Greek. In: Proceedings of LREC
Poletto F, Stranisci M, Sanguinetti M, Patti V, Bosco C ( 2017) Hate speech annotation: analysis of an Italian twitter corpus. In: Proceedings of CLiC-it
Ranasinghe T, Zampieri M (2021) An evaluation of multilingual offensive language identification methods for the languages of india. Information 12(8):1
Article Google Scholar
Ranasinghe T, Zampieri M ( 2020) Multilingual offensive language identification with cross-lingual embeddings. In: Proceedings of EMNLP
Ranasinghe T, Zampieri M (2021) Multilingual offensive language identification for low-resource languages. ACM transactions on asian and low-resource language information processing (TALLIP)
Ranasinghe T, Zampieri M ( 2021) MUDES: multilingual detection of offensive spans. In: Proceedings of NAACL
Ranasinghe T, Hettiarachchi H ( 2020) BRUMS at SemEval-2020 task 12: transformer based multilingual offensive language identification in social media. In: Proceedings of SemEval
Ranasinghe T, Sarkar D, Zampieri M, Ororbia A (2021) WLV-RIT at SemEval-2021 task 5: a neural transformer framework for detecting toxic spans. In: Proceedings of SemEval
Ridenhour M, Bagavathi A, Raisi E, Krishnan S (2020) Detecting online hate speech: approaches using weak supervision and network embedding models. arXiv preprint arXiv:2007.12724
Rosenthal S, Atanasova P, Karadzhov G, Zampieri M, Nakov P(2021) Solid: a large-scale semi-supervised dataset for offensive language identification. In: Findings of ACL
Sarkar D, Zampieri M, Ranasinghe T, Ororbia A (2021) fbert: a neural transformer for identifying offensive content. In: Findings of the association for computational linguistics: EMNLP 2021, pp 1792– 1798
Schwarm SE, Ostendorf M ( 2005) Reading level assessment using support vector machines and statistical language models. In: Proceedings of ACL
Tulkens S, Hilte L, Lodewyckx E, Verhoeven B, Daelemans W (2016) A dictionary-based approach to racism detection in Dutch Social Media. In: Proceedings of TA-COS
Wiegand M, Siegel M, Ruppenhofer J ( 2018) Overview of the GermEval 2018 shared task on the identification of offensive language. In: Proceedings of GermEval
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac,P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Le Scao T, Gugger S, Drame M, Lhoest Q, Rush A (2020) Transformers: state-of-the-art natural language processing. In: Proceedings of EMNLP
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019) XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of NeurIPS
Yao M, Chelmis C, Zois D-S (2019)Cyberbullying ends here: towards robust detection of cyberbullying in social media. In: Proceedings of WWW
Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R (2019) Predicting the type and target of offensive posts in social media. In: Proceedings of NAACL
Zampieri M, Nakov P, Rosenthal S, Atanasova P, Karadzhov G, Mubarak H, Derczynski L, Pitenis Z, Çöltekin C (2020) SemEval-2020 Task 12: multilingual offensive language identification in social media (OffensEval 2020). In: Proceedings of SemEval
Zhang J, Chang J, Danescu-Niculescu-Mizil C, Dixon L, Hua Y, Taraborelli D, Thain N ( 2018) Conversations gone awry: detecting early signs of conversational failure. In: Proceedings of ACL

Download references

Author information

Authors and Affiliations

Rochester Institute of Technology, Rochester, NY, USA
Marcos Zampieri, Mrinal Chaudhari, Saurabh Gaikwad, Prajwal Krishna, Mayuresh Nene & Shrunali Paygude
University of Wolverhampton, Wolverhampton, UK
Tharindu Ranasinghe

Authors

Marcos Zampieri
View author publications
You can also search for this author in PubMed Google Scholar
Tharindu Ranasinghe
View author publications
You can also search for this author in PubMed Google Scholar
Mrinal Chaudhari
View author publications
You can also search for this author in PubMed Google Scholar
Saurabh Gaikwad
View author publications
You can also search for this author in PubMed Google Scholar
Prajwal Krishna
View author publications
You can also search for this author in PubMed Google Scholar
Mayuresh Nene
View author publications
You can also search for this author in PubMed Google Scholar
Shrunali Paygude
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcos Zampieri.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zampieri, M., Ranasinghe, T., Chaudhari, M. et al. Predicting the type and target of offensive social media posts in Marathi. Soc. Netw. Anal. Min. 12, 77 (2022). https://doi.org/10.1007/s13278-022-00906-8

Download citation

Received: 09 March 2022
Revised: 09 June 2022
Accepted: 10 June 2022
Published: 09 July 2022
DOI: https://doi.org/10.1007/s13278-022-00906-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predicting the type and target of offensive social media posts in Marathi

Abstract

Access this article

Similar content being viewed by others

SOLD: Sinhala offensive language dataset

Offensive language identification with multi-task learning

Aggressive and Offensive Language Identification in Hindi, Bangla, and English: A Comparative Study

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Predicting the type and target of offensive social media posts in Marathi

Abstract

Access this article

Similar content being viewed by others

SOLD: Sinhala offensive language dataset

Offensive language identification with multi-task learning

Aggressive and Offensive Language Identification in Hindi, Bangla, and English: A Comparative Study

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation