OLID-BR: offensive language identification dataset for Brazilian Portuguese

Trajano, Douglas; Bordini, Rafael H.; Vieira, Renata

doi:10.1007/s10579-023-09657-0

OLID-BR: offensive language identification dataset for Brazilian Portuguese

Original Paper
Published: 03 May 2023

(2023)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Douglas Trajano¹,
Rafael H. Bordini¹ &
Renata Vieira²

226 Accesses
2 Citations
Explore all metrics

Abstract

Social media has revolutionized the manner in which our society is interconnected. While this extensive connectivity offers numerous benefits, it is also accompanied by significant drawbacks, particularly in terms of the proliferation of fake news and the vast dissemination of hate speech. Identifying offensive comments is a critical task for ensuring the safety of users, which is why industry and academia have been working on developing solutions to this problem. Prior research on hate speech detection has predominantly focused on the English language, with few studies devoted to other languages such as Portuguese. This paper introduces the Offensive Language Identification Dataset for Brazilian Portuguese (OLID-BR), a high-quality NLP dataset for offensive language detection, which we make publicly available. The dataset contains 6,354 (extendable to 13,538) comments labeled using a fine-grained three-layer annotation schema compatible with datasets in other languages, which allows the training of multilingual/cross-lingual models. The five NLP tasks available in OLID-BR allow the detection of offensive comments, the classification of the types of offenses such as racism, LGBTQphobia, sexism, xenophobia, and so on, the identification of the type and the target of offensive comments, and the extraction of toxic spans of offensive comments. All those tasks can enhance the capabilities of content moderation systems by providing deep contextual analysis or highlighting the spans that make a text toxic. We further experiment with and evaluate the dataset using state-of-the-art BERT-based and NER models, which demonstrates the usefulness of OLID-BR for the development of toxicity detection systems for Portuguese texts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OMCD: Offensive Moroccan Comments Dataset

Article 05 June 2023

SOLD: Sinhala offensive language dataset

Article Open access 06 March 2024

Reason Based Machine Learning Approach to Detect Bangla Abusive Social Media Comments

Availability of data and materials

The dataset created in this work is available on Kaggle (https://www.kaggle.com/dougtrajano/olidbr) and HuggingFace (https://huggingface.co/datasets/dougtrajano/olid-br).

Code availability

The source code and experiments for this paper are available on the GitHub platform at https://dougtrajano.github.io/olid-br/ and https://dougtrajano.github.io/ToChiquinho/.

Notes

References

Alonso, P., Saini, R., & Kovács, G. (2020). Hate speech detection using transformer ensembles on the hasoc dataset. In International Conference on Speech and Computer (pp. 13–21). Springer
Basile, V., Bosco, C., & Fersini, E., et al. (2019). Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th international workshop on semantic evaluation (pp. 54–63).
Çöltekin, Ç. (2020). A corpus of turkish offensive language on social media. In Proceedings of the 12th language resources and evaluation conference (pp. 6174–6184).
de Pelle, R. P., & Moreira, V. P. (2017). Offensive comments in the brazilian web: a dataset and baseline results. In Anais do VI Brazilian Workshop on Social Network Analysis and Mining. SBC.
Eugenio, B. D., & Glass, M. (2004). The kappa statistic: A second look. Computational Linguistics, 30(1), 95–101.
Article Google Scholar
Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. the problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543–549.
Article Google Scholar
Fortuna, P., & Nunes, S. (2018). A survey on automatic detection of hate speech in text. ACM Computing Surveys (CSUR), 51(4), 1–30.
Article Google Scholar
Fortuna, P., da Silva, J. R., & Wanner, L., et al. (2019). A hierarchically-labeled portuguese hate speech dataset. In Proceedings of the Third Workshop on Abusive Language Online (pp. 94–104).
Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Gaithersburg, MD, USA: Advanced Analytics, LLC.
Google Scholar
Han, J., Kamber, M., & Pei, J. (2011). Data mining: Concepts and techniques (3rd ed.). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Google Scholar
Leite, J. A., Silva, D., & Bontcheva, K., et al. (2020). Toxic language detection in social media for Brazilian Portuguese: New dataset and multilingual analysis. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Suzhou, China, pp. 914–924, https://aclanthology.org/2020.aacl-main.91
Levy, L., Karst, K., & Winkler, A. (2000). Encyclopedia of the American Constitution. No. v. 6 in Encyclopedia of the American Constitution, Macmillan Reference USA, USA.
Nascimento, G., Carvalho, F., & Cunha, A. M. d., et al. (2019). Hate speech detection using brazilian imageboards. In Proceedings of the 25th Brazillian Symposium on Multimedia and the Web (pp. 325–328).
Pavlopoulos, J., Sorensen, J., & Laugier, L., et al. (2021). SemEval-2021 task 5: Toxic spans detection. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). Association for Computational Linguistics, Online, pp. 59–69, https://doi.org/10.18653/v1/2021.semeval-1.6, https://aclanthology.org/2021.semeval-1.6
Pitenis, Z., Zampieri, M., & Ranasinghe, T. (2020). Offensive language identification in Greek. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association (pp. 5113–5119). Marseille, France. https://aclanthology.org/2020.lrec-1.629
Poletto, F., Basile, V., Sanguinetti, M., et al. (2021). Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources and Evaluation, 55(2), 477–523.
Article Google Scholar
Raghunathan, B. (2013). The complete book of data anonymization: From planning to implementation. Auerbach Publications.
Book Google Scholar
Rosenthal, S., Atanasova, P., Karadzhov, G., et al. (2021). Solid: A large-scale semi-supervised dataset for offensive language identification. Findings of the Association for Computational Linguistics: ACL-IJCNLP, 2021, 915–928.
Google Scholar
Shelar, H., Kaur, G., Heda, N., & Mai. (2020). Named entity recognition approaches and their comparison for custom ner model. Science & Technology Libraries, 39, 324–337. https://doi.org/10.1080/0194262X.2020.1759479
Article Google Scholar
Siddiqui, S., Singh, T., et al. (2016). Social media its impact with positive and negative aspects. International Journal of Computer Applications Technology and Research, 5(2), 71–75.
Article Google Scholar
Sigurbergsson, G. I., & Derczynski, L. (2020). Offensive language and hate speech detection for danish. In Proceedings of the 12th Language Resources and Evaluation Conference (pp. 3498–3508).
Souza, F., Nogueira, R., & Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Brazilian Conference on Intelligent Systems (pp. 403–417). Springer.
Vaswani, A., Shazeer, N., & Parmar, N., et al. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008). Curran Associates, Inc.
Zampieri, M., Malmasi, S., & Nakov, P., et al. (2019a). Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp. 1415–1420, https://doi.org/10.18653/v1/N19-1144, https://aclanthology.org/N19-1144
Zampieri, M., Malmasi, S., & Nakov, P., et al. (2019b). Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). In Proceedings of the 13th International Workshop on Semantic Evaluation (pp. 75–86).
Zampieri, M., Nakov, P., & Rosenthal, S., et al. (2020). Semeval-2020 task 12: Multilingual offensive language identification in social media (offenseval 2020). In Proceedings of the Fourteenth Workshop on Semantic Evaluation (pp. 1425–1447).

Download references

Acknowledgements

We gratefully acknowledge the financial support of Uol EdTech, Brazilian National Council for Scientific and Technological Development (CNPq), and Portuguese Foundation for Science and Technology (FCT) under the projects CEECIND/01997/2017, UIDB/00057/2020.

Author information

Authors and Affiliations

School of Technology, Pontifical Catholic University of Rio Grande do Sul - PUCRS, Porto Alegre, Brazil
Douglas Trajano & Rafael H. Bordini
CIDEHUS, University of Evora, Évora, Portugal
Renata Vieira

Authors

Douglas Trajano
View author publications
You can also search for this author in PubMed Google Scholar
Rafael H. Bordini
View author publications
You can also search for this author in PubMed Google Scholar
Renata Vieira
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the conception and design of the work. DT contributed to the data collection, data processing, annotation process, data analysis, and performed the experiments. All authors reviewed the work, contributed to the writing of the manuscript, and approved the final manuscript.

Corresponding author

Correspondence to Douglas Trajano.

Ethics declarations

Conflict of interest

The authors declared no potential conflicts of interest concerning this article’s research, authorship, and publication.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the Research Ethics Committee of the Pontifical Catholic University of Rio Grande do Sul (PUCRS) and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Trajano, D., Bordini, R.H. & Vieira, R. OLID-BR: offensive language identification dataset for Brazilian Portuguese. Lang Resources & Evaluation (2023). https://doi.org/10.1007/s10579-023-09657-0

Download citation

Accepted: 27 March 2023
Published: 03 May 2023
DOI: https://doi.org/10.1007/s10579-023-09657-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OLID-BR: offensive language identification dataset for Brazilian Portuguese

Abstract

Access this article

Similar content being viewed by others

OMCD: Offensive Moroccan Comments Dataset

SOLD: Sinhala offensive language dataset

Reason Based Machine Learning Approach to Detect Bangla Abusive Social Media Comments

Availability of data and materials

Code availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

OLID-BR: offensive language identification dataset for Brazilian Portuguese

Abstract

Access this article

Similar content being viewed by others

OMCD: Offensive Moroccan Comments Dataset

SOLD: Sinhala offensive language dataset

Reason Based Machine Learning Approach to Detect Bangla Abusive Social Media Comments

Availability of data and materials

Code availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation