Skip to main content
Log in

OLID-BR: offensive language identification dataset for Brazilian Portuguese

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Social media has revolutionized the manner in which our society is interconnected. While this extensive connectivity offers numerous benefits, it is also accompanied by significant drawbacks, particularly in terms of the proliferation of fake news and the vast dissemination of hate speech. Identifying offensive comments is a critical task for ensuring the safety of users, which is why industry and academia have been working on developing solutions to this problem. Prior research on hate speech detection has predominantly focused on the English language, with few studies devoted to other languages such as Portuguese. This paper introduces the Offensive Language Identification Dataset for Brazilian Portuguese (OLID-BR), a high-quality NLP dataset for offensive language detection, which we make publicly available. The dataset contains 6,354 (extendable to 13,538) comments labeled using a fine-grained three-layer annotation schema compatible with datasets in other languages, which allows the training of multilingual/cross-lingual models. The five NLP tasks available in OLID-BR allow the detection of offensive comments, the classification of the types of offenses such as racism, LGBTQphobia, sexism, xenophobia, and so on, the identification of the type and the target of offensive comments, and the extraction of toxic spans of offensive comments. All those tasks can enhance the capabilities of content moderation systems by providing deep contextual analysis or highlighting the spans that make a text toxic. We further experiment with and evaluate the dataset using state-of-the-art BERT-based and NER models, which demonstrates the usefulness of OLID-BR for the development of toxicity detection systems for Portuguese texts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Availability of data and materials

The dataset created in this work is available on Kaggle (https://www.kaggle.com/dougtrajano/olidbr) and HuggingFace (https://huggingface.co/datasets/dougtrajano/olid-br).

Code availability

The source code and experiments for this paper are available on the GitHub platform at https://dougtrajano.github.io/olid-br/ and https://dougtrajano.github.io/ToChiquinho/.

Notes

  1. https://github.com/LaCAfe/Dataset-Hatespeech.

  2. https://github.com/rogersdepelle/OffComBR.

  3. https://developer.twitter.com/.

  4. https://www.perspectiveapi.com/.

  5. https://appen.com/.

  6. https://www.mturk.com/

  7. https://www.perspectiveapi.com/.

  8. https://www.fecap.br/curta-duracao/comunicacao-nao-violenta/.

  9. https://spacy.io/.

References

  • Alonso, P., Saini, R., & Kovács, G. (2020). Hate speech detection using transformer ensembles on the hasoc dataset. In International Conference on Speech and Computer (pp. 13–21). Springer

  • Basile, V., Bosco, C., & Fersini, E., et al. (2019). Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th international workshop on semantic evaluation (pp. 54–63).

  • Çöltekin, Ç. (2020). A corpus of turkish offensive language on social media. In Proceedings of the 12th language resources and evaluation conference (pp. 6174–6184).

  • de Pelle, R. P., & Moreira, V. P. (2017). Offensive comments in the brazilian web: a dataset and baseline results. In Anais do VI Brazilian Workshop on Social Network Analysis and Mining. SBC.

  • Eugenio, B. D., & Glass, M. (2004). The kappa statistic: A second look. Computational Linguistics, 30(1), 95–101.

    Article  Google Scholar 

  • Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. the problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543–549.

    Article  Google Scholar 

  • Fortuna, P., & Nunes, S. (2018). A survey on automatic detection of hate speech in text. ACM Computing Surveys (CSUR), 51(4), 1–30.

    Article  Google Scholar 

  • Fortuna, P., da Silva, J. R., & Wanner, L., et al. (2019). A hierarchically-labeled portuguese hate speech dataset. In Proceedings of the Third Workshop on Abusive Language Online (pp. 94–104).

  • Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Gaithersburg, MD, USA: Advanced Analytics, LLC.

    Google Scholar 

  • Han, J., Kamber, M., & Pei, J. (2011). Data mining: Concepts and techniques (3rd ed.). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

    Google Scholar 

  • Leite, J. A., Silva, D., & Bontcheva, K., et al. (2020). Toxic language detection in social media for Brazilian Portuguese: New dataset and multilingual analysis. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Suzhou, China, pp. 914–924, https://aclanthology.org/2020.aacl-main.91

  • Levy, L., Karst, K., & Winkler, A. (2000). Encyclopedia of the American Constitution. No. v. 6 in Encyclopedia of the American Constitution, Macmillan Reference USA, USA.

  • Nascimento, G., Carvalho, F., & Cunha, A. M. d., et al. (2019). Hate speech detection using brazilian imageboards. In Proceedings of the 25th Brazillian Symposium on Multimedia and the Web (pp. 325–328).

  • Pavlopoulos, J., Sorensen, J., & Laugier, L., et al. (2021). SemEval-2021 task 5: Toxic spans detection. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). Association for Computational Linguistics, Online, pp. 59–69, https://doi.org/10.18653/v1/2021.semeval-1.6, https://aclanthology.org/2021.semeval-1.6

  • Pitenis, Z., Zampieri, M., & Ranasinghe, T. (2020). Offensive language identification in Greek. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association (pp. 5113–5119). Marseille, France. https://aclanthology.org/2020.lrec-1.629

  • Poletto, F., Basile, V., Sanguinetti, M., et al. (2021). Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources and Evaluation, 55(2), 477–523.

    Article  Google Scholar 

  • Raghunathan, B. (2013). The complete book of data anonymization: From planning to implementation. Auerbach Publications.

    Book  Google Scholar 

  • Rosenthal, S., Atanasova, P., Karadzhov, G., et al. (2021). Solid: A large-scale semi-supervised dataset for offensive language identification. Findings of the Association for Computational Linguistics: ACL-IJCNLP, 2021, 915–928.

    Google Scholar 

  • Shelar, H., Kaur, G., Heda, N., & Mai. (2020). Named entity recognition approaches and their comparison for custom ner model. Science & Technology Libraries, 39, 324–337. https://doi.org/10.1080/0194262X.2020.1759479

    Article  Google Scholar 

  • Siddiqui, S., Singh, T., et al. (2016). Social media its impact with positive and negative aspects. International Journal of Computer Applications Technology and Research, 5(2), 71–75.

    Article  Google Scholar 

  • Sigurbergsson, G. I., & Derczynski, L. (2020). Offensive language and hate speech detection for danish. In Proceedings of the 12th Language Resources and Evaluation Conference (pp. 3498–3508).

  • Souza, F., Nogueira, R., & Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Brazilian Conference on Intelligent Systems (pp. 403–417). Springer.

  • Vaswani, A., Shazeer, N., & Parmar, N., et al. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008). Curran Associates, Inc.

  • Zampieri, M., Malmasi, S., & Nakov, P., et al. (2019a). Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp. 1415–1420, https://doi.org/10.18653/v1/N19-1144, https://aclanthology.org/N19-1144

  • Zampieri, M., Malmasi, S., & Nakov, P., et al. (2019b). Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). In Proceedings of the 13th International Workshop on Semantic Evaluation (pp. 75–86).

  • Zampieri, M., Nakov, P., & Rosenthal, S., et al. (2020). Semeval-2020 task 12: Multilingual offensive language identification in social media (offenseval 2020). In Proceedings of the Fourteenth Workshop on Semantic Evaluation (pp. 1425–1447).

Download references

Acknowledgements

We gratefully acknowledge the financial support of Uol EdTech, Brazilian National Council for Scientific and Technological Development (CNPq), and Portuguese Foundation for Science and Technology (FCT) under the projects CEECIND/01997/2017, UIDB/00057/2020.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the conception and design of the work. DT contributed to the data collection, data processing, annotation process, data analysis, and performed the experiments. All authors reviewed the work, contributed to the writing of the manuscript, and approved the final manuscript.

Corresponding author

Correspondence to Douglas Trajano.

Ethics declarations

Conflict of interest

The authors declared no potential conflicts of interest concerning this article’s research, authorship, and publication.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the Research Ethics Committee of the Pontifical Catholic University of Rio Grande do Sul (PUCRS) and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Trajano, D., Bordini, R.H. & Vieira, R. OLID-BR: offensive language identification dataset for Brazilian Portuguese. Lang Resources & Evaluation (2023). https://doi.org/10.1007/s10579-023-09657-0

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10579-023-09657-0

Keywords

Navigation