OLID-BR: offensive language identification dataset for Brazilian Portuguese

Social media has revolutionized the manner in which our society is interconnected. While this extensive connectivity offers numerous benefits, it is also accompanied by significant drawbacks, particularly in terms of the proliferation of fake news and the vast dissemination of hate speech. Identifying offensive comments is a critical task for ensuring the safety of users, which is why industry and academia have been working on developing solutions to this problem. Prior research on hate speech detection has predominantly focused on the English language, with few studies devoted to other languages such as Portuguese. This paper introduces the Offensive Language Identification Dataset for Brazilian Portuguese (OLID-BR), a high-quality NLP dataset for offensive language detection, which we make publicly available. The dataset contains 6,354 (extendable to 13,538) comments labeled using a fine-grained three-layer annotation schema compatible with datasets in other languages, which allows the training of multilingual/cross-lingual models. The five NLP tasks available in OLID-BR allow the detection of offensive comments, the classification of the types of offenses such as racism, LGBTQphobia, sexism, xenophobia, and so on, the identification of the type and the target of offensive comments, and the extraction of toxic spans of offensive comments. All those tasks can enhance the capabilities of content moderation systems by providing deep contextual analysis or highlighting the spans that make a text toxic. We further experiment with and evaluate the dataset using state-of-the-art BERT-based and NER models, which demonstrates the usefulness of OLID-BR for the development of toxicity detection systems for Portuguese texts.

Availability of data and materials

The dataset created in this work is available on Kaggle ( and HuggingFace (

Code availability

The source code and experiments for this paper are available on the GitHub platform at and












We gratefully acknowledge the financial support of Uol EdTech, Brazilian National Council for Scientific and Technological Development (CNPq), and Portuguese Foundation for Science and Technology (FCT) under the projects CEECIND/01997/2017, UIDB/00057/2020.

All authors contributed to the conception and design of the work. DT contributed to the data collection, data processing, annotation process, data analysis, and performed the experiments. All authors reviewed the work, contributed to the writing of the manuscript, and approved the final manuscript.

Correspondence to Douglas Trajano.

The authors declared no potential conflicts of interest concerning this article’s research, authorship, and publication.

All procedures performed in studies involving human participants were in accordance with the ethical standards of the Research Ethics Committee of the Pontifical Catholic University of Rio Grande do Sul (PUCRS) and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Trajano, D., Bordini, R.H. & Vieira, R. OLID-BR: offensive language identification dataset for Brazilian Portuguese. Lang Resources & Evaluation (2023).

