Domain-adaptive pre-training on a BERT model for the automatic detection of misogynistic tweets in Spanish

Rodríguez, Dalia A.; Diaz-Escobar, Julia; Díaz-Ramírez, Arnoldo; Trujillo, Leonardo

doi:10.1007/s13278-023-01128-2

Domain-adaptive pre-training on a BERT model for the automatic detection of misogynistic tweets in Spanish

Original Article
Published: 29 September 2023

Volume 13, article number 126, (2023)
Cite this article

Social Network Analysis and Mining Aims and scope Submit manuscript

Dalia A. Rodríguez¹,
Julia Diaz-Escobar¹,
Arnoldo Díaz-Ramírez¹ &
…
Leonardo Trujillo²

174 Accesses
Explore all metrics

Abstract

Violence against women is a major social issue. One in every three women worldwide has been subjected to physical or sexual violence. The pervasive violence against women in the physical world, the ever-growing presence of social media in our lives, and its lack of content moderation have led to an influx of misogynistic social media content. We contribute to preventing violence against women by introducing a BERT architecture with domain-adaptive pre-training to detect misogynistic tweets in Spanish automatically. We used the IbeEval 2018 Spanish dataset for automatic misogyny identification, obtaining an accuracy of 84.60%, precision of 79.64%, recall at 86.70%, and F-1 score of 83.02%, outperforming the state of the art. We also conducted a manual error analysis and discovered 469 mislabeled tweets and a misogynistic bias in the IbeEval 2018 Spanish dataset. Our debiased model outperformed the current literature on automatic misogyny detection with an accuracy of 84.35%, precision of 84.64%, recall of 83.93%, and F-1 score of 84.28%. Lastly, we addressed the need for misogyny detection on other social media by experimenting with a manually curated and labeled dataset of Facebook comments in Spanish for automatic misogyny detection. We obtained excellent results with an accuracy of 87.85%. Misogyny is a complex social issue, so an interdisciplinary approach might benefit future models for automatically detecting misogyny.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining the Discussion of Monkeypox Misinformation on Twitter Using RoBERTa

Persian offensive language detection

Article 23 August 2023

OLID-BR: offensive language identification dataset for Brazilian Portuguese

Article 03 May 2023

Notes

References

Aayel A, Magdy W (2021) Stance detection on social media: state of the art and trends. Inf Process Manag 58(4):102–597. https://doi.org/10.1016/j.ipm.2021.102597
Article Google Scholar
Anderson A, Huttenlocher D, Kleinberg J, Leskovec J (2012) Effects of user similarity in social media. In: Proceedings of the fifth ACM international conference on web search and data mining. Association for Computing Machinery, New York, NY, USA, pp 703–712. https://doi.org/10.1145/2124295.2124378
Banda JM, Tekumalla R, Wang G, Yu J, Liu T, Ding Y, Artemova K, Tutubalina E, Chowell G (2022) A large-scale COVID-19 Twitter chatter dataset for open scientific research—an international collaboration. Zenodo. https://doi.org/10.5281/zenodo.7297788
Bashar MA, Nayak R, Suzor N (2020) Regularising lstm classifier by transfer learning for detecting misogynistic tweets with small training set. Knowl Inf Syst 62:4029–4054. https://doi.org/10.1007/s10115-020-01481-0
Article Google Scholar
Basile V, Bosco C, Fersini E, Nozza D, Patti V, Rangel Pardo FM, Rosso P, Sanguinetti M (2019) SemEval-2019 task 5: multilingual detection of hate speech against immigrants and women in Twitter. In: Proceedings of the 13th international workshop on semantic evaluation. Association for Computational Linguistics, Minneapolis, Minnesota, USA, pp 54–63. https://doi.org/10.18653/v1/S19-2007
Blake KR, O’Dean SM, Lian J, Denson TF (2021) Misogynistic tweets correlate with violence against women. Psychol Sci 32(3):315–325. https://doi.org/10.1177/0956797620968529
Article Google Scholar
Cañete J, Chaperon G, Fuentes R, Pérez J, Ho J-H, Kang H (2020) Spanish pre-trained bert model and evaluation data. In: Practical machine learning for developing countries workshop at the international conference on learning representations 2020
Canós JS (2018) Misogyny identification through SVM at ibereval 2018. In: Rosso P, Gonzalo J, Martínez R, Montalvo S, de Albornoz JC (eds) Proceedings of the third workshop on evaluation of human language technologies for Iberian languages (IberEval 2018) co-located with 34th conference of the Spanish Society for Natural Language Processing (SEPLN 2018), Sevilla, Spain, September 18th, 2018. CEUR Workshop Proceedings, vol 2150, pp 229–233. CEUR-WS.org. http://ceur-ws.org/Vol-2150/AMI_paper1.pdf
Comito C, Falcone D, Talia D (2017) A peak detection method to uncover events from social media. In: Proceedings of the IEEE international conference on data science and advanced analytics (DSAA), pp 459–467 (2017). https://doi.org/10.1109/DSAA.2017.69
Council of Europe (2023) Cyberviolence against women. https://www.coe.int/en/web/cyberviolence/cyberviolence-against-women
Coyne SM, Rogers AA, Zurcher JD, Stockdale L, Booth M (2020) Does time spent using social media impact mental health? An eight year longitudinal study. Comput Hum Behav 104:106160. https://doi.org/10.1016/j.chb.2019.106160
Article Google Scholar
Devlin J, Chang M.-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, Volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423
Dwivedi A, Lewis C (2021) How millennials’ life concerns shape social media behaviour. Behav Inf Technol 40(14):1467–1484. https://doi.org/10.1080/0144929X.2020.1760938
Article Google Scholar
Fersini E, Rosso P, Anzovino M (2018) Overview of the task on automatic misogyny identification at ibereval 2018. In: Rosso P, Gonzalo J, Martínez R, Montalvo S, de Albornoz JC (eds) Proceedings of the third workshop on evaluation of human language technologies for Iberian languages (IberEval 2018) co-located with 34th conference of the Spanish Society for Natural Language Processing (SEPLN 2018), Sevilla, Spain, September 18th, 2018. CEUR Workshop proceedings, vol 2150, pp 214–228. CEUR-WS.org. http://ceur-ws.org/Vol-2150/overview-AMI.pdf
Frenda S, Ghanem B, Montes-y-Gómez M (2018) Exploration of misogyny in Spanish and English tweets. In: Rosso P, Gonzalo J, Martínez R, Montalvo S, de Albornoz JC (eds) Proceedings of the third workshop on evaluation of human language technologies for Iberian languages (IberEval 2018) co-located with 34th conference of the Spanish society for natural language processing (SEPLN 2018), Sevilla, Spain, September 18th, 2018. CEUR workshop proceedings, vol 2150, pp 260–267. CEUR-WS.org. http://ceur-ws.org/Vol-2150/AMI_paper6.pdf
Fulper R, Ciampaglia GL, Ferrara E, Ahn Y, Flammini A, Menczer F, Lewis B, Rowe K (2014) Misogynistic language on twitter and sexual violence. In: ChASM’14: computational approaches to social modeling
García-Díaz J.A, Cánovas-García M, Colomo-Palacios R, Valencia-García R (2021) Detecting misogyny in Spanish tweets. an approach based on linguistics features and word embeddings. Future Gener Comput Syst 114:506–518. https://doi.org/10.1016/j.future.2020.08.032
García-Díaz J, Jiménez-Zafra SM, García-Cumbreras MA (2022) Valencia–García R Evaluating feature combination strategies for hate-speech detection in Spanish using linguistic features and transformers. Complex Intell Sys. https://doi.org/10.1007/s40747-022-00693-x
Article Google Scholar
Gobierno de México (2016) Qué es el feminicidio y cómo identificarlo? https://www.gob.mx/conavim/articulos/que-es-el-feminicidio-y-como-identificarlo?idiom=es
Goenaga I, Atutxa A, Gojenola K, Casillas A, de Ilarraza AD, Ezeiza N, Oronoz M, Pérez A, Perez-de-Viñaspre O (2018)Automatic misogyny identification using neural networks. In: Rosso P, Gonzalo J, Martínez R, Montalvo S, de Albornoz JC (eds) Proceedings of the third workshop on evaluation of human language technologies for Iberian languages (IberEval 2018) co-located with 34th conference of the Spanish Society for Natural Language Processing (SEPLN 2018), Sevilla, Spain, September 18th, 2018. CEUR workshop proceedings, vol 2150, pp 249–254. CEUR-WS.org (2018). http://ceur-ws.org/Vol-2150/AMI_paper4.pdf
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge (2016). http://www.deeplearningbook.org
Gururangan S, Marasović A, Swayamdipta S, Lo K, Beltagy I, Downey D, Smith NA (2020) Don’t stop pretraining: adapt language models to domains and tasks. In: Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 8342–8360. https://doi.org/10.18653/v1/2020.acl-main.740
INEGI: Modulo sobre Ciberacoso 2020. INEGI. https://www.inegi.org.mx/contenidos/saladeprensa/boletines/2021/EstSociodemo/MOCIBA-2020.pdf
Kemp S (2023) Datareportal: digital 2023 global overview report. https://datareportal.com/reports/digital-2023-global-overview-report
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv. https://doi.org/10.48550/ARXIV.1412.6980
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv. https://doi.org/10.48550/ARXIV.1711.05101
Manne K (2017) Down girl: the logic of Misogyny. Oxford University Press, Oxford
Book Google Scholar
Nina-Alcocer V (2018) AMI at ibereval2018 automatic misogyny identification in Spanish and English tweets. In: Rosso P, Gonzalo J, Martínez R, Montalvo S, de Albornoz JC (eds) Proceedings of the third workshop on evaluation of human language technologies for Iberian languages (IberEval 2018) co-located with 34th conference of the Spanish Society for Natural Language Processing (SEPLN 2018), Sevilla, Spain, September 18th, 2018. CEUR Workshop Proceedings, vol 2150, pp 274–279. CEUR-WS.org. http://ceur-ws.org/Vol-2150/AMI_paper8.pdf
Observatorio Nacional de la Violencia Contra las Mujeres y los Integrantes del Grupo Familiar (2021) Datos y evidencias sobre violencia hacia las mujeres e integrantes del grupo familiar, según fuente de información. https://observatorioviolencia.pe/datos/
Otterbacher J, Bates J, Clough P (2017) Competent men and warm women: gender stereotypes and backlash in image search results. In: Proceedings of the 2017 CHI conference on human factors in computing systems (CHI’17). Association for Computing Machinery, New York, NY, USA, pp 6620–6631. https://doi.org/10.1145/3025453.3025727
Pamungkas EW, Basile V, atti V (2020) Misogyny detection in twitter: a multilingual and cross-domain study. Inf Process Manag 57(6):102360. https://doi.org/10.1016/j.ipm.2020.102360
Article Google Scholar
Pamungkas EW, Cignarella AT, Basile V, Patti V (2020) 14-ExLab@UniTo for AMI at ibereval2018: exploiting lexical knowledge for detecting misogyny in English and Spanish tweets. In: Rosso P, Gonzalo J, Martínez R, Montalvo S, de Albornoz JC (eds) Proceedings of the third workshop on evaluation of human language technologies for Iberian languages (IberEval 2018) co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2018), Sevilla, Spain, September 18th, 2018. CEUR workshop proceedings, vol 2150, pp 234–241. CEUR-WS.org. http://ceurws.org/Vol2150/AMI_paper2.pdf
Plaza-Del-Arco F-M, Molina-González MD, Ureña-López LA, Martín-Valdivia MT (2020) Detecting misogyny and xenophobia in Spanish tweets using language technologies. ACM Trans Internet Technol. https://doi.org/10.1145/3369869
Plaza-del-Arco FM, Molina-González MD, Ureña-López LA, Martín-Valdivia MT (2021) Comparing pre-trained language models for Spanish hate speech detection. Expert Syst with Appl 166:114120. https://doi.org/10.1016/j.eswa.2020.114120
Posetti J, Aboulez N, Bontcheva K, Harrison J, Waisbord S (2020) Online violence against women journalists: a global snapshot of incidence and impacts. UNESCO. https://unesdoc.unesco.org/ark:/48223/pf0000375136
Rodríguez DA, Díaz-Ramírez A, Miranda-Vega JE, Trujillo L (2021) A systematic review of computer science solutions for addressing violence against women and children. IEEE Access 9:114622–114639. https://doi.org/10.1109/ACCESS.2021.3103459
Article Google Scholar
Secretaria de Seguridad y Protección Ciudadana de México (2022) Información sobre violencia contra las mujeres Incidencia delictiva y llamadas de emergencia 9-1-1. https://drive.google.com/file/d/1jvGGrA31Q361fOuNChetkBu0pva_MGxF/view
Srivastava K, Chaudhury S, Bhat PS, Sahu S (2017) Misogyny, feminism, and sexual harassment. Ind Psychiatry J 26(2):111–113. https://doi.org/10.4103/ipj.ipj_32
Article Google Scholar
Sveen W, Dewan M, Dexheimer JW (2022) The risk of coding racism into pediatric sepsis care: the necessity of antiracism in machine learning. J Pediatr 247:129–132. https://doi.org/10.1016/j.jpeds.2022.04.024
Article Google Scholar
Taylor SJ, Muchnik L, Kumar M, Aral S (2023) Identity effects in social media. Nat Hum Behav 7(1):27–37. https://doi.org/10.1038/s41562-022-01459-8
Article Google Scholar
Twitter I (2014) The 2014 #yearontwitter. Twitter. https://blog.twitter.com/official/en_us/a/2014/the-2014-yearontwitter.html
UN Women (2021) Facts and figures: ending violence against women. https://www.unwomen.org/en/what-we-do/ending-violence-against-women/facts-and-figures
United Nations Office for the Coordination of Humanitarian Affairs (2020a) A double pandemic: gender-based violence in Latin America and the early experience of women during Covid-19. https://bit.ly/3I6UQOE
United Nations Office for the Coordination of Humanitarian Affairs (2020b) Surge in violence against girls and women in Latin America and Caribbean. https://bit.ly/3O8GIbC
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L.u, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. http://arxiv.org/abs/1706.03762
Vogels EA (2021) The state of online harassment. Technical report, Pew Research Center. https://www.pewresearch.org/internet/2021/01/13/the-state-of-online-harassment/
World Health Organization (2021) Violence against women. https://www.who.int/news-room/fact-sheets/detail/violence-against-women
Zou J, Schiebinger L (2018) Ai can be sexist and racist—it’s time to make it fair. Nature 559(7714):324–326. https://doi.org/10.1038/d41586-018-05707-8
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Systems, Tecnológico Nacional de México/IT Mexicali, Av Tecnológico s/n, Mexicali, 21376, Baja California, Mexico
Dalia A. Rodríguez, Julia Diaz-Escobar & Arnoldo Díaz-Ramírez
Department of Electrics and Electronics, Tecnológico Nacional de México/IT Tijuana, Blvd. Industrial s/n, Tijuana, 22430, Baja California, Mexico
Leonardo Trujillo

Authors

Dalia A. Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
Julia Diaz-Escobar
View author publications
You can also search for this author in PubMed Google Scholar
Arnoldo Díaz-Ramírez
View author publications
You can also search for this author in PubMed Google Scholar
Leonardo Trujillo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.R. and A.D-R. conceived of the presented idea. D.R. and J.D-E. developed the theory, and D.R. performed the computations with the supervision of J.D-E. and L.T. A.D-R, and J-D-E verified the analytical methods. A.D-R., J.D-E and L.T. supervised the findings of this work. D.R. wrote the manuscript with input from all authors. All authors discussed the results and contributed to the final manuscript.

Corresponding author

Correspondence to Arnoldo Díaz-Ramírez.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rodríguez, D.A., Diaz-Escobar, J., Díaz-Ramírez, A. et al. Domain-adaptive pre-training on a BERT model for the automatic detection of misogynistic tweets in Spanish. Soc. Netw. Anal. Min. 13, 126 (2023). https://doi.org/10.1007/s13278-023-01128-2

Download citation

Received: 31 January 2023
Revised: 13 May 2023
Accepted: 06 September 2023
Published: 29 September 2023
DOI: https://doi.org/10.1007/s13278-023-01128-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Domain-adaptive pre-training on a BERT model for the automatic detection of misogynistic tweets in Spanish

Abstract

Access this article

Similar content being viewed by others

Mining the Discussion of Monkeypox Misinformation on Twitter Using RoBERTa

Persian offensive language detection

OLID-BR: offensive language identification dataset for Brazilian Portuguese

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Domain-adaptive pre-training on a BERT model for the automatic detection of misogynistic tweets in Spanish

Abstract

Access this article

Similar content being viewed by others

Mining the Discussion of Monkeypox Misinformation on Twitter Using RoBERTa

Persian offensive language detection

OLID-BR: offensive language identification dataset for Brazilian Portuguese

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation