MarIA and BETO are sexist: evaluating gender bias in large language models for Spanish

Garrido-Muñoz, Ismael; Martínez-Santiago, Fernando; Montejo-Ráez, Arturo

doi:10.1007/s10579-023-09670-3

MarIA and BETO are sexist: evaluating gender bias in large language models for Spanish

Original Paper
Open access
Published: 23 July 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

Language Resources and Evaluation Aims and scope Submit manuscript

MarIA and BETO are sexist: evaluating gender bias in large language models for Spanish

Download PDF

Ismael Garrido-Muñoz¹^na1,
Fernando Martínez-Santiago¹^na1 &
Arturo Montejo-Ráez¹^na1

1259 Accesses
1 Citation
16 Altmetric
Explore all metrics

Abstract

The study of bias in language models is a growing area of work, however, both research and resources are focused on English. In this paper, we make a first approach focusing on gender bias in some freely available Spanish language models trained using popular deep neural networks, like BERT or RoBERTa. Some of these models are known for achieving state-of-the-art results on downstream tasks. These promising results have promoted such models’ integration in many real-world applications and production environments, which could be detrimental to people affected for those systems. This work proposes an evaluation framework to identify gender bias in masked language models, with explainability in mind to ease the interpretation of the evaluation results. We have evaluated 20 different models for Spanish, including some of the most popular pretrained ones in the research community. Our findings state that varying levels of gender bias are present across these models.This approach compares the adjectives proposed by the model for a set of templates. We classify the given adjectives into understandable categories and compute two new metrics from model predictions, one based on the internal state (probability) and the other one on the external state (rank). Those metrics are used to reveal biased models according to the given categories and quantify the degree of bias of the models under study.

Gender Prediction Based on Chinese Name

An Investigation of Structures Responsible for Gender Bias in BERT and DistilBERT

Investigating Gender Bias in BERT

Article 20 May 2021

1 Introduction

It is well agreed that data models for natural language processing are able to capture reality very accurately. They are so good that they even capture undesirable or unfair associations. Bolukbasi et al. (2016) showed how word embeddings capture associations such as a Man will be a computer programmer while Woman will be a home-maker. The work by Caliskan et al. (2017) will later show how data-driven trained models in artificial intelligence are able to capture all kinds of prejudices and human-like biases. This is not exclusive to word embedding models, in more recent and more complex models this behavior is still present (Garrido-Muñoz et al., 2021). A clear example is the recent GPT-3 (Abid et al., 2021) model that shows a bias towards the Muslim religion by associating it with violence in a high number of cases. These associations are also present in widely used pretrained models like BERT (Bender et al., 2021).

Actually, these models are part of multiple systems and applications, so undesirable associations may be reflected directly or indirectly in their output. We call these types of associations bias and define bias as any prejudice for or against a person, group, or thing. Bias can be reflected in various dimensions such as gender (Bhardwaj et al., 2021; Zhao et al., 2018b), race (Nadeem et al., 2021; Manzini et al., 2019), religion (Babaeianjelodar et al., 2020), ideology (McGuffie & Newhouse, 2020), ethnicity (Groenwold et al., 2020), sexual orientation, age, disability or even appearance (Nangia et al., 2020).

Dealing with bias in language models mostly involves two different tasks: evaluation (to measure how biased is a model) and mitigation (to prevent or reduce the bias in a model). Most of the work done in bias research on language modelling is focused on the English language. In this paper, we present a novel framework for gender bias evaluation and apply it to some of the most popular Spanish pretrained language models, including multilingual ones where Spanish is among the supported languages. Our approach for bias evaluation is based on previous literature, opting for a mechanism that measures differences in the probability distribution of certain words in a masked language task. Our proposal focuses on adjectives as targeted terms. In our contribution, measurements have been done at a higher level of abstraction, by grouping adjectives in semantic classes according to different classification schemes. Our main findings are that there are different levels of bias across analyzed pretrained models and that gender bias is mainly focused on body appearance when comparing male versus female proposed adjectives. In order to establish a base case, a set of simple templates has been prepared to contain a masked word where an adjective should go. For each template, we will measure and compare the suggestions that each model generates. By comparing the bias of models towards certain categories of adjectives, the method eases the interpretation of this bias.

The remainder of this article is organized as follows. In Section 2 we discuss why bias-free resources are needed and what the impact of bias is on society, as well as the legislative changes it is leading to. Section 3 serves as a walkthrough of the previous work done on bias evaluation. In the fourth section, we design an evaluation method. The fifth section shows the results when applying this method over several Spanish models to evaluate the degree of gender bias. Finally, we provide some brief conclusions and foresee some future work.

2 The need for unbiased models

The presence of bias in a model is the symptom of multiple issues throughout the training process. In the first place, the problem might be in the data fed to the model. If the data source under-represents one class of a protected attribute (e.g. gender) relative to its multiple values (e.g. male versus female), then model predictions will also favour the most represented attribute class while the underestimation for the minority class will be accentuated (Blanzeisky & Cunningham, 2021).

Unequal representation is particularly problematic when this ends up affecting sensible decision systems, as it happened with a U.S. healthcare system algorithm that underestimated the illnesses of black people (Obermeyer et al., 2019). In language models this problem also exists; for example, Ramezanzadehmoghadam et al. (2021) studied the distribution of gender (male/female) and race (Caucasian/African) in the BERT vocabulary against the Labour Market Distribution and found that the model’s vocabulary contains 100% of studied male and female names but only 33% of male Africans and 11% of female Africans.

Regarding models trained for classification tasks (named entity recognition, sentiment analysis, text classification and so on) bias is also present, as those models are trained from human annotations that may not adequately represent reality or even if annotators manifest their personal biases in the labelling process. Actually, Al Kuwatly et al. (2020) explore whether the demographic characteristics of the annotators produce biased labelling and finds that it does happen, so there are some characteristics that affect labelling such as language proficiency, the age range of the annotator, or even the educational level of the annotator, while features such as gender do not make a difference in the suggested task.

We could also consider biased training methods or even biased source media. It is interesting to ask ourselves questions like: Does the model behave in the same way in different classes? Is the model able to encode words with unusual language-specific characters? If the model recognizes people, is it able to operate with the same quality, regardless of race or even with different literacy levels? We consider questions like these necessary. Bias analysis can go even further and study the full pipeline of processes involved in the training of the final model. For example, BERT encodes words as tokens, and some words are encoded as a pair of tokens instead of a single token. Does it affect the output in some way? Is the effect the same for the different classes? Any aspect may exhibit a side effect in terms of biased output in a language model, as these models (tokenizers, encoders, decoders...) learn patterns from massive collections of real-world texts.

We have multiple examples of Artificial Intelligence (AI) models that turned out to be biased, such as Amazon’s recruiting tool that turned out to penalize women (Dastin, 2018). Apple’s sexist credit cards applied an algorithm that sets different limits for men and women (Kelion, 2019). Google removed the word gorilla from Google Photos when it was discovered that the system tagged labelled black people with that word (Simonite, 2018). Gary (2019) collected more examples of AI systems showing unfair, unethical, or abusive behavior. Once a model is biased and used in production systems, this bias and prejudice will feed into other systems and society’s perception (Kay et al., 2015).

The proliferation of these non-transparent and non-auditable AI-based models is prompting proposals for changes in European legislation. From setting up an agency for AI monitoring in Spain (Europa Press, 2021) to the ban on systems that exploit vulnerabilities of protected groups due to their age, physical or mental disability (Jane Wakefield, 2021; MacCarthy & Propp, 2021). Legislation is also being passed to ensure the transparency of AI systems by establishing obligations to consider high-risk AIs reliable, among which are those related to data quality, documentation, traceability, transparency, human oversight, accuracy, and robustness (European Commission, 2021). These rules are complemented by Article 13(2)(f) of GDRP which specifies that, in certain cases, in order to ensure transparent and fair data processing, the data controller must provide meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject (European Commission, 2018).

3 Bias in deep learning models

The bias phenomenon in deep language models has been clearly identified (Garrido-Muñoz et al., 2021). The study of bias usually involves two main tasks: the evaluation task, in which the aim is to characterize the bias and make measurements to quantify it; and the mitigation task, in which the objective is to eliminate the bias or mitigate its effect. After mitigation, the same techniques used in the evaluation are used to check whether the measured bias has been reduced or not. Not all works focus on both aspects of the bias problem. In our case, we have focused this work on the evaluation of gender bias in pretrained language models for Spanish based on deep neural networks.

One of the first forms of bias in language models was found in word embeddings. Bolukbasi et al. (2016) highlights how the model captures strong associations between some professions and the male gender and other professions with the female gender. According to the model analysed, a man would be a teacher, a programmer, or a doctor, while a woman would be preferably a housewife, a nurse, or a receptionist. Actually, in the embedding space, the analogy father$\rightarrow$ doctor so mother$\rightarrow$? is resolved with nurse, for the model there is no such thing as a female doctor. Caliskan et al. (2017) will later show how AI models are able to capture all kinds of prejudices and human-like biases. This work makes the first tests with racial bias, quantifying it by comparing the results of the model with preferably African names versus preferably European names, confirming that there is indeed bias in the studied model. The authors also question the impact that this could have on NLP applications such as sentiment analysis. Ideally, the outcome of sentiment analysis contained in the ratings of a film, product, or company should not be affected by the names of its protagonists, workers, or other involved people names, but that cannot be ensured due to the presence of unequal treatment of proper names.

The study of bias based on measuring associations is continued by Caliskan et al. (2017), who developed WEAT (Word Embedding Association Test) as a mechanism to measure the association between two concepts based on the cosine of their vector representations.

This test measures the association between a set of words and a set of attributes by applying cosine similarity to measure the distance between the vectors representing the embeddings of these words. It does so for a pair of sets of words of equal sizes, such as European American names = {Adam, Harry, Josh, Roger,...} and African American names = {Alonzo, Jamel, Theo, Alphonse,...} with respect to two attribute sets to which the association is to be studied. For example, to measure whether there is a positive or negative association of names with respect to their origin, it uses the attribute sets Pleasant = { caress, freedom, health, love, peace, cheer,...} and Unpleasant = {abuse, crash, filth, murder, sickness}.

This technique will be widely used and adapted to sentences, known as SEAT (Sentence Encoder Association Test) (May et al., 2019) and even to context-dependent neural models such as BERT. Such a technique would be adapted under the new name CEAT (Contextualized Embedding Association Test) (Babaeianjelodar et al., 2020) or variants like SWEAT (Bianchi et al., 2021) which considers also polarized behaviors between values for one single concept.

All this literature studies bias as a problem of harmonizing vector space models. In the case of attention models, such as BERT (Vaswani et al., 2017), the task is much more complex, as we have to deal with language models and not just word models. An alternative way to study the model’s behavior is by means of its results, rather than the internal encoding mechanisms. To this end, it is possible to continue with the association approach. Following this approach, multiple datasets have been proposed, such as Winobias (Zhao et al., 2018a) with 3160 sentences for the study of bias in co-reference resolution; StereoSet (Nadeem et al., 2021) with 170,000 sentences for the study of stereotypes associated to race, gender, profession or religion; the more recent BOLD set (Babaeianjelodar et al., 2020) with 23,679 sentences; StereoImmigrants(Sánchez-Junquera et al., 2021), an annotated dataset on stereotypes towards immigrants, in Spanish, consisting of 1685 stereotyped examples and 2019 non-stereotyped examples; or the contribution of Nangia et al. (2020) with CrowS-Pairs, which contains 1508 sentence pairs to measure stereotypes in a total of 9 different categories. Unfortunately, all the corpus creation efforts specialized in bias detection and evaluation are for English, which leaves a big gap in resources to study bias in non-English language models.

To understand how benchmark datasets work, we detail how StereoSet works (Nadeem et al., 2021). It proposes two types of tests based on predefined sentences; the first one leaves a gap in sentences and three possible options are given: one of the words corresponds to a stereotype, another to an anti-stereotype, and, finally, a random unrelated word. Thus, it is possible to measure which of the three is more likely to be selected by the model and, therefore, to know if the model replicates bias (stereotyped) or moves away from it (anti-stereotype). The second test consists of a set of sentences that establish a context accompanied by three sentences each, one of them being stereotyped, another anti-stereotyped, and another unrelated. The intention of this test is the same as the previous one, from the sentence that is most likely to appear we will know whether the model is stereotyped or not. StereoSet presents an extensive set of tests according to what has been described, with the corresponding stereotype annotation. Our approach takes from this work the idea of measuring bias from a given context by means of the models’ ability to fill a mask in a predefined text.

StereoSet is not the only context-related based work. Bartl et al. (2020) proposes to study bias by capturing the probability of association between a term referring to a profession and another term referring to gender. It performs this study for English and German. For example, the template <person_subject> is a <profession>, would generate a list of professions sentences such as he is a teacher and she is a teacher, or My brother is a kindergarten teacher and My sister is a kindergarten teacher. This type of test works very well for English because of the lack of gender inflection in adjectives or determinants, so writing these patterns is not very difficult. For a heavily inflected language, like Spanish, this approach is not easy to implement.

The work by Nozza et al. (2021) shows an evaluation framework based on text completion. By counting how many times the selected word by the language model was a word was in the HurtLex lexicon, it was possible to measure how stereotyped was the model according to the lexicon categories. Yet, an overall metric on how biased is the model is difficult to be drawn from this method. Besides, as the authors point out, considering only harmful expressions misses other stereotypes related to gender bias like "men are more intelligent” or "the value of a woman depends on its beauty".

In a previous work (Muñoz et al., 2022), a system was built to support the technological infrastructure needed to implement the approach detailed in this paper. This was an early attempt to analyse gender bias in deep learning using a visual approach.^{Footnote 1} This work was a demonstration of the tool that was the starting point of a more exhaustive approach, the one presented in this paper.

4 Designing of the evaluation framework

In order to evaluate how biased is a language model towards a specific protected attribute (gender, in our case), we have designed a method based on a masked language task and assuming the following hypothesis: a language model is considered to be gender-biased if it presents significant differences in the probability distribution of adjectives between male sentences and their female counterparts.

Below is the notation of the concepts that are part of the evaluation framework (See Table 1).

Table 1 Base notation used in the proposed evaluation framework

MarIA and BETO are sexist: evaluating gender bias in large language models for Spanish

Abstract

Similar content being viewed by others

Gender Prediction Based on Chinese Name

An Investigation of Structures Responsible for Gender Bias in BERT and DistilBERT

Investigating Gender Bias in BERT

1 Introduction

2 The need for unbiased models

3 Bias in deep learning models

4 Designing of the evaluation framework

4.1 Feasibility of previous methods to the Spanish case

4.2 Method

4.3 Evaluation patterns and number of proposals from models

4.4 Adjectives categorization

4.5 Metrics

5 Experiments

5.1 Models analysed

5.2 Results

5.2.1 Visible/invisible, positive/negative

5.2.2 Accept/reject, love/status

5.2.3 Supersenses

6 Conclusions and future work

7 Final remarks on reproducibility

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Appendix 1

Appendix 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation