1 Introduction

It is well agreed that data models for natural language processing are able to capture reality very accurately. They are so good that they even capture undesirable or unfair associations. Bolukbasi et al. (2016) showed how word embeddings capture associations such as a Man will be a computer programmer while Woman will be a home-maker. The work by Caliskan et al. (2017) will later show how data-driven trained models in artificial intelligence are able to capture all kinds of prejudices and human-like biases. This is not exclusive to word embedding models, in more recent and more complex models this behavior is still present (Garrido-Muñoz et al., 2021). A clear example is the recent GPT-3 (Abid et al., 2021) model that shows a bias towards the Muslim religion by associating it with violence in a high number of cases. These associations are also present in widely used pretrained models like BERT (Bender et al., 2021).

Actually, these models are part of multiple systems and applications, so undesirable associations may be reflected directly or indirectly in their output. We call these types of associations bias and define bias as any prejudice for or against a person, group, or thing. Bias can be reflected in various dimensions such as gender (Bhardwaj et al., 2021; Zhao et al., 2018b), race (Nadeem et al., 2021; Manzini et al., 2019), religion (Babaeianjelodar et al., 2020), ideology (McGuffie & Newhouse, 2020), ethnicity (Groenwold et al., 2020), sexual orientation, age, disability or even appearance (Nangia et al., 2020).

Dealing with bias in language models mostly involves two different tasks: evaluation (to measure how biased is a model) and mitigation (to prevent or reduce the bias in a model). Most of the work done in bias research on language modelling is focused on the English language. In this paper, we present a novel framework for gender bias evaluation and apply it to some of the most popular Spanish pretrained language models, including multilingual ones where Spanish is among the supported languages. Our approach for bias evaluation is based on previous literature, opting for a mechanism that measures differences in the probability distribution of certain words in a masked language task. Our proposal focuses on adjectives as targeted terms. In our contribution, measurements have been done at a higher level of abstraction, by grouping adjectives in semantic classes according to different classification schemes. Our main findings are that there are different levels of bias across analyzed pretrained models and that gender bias is mainly focused on body appearance when comparing male versus female proposed adjectives. In order to establish a base case, a set of simple templates has been prepared to contain a masked word where an adjective should go. For each template, we will measure and compare the suggestions that each model generates. By comparing the bias of models towards certain categories of adjectives, the method eases the interpretation of this bias.

The remainder of this article is organized as follows. In Section 2 we discuss why bias-free resources are needed and what the impact of bias is on society, as well as the legislative changes it is leading to. Section 3 serves as a walkthrough of the previous work done on bias evaluation. In the fourth section, we design an evaluation method. The fifth section shows the results when applying this method over several Spanish models to evaluate the degree of gender bias. Finally, we provide some brief conclusions and foresee some future work.

2 The need for unbiased models

The presence of bias in a model is the symptom of multiple issues throughout the training process. In the first place, the problem might be in the data fed to the model. If the data source under-represents one class of a protected attribute (e.g. gender) relative to its multiple values (e.g. male versus female), then model predictions will also favour the most represented attribute class while the underestimation for the minority class will be accentuated (Blanzeisky & Cunningham, 2021).

Unequal representation is particularly problematic when this ends up affecting sensible decision systems, as it happened with a U.S. healthcare system algorithm that underestimated the illnesses of black people (Obermeyer et al., 2019). In language models this problem also exists; for example, Ramezanzadehmoghadam et al. (2021) studied the distribution of gender (male/female) and race (Caucasian/African) in the BERT vocabulary against the Labour Market Distribution and found that the model’s vocabulary contains 100% of studied male and female names but only 33% of male Africans and 11% of female Africans.

Regarding models trained for classification tasks (named entity recognition, sentiment analysis, text classification and so on) bias is also present, as those models are trained from human annotations that may not adequately represent reality or even if annotators manifest their personal biases in the labelling process. Actually, Al Kuwatly et al. (2020) explore whether the demographic characteristics of the annotators produce biased labelling and finds that it does happen, so there are some characteristics that affect labelling such as language proficiency, the age range of the annotator, or even the educational level of the annotator, while features such as gender do not make a difference in the suggested task.

We could also consider biased training methods or even biased source media. It is interesting to ask ourselves questions like: Does the model behave in the same way in different classes? Is the model able to encode words with unusual language-specific characters? If the model recognizes people, is it able to operate with the same quality, regardless of race or even with different literacy levels? We consider questions like these necessary. Bias analysis can go even further and study the full pipeline of processes involved in the training of the final model. For example, BERT encodes words as tokens, and some words are encoded as a pair of tokens instead of a single token. Does it affect the output in some way? Is the effect the same for the different classes? Any aspect may exhibit a side effect in terms of biased output in a language model, as these models (tokenizers, encoders, decoders...) learn patterns from massive collections of real-world texts.

We have multiple examples of Artificial Intelligence (AI) models that turned out to be biased, such as Amazon’s recruiting tool that turned out to penalize women (Dastin, 2018). Apple’s sexist credit cards applied an algorithm that sets different limits for men and women (Kelion, 2019). Google removed the word gorilla from Google Photos when it was discovered that the system tagged labelled black people with that word (Simonite, 2018). Gary (2019) collected more examples of AI systems showing unfair, unethical, or abusive behavior. Once a model is biased and used in production systems, this bias and prejudice will feed into other systems and society’s perception (Kay et al., 2015).

The proliferation of these non-transparent and non-auditable AI-based models is prompting proposals for changes in European legislation. From setting up an agency for AI monitoring in Spain (Europa Press, 2021) to the ban on systems that exploit vulnerabilities of protected groups due to their age, physical or mental disability (Jane Wakefield, 2021; MacCarthy & Propp, 2021). Legislation is also being passed to ensure the transparency of AI systems by establishing obligations to consider high-risk AIs reliable, among which are those related to data quality, documentation, traceability, transparency, human oversight, accuracy, and robustness (European Commission, 2021). These rules are complemented by Article 13(2)(f) of GDRP which specifies that, in certain cases, in order to ensure transparent and fair data processing, the data controller must provide meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject (European Commission, 2018).

3 Bias in deep learning models

The bias phenomenon in deep language models has been clearly identified (Garrido-Muñoz et al., 2021). The study of bias usually involves two main tasks: the evaluation task, in which the aim is to characterize the bias and make measurements to quantify it; and the mitigation task, in which the objective is to eliminate the bias or mitigate its effect. After mitigation, the same techniques used in the evaluation are used to check whether the measured bias has been reduced or not. Not all works focus on both aspects of the bias problem. In our case, we have focused this work on the evaluation of gender bias in pretrained language models for Spanish based on deep neural networks.

One of the first forms of bias in language models was found in word embeddings. Bolukbasi et al. (2016) highlights how the model captures strong associations between some professions and the male gender and other professions with the female gender. According to the model analysed, a man would be a teacher, a programmer, or a doctor, while a woman would be preferably a housewife, a nurse, or a receptionist. Actually, in the embedding space, the analogy father\(\rightarrow\) doctor so mother\(\rightarrow\)? is resolved with nurse, for the model there is no such thing as a female doctor. Caliskan et al. (2017) will later show how AI models are able to capture all kinds of prejudices and human-like biases. This work makes the first tests with racial bias, quantifying it by comparing the results of the model with preferably African names versus preferably European names, confirming that there is indeed bias in the studied model. The authors also question the impact that this could have on NLP applications such as sentiment analysis. Ideally, the outcome of sentiment analysis contained in the ratings of a film, product, or company should not be affected by the names of its protagonists, workers, or other involved people names, but that cannot be ensured due to the presence of unequal treatment of proper names.

The study of bias based on measuring associations is continued by Caliskan et al. (2017), who developed WEAT (Word Embedding Association Test) as a mechanism to measure the association between two concepts based on the cosine of their vector representations.

This test measures the association between a set of words and a set of attributes by applying cosine similarity to measure the distance between the vectors representing the embeddings of these words. It does so for a pair of sets of words of equal sizes, such as European American names = {Adam, Harry, Josh, Roger,...} and African American names = {Alonzo, Jamel, Theo, Alphonse,...} with respect to two attribute sets to which the association is to be studied. For example, to measure whether there is a positive or negative association of names with respect to their origin, it uses the attribute sets Pleasant = { caress, freedom, health, love, peace, cheer,...} and Unpleasant = {abuse, crash, filth, murder, sickness}.

This technique will be widely used and adapted to sentences, known as SEAT (Sentence Encoder Association Test) (May et al., 2019) and even to context-dependent neural models such as BERT. Such a technique would be adapted under the new name CEAT (Contextualized Embedding Association Test) (Babaeianjelodar et al., 2020) or variants like SWEAT (Bianchi et al., 2021) which considers also polarized behaviors between values for one single concept.

All this literature studies bias as a problem of harmonizing vector space models. In the case of attention models, such as BERT (Vaswani et al., 2017), the task is much more complex, as we have to deal with language models and not just word models. An alternative way to study the model’s behavior is by means of its results, rather than the internal encoding mechanisms. To this end, it is possible to continue with the association approach. Following this approach, multiple datasets have been proposed, such as Winobias (Zhao et al., 2018a) with 3160 sentences for the study of bias in co-reference resolution; StereoSet (Nadeem et al., 2021) with 170,000 sentences for the study of stereotypes associated to race, gender, profession or religion; the more recent BOLD set (Babaeianjelodar et al., 2020) with 23,679 sentences; StereoImmigrants(Sánchez-Junquera et al., 2021), an annotated dataset on stereotypes towards immigrants, in Spanish, consisting of 1685 stereotyped examples and 2019 non-stereotyped examples; or the contribution of Nangia et al. (2020) with CrowS-Pairs, which contains 1508 sentence pairs to measure stereotypes in a total of 9 different categories. Unfortunately, all the corpus creation efforts specialized in bias detection and evaluation are for English, which leaves a big gap in resources to study bias in non-English language models.

To understand how benchmark datasets work, we detail how StereoSet works (Nadeem et al., 2021). It proposes two types of tests based on predefined sentences; the first one leaves a gap in sentences and three possible options are given: one of the words corresponds to a stereotype, another to an anti-stereotype, and, finally, a random unrelated word. Thus, it is possible to measure which of the three is more likely to be selected by the model and, therefore, to know if the model replicates bias (stereotyped) or moves away from it (anti-stereotype). The second test consists of a set of sentences that establish a context accompanied by three sentences each, one of them being stereotyped, another anti-stereotyped, and another unrelated. The intention of this test is the same as the previous one, from the sentence that is most likely to appear we will know whether the model is stereotyped or not. StereoSet presents an extensive set of tests according to what has been described, with the corresponding stereotype annotation. Our approach takes from this work the idea of measuring bias from a given context by means of the models’ ability to fill a mask in a predefined text.

StereoSet is not the only context-related based work. Bartl et al. (2020) proposes to study bias by capturing the probability of association between a term referring to a profession and another term referring to gender. It performs this study for English and German. For example, the template <person_subject> is a <profession>, would generate a list of professions sentences such as he is a teacher and she is a teacher, or My brother is a kindergarten teacher and My sister is a kindergarten teacher. This type of test works very well for English because of the lack of gender inflection in adjectives or determinants, so writing these patterns is not very difficult. For a heavily inflected language, like Spanish, this approach is not easy to implement.

The work by Nozza et al. (2021) shows an evaluation framework based on text completion. By counting how many times the selected word by the language model was a word was in the HurtLex lexicon, it was possible to measure how stereotyped was the model according to the lexicon categories. Yet, an overall metric on how biased is the model is difficult to be drawn from this method. Besides, as the authors point out, considering only harmful expressions misses other stereotypes related to gender bias like "men are more intelligent” or "the value of a woman depends on its beauty".

In a previous work (Muñoz et al., 2022), a system was built to support the technological infrastructure needed to implement the approach detailed in this paper. This was an early attempt to analyse gender bias in deep learning using a visual approach.Footnote 1 This work was a demonstration of the tool that was the starting point of a more exhaustive approach, the one presented in this paper.

4 Designing of the evaluation framework

In order to evaluate how biased is a language model towards a specific protected attribute (gender, in our case), we have designed a method based on a masked language task and assuming the following hypothesis: a language model is considered to be gender-biased if it presents significant differences in the probability distribution of adjectives between male sentences and their female counterparts.

Below is the notation of the concepts that are part of the evaluation framework (See Table 1).

Table 1 Base notation used in the proposed evaluation framework

4.1 Feasibility of previous methods to the Spanish case

There are several works that attempt to measure bias for English data models, as we have seen. Our first approach was towards the translation of the published evaluation frameworks into Spanish, but the peculiarities in how Spanish treats gender in a sentence forced us to design an evaluation corpus from scratch. While the grammatical gender in English applies mainly to personal pronouns in the third-person singular, in Spanish it applies to nouns, articles, adjectives, participles, pronouns and certain verb forms. Besides, we wanted to generate an evaluation method on gender bias able to produce understandable results, rather than just general divergence metrics between male and female cases. To this end, we compare categories of adjectives, ensuring that these categories have semantic coherence.

For example, in both StereoSet and the work by Bartl et al. (2020), it is not possible to work on a direct translation or to apply exactly the same masking mechanism for Spanish, as gender may affect nouns, adjectives, determinants and articles. Table 2 illustrates this with a more complete example from the same work. Thus, if we use the same approach in the translation of the same example into Spanish, we see some difficulties (Table 3): while in the English version it is possible to study the probability of gender with respect to the element that sets the context, the profession, in Spanish it is not possible as the probability comes from each of the words that vary as the gender of the sentence changes. Since it is not possible to study associations in isolation, this approach had to be adapted.

Table 2 Templates example
Table 3 Translated example

4.2 Method

Our method consists in evaluating both internally and externally the response of the different models. To do so, we create a set of sentences with a masked word. For each sentence, we generate a tuple containing a version of the sentence for each of the possible values of our protected attribute. In our case, the protected attribute is gender and there are two classes to study, male and female, so we have variants for each of these two. These templates contain a mask hiding one of the words, in our case, they hide adjectives referring to the subject of the phrase. For each template, we obtain the top 10 suggestions from the model with the highest probability. This is the first measure, the probability of a given word for a given template on every model. The second measure is the retrieval status value (RSV, that is, the rank). To the first 10, the RSV assigned is 11 minus the index on that list (as we only consider the top 10). Therefore, the RSV of the top suggestion from the model will be 10, the second one will be 9 and so on.

The probability is relevant, as it exposes a more detailed presence of the bias phenomenon and helps to its understanding. Anyhow, it is the ranking of the words that determines how the model generates texts, so the relative order between word candidates is more relevant than their absolute probabilities.

We agglutinate and cluster these adjectives according to certain categorization criteria that will be explained later. This allows us to compare the variation of the ranking and probability values for each category between classes. In the following section, the method is detailed step by step.

4.3 Evaluation patterns and number of proposals from models

The first step is to prepare the sentences. For each of the sentences in the template set \(T = \{T_1, T_2,..., T_t\}\), one sentence must be prepared referring each of the protected attributes values V for each class \(V = \{ V_1, V_2,..., V_v\}\).

For our use case with the protected attribute gender, we have two protected values (V\(_\mathrm{{male}}\), V\(_\mathrm{{female}}\)) and a set of 96 templates. Therefore, a total of 192 templates with regard to the protected attribute are generated.

For example, a valid pair is Él ha conseguido el trabajo ya que es muy < mask> and Ella ha conseguido el trabajo ya que es muy < mask> (respectively, in English, He got the job as he is very < mask> and She got the job as she is very < mask>). With this type of sentence, we are clearly looking for some kind of adjective or qualifier about the subject. As previously mentioned, in this work we focus on gender with the classes male and female, however, the framework is extensible to study other types of biases.

To generate the sentences, a set of 8 templates was defined. These templates were populated with 12 different subjects. In Table 4 can be seen the male version of the templates together with an indicative translation.

Table 4 Some of the proposed templates

The set of sentences is intended, on the one hand, to favour the elicitation of adjectives by the model; on the other hand, it provides sufficient variety to explore the predictions of the models independently of characteristics such as sentence length. At Table 5, there is an example of one of the sentences together with its variations for both classes.

Table 5 One of the proposed templates with its 12 versions

For a given template t and a sentence s (generated from that template), the model being evaluated generates a probability distribution of words \(W^{t,s} = (w^{t,s}_1, w^{t,s}_2,..., w^{t,s}_{10})\), being \(Prob(w^{t,s}_j)\) the probability of word at position j in the list of suggestions. From all the W words we will keep the top 10 suggestions. It is important to note that, depending on the downstream task, just considering the most probable one could not be enough to measure bias in the model, as it is usual to introduce some randomness to avoid determinism when generating texts.

As not all the words returned by the model may be adjectives, we use a PoS taggerFootnote 2 to retrieve the Part of Speech tag of each suggested word. We will only classify the ones with the AQ tag, which stands for Qualifying Adjective.

Below, is shown the ratio of adjectives obtained by the models for both male (Table 6) and female cases (Table 7), that is, from the total of words generated by the model in all the templates, how many of them were tagged as AQ. It is striking how the base model of MarIA base gets the second and third place, while the large version gets the penultimate places for both male and female.

Table 6 The proportion of adjectives for male templates
Table 7 The proportion of adjectives for female templates

4.4 Adjectives categorization

To understand the differences between the results of each model for the male and female versions, the adjectives obtained should fall on previously defined categories, so we want classification schemes that will allow us to intuit that there are differences in the results and how to interpret those differences in a more semantic way. The categorizations have been made by consensus among the authors of this work. We have explored three different categorization schemes for adjectives:

  1. 1.

    Visible/Invisible, Positive/Negative The baseline proposal is to classify the adjectives in two dimensions, the first dimension answers the question “Does the adjective refer to a visible characteristic?”, while the second answers the question “Is the adjective positive or negative?” We have then \(\mid C_{visibility\_polarity}\mid\) = 4, with the labels: Visible+, Visible-, Invisible+ and Invisible-.

  2. 2.

    Accept/Reject, Self/Other, Love/Status Jerry (1979) proposes to categorize adjectives using three dimensions, with two possible values for each dimension. The first dimension distinguishes between accepting/rejecting. For example, to say that someone is kind or hard-working is to accept them for those characteristics, but to say that they are lazy would be considered rejecting. The second dimension is self/other, since all prepared sentences refer to others, we consider that this dimension always categorizes as “other”. The third dimension distinguishes between love/status, with love referring to emotional and status to social. With these three dimensions combined we would have eight categories, but given that in the second dimension we always take “other”, we would be left with four possible combinations. Some example can be found in Table 8. Thefore, we have \(\mid C_{psychological\_taxonomy}\mid\) = 4, the labels are: accept_love, accept_status, reject_status, and reject_love. One of the main problems with this categorization scheme is that it is not entirely clear which category to choose for some of the adjectives. The other major problem is that the original study is focused on a study of personality traits, leaving out of this categorization all kinds of adjectives referring to the body.

    Table 8 Examples
  3. 3.

    Supersenses Tsvetkov et al. (2014) proposes a taxonomy of supersenses for adjectives. This taxonomy covers the set of all possible adjectives better than trait based studies like the previous one. The categories proposed are perception, spatial, temporal, motion, substance, weather, body, feeling, mind, , social, quantity and misc. Since we are drawing adjectives referring to people, given the context we provide in the sentences, the categories of perception, spatial, temporal, motion, substance, weather and quantity are left out of the study. Therefore, \(\mid C_{supersenses}\mid\) = 5, with the labels body, feeling, mind, behavior and social.

4.5 Metrics

From these categories, two values are obtained, the first one will be the model Bias Probability Index (BPI), which is the probability for the given word to fill the mask, which is an internal measure from the model. The BPI is computed for each category of the classification scheme of adjectives of our choice (so we have \(BPI_{C_i}, \forall C_i \in C\). Therefore, we can observe how a model is biased towards male or female in that dimension (i.e. category). The second is the Bias Rank Index (BRI) which is based on the retrieval status value (RSV), that is, the score derived from the position of the predicted word in the model suggestion list. Therefore, the item with the largest probability has a value of 10 (as we are taking the top 10 suggested adjectives from the model), the second most likely would get 9, and so on. This will serve as an external measure of the model, as it describes the model behavior without a hint of internal values. For each model, we compute these metrics as the aggregate of probabilities or RSV at the category level and for each value of the protected attribute, that is, male and female versions of the patterns. To make the values comparable, we weight categories according to the number of adjectives they contain.

Here is the formal notation of these two measures:

$$\begin{aligned} Prob_{Ci} = {1 \over N_i} \sum _{t = 1}^{\mid T \mid } \sum _{s = 1}^{\mid S_t \mid } \sum _{j = 1}^{10} Prob(w^{t,s}_j) \mid w^{t,s}_j \in C_{i} \end{aligned}$$
(1)
$$\begin{aligned} RSV_{Ci} = {1 \over N_i} \sum _{t = 1}^{\mid T \mid } \sum _{s = 1}^{\mid S_t \mid } \sum _{j = 1}^{10} (10 - j)) \mid w^{t,s}_j \in C_{i} \end{aligned}$$
(2)

where:

T set of templates

\(S_t\) set of sentences generated from template t

\(w^{t,s}_j\) word at order j proposed by the model for sentence s in template t

Ci category i of adjectives

\(N_i\) total number of adjectives generated that are included in category \(C_i\)

In the end, we have a value of BPI and BRI for every value (male and female) in each category. The difference between these male and female measurements will provide a final bias value:

$$\begin{aligned} BPI_{C_i} = Prob^{male}_{C_i} - Prob^{female}_{C_i} \end{aligned}$$
(3)
$$\begin{aligned} BRI_{C_i} = RSV^{male}_{C_i} - RSV^{female}_{C_i} \end{aligned}$$
(4)

Note that, in the case of a bias analysis related to a protected attribute with more than two values (like sexual orientation, nationality, profession or ethnicity), the metrics above can be generalized as the average distance between aggregated probabilities and ranks per category, so the proposed method can be applied to any type of bias analysis (see Eqs. 5 and 6).

$$\begin{aligned} BPI_{C_i} = \left({{\mid V \mid }^2 \over 2} - 1) \sum _{j=1}^{\mid V \mid } \sum _{k=j+1}^{\mid V \mid } (Prob^j_{C_i} - Prob^k_{C_i}\right) \end{aligned}$$
(5)
$$\begin{aligned} BRI_{C_i} = \left({{\mid V \mid }^2 \over 2} - 1) \sum _{j=1}^{\mid V \mid } \sum _{k=j+1}^{\mid V \mid } (RSV^j_{C_i} - RSV^k_{C_i}\right) \end{aligned}$$
(6)

5 Experiments

We have applied the method to analyse several models (the most known in the literature and most downloaded from Huggingface’s repository). Over these models, the three categorization schemes have been used to measure gender bias. The categorization process was carried out using the expert judgment method in three iterations. We made a first independent iteration, and then the result of the categorization was shared and discussed, identifying discrepancies. Based on that, the criteria were refined and improved, then the process was repeated until a high level of agreement was reached.

In order to visually portray bias, we utilize tables that contain a numerical value in each cell, indicating the degree of disparity between male and female. When the value is negative, it indicates a bias towards females, and the cell background is colored red. Conversely, a positive value signifies a bias towards males, and the cell background is colored blue. The strength of the color indicates the level of bias, with respect to the highest or lowest value in the column represented by the most intense color. The least intense cells are values close to 0, where no bias is observed. Such graphical representation can aid in identifying and understanding the extent of gender bias within a model.

5.1 Models analysed

Several available models for Spanish from the repository maintained by the Hugginface project have been evaluated. Huggingface is the main repository of deep learning based language models for NLP tasks (Wolf et al., 2020). A very high rate of researchers, along with a large community from the industry, use the models found in this repository. Most of the major models that are domain-adapted or fine-tuned to specific tasks are shared through Huggingface.

The models selected were pretrained following a masked language modeling task on Spanish texts. For our study, the selected models had to produce adequate predictions, that is, for the given sentences where masked positions had to be replaced with words, only complete Spanish words (no subwords) were proposed. Consequently, some models were discarded for not providing predictions in Spanish, and others for not giving complete terms, possibly because they are not really trained for the task in which they are listed. This left us with a total of 20 functional models out of the 26 models found in the repository at the time of our research. They are listed in Table 9.

Table 9 Spanish language models selected for evaluation from the hugging face repository

These models are based either on BERT (Zhuang et al., 2021) or RoBERTa (Devlin et al., 2019), except one, which is based on ELECTRA (Clark et al., 2020). They either focus on Spanish or Spanish is one of the supported languages. They are intended for general use except for ALBERTI, which is trained in poetry, and for the BSC-TeMU/RoBERTalex model, trained on legal texts. Although this pair of models differ from the rest, we understand that it is interesting to evaluate if in these specific domain-oriented models gender bias is present.

5.2 Results

For every sentence in the pattern corpus, the top 10 tokens that the model suggests to fill in the mask for both the male and female versions are obtained. For each token, its rank over the 10 suggestions and its probability (sigmoid on the logit output) of filling the mask according to the model itself are also stored. Table 10 shows the adjectives generated by the model for a sample pair of male/female sentences and their probabilities (scores). Table 11 shows the example translated.

Table 10 Outputs by MarIA-base model for sentences “El maestro es el más < mask>” and “La maestra es la más < mask>”
Table 11 The previous example translated. The translated template is the same for male and female: “The teacher is the most < mask>”

5.2.1 Visible/invisible, positive/negative

In Table 12 it can be seen how each category exhibits different behavior. The Visible+ category is very biased towards the female class, and with quite large differences in general, among those, BETO and ELECTRICIDAD stand out. The Invisible+ category presents a different behavior which really depends on the model, with very popular models biased towards the male version such as BETO or ELECTRICIDAD, while other models such as Recognai or ALBERTI are marked towards the female version. The Visible- category is quite balanced and the differences are small. Finally, in the Invisible- category, the male version predominates, and we can observe that there are some strong variations if, instead of looking at the external state of the model (RSV), we look at the internal one (probability) in models like ALBERTI (probability is 3.57 times greater than RSV) or BERTIN in its random version (2.24 times greater).

Table 12 Differences between male and female for visibility-polarity categories

From these results, we can already intuit that there is a certain bias towards women when we talk about visible and positive adjectives, which could be adjectives related to physique, and a bias towards men with non-visible adjectives, which could be related to personality. This phenomenon is better understood with other categorizations, as it is described later.

5.2.2 Accept/reject, love/status

Again, scores and tables are recomputed, but based on a different grouping of adjectives as previously defined. In this section, we explore the results (See Table 13) according to the Accept/Reject, Love/Status categorization scheme proposed by Jerry (1979).

Table 13 Differences between male and female using Wiggins’ categories

Under these categories, we can see that there is a certain tendency to associate men with positive status in models such as BETO, MarIA, Geotrend, Amine or Recognai, and women with sentimental characteristics (love) in models such as MarIA, Recognai, ALBERTI or Geotrend. However, it is not something generalized at all. The reject and love categories are, in general, less unbalanced, except for ALBERTI and BERT-multilingual. Finally, reject+status as well as reject+love are slightly unbalanced toward men in general, but there is nothing particularly significant.

In general, we do not find this categorization very useful. This categorization only allows us to intuit a certain imbalance in terms of the material with which the models are trained, relating the woman more to the sentimental plane and the man to the status. To understand better how gender is present, we have explored a last categorization scheme that moves away from personality traits and allows a larger set of adjectives to be categorized in a more clear and comprehensive way.

5.2.3 Supersenses

Under the categorization scheme proposed by Tsvetkov et al. (2014) as Supersenses, we observe a behavior similar to what was reported by the first scheme (See Table 14). Mostly all models give more weight to the female version of the category referring to physical appearance (body) than the male counterpart.

Table 14 Differences between male and female under Supersenses categorization

We can see how the likelihood of the model suggesting body-related phrases is higher when predicting words to fill the mask on female templates. This occurs for all the models in the RSV variable referring to the ranking, and for 19 models out of 20 according to the probability metric. In these two cases where it does not occur, the difference is minimal, which implies a cleaner pair of models in terms of gender bias. Only some BERTIN models have a slight bias. Any other model (BETO, MarIA, ELECTRICIDAD, MMG, BERT-multilingual, Geotrend, ReoBERTalex and Recognai) shows a strong bias towards the female class.

For the behavior category we observe the opposite situation, in 11 of the 20 models the probability is much higher for male sentences, and four of the models are strongly biased toward women. For the social category, we observe that the labels go mainly to the male class, although the difference is not very high. For the feel category, the behavior is more balanced and more attenuated, except for RoBERTa and ALBERTI in favour of the female class and a couple of the BERTIN models for the male class. The behavior of the feel category does not have a very biased behavior as, in general, it is quite balanced.

In Figs. 12 and 3 we can see how the adjectives are distributed proportionally in the categories for three of the models. We can easily see the important differences under the body category and how these three models generate more adjectives related to the body for the female templates according to the categories of the supersenses scheme.

Fig. 1
figure 1

Radar chart for MarIA base

Fig. 2
figure 2

Radar chart for BETO uncased

Fig. 3
figure 3

Radar chart for BERTIN stepwise

6 Conclusions and future work

It is evident that there are certain biases in Spanish language models, as we found a great difference in the way women are talked about with respect to men. Some of the most important models such as BETO or the recent MarIA, among others, present a strong bias when talking about the body towards women and when dealing with the behavior towards men. For example, in the MarIA base model (BSC-TeMU/roberta-base-bne), for the pair of templates "La chica es la más [MASK]" y "El chico es el más [MASK]" (translated "The girl/boy is the most [MASK]") we observe a huge difference. The top 8 results for female refer to the woman’s body "guapa, sexy, bonita, bella, linda, fea, hermosa, mona", while for the male version this only happens in half of the results "guapo, listo, sexy, bonito, grande, fuerte, rápido, lindo". This should be taken into account when considering these models to make decisions in real-world environments, as the evident shift present in how the model considers male versus female features could result in a system moving away from fair predictions.

This work proposes an approach to finding biases in models in Spanish that can be generalized to other types of biases. The method, which can be easily generalized to other types of biases, provides coherent metrics to compute interclass imbalances among the different values a protected property may take. Besides, the existence of meaningful classification schemes provides insights on the way the models are biased, which could serve as supporting information for bias studies in terms of explainability. In this regard, it is important to use classification schemes that are adequate to the type of bias under study, in order to achieve such ability to understand the specific behavior of a model.

There are multiple paths to take when studying bias, here we describe some approaches for future work. For the evaluation part, creating corpora that represent other dimensions beyond gender, such as ethnicity or religion, or less obvious classes, such as socio-economic status, is foreseen. In addition to creating other corpora, the proposed method could be applied using resources such as the EXIST(Rodríguez-Sánchez et al., 2021) dataset for identifying sexism. By using this dataset, we could generate a set of labeled phrases that can be transformed into templates. This would enable us to obtain a more representative and accurate set of phrases that reflect reality, which can then be used to perform the proposed evaluation.

Another way to extend this study is to apply the models oriented to other specific tasks, such as text generation or sentiment prediction. A biased model that is part of an automatic content moderation system can be very harmful.

Additionally, the existence of a dataset focused on gender bias like EXIST(Rodríguez-Sánchez et al., 2021) could help evaluate how bias-mitigated models perform against non-mitigated versions, as different sequence probabilities would result from these models when analyzing a sexist text.

As work further in the future, once an evaluation method is available, we plan to research on methods and strategies to mitigate the bias and, then, evaluate again to see how effective the mitigation solution was. Mitigation measures have mostly been applied, again, to English models. Many of the techniques available are neither trivially adaptable to other languages nor easy to automate, so exploring this direction is challenging.

7 Final remarks on reproducibility

Our tool for exploring the model suggestions for each sentence, the statistics of adjectives in the models, the charts with the proportion per category for each model, and the tables that visually compare the differences between the models for each category is available. Both, the tool and the research source code can be found in the following link: https://github.com/IsGarrido/Evaluating-Gender-Bias-in-Spanish-Deep-Learning-Models\(\cdot\).