1 Introduction

Language models are used for a variety of applications, such as CV parsing for a job position or document ranking for web search [5, 33]. Recently, a big step forward in the field of natural language processing (NLP) was the introduction of language models based on word embeddings, i.e. representations of words as vectors in a multi-dimensional space. These models translate the semantics of words into geometric properties, so that terms with similar meanings tend to have their vectors close to each other, and the difference between two embeddings represents the relationship between their respective words [40]. For instance, it is possible to retrieve the analogy \(man: king = woman: queen\) because the difference vectors \(\overrightarrow{queen} - \overrightarrow{king}\) and \(\overrightarrow{woman} - \overrightarrow{man}\) share approximately the same direction.

Word embeddings boosted results in many NLP tasks, like sentiment analysis and question answering. However, despite the growing hype around them, these models have been shown to reflect the stereotypes of our society, even when the training phase is performed over text corpora written by professionals, such as news articles. For instance, they return sexist analogies like \(man: programmer = woman: homemaker\) [7]. The social bias in the geometry of the model is then of course reflected in downstream applications like web search, CV parsing or hate speech detection [3, 7, 43]. In turn, this phenomenon favours the spread of prejudice towards social categories already frequently penalised, such as women or African Americans.

Lately, sentence embeddings—vector representations of sentences based on word embeddings—are increasing in popularity, gaining exceptional results in many language understanding tasks, such as semantic similarity or sentiment prediction  [17, 49]. Training language models on large corpora, that often encapsulate historical bias in the form of social stereotypes, leads to the risk of enforcing the bias originally present in our society; as a result, training datasets should be adjusted to remove bias [54]. Therefore, it is of the utmost importance to expand research on how sentence embedding encoders internalise the semantics of natural languages. An important step towards this direction is to define metrics that are able to reflect and quantify social bias in sentence encoders. Furthermore, studying and limiting the causes and consequences of bias in language models is an extremely important task [4, 6].

This work expands research on social bias in embedding-based models, focusing specifically on gender bias in sentence representations. First, we propose a method to estimate gender bias in sentence embeddings, highlighting the correlation between bias and stereotypical concepts in the sentence. Our solution, named bias score, is highly flexible and designed to be easily adapted to both different kinds of social biases (e.g. ethnic, religious) and various sentence encoders. Moreover, since gender bias is determined by the internalisation of stereotypical associations in language models, bias score allows to identify stereotyped sentences that are responsible for increasing gender bias in the output embeddings encoded by the model. Therefore, in the second part of the paper, we leverage bias score to retrieve the more stereotyped sentences from the Stanford Natural Language Inference corpus (SNLI) [9], a large text corpus suitable for training general-purpose sentence encoders, such as those proposed by [17] and [13]. We then outline two approaches to make SNLI fairer: removing entries associated to the highest bias score, and performing data augmentation by compensating stereotyped sentences with their gender-swapped counterparts. Finally, we retrain a BiLSTM sentence encoder [17] on different fairer versions of SNLI, testing and comparing it with its original counterpart from both fairness and accuracy viewpoint in downstream tasks.

Our contributions in this work include: aa novel metric to estimate gender bias in sentence embeddings leveraging the semantic importance of words and previous research on bias in word embeddings; btwo methods to mitigate gender bias in sentence encoders by improving training data, performing data subtraction and data augmentation, respectively; can analysis of the effect of such mitigation actions when retraining a BiLSTM sentence embedding encoder, with a comparison with traditional methods for gender bias mitigation; da demonstration of the flexibility of our approach to be adapted to other language models, such as those based on transformer architectures.

The rest of the paper is structured as follows. Section 2 explores the state of the art on bias identification and reduction in language models, focusing on word and sentence embeddings. Section 3 introduces and defines bias score, our new metric for estimating gender bias in sentence representations. At the end of the section, we provide some examples of gender bias estimation via bias score. Section 4 first describes how to leverage bias score to make text corpora fairer, then explores a new approach to reduce bias in sentence encoders by retraining them on improved versions of their training data. Section 5 shows the results of our bias reduction methodology, discussing the benefits of the procedure from both the perspectives of quality and fairness. Section 6 describes how to extend our solution to transformer-based sentence encoders. Finally, Sect. 7 concludes the paper and outlines future work.

2 Related Work

Although language models are successfully used in a variety of applications, bias and fairness in NLP have received relatively little consideration until recent times, running the risk of favouring prejudice and strengthening stereotypes [14].

Static word embeddings were the first to be analysed. In 2016, they have been shown to exhibit the so-called gender bias, defined as the cosine of the angle between the word embedding of a gender-neutral word, and a one-dimensional subspace representing gender [7]. This approach was later adapted for non-binary social biases such as racial and religious bias [34]. A debiasing algorithm was also proposed to mitigate gender bias in word embeddings [7]; however, it was also shown that it fails to entirely capture and remove bias [24]. The Word Embedding Association Test (WEAT) [11] was created to measure bias in word embeddings following the pattern of the implicit-association test for humans. WEAT demonstrated the presence of harmful associations in GloVe [46] and word2vec [38, 39] embeddings.

Recently, a number of different approaches extended the research field. A new debiasing procedure was proposed to reduce gender bias by introducing a term to the loss function used during the training phase of the model [48]. Additionally, [8] presented a regularisation procedure that aims at debiasing a language model by minimising the projection of encoder-trained embeddings onto a subspace that encodes gender. Similarly, [59] used model compression techniques, a type of regularisation techniques, to reduce toxicity and bias originally present in generative language models. The system proposed by [32] mitigates bias by employing counterfactual data augmentation, proving that modifying the training data works better than changing the actual geometry of the embeddings. On a similar note, the approach described by [10] performs perturbations of the original embeddings training data to reduce the overall bias present in them. [27] presented a method to preserve gender-related information in feminine and masculine words while removing bias from stereotypical words. Still using GloVe as language model, [56] described an innovative procedure called Double-Hard Debias, to cope with changes in word frequency statistics that commonly have an undesirable impact on standard debasing methods. [60] describes a novel method exploiting causal inference, to reduce not only gender bias associated with a gender direction, but also gender bias from word embedding relations.

More recently, contextualised word embeddings like BERT [18] proved to be very accurate language models. However, despite the literature suggesting that they are generally less biased compared to their static counterparts [2], they still display a significant amount of social bias [36]. WEAT was extended to measure bias in sentence embedding encoders: the Sentence Encoder Association Test (SEAT) is again based on the evaluation of implicit associations and shows that modern sentence embeddings also exhibit social bias [36]. Meanwhile, attempts at debiasing sentence embeddings faced the issue of not being able to recognise neutral sentences, thus debiasing every representation regardless of the gender attributes in the original natural language sentence, leading to a loss of correct semantics [30].

Recently, [61] suggested the generation of implicit gender bias samples at sentence-level, which, along with a novel metric, can be used to accurately measure gender bias on contextualised embeddings. [28] proposed a fine-tuning method for debiasing word embeddings that can be applied to any pre-trained language model. Additionally, researchers have started working on generative transformer models. For instance, [25] proposed to mitigate gender disparity in text generation by learning a fair model with knowledge distillation. Last but not least, two comprehensive survey papers highlighted the latest advances on this front: [45] presents an overview of the most common debiasing methods in the context of vision and language research, while [37] proposes a deep empirical analysis of several bias mitigation techniques with different language models.

3 Gender Bias Estimation

Gender bias in word embeddings is typically estimated by computing the cosine similarity between word vectors and a gender direction identified in the vector space [7]. Cosine similarity is a popular metric to establish the semantic similarity of words based on the angle \(\theta\) between their embedding vectors \({\mathbf{\vec{u}}}\) and \({\mathbf{\vec{v}}}\): \(\cos (\theta ) = \frac{{u \cdot \overrightarrow {v} }}{{\left\| {\overrightarrow {u} } \right\|\;\left\| {\overrightarrow {v} } \right\|}}\). The more \(\cos (\theta )\) approaches 1, the higher is the similarity between \({\mathbf{\vec{u}}}\) and \({\mathbf{\vec{v}}}\). In word embedding models, semantic similarity with respect to a gender direction (typically computed with PCA from multiple gender words) means that a word vector contains information about gender. Since only gender-neutral words can be biased, gender words like man or woman are assumed to contain correct gender information.

Recently, sentence representations are becoming increasingly popular, but the same approach used for measuring gender bias in word-level representations cannot be easily adopted, demanding a new methodology. In fact, the main problem is that gender-neutral sentences cannot be identified and listed. Unlike words, sentences are infinite in number. Moreover, sentences may contain gender bias despite containing explicit gender information. Consider the sentence my mother is a nurse: the word mother contains correct gender semantics, but the word nurse is female stereotyped. Table 1 shows that representations of gender-neutral sentences like someone is a nurse still contain a lot of “false” gender information due to the bias associated with the word nurse.

Table 1 Gender information from cosine similarity for sentence embeddings encoded by InferSent [17] and SBERT [49]

Therefore, it is important to distinguish between the amount of encoded gender information coming from gender words, and the amount coming from biased words. For this reason, we adopt a more dynamic approach: we start at the word level, using the cosine similarity between neutral word representations and the gender direction to estimate word-level gender bias. Then, we sum the bias of all the words in the sentence, normalising it with respect to the length of the sentence and to the contextualised semantic importance of each word. This decision is grounded on two observations: first, the semantics of a sentence largely depends on the semantics of the words contained in it; second, sentence embedding encoders are generally based on predefined word embedding models [17, 49]. In Sect. 3.5, we test our metric on sentence representations produced by InferSent [17], a sentence encoder by Facebook AI that achieved great results in a variety of natural language understanding tasks [16]. Since InferSent is based on GloVe [46] word vectors, we adopt GloVe representations to first quantify gender bias at word level.

3.1 Bias Score

An overview of the approach adopted to calculate bias score is illustrated in Fig. 1. We consider as inputs a sentence in natural language, n pairs of gender words, and a list of words with explicit gender connotation. The output are two estimations corresponding to the amount of female-related and male-related gender bias at sentence level. In particular, we consider four fundamental elements for gender bias estimation, representing the inputs to the bottom-right block in Fig. 1: athe gender direction \(\overrightarrow{g}\) previously identified in the vector space; ba list L of gender words in the same language as the encoder (here is English); cthe semantic importance \(I_w\) of each word in the sentence according to the encoder; dthe word-level embeddings of the input sentence.

Fig. 1
figure 1

Overview of the framework to compute bias score

The amount of female-related and male-related gender bias is represented by a positive and a negative value, respectively, obtained from the sum of the gender bias associated to each word, estimated from cosine similarity with respect to the gender direction. Since gender bias is a characteristic of gender-neutral words, gendered terms are excluded from the computation and their bias is always set to zero. In fact, should the bias score of gender words be considered, it would create an additive term, i.e. an offset in the final score, which might hide the real bias in the geometry of the sentence representation. In detail, for each neutral word w in the sentence, we compute its gender bias as the cosine similarity between its word vector \(vec_w\) and the gender direction \(\overrightarrow{g}\), and then we multiply it by the word importance \(I_w\). In particular, for a given sentence s:

$$\begin{aligned} BiasScore_{F}(s) \, = \, \sum _{\begin{array}{c} w \in s\\ w \notin L \end{array}} \; \underbrace{\cos (vec_w, \overrightarrow{g})}_{> 0} \; \times I_w \end{aligned}$$
(1)
$$\begin{aligned} BiasScore_{M}(s) \, = \, \sum _{\begin{array}{c} w \in s\\ w \notin L \end{array}} \; \underbrace{\cos (vec_w, \overrightarrow{g})}_{< 0} \; \times I_w \end{aligned}$$
(2)

Notice that, for each word w that is gender-neutral, \(w \notin L\). Also, word importance \(I_w\) is always a positive number, and the cosine similarity can be either positive or negative. Therefore, bias score keeps the estimations of gender bias towards the male and female directions separated. A slightly different approach allows us to derive a single value estimation of gender bias at sentence level, by computing the absolute value of each word-level bias:

$$\begin{aligned} Abs\text {-}BiasScore(s) \, = \, \sum _{\begin{array}{c} w \in s \\ w \notin L \end{array}} \, \mid \, \underbrace{\cos (vec_w, \overrightarrow{g}) \times I_w}_{\textit{word-level bias}} \, \mid \, \end{aligned}$$
(3)

This proves useful in certain situations, such as when sorting multiple sentences according to the total amount of associated bias score. In the following sections, we go into more detail by describing how we derive \(\overrightarrow{g}\), L and \(I_w\).

3.2 Gender Direction

The first step of our method is to identify in the vector space a single dimension comprising the majority of the gender semantics of the model. The resulting dimension \(\overrightarrow{g}\), named gender direction, serves as the second term in the cosine similarity function, to establish the amount of gender semantics encoded in a vector for a given word, according to the model being analysed. We mainly test bias score on the sentence encoder InferSent [17], which is based on GloVe [46], a word embedding model with a vector space of 300 dimensions. In general, it is important to adopt word embedding models matched to the encoder under analysis. For instance, SBERT produces sentence representations based on word-level BERT embeddings [49]. See Sect. 6 for an example of bias score with SBERT.

Inside the vector space, the difference between two embeddings returns the direction that connects them. In the case of the embeddings \(\overrightarrow{she}\) and \(\overrightarrow{he}\), their difference vector \(\overrightarrow{she} - \overrightarrow{he}\) represents a one-dimensional subspace that identifies gender in GloVe. However, also the difference vector \(\overrightarrow{woman} - \overrightarrow{man}\) identifies gender, yet it represents a slightly different subspace compared to \(\overrightarrow{she} - \overrightarrow{he}\). Therefore, to avoid inconsistency, we take into consideration several pairs of gender words and perform a Principal Component Analysis (PCA) to reduce their dimensionality to one. In particular, we select the following ten pairs of gender words: woman–man, girl–boy, she–he, mother–father, daughter–son, gal–guy, female–male, her–his, herself–himself, Mary–John.

As shown in Fig. 2a, the top component resulting from the analysis is significantly more important than the other components, explaining 58% of the variance. We use this top component as gender direction, and we observe that embeddings of female words have a positive cosine with respect to it, whereas for male words we have a negative cosine.

Following the advise from [21], and to assess the quality of the gender direction obtained, we further perform PCA starting from an extended list of 50 pairs of gender words, taken from [7], and compare the result with \(\overrightarrow{g}\). From the full list of pairs available on the author’s repository,Footnote 1 we select only those consisting of words present in GloVe. Figure 2b shows that again the top component is clearly the most important, explaining a variance of 41%. Moreover, its cosine similarity with respect to \(\overrightarrow{g}\) is above 93%, demonstrating the quality of the gender direction selected.

Fig. 2
figure 2

Top ten components in PCA from using 10 and 50 pairs of gender words. The top component explains 58% and 41% of the variance, respectively

3.3 Gender Words

A list L of gender words is fundamental to estimate gender bias, because only gender-neutral entities can be biased. Since the number of elements in the subset \({\mathcal {N}}\) of gender-neutral words in the vocabulary of a language is very big, while the subset \({\mathcal {G}}\) of gender words is relatively small (especially in the case of the English language), we derive \({\mathcal {N}}\) as the difference between the complete vocabulary of the language \({\mathcal {V}}\) and the subset \({\mathcal {G}}\) of gender words: \({\mathcal {N}} = {\mathcal {V}} \setminus {\mathcal {G}}\). To achieve this, we define a list L of words containing as many of the elements of the subset \({\mathcal {G}}\) as possible. Therefore, gender bias is estimated for all elements \(w_n\) in the subset \({\mathcal {N}}\) (neutral words), whereas for all elements \(w_g\) in the subset \({\mathcal {G}}\) (gender words) the gender bias is always set to zero:

$$\begin{aligned}{} & {} \forall \, w_n \in {\mathcal {N}}, \; bias(w_n) \ne 0, \\{} & {} \forall \, w_g \in {\mathcal {G}}, \; bias(w_g) = 0. \end{aligned}$$

For this reason, all the elements from L are not considered when estimating gender bias. As a matter of fact, we consider the gender information encoded in their word embeddings to be always appropriately expressed. Examples of gender words include he, she, sister, father, councilman, heroine, princess.

Our list L contains a total of 6562 gendered nouns, of which 409 and 388 are common nouns, respectively, selected starting from the work of [7] and [63]. All words are listed in their lower-cased and capitalised version, in both singular and plural forms. Additionally, we added 5765 unique given names taken from Social Security card applications in the USA.Footnote 2

3.4 Word Importance

Following the approach described by [17], word importance is estimated based on the max-pooling operation commonly performed by sentence encoders to reduce the dimensionality of the final sentence embedding to a fixed amount. Our approach consists in counting how many times in the max-pooling phase a word representation is selected to be part of the sentence embedding. In the case of InferSent, this approach is equivalent to counting the number of times that the max-pooling procedure selects the hidden state \(h_t\), for each time step t in the neural network underlying the language model, with \(t \in [0, \dots , T]\) and T equal to the number of words in the sentence. Note that \(h_t\) can be seen as a sentence representation centred on the word \(w_t\), i.e. the word at position t in the sentence.

We consider both the absolute importance of each word, and the percentage with respect to the total absolute importance of all the words in the sentence. For instance, in the example of Fig. 3, the absolute importance of the word saxophone is 1106, meaning that its vector representation is selected by the max-pooling procedure for 1106 dimensions out of the total 4096 dimensions of the sentence embeddings computed by InferSent. The percentage importance is \(\frac{1106}{4096} \approx 0.27\), meaning that the word counts for around 27% of the semantics of the sentence. In particular, the percentage importance is also independent of the length of the sentence, despite the fact that very long sentences generally have a more distributed semantics. For this reason, we choose to adopt the percentage importance for computing bias score.

Fig. 3
figure 3

Word importance for the input sentence A man is playing the saxophone

3.5 Bias Score Examples

Table 2 illustrates a detailed example of gender bias estimation via bias score, regarding the sentence She likes the pink dress. The example shows how gender stereotypes like this are heavily internalised in the final sentence representation: in fact, they account for the majority of the gender bias in the embedding according to bias score.

Table 2 Bias score example for the sentence She likes the pink dress

Additionally, we use bias score to estimate gender bias for sentences contained in SNLI, a large text corpus for training language models [9]. According to the experiments, sentences corresponding to the highest bias score towards the male direction, estimated applying Eq. 2, describe situations from popular sports like baseball and football, that are frequently associated with men and rarely with women. Similarly, sentences corresponding to the highest bias score in the female direction, estimated via Eq. 1, illustrate female stereotypes, like participating in beauty pageants, applying make-up or working as a nurse. Table 3 displays the most-biased sentences in the SNLI corpus according to our metric, in both male and female directions.

Results are similar when estimating the absolute bias score using the more general Eq. 3: top entries include sentences with a high score in either the female or male direction, like football players scoring touchdowns or the bikini is pink. Additionally, sexualised sentences like the pregnant sexy volleyball player is hitting the ball are also present.

Table 3 Highest bias scores for sentences in SNLI, towards the female and male directions

4 Gender Bias Reduction

Embedding-based language models learn stereotypical associations during the training phase, even if data are seemingly verified and safe [7]. Therefore, to mitigate gender bias and limit the internalisation of stereotypical conceptions, our approach aims to detect stereotyped entries in text corpora used for training language models, namely SNLI [9]. We explore two directions: removing stereotyped entries from the corpus, or compensating by adding counterfactual entries regarding gender. In this section, we describe in detail how to improve the fairness of a training corpus, and then test our intuition by retraining a sentence encoder on the new corpus obtained. Our goal is to improve the degree of fairness in the encoder, without losing accuracy in downstream tasks, by retraining it on a fairer and less stereotyped corpus of text. To evaluate both properties (quality and fairness), we test the newly retrained models with SentEval [16] and SEAT [36], respectively, as described in Sect. 5. An overview of the adopted methodology is illustrated in Fig. 4.

Fig. 4
figure 4

Overview of the proposed methodology for gender bias reduction in sentence encoders

4.1 Training Corpus

The Stanford Natural Language Inference corpus (SNLI) is a large collection of English sentence pairs written by humans for textual inference tasks [9]. It is one of the largest resources of its kind, listing more than 570K pairs of sentences, and more than 600k unique sentences in the train set alone. Each entry is composed of a premise sentence, a hypothesis sentence, and a gold label with one of three possible values: entailment, contradiction, neutral. The general goal of inference tasks is to predict whether the hypothesis sentence logically follows the premise sentence (entailment), contradicts it (contradiction), or they do not share any correlation (neutral). According to the authors, the size and diversity of the dataset allow to train language models for sentence meaning representation. Table 4 illustrates some examples of SNLI entries.

Despite its large use, the literature showed that SNLI contains gender and ethnic stereotypes, alongside harmful or pejorative language associated with social categories like women, Muslims, African Americans [51]. For this reason, our intuition is that improving the fairness of SNLI by getting rid of stereotypical concepts, we effectively prevent natural language models trained on this corpus from internalising them.

Table 4 Sample entries from the SNLI corpus. Each entry contains a premise sentence, an hypothesis sentence and a gold label describing their relationship. C = contradiction, N = neutral, E = entailment

4.2 Improving SNLI Fairness

To improve the fairness of SNLI, we follow two approaches: data subtraction and data augmentation. The first is aimed at reducing the number of entries in the corpus, by removing stereotyped pairs of sentences, identified from the bias score associated to their embeddings. The second is directed to the addition of more entries to the corpus, in order to balance stereotyped sentence pairs and reach a higher degree of gender equality.

In both cases, the first step is to apply our metric to all the sentences contained in SNLI in order to find the entries associated with the highest bias score. This methodology is grounded on the observation that a high bias score is correlated with stereotyped or sexist sentences, as illustrated by the examples in Sect. 3.5. Moreover, SNLI has been shown to contain many stereotypical associations [51], proving to be a good candidate for our methodology.

4.2.1 Data Subtraction

Following this approach, we want to exclude sentences whose embedding exhibits a large amount of gender bias, without diminishing too drastically the size of the corpus. First, we compute bias score for both the premise and the hypothesis sentence in each sentence pair in the corpus, using the absolute bias score formula described in Eq. 3. Then, we only consider for each pair the highest bias score between the one associated to the embedding of the premise and the one related to the hypothesis, in order to discard entries for which at least one of the two scores is too large.

We define four additional corpora derived from SNLI, respectively, by subtraction of the 3%, 5%, 7% and 10% entries with the highest bias score associated. Therefore, we set a threshold at the 97th, 95th, 93th and 90th percentiles and discard entries with a bias score above the threshold. The distribution of bias score for all entries in SNLI and the four selected thresholds is illustrated in Fig. 5. It is evident from the chart that discarding 5% of the entries already allows to halve the highest absolute bias score in the corpus, from 0.1498 to 0.0756. After removing the selected entries, we randomise and split each of the four resulting corpora (Sub90, Sub93, Sub95 and Sub97) into training, development and testing sets. Following the split in the original version of the corpus, we place 10k pairs each in the development and testing sets, and use the remaining pairs for training.

Fig. 5
figure 5

Bias score distribution for entries in SNLI. The red lines correspond to the 90th, 93th, 95th and 97th percentiles

4.2.2 Data Augmentation

To increase the number of entries in SNLI, we adopt an approach of counterfactual data augmentation based on duplicating sentence pairs by converting all female words to their corresponding male words, and vice versa. A similar approach for swapping gender entities is described by [62] and [32], but neither of the two works consider given names in the procedure. First, similarly to the approach used for data subtraction, we set a threshold at the 90th, 93th, 95th and 97th percentiles, then perform the duplication for all sentence pairs in the corpus associated to a bias score higher than the threshold. In case neither the premise or the hypothesis contain gender words, the entry is not duplicated. On average, duplication affects around 60% of the considered entries. To avoid overfitting, each entry is only duplicated once.

To perform the duplication, we first tokenise the sentence with NLTK, a text processing library.Footnote 3 Each token represents either a word or a punctuation mark. Then, we consider only gender words, and specifically those for which there exists a female/male counterpart. After swapping them with their gender counterpart, we obtain a sentence with subjects of the opposite gender. For instance, he is a young boy is converted to she is a young girl, and my father is a singer becomes my mother is a singer. A total of 122 gender words are considered, mostly nouns regarding family members or occupations (e.g. uncle, aunt, chairman, chairwoman). Appendix A contains the full list of gender words involved in this procedure.

Moreover, we consider the 2500 most popular female and male given names in the USA, according to the Social Security AdministrationFootnote 4; they are used to convert female names to equally popular male names, and vice versa. For instance, the sentence Patrick is going to the supermarket becomes Rachel is going to the supermarket. Some examples of sentence pair duplication by gender-swapping are provided in Table 5. Finally, we randomise the four resulting corpora (Aug90, Aug93, Aug95 and Aug97) and split them into training, development and testing sets, again with 10k pairs in development and testing sets, and the rest reserved for training.

Table 5 Sentence pair duplication by gender-swapping in SNLI

4.3 Language Model and Parameters

We retrain a sentence encoder language model based on a bidirectional LSTM architecture (BiLSTM) developed by Facebook AI [17]. The network features a forward and a backward LSTM that read the input sentence in two opposite directions. The final output is a 4096 dimensions vector representation of a natural language sentence, obtained with a max-pooling discretisation technique. Authors train the network on SNLI with a supervised approach, hence our choice to focus on this specific corpus in our experiments.

The parameters used for retraining the BiLSTM network are those suggested by the original authors: batch size 64, SGD optimiser with a learning rate of 0.1 and weight decay of 0.99. Training is stopped when learning rate goes under the threshold of 10-5: since the network converges fairly quickly, as pointed out by [17], we set the maximum number of training epochs to 8. Finally, as underlying word-level representations, we select the biggest and more powerful GloVe model available, trained on Common Crawl 840B.Footnote 5 For the retraining, we used a Linux machine with Ubuntu 18.04, 78GB of RAM and a GeForce GTX 1060 GPU.

5 Experimental Results

The goal of our experimental setting is to confirm a reduction in the gender bias exhibited by the language model retrained by either data augmentation or subtraction. At the same time, it is fundamental to avoid any degradation in the semantic power of the resulting sentence embeddings.

For this reason, after separately retraining the model on the eight SNLI-derived corpora obtained with data subtraction and augmentation, we test them on both fairness and accuracy, comparing them with the original encoder trained on the full unedited SNLI corpus.

5.1 Fairness and Accuracy Metrics

Since our goal is to improve the fairness of sentence encoders, without losing accuracy in downstream tasks, we need to check both qualities. Therefore, the retrained models are evaluated using SEAT [36], a fairness test for sentence encoders, and SentEval [16], a toolkit for assessing the accuracy of sentence embedding models in a variety of natural language tasks.

5.1.1 SEAT

The Sentence Encoder Association Test, or in short SEAT [36], is a fairness test that adapts to sentence encoders the well-known Word Embedding Association Test (WEAT) [11]. Like WEAT, SEAT measures stereotypical association between two sets of target concepts XY (e.g. sentences with male/female subjects) and two sets of attributes AB (e.g. sentences related to career/family). Targets and attributes are built from simple semantically bleached templates like This is \(<term>\) or \(<term>\) is here, e.g. This is John. The magnitude of the association is measured by the so-called effect size, defined as:

$$\begin{aligned} d = \frac{\text {mean}_{x\in X} \;s(x,A,B) - \text {mean}_{y\in Y} \;s(y,A,B)}{\text {std}_{z\in X\cup Y}\; s(z,A,B)} \;, \end{aligned}$$

where s(wAB) is the difference of mean cosine similarities between the embedding of the target concept w and the embedding of each attribute in the sets A and B. A higher effect size reflects a stronger correlation between target concepts and attributes, thus a more severe association bias.

SEAT contains ten tests adapted from the work of [11]. We focus only on those tests that contain gender target concepts or attributes, namely C6, C7, C8, and their alternative versions C6b, C7b and C8b, generated by replacing given names (e.g. John, Sarah) with group terms (e.g. male, women). In SEAT, positive scores describe stereotypical associations, like \(male \rightarrow career\) and \(female \rightarrow family\). Thus, the lower the score, the higher the fairness of the model. Additionally, we consider tests C1 and C2, to assess the retention of correct and ethical associations, such as \(instruments \rightarrow pleasant\) and \(weapons \rightarrow unpleasant\). In this case, a reduction in the score indicates a loss of correct associations, therefore it is desirable to maintain fairly high positive scores.

5.1.2 SentEval

We employ SentEval [16] to assess the quality of our models in terms of sentence representations. SentEval is a toolkit featuring a variety of natural language downstream tasks; to test our models we select twelve of them, namely MR [42] and SST [53] for sentiment analysis, CR for text summarisation [26], SUBJ on subjectivity/objectivity [41], MPQA on opinion polarity [57], TREC for question answering [55], STS-Benchmark for semantic relatedness [12], SICK-E and SICK-R for semantic entailment and relatedness [35], STS14 for semantic similarity [1] and MRPC for paraphrase detection [19]. All tasks except for STS14 requires the model to be further trained on their respective corpus. For all the training phases, we use the default parameters suggested by the authors: 10-fold cross-validation, Adam optimisation, batch size of 64, tenacity of 5, and epoch size of 4.

5.2 Retrained Models Evaluation

Tables 6 and 7 show the effect size for all gender-related tests in SEAT and for the two association retention tests C1 and C2, for both subtraction and augmentation approaches, respectively.

Table 6 SEAT results on subtraction models for gender-related tests C6–C8b and retention tests C1–C2 for the data subtraction approach, average effect size is computed on C6–C8b
Table 7 SEAT results on augmentation models for gender-related tests C6–C8b and retention tests C1–C2 for the data augmentation approach, average effect size is computed on C6–C8b

Concerning gender-related tests from C6 to C8b, there is a marked improvement for all our models in almost every test. C7 and C8 in particular sees an improvement in the association score for every single model, with up to a 0.635 and a 0.703 reduction, respectively, both performed by Aug90, i.e. the most augmented model tested. The same trend is evident in tests C7, C7b and C8, with only C6b not showing significant improvements, probably due to the already very low association score in the original model. Furthermore, the average score from tests C6 to C8b shows again an improvement for all our models compared to the original, with best scores by Sub95, Sub90 and Aug90, suggesting that addressing a larger percentage of sentences in the training data reduces the amount of bias. Figure 6 depicts the trend of the average SEAT scores for both data subtraction and augmentation approaches: concerning augmentation, the trend clearly decreases as the percentage of sentence pairs subject to gender-swapping augmentation increases; on the other hand, models obtained from data subtraction fluctuate in the results, yet still performing better than the original model. Finally, C1 and C2 both highlight a retention of correct associations in all our models, with C2 also showing a reinforcement of the ethical correlations \(instruments \rightarrow pleasant\) and \(weapons \rightarrow unpleasant\).

Fig. 6
figure 6

Average SEAT score across models for tests C6-C8b. The lower the score, the higher the degree of fairness

Tables 8 and 9 show the results from testing our models on the twelve downstream tasks provided by SentEval. In all tasks, the performance is very much similar compared to the original model, with only slight deviations. These results are confirmed by the overall mean score across all tasks, illustrated in Table 10.

Table 8 SentEval results for all encoders trained on SNLI corpus adjusted by data subtraction and augmentation (above), and a comparison with encoders trained on hard-debiased (HD) embeddings (below)
Table 9 SentEval results for all encoders trained on SNLI corpus adjusted by data subtraction and augmentation (above), and a comparison with encoders trained on hard-debiased (HD) embeddings (below)
Table 10 Overall SentEval score across all test, and difference compared to the original model

5.3 Analysis and Discussion

The results of our experiments show the possibility to find stereotyped sentences by means of bias score, allowing to identify stereotypes in text corpora used for training language models. Additionally, retraining encoders on fairer corpora of sentences, such as an augmented or size-reduced version of SNLI, proves to be an effective way to achieve more ethical and equally powerful language models. More specifically, results from SEAT suggests that both models obtained from data augmentation and data subtraction can override unethical and gender stereotypical associations, leading to better association scores. At the same time, correct association are maintained if not even reinforced, meaning that the basic semantics of the language is retained. In fact, this is confirmed by the tests performed on SentEval, showing that all our models achieve comparable results to the original one, with no augmentation or subtraction on the training data. Additionally, the trend from average SEAT scores depicted in Fig. 6 suggests that considering a higher percentage of sentence pairs from SNLI increasingly improves results, at least for the augmentation approach. Despite results from SentEval suggesting that accuracy remain steady regardless of the percentage of entries removed from the corpus or added to it, we expect a limit to exist, beyond which the accuracy of the model starts decreasing, despite a continuous improvement from the fairness point of view.

In general, results confirm that gender bias in sentence encoders can be ascribed to the internalisation of stereotypical concepts encountered during the training phase. Therefore, removing or compensating for stereotyped entries in the training data improves the fairness of the final model.

5.4 Comparison with Hard-Debiasing

While our approach proposes the optimisation of the training procedure of language models focusing on the training data, an alternative option is represented by the so-called vector-space manipulation [22, 45].

In particular, we compare our approach with hard-debiasing, a vector-space manipulation technique to reduce bias in word embeddings by forcibly removing gender semantics for all vectors associated with gender-neutral words [7]. We first hard-debias GloVe embeddings, then re-train the sentence encoder starting from debiased embeddings and the full original SNLI corpus. We tested the resulting model on both SentEval and SEAT. Results are presented in Table 89 and 10, and Table 11, respectively. While we witness a sensible improvement in SEAT tests, the overall score from SentEval sees an average degradation of 1.73 points. Moreover, performance in specific tasks like CR and TREC drop drastically by up to 3.4 and 6.8, respectively. Similar results are obtained when adopting word2vec [38, 39] and hard-debiased word2vec from [7] in place of GloVe: SEAT tests improves considerably, at the price of a major reduction in many SentEval tasks, particularly in classification tasks such as CR, TREC and SST-5.

In summary, hard-debiasing can effectively improve the fairness of sentence encoders, but at the cost of largely losing accuracy in downstream tasks. On the contrary, our approach allows to maintain the accuracy and quality of the resulting sentence embeddings in all tasks, while still considerably improving the average SEAT scores for gender-related tests by up to 19%. Since the drop in the accuracy of the model is significant when using hard-debiased GloVe, we think it is not advisable to combine our approach with hard-debiasing, since the quality of the final sentence representations would still be lower.

Table 11 SEAT results comparison between encoders trained on original GloVe or word2vec (w2v) and their hard-debiased (HD) version

6 Extension to Transformer-Based Models

In this section, we briefly describe how to adapt bias score to transformer-based sentence encoders. Additionally, we present some examples focused on the widespread BERT family of language models, particularly on the sentence encoder SBERT [49] based on BERT-Base for NLI with max pooling discretisation. We adopt SentenceTransformer,Footnote 6 a Python framework that allows to easily switch from one language model to another, without installing additional tools.

The main difference with the methodology described in Sect. 3 is how to compute the gender direction \(\overrightarrow{g}\). In fact, contextualised encoders need pairs of sentences instead of words, to fully capture the semantic of gender. To do so, we take more than 100 sentences randomly extracted from the following three datasets: POM [44], MELD [47], SST [53]. Then, we swap all female words to male words, and vice versa. Following the approach described by [30], the difference vector, between the embedding of the original sentence and the embedding of the gender-swapped counterpart, represents the gender. Again, to solidify the methodology, we perform a PCA of the resulting difference vectors to find a single direction \(\overrightarrow{g}\). This approach proves to be extremely effective, resulting in a top component explaining 78% of the variance, as shown in Fig. 7a. Again, repeating the procedure with different pairs of gender sentences does not produce much difference in the resulting direction, with a similarity higher than 99%.

Concerning the word importance, since our approach is based on max pooling, it is best to choose this discretisation option. Moreover, it is worth noting that BERT-based models split sentences into tokens, that either represent entire words or sub-words. For this reason, the importance is estimated at token-level. An example is provided in Fig. 7b

Fig. 7
figure 7

On the left, top ten components in PCA to retrieve the gender direction for SBERT with BERT-Base vectors. On the right, token-level importance for the input sentence A man is playing the saxophone

Table 12 provides a detailed example of bias score, showing that, differently from sentence encoders based on static word vectors like GloVe, in contextualised representations every word inherits the gender information from the context of the sentence, in this case from the female subject. This feature makes it more difficult to separate the amount of correct gender information from the gender biased information. However, results from Table 13 show that BERT-based encoders still encapsulate a lot of gender bias, which is especially noticeable in gender-neutral sentences. In fact, in the first set of sentences, the amount of female bias with a female subject is a lot higher than the male bias with a male subject. Accordingly, with a neutral subject the same sentence still exhibits female bias due to the stereotypical profession described. The second set of sentences again shows that female stereotypical associations are internalised more than the male ones; when the context of the sentence contains both a female and male stereotype (pink dresses and playing football), the amount of bias in the female direction is higher, even when the subject is explicitly male. Moreover, with a neutral subject, the female bias is still very large regardless of the order in which the two concepts are presented. We could not find any explanation of this behaviour in the literature and we believe that it is due to the higher presence of female stereotypes in training corpora.

Table 12 Detailed bias score estimation for SBERT encoder with BERT-Base vectors for the sentence She likes the new pink dress
Table 13 Examples of male and female bias score with BERT-based sentence encoders, considering different gender stereotyped sentences

Finally, to adapt the methodology described in Sect. 4 to BERT-based encoders, we first consider that their training is extremely time and resource consuming. However, they offer the opportunity to be fine-tuned and adjusted with a lot less effort. To adapt our methodology to this scenario, we first identify text corpora commonly used to semantically fine-tune pre-trained language models, such as STS-Benchmark [12] and MultiNLI [58]. Similarly to the methodology described in this work, the first step is to identify the more stereotypical entries in these corpora, improve their degree of fairness, and then used them to fine-tune a pre-trained transformer-based model. Appendix B illustrates preliminary results from the identification of stereotypical entries in the two aforementioned training corpora. Similar approaches based on fine-tuning for bias mitigation showed promising results in the recent literature [15, 20, 23]

7 Conclusions and Future Work

In this paper, we proposed both an algorithm to estimate gender bias in sentence embeddings, based on a novel metric named bias score, and a method to mitigate gender bias in sentence encoders by retraining them on training data improved by performing either data subtraction or gender-swapping data augmentation.

Bias score discerns between gender bias and gender neutral information encoded in a sentence embedding and quantifies the presence of bias on the basis of the semantic importance of each word. We tested our solution on InferSent [17], searching for the most gender biased representations from a corpus of natural language sentences. Moreover, bias estimation is also crucial for adapting procedures like debiasing to sentence embeddings, since it requires to effectively identify biased sentence representations [7]. Since bias score is proportional to the amount of stereotypical conceptions encapsulated in sentence representations, it allows to retrieve stereotyped entries from text corpora used for training language models. In the second part of this work, we define fairer versions of the SNLI corpus by data subtraction and data augmentation of its more stereotyped entries: sentence encoders retrained on them proves to be less prone to make stereotypical associations compared to their original counterpart, while maintaining the same accuracy in a variety of natural language understanding tasks. This is crucial to maintain quality, yet reducing discrimination in a variety of web-related tasks, such as document search and ranking or hate speech detection.

Future work includes adapting bias score to different kinds of social bias (e.g. ethnic, religious) and further testing it on other sentence encoders such as SBERT [49]. Additionally, considering a higher percentage of SNLI for data subtraction and data augmentation may result in additional improvement in SEAT scores, either maintaining the same accuracy for the retrained encoder, or confirming the hypothesis that after a certain threshold the performance of the model starts decreasing. Moreover, combining the two approaches of subtraction and augmentation may prove even more effective on reducing gender bias. This means removing sentence pairs associated with high bias score and substituting them with the equivalent gender-swapped sentence. Finally, additional comparisons with debiasing techniques such as the one proposed by [63] can be useful to highlight strength and weaknesses of both approaches.