1 Introduction

Pre-trained language models are widely used, such as for translation (Wu et al., 2016), opinion summarization (Zhu et al., 2013; Farzindar, 2014) and chatbots (Shawar and Atwell, 2007). For example, OpenAI’s ChatGPT (OpenAI, 2023) chatbot is now widely accepted and applied in real life, for instance, in the education industry for lecture design (Extance, 2023), news article production (Liu, 2022) and content creation (Cao et al., 2023). Such powerful chatbots, along with other large language models, have been pre-trained on large collective training datasets, mostly from online resources (Devlin et al., 2019; Brown et al., 2020).

Language models learn arrays containing rich features, which are called word embeddings. Recent studies indicate that embeddings exhibit systematic patterns of stereotype discrimination, mirroring human biases (Zhang et al., 2021; Wolfe and Caliskan, 2021; Nadeem et al., 2021). For example, the word embeddings illustrate a significantly higher probability of ‘he’ as the predicted pronoun for a surgeon in a simple sentence. In contrast, word embeddings demonstrate a notably elevated probability of predicting ‘she’ as the pronoun associated with a nurse (Kumar et al., 2020).

Language is intertwined with culture due to differences in word usage behavior (Loveys et al., 2018), writing styles (Ma et al., 2022), common sense knowledge, debatable topics, and value systems (Hershcovich et al., 2022). Research has shown that these demographic differences in the task domain will harm the performance of downstream Natural Language Processing (NLP) tasks (Ma et al., 2022; Ghosh et al., 2021; Sun et al., 2021; González et al., 2020; Tan et al., 2020; Loveys et al., 2018).

An additional source of bias is differences in the amount of data each group contributes to a dataset. For example, most massive datasets are collected online. Regions with smaller populations than others contain fewer online users and are hence underrepresented in the training data. Particularly, Zhang et al. (2021) show that most of the word embeddings reflect more of the language habits of European-educated males, neglecting other subsets of the population. This constitutes a biased selection of the population (Hershcovich et al., 2022; Ma et al., 2022) and raises concerns about the non-selected groups’ representation within the dataset (Hershcovich et al., 2022; Wolfe and Caliskan, 2021), which will probably cause harms in applications.(Ghosh et al., 2021; Tan et al., 2020; González et al., 2020).

While research mostly focuses on cross-culture problems in cross-lingual language models, the monolingual model—English in this case—may not be free of culture difference bias either (Hershcovich et al., 2022). Recent studies on English language models focus on the geographic influence of non-traditional English-speaking (L2) regions (Tan et al., 2020; Ghosh et al., 2021), e.g., researchers discovered differences in emotional responses (Ghosh et al., 2021) and word usage (Ma et al., 2022) between L2 regions. Others investigate bias in models on NLP tasks for different social groups within one region (Zhang et al., 2021).

However, Ma et al. (2022) and other research reflect bias on resource-abundant regions, such as the US or India, where labeled datasets for NLP tasks are easier to access. In contrast, this paper facilitates the inclusion of resource-limited regions by reducing the reliance on labeled task data by investigating regional bias on word embedding level before fine-tuning, which can influence the downstream task performance. Here, we focus on the ‘sequence output’ of BERT, where the probability distribution of each word is concatenated into a sequence array. Our proposed framework is not tailored to regional bias and can be used to investigate different sources of bias while including resource-limited groups. However, to demonstrate its power and close a research gap simultaneously, we focus on regional bias within the inner-circle English group of L1 regions that have previously been neglected by research. We ask the following research questions:

  1. 1.

    Do regional differences in raw text data manifest in embedding space?

  2. 2.

    What impact do regional differences have on the performance of downstream tasks?

The regional differences identified in RQ1 may arise from diverse topics or variations in word usage frequencies across different regions. Examining these variations serves not only to elucidate regional patterns but also offers valuable insights that can guide bias mitigation efforts, relevant to the bias shown by the impact on downstream task in RQ2.

We approach the research questions using two different standard datasets in NLP, Sentiment140 and Reuters21578, containing tweets and news articles from six and four L1 English-speaking countries, respectively. We find that regional differences are indeed manifested in BERT embedding space as well as in regional feature space. Furthermore, these differences affect the performance of downstream learning tasks. Particularly, we investigated sentiment classification and multilabel classification on the corresponding datasets and found significant drops in performance for underrepresented regions in relation to the test set performance. These results imply that differences in embedding space indicate model performance gaps and are hence a suitable tool to analyze bias while including resource-limited groups.

The remainder of this article is organized as follows: Sect. 2 provides preliminary information for our methodology in Sect. 3. Section 4 introduces the experimental setup. Section 5 analyzes the results. Finally, Sect. 6 reviews related research before Sect. 7 concludes this paper.

2 Preliminary and Notation

Before diving into the details of our methodology, we outline the notation and provide brief preliminary definitions required for the remainder of this article. More details can be found in Appendix 1.

Notation Let \(\Phi \) denote a pre-trained language model, \(\Phi : W \rightarrow S\), where W and S denote the input and the output of \(\Phi \). Likewise, let \(\Omega \) denote a downstream NLP task model, \(\Omega : S \rightarrow T\), with input S and output T. Then let \(D_{s}\) and \(D_{t}\) denote the metric functions (distance functions) on S and T respectively: \(D_{s}: S \times S \rightarrow {\mathbb {R}}\), \(D_{t}: T \times T \rightarrow {\mathbb {R}}\). This paper aims to show that differences in embedding space, \(D_{s}(\Phi (W_{i}), \Phi (W_{j}))\), cause differences in the downstream NLP task performance, \(D_{t}(\Omega (\Phi (W_{i})), \Omega (\Phi (W_{j})))\), where \(W_{i}\) and \(W_{j}\) denote two group domains for \(\Phi \).

Wasserstein distance and Sinkhorn algorithm Wasserstein distance is a measure derived from the optimal transport problems, estimating the effort of transforming one shape into another. It can be used to measure the difference between two probability distributions, such as region-specific data distributions in embedding space (Cai and Lim, 2022). The Sinkhorn algorithm (Chizat et al., 2020) allows for efficient calculation of the Wasserstein distance.

Linear discriminant analysis (LDA) Linear discriminant analysis (LDA) is a statistical method for finding new features (as linear combinations of the original features) that best discriminate between classes (in our case: regions). These new features constitute the axes of a new space, typically of a smaller dimension, which we refer to as the LDA space. LDA operates on the dataset alone using the regional labels and does not need a specific task.

Distance correlation Distance Correlation measures the strength of the dependency between two variables, even if their dimensions differ (Edelmann et al., 2021). In contrast to Pearson’s correlation, distance correlation can capture nonlinear associations between variables. The population distance correlation is zero if and only if the variables are independent. This paper chooses distance correlation to measure the dependency of the task performance of different regional groups. Distance correlation ranges from 0 (independence) to 1(perfect linear dependancy).

Relation measurement To examine the relation between distances in embedding space and distances in downstream task performance, we use Spearman’s rho and Kendall’s \(\tau \) rank correlation coefficients. \(\rho \) assesses the strength and direction of the relationship between two variables by examining how their ranks change, not their actual values. \(\tau \) calculates directly how many pairs of data points agree or disagree in their order, making it less sensitive to tied values. Both coefficients range from -1 (strong negative correlation) to 1(strong positive correlation) via 0 (independence).

3 Methodology

To investigate bias in pre-trained language models and analyze its impact on a specific dataset, we propose a methodology that is specifically tailored to include all sub-populations, even those with limited access to task-labeled datasets.

Fig. 1
figure 1

The process of the proposed methodology

Figure 1 depicts the procedural flow of the proposed method, which compares regional features in text data X in three distinct ways:

  1. I.

    observe differences in performance for a downstream task y to quantify the impact of both regional bias and embedding bias,

  2. II.

    measure distances in embedding space tracing the regional differences back to the embedding, and

  3. III.

    measure distances in LDA space to disentangle the effect of regional and embedding bias.

The subsequent Results Analysis stage examines the correlation between differences in intrinsic metrics (derived from I) and extrinsic metrics (derived from II and III). It is worth noting that this methodology focuses on intrinsic features obtained directly from the model’s output and word embeddings, diverging from the conventional practice of analyzing intermediate outputs of model layers (Leteno et al., 2023) or prediction probabilities over words for tasks like text classification (Nadeem et al., 2021; Lauscher et al., 2021) and text generation (Sun et al., 2022). This methodolgy helps illustrate the relationship between the knowledge features in models and the task performance in a straight forward way.

Additionally, our approach recommends the use of LDA as a tool to identify the feature subspace crucial for distinguishing group features. In essence, this method successfully unveils the intricate relationship between intrinsic and extrinsic metrics, especially in the context of multi-group data, such as multiple regional groups in our case study. Notably, the intrinsic metrics derived herein can serve as effective evaluation metrics for addressing bias in pre-trained language models. Subsequently, we discuss each involved aspect in detail.

Performance To measure performance differences, a subset of text data X is employed as the training data to fine-tune the pre-trained LLMs, incorporating an additional task performance layer tailored to the specific task y. Simultaneously, a subset of X serves as the test data, subjected to the fine-tuned models to yield comprehensive performance results. The performance evaluation metrics include accuracy, AUC (area under the ROC curve), precision scores, and recall scores. Subsequently, gaps (score differences) in these evaluation metrics are calculated on a group-wise basis. The type of fine-tuning layer is dynamically determined by the nature of the task y. If there exists more than one fine-tuned model, the distance correlation of these performance metrics is calculated.

Distance correlation measures the dependency of performance variance patterns of two datasets. Performance variance, in this context, denotes the fluctuations in performance on a test dataset resulting from alterations in the testing model. Two similar datasets are expected to have similar patterns of performance variance. For example, one model improves the performance on one dataset A by \(f\%\). Then the performance improvement is expected to be close to \(f\%\) when the model is performed on dataset B if dataset A and dataset B are similar in features. This measure indicates the reliance of model performance estimation based on the standard test datasets.

Performance drops between regions can be the cause of two different types of biases: true regional bias inherent in the dataset and bias that is already encoded in the embedding. To isolate embedding bias, we subsequently calculate distances in embedding space as well as in LDA space which circumvents using the embedding.

Embedding To measure embedding distances, the sequence embeddings, where the probability distribution arrays of words are concatenated into one array by the order in the word sequence, are extracted for the test data by regional groups. WWe opt for Wasserstein distance instead of the commonly used Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence. Wasserstein distance is preferred because it relaxes the assumption that two probability distributions are measured in the same space (detailed justification can be found in Appendix 1). Due to this property, Wasserstein distance can be generalised to other context such as different languages where he presence of a common subspace is not guaranteed. We then compute the Wasserstein distance in embedding space for any pair of regional groups using the Sinkhorn algorithm. This approach captures the nuanced differences in BERT embeddings, thereby contributing to the understanding of regional distinctions within the text data. Large regional differences in embedding space indicate large differences in word usage between those regions, whereas regions that are close in embedding space use similar language.

LDA In the embedding aspect, embeddings are extracted again for the test data by regional groups. These embeddings are projected into LDA space to calculate their group distances, if applicable. Earlier studies (Zhao et al., 2019; Bolukbasi et al., 2016) have identified specific feature spaces using embeddings within the given feature set. For instance, Zhao et al. (2019) utilized gender-related sets (e.g., woman, girl, man, boy) to define the gender feature space. The present research aims to explore variations across multiple groups within the embedding space. LDA is employed to discover the feature space that effectively distinguishes between groups. LDA conducts this exploration without prior knowledge of the feature space and is adaptable to multiple-class data from binary class data.

In LDA space, the axes represent linear combinations of features, from embeddings in this context, with an emphasis on maximizing the distance between groups. The adoption of LDA techniques in this method serves the purpose of identifying the space that accentuates group identity features. It is essential to note, however, that LDA is not universally applicable. Its effectiveness is contingent upon equal group sizes. In instances of uneven group distribution, this paper employs a down-sampling strategy. LDA faces challenges in the Reuters21578 datasets due to a total data sample size that falls below the count of features. Despite these limitations, LDA remains a valuable technique for visualizing group identity features.

Results analysis In the phase of results analysis, the group performance gaps computed in the performance phase are juxtaposed with the group distances identified in embedding and LDA. To assess the presence of bias in the embedding space, correlation coefficients such as Spearman’s \(\rho \) and Kendall’s \(\tau \) are employed. An impact on the downstream performance task is considered when the correlation strength is moderate to strong, accompanied by significant p-values.

4 Experimental setup

To scrutinize regional biases within large language models using our proposed methodology, we conduct experiments on two distinct datasets tailored for different NLP tasks. This section introduces the data we used and the experiment procedure, including pre-processing, training, and evaluation. We provide our implementation, results, and scripts in our repository, alongside supplementary materials containing additional details and results: https://github.com/anniejlu/regional_bias.

Fig. 2
figure 2

Regional distribution in pre-processed datasets

Fig. 3
figure 3

Data usage schema for Sentiment140 in our analysis

Datasets and preprocessing The availability of publicly accessible datasets for NLP task evaluation containing geographic information is limited. This experiment relies on two well-established datasets: Sentiment140 (training) (Go et al., 2009) and Reuters21578 (Apt’e et al., 1994). These datasets serve as standards for sentiment analysis and document multi-labeling tasks, respectively. See Table 1 for a dataset overview and Fig. 2 for the distribution of regions within each dataset.

Table 1 Training data summary
Fig. 4
figure 4

Data usage schema for Reuters21578 in our analysis

The Sentiment140 training sets comprise 1.6 million tweets with sentiment labels. The labeling process is performed using Go et al. (2009)’s classifier, which relies on emoticons present in the tweets eliminating the need for human labeling. We obtain binary labels (positive and negative). All special characters and emoticons are removed from the text. Figure 3 demonstrates the process of filtering and sampling applied to the Sentiment140 dataset.

For performance evaluation (I), we randomly select five training sets, each containing 16,000 tweets. Subsequently, we use 5-fold cross-validation to split each set further. Four folds are utilized as training data, while the remaining fold is used as test data, serving as one of the baseline datasets for performance evaluation.

We randomly select a sample of 100,000 tweets from the entire dataset for the extraction of regional data based on the location content from the L1 countries Australia (AU), Canada (CA), New Zealand (NZ), the United Kingdom (UK), the United States (US), and South Africa (ZA). See Appendix “Data preprocessing” section for details.

Here, the “mixed” dataset refers to the combined set of all data with identified regions. We have 28,636 tweets together for the regional test dataset.

Tweets are particularly short, which causes problems in our LDA analysis (see Appendix 3.2 for details). To overcome these problems, we generate long-texts from the Sentiment140 data by randomly picking 10 tweets, maximising the use of 128-token sequence size in the model, to constitute a corpus in each iteration. Repeat this procedure 100 times for statistical stability. For each pair of comparisons, the embeddings of these long-text tweets are combined, and we measure the differences in their concatenated embeddings. We generally distinguish between short-text (uses the original tweets) and long-text embeddings and indicate which one we use.

Reuters21578 is a multilabel dataset for business news in Reuters News. There are 119 document topic labels such as “trade”, “crude”, and “nat-gas”. Each document can have one or multiple labels. News articles without any topic labels are removed. We then preprocess every article by removing newline characters and replacing tab spaces with single spaces. We construct a regional dataset by gathering data where the “place” field contains a single value, operating under the assumption that the authors of the news articles originate from the respective publishing countries. Articles labeled as “Multiple” denote publications in the five focus regions, as well as in multiple areas that encompass these five regions. We follow the train-test setting pre-defined by Apt’e et al. (1994). Figure 4 shows the details of train-test split of the dataset.

Sampling To save computational costs, for the model training, we randomly draw 5 samples with 16,000 tweets for Sentiment140 as illustrated in Fig. 3. For each sample, we use 5-fold cross-validation to get 5 training data samples (12,000 tweets) and 5 test data samples (4000 tweets). The test data samples here serve as the baseline for the comparison of model performance for different region datasets, labeled as “baseline" in the plots in Sect. 5. Due to dataset size constraints in Reuters21578, we construct a single pre-defined training data sample and test data sample as per Apt’e et al. (1994). The test data sample, denoted as “Test" in subsequent plots, serves as the benchmark for performance comparison. As shown in both Figs. 3 and 4, we construct regional datasets with stratified sampling. Equal-sized samples are drawn from each region. We also create a “Mixed/Multiple" sample region by drawing samples exclusively from regions where all focused regions are known in terms of distribution, serving as a baseline for datasets with known regional distributions.

Embedding and Model Training We choose BERT (Devlin et al., 2019) as the embedding in our experiments and proceed as depicted in Fig. 1: We fine-tune the BERT model for the downstream task (I) and project the test datasets into embedding space (for II and III). Aiming to simulate the routine task training and testing. We use unstratified dataset samples as training data and baseline test dataset. The results on baseline datasets represent the model performance generally. For Sentiment140 data, an uncased English BERT with 4 hidden layers of 512 neuron size with 8 attention heads from the TensorFlow HubFootnote 1 is fine-tuned by a dense layer and a dropout layer.

For Reuters21578 data, the multilabel classifier training adheres to the original settings in DocBERT (Adhikari et al., 2019). Due to data limitations, the multilabeling experiment cannot accommodate cross-validation settings. Consequently, this experiment features a singular multilabel classifier.

We provide the concrete hyperparameter settings in Appendix “Hyperparameters”.

Evaluation For performance evaluation (I) on Sentiment140, starting from 28.636 tweets (see Fig. 3), We downsample each region to a sample size of 1000(except for NZ and AU, where we used all tweets), aligned with CA data,as well as a mixed sample of equal size containing all regions (in their original distribution) to reduce the computational cost of transferring the tweets to embedding space. The sampling process is repeated 30 times to allow for statistical stability, and we report average accuracy, area under the curve (AUC), precision, and recall. For Reuters21578, the test dataset size is manageable and does not need to be reduced prior to the embedding space transfer.

For embedding differences (II), we sample data for each region with equal sizes of 100 for Sentiment140 and down-sampled size of 35, aligned with AU data, for Reuters21578. Then repeat the process for 100 times. To emphasize the regional feature pattern, we aggregate the sequence embeddings into a longer sequence embedding. For instance, in Sentiment140, we concatenate 100 sequence embeddings to form a singular embedding, facilitating the measurement of Wasserstein distance.. We report the mean Wasserstein distances for each pair of regions, representing the embedding difference between the two regions.

For LDA distances (III), we randomly sample 100 tweets or artificial long-text tweets and 35 tweets from each region and mixed-region group for Sentiment140 and Reuters21578, respectively. We proceed to project the group data into a dimension-reduced space using LDA and compute the distance between the centroids of clusters for each pair of groups, utilizing Euclidean distance. We repeat the process 100 times, and the mean distance values serve as the LDA distance for analysis. However, LDA fails to project Reuters21578 data to a reasonable space due to its small data size. The total sample size is 210 (including Australia) or 500 (excluding Australia), which is smaller than the dimension, 512, of BERT embeddings.

5 Results and discussion

In the following section, we discuss the outcomes of our study, exploring how regional differences in raw text data reflect in the embedding space and examining their influence on the performance of downstream tasks. This discussion builds upon the methodology and experimental setup outlined in the previous sections. Flowing into the result analysis are three types of results as illustrated in Fig. 1: (I) Performance gaps and distance correlations as a result of training a model for the specific tasks, i.e., sentiment analysis for the Sentiment140 dataset and multi-class labeling for Reuters21578, (II) Wasserstein distances of regional groups within the embedding space, i.e., BERT in our case, and (III) distances of regional groups within LDA space.

We structure this section according to our research questions, “Do regional differences in raw text data manifest in embedding space?” and “What impact do regional differences have on the performance of downstream tasks?”.

5.1 RQ1: Do regional differences in raw text data manifest in embedding space?

Fig. 5
figure 5

The feature difference level of regions from the standard dataset in the embedding space is proportional to the region’s representation level. The share of each region in the standard dataset is known for Reuters21578 but unknown for Sentiment140 data. Thus, “Mixture” refers to the dataset merging 6 object regions instead of the standard dataset. The concrete equations for computation are in Appendix 1

It is well-known that underrepresentation of certain groups in the dataset can cause bias since the model will focus on the majority groups and neglect others as they do not contribute to the overall training error. Imbalance of regional groups is a problem in the datasets we use. Figure 5 illustrates the relationship between the proportion within the standard test data and the difference of BERT embedding with the standard test data. The embedding space difference is measured by Wasserstein distance (see computation details in A). It generally follows that the Wasserstein distance of regional data from the standard test data is smaller when it possesses a larger share of the standard data. In other words, a larger proportion of regional data in the standard datasets may exert an influence on the features of pre-trained BERT embeddings. There is one notable exception in the case of Canadian English, whose proximity to the standard test data is closer than anticipated in both datasets. One potential explanation is that Canadian English and global English share similar exposure to two mainstream English styles, American English and British English (Boberg, 2012). Further experiments are required to confirm this hypothesis.

Fig. 6
figure 6

Regional differences in embedding and LDA space for Sentiment140 (left, center) and Reuters21578 (right). The colors are scaled individually for each subplot

We expected long-text data to contain more regional patterns and saw this confirmed in preliminary experiments (Appendix Fig. 12), so we use long-text subsequently. We acknowledge that the synthetic long-text data obfuscates the interpretation. The linguistic differences in L1 English regions are subtle. This framework aims to capture the pattern magnified by extended corpus sequence.

Aiming to investigate the embedding space more closely, we illustrate the differences between the region groups in that space in Fig. 6a. UK English appears more similar to CA, AU, and NZ English than the rest. South Africa data is closer to UK data compared with US data. These imply that British English emerges as the central point among the language groups. British English is close to Australian English and Kiwi English – this is to be expected given that AU and NZ were former British colonies. These three regions form a proximity group. American English is far from this group. Canadian English and South African English fall between the triangle group and American English. Canadian English is closer to American English (understandable, given the countries’ geographic proximity), while South African English is closer to British English (which can be explained by colonial influences).

Reuters21578 data is considered long-text data as well since it is a set of news articles data. Figure 6c demonstrates the region pattern in Reuters21578 data. Canada data and US data, together with standard test data and multiple region data, cluster closely together. Both Australian data and UK data are far away from this group. This might be a result of the dominance of American English features in the standard dataset as it occupies over \(50\%\) of the dataset as shown in Fig. 5.

The observed differences in embedding space can stem from various contributing factors, including distinct semantic features from different words. Employing the LDA method allows us to further discern patterns in regional features among different regions. As depicted in Fig. 6b, a similar pattern emerges. Furthermore, the correlation with performance evaluation measures aligns between LDA results and BERT embeddings for long-text, as evident in the comparable color patterns in Fig. 7. The LDA results affirm that the distinctive patterns identified in the long-text embedding space largely arise from regional feature differences. LDA fails on Reuters21578 data due to its small data size, as described in Sect. 4.

Fig. 7
figure 7

Spearman’s \(\rho \) (left) and Kendall’s \(\tau \) measure the relations of different results on Sentiment140. Values close to \(\pm 1\) indicate strong relationships, whereas values close to 0 indicate near-independence. ‘DC’ stands for Distance Correlation.The verdant green blocks in the first two columns indicate a moderate positive relationship between performance the regional pattern differences. The red to orange blocks in the second column indicate a moderate negative relationship between performance variation and the regional pattern differences

Fig. 8
figure 8

Spearman’s \(\rho \) (left) and Kendall’s \(\tau \) measure the relations of different results on Reuters21578. Values close to \(\pm 1\) indicate strong relationships, whereas values close to 0 indicate near-independence

5.2 RQ2: What impact do regional differences have on the performance of downstream tasks?

Based on the discussion in the previous section, we know that regional differences do manifest in embedding space. In this section, we discuss the impact of the identified bias on the performance of downstream tasks.

We investigate the impact of regional bias on performance through pairwise correlations of BERT embeddings and performance measures for both datasets, Sentiment140 and Reuters21578. The correlation depicts the relationship between performance and regional feature differences in BERT embeddings and thus presents the impact of the regional bias. Figures 7 and 8 shows the correlation matrix (the p-value results of permutation tests for both coefficients are in Appendix Figs. 14 and C5). The impact on performance is observed from two perspectives: the performance scores and the performance responses to the change of models.

Fig. 9
figure 9

Sentiment classification performance result distribution per region for Sentiment140

We evaluate the performance of the sentiment classification task with accuracy, AUC, precision scores, and recall scores. We measure the impact with correlation coefficients of Spearman and Kendall. The results in Fig. 7 show that long-text BERT embedding differences and LDA space distances have a moderate positive relationship with Accuracy and AUC for the sentiment classification task. This implies that the model performance on two datasets tends to have a larger discrepancy if they have distinct regional features. Recall that the differences from baseline data (standard test data) in embedding space are proportional to each region group’s representation level in the baseline data. The underrepresented region groups tend to have more distinct regional features from baseline data. Thus, the region groups with low representation power in standard test data tend to suffer a performance drop due to the regional feature differences.

The multi-label classification task shows a similar sign when excluding the performance on Australian data. The small size of AU data (35 documents) contributes to an unanticipated high performance (see in Fig. 10) despite large feature differences from the standard test dataset as shown in Fig. 6c. Besides, the differences in BERT embeddings have a moderate positive relationship with precision scores when short-text Sentiment140 has the same pattern. This suggests that when two datasets possess different features in the embedding space, the model’s performance on them, evaluated in precision scores, is likely to exhibit a more significant discrepancy. Reuters21578, news articles, holds both patterns of long-text embeddings and short-text embeddings because the model directly applies to the embeddings of documents that are long. No impact on recall scores has been found for both datasets.

The performance drop for the region groups due to regional bias is demonstrated in Figs. 9 and 10. Sentiment classification performance on New Zealand data is below the average. Model performance on Australian data and UK data is also worse than the baseline dataset. The sudden improvement of precision scores for positive labels can be explained by the imbalance classification distribution (see Table 2 in “Appendix 3.3”). No significant differences can be observed for the recall score, explaining no regional feature impact on the recall score.

Fig. 10
figure 10

Multi-label classification performance results per region for Reuters21578

Table 2 Class distribution across datasetsThe values are the average counts of 100 repeat runs

In Fig. 10, multi-label classification on UK data has a substantial gap from US data. This can be explained by the large feature differences in the embedding space (see Appendix 3.3.2). The unexpected superior model performance on US data still persists when decreasing the proportion of US data in the training data until model training stops early due to data limitations. The resemblance between the Reuters21578 data (news articles) and the collective training set data for BERT training (mostly online articles) appears to hinder the model’s ability to grasp certain text features, such as word usage, during fine-tuning. This suggests a possible dominant influence of US-specific characteristics in the pre-trained BERT model. However, this conjecture needs further experiments to verify.

Lastly, we investigate the impact on the performance variation patterns due to the change of models on region data groups. The performance correlation examines whether a model exhibits similar enhancements or deterioration patterns on different datasets. When two datasets exhibit similarity, their performances are anticipated to align closely. Figure 7 shows a moderate negative relationship between the performance distance correlation and the long-text BERT embedding differences or long-text region group distances in the LDA space. It suggests that model performance variance tends to be similar when two region datasets have similar regional features. This observation may raise concerns about estimation model performance across diverse datasets. It suggests that having a better-trained model does not automatically guarantee improved prediction results for all datasets, especially when datasets exhibit distinct regional features. The performance variation examination results imply that some regional features entangle in the feature space for task performance. For example, the word “scheme” carries a negative connotation in the US but remains neutral in the UK. This divergence in sentiment could potentially impact the performance of sentiment prediction task on Sentiment140 dataset. This observation excludes the Reuters21578 dataset, as only one model is trained on it.

Overall, we observe regional feature differences in the embedding space. The variations from the baseline data (standard test data) captured in the embedding space are directly proportional to the representation level of each region group in the baseline data. Region groups with limited representation in standard test data are prone to experiencing a decline in performance due to regional bias. The investigation of the performance distance correlation indicates that possessing a well-trained model does not inherently ensure enhanced prediction outcomes across all datasets, particularly when those datasets manifest distinct regional features.

5.3 Summary

Differences in regional features within the embedding space can be quantified through distribution variances assessed by Wasserstein distance. Such differences illustrate three English language groups (US, UK-AU-NZ, CA-ZA) as shown in Fig 6. The observed performance disparities among regional groups, as indicated in Figs. 9 and 10, can be partially attributed to such regional feature differences, as evidenced by Spearman’s rho and Kendall’s tau in Figs. 7 and 8. The experimental findings underscore that these regional feature differences directly impact precision scores, accuracy, and AUC. Moreover, they demonstrate that such differences influence performance variations across diverse models fine-tuned on different datasets. It suggests that the estimation of model performance based on general datasets does not translate to regional datasets, which requires additional caution when using pre-trained LLMs, particularly in underrepresented regions.

6 Related work

This section discusses recent research within related disciplines. Section 6.1 explains the distinctive features of this study compared to existing research on regional biases. Section 6.2 provides an overview of other types of bias found in language models.

6.1 Studies investigating regional differences in LLMs

González et al. (2020) and Sun et al. (2021) delved into the regional impacts on multi-lingual language models (LLMs). González et al. (2020) reveal that languages featuring anti-reflective pronouns, like Swedish, may introduce unambiguous gender bias due to distinct grammar structures and national linguistic characteristics. They measure the bias by comparing the model performance on sentences with different types of pronouns, such as feminine pronouns. The result shows that language models behave worse in sentences with feminine pronouns. Sun et al. (2021) propose three metrics for assessing cross-cultural proximity: language context ratio, literal translation quality, and emotional semantic distance. These metrics illustrate how different languages behave differently. Inspired by the emotional semantic distance, this work seeks to highlight feature differences in English across various regions within the inner circle, despite the common perception of shared culture among these inner-circle regions. Overall, this paper focuses on English language models, monolingual language models, while the above articles focus on multi-lingual language models.

Numerous studies concentrate on the regional influences on monolingual language models, particularly those focused on English. Tan et al. (2020) employ an inflectional perturbation strategy to generate adversarial attacks on pre-trained language models, simulating the language behaviours of second-language English speakers (L2). Ghosh et al. (2021) identify biases in existing toxicity detection models, noting a tendency to favor offensive words from underrepresented non-standard English (L2) regions. They evaluated the toxicity score sensitivity of country-specific words. The results show that the off-shelf toxicity detection model can weakly detect toxic country-specific slang words probably unseen during training. Instead of investigating the model behavior discrepancy in text generation tasks or toxicity detection tasks for L2 English speakers, we explore the difference for L1 (first language) English speakers.

Zhang et al. (2021) demonstrate that pre-trained language models exhibit a bias towards the “white man with a high education level" on cloze-tests data in the US, with the exception of BERT-large models. The authors compare the model selection with human selection and show that model selection overlaps more with highly-educated white man’s selection from the US community. It raises concern about minority social group bias in the existing language model. It investigates model performance differences in English speakers from different cultures in one region, mostly in the United States. In contrast, this paper aims to study differences in regional identities, which include all demographic groups within a region in different geographic areas.

Ma et al. (2022) propose a benchmark dataset for five English-speaking regions and illustrate the word usage difference and performance difference before and after learning regional features. They investigate the regional differences at the performance level, while this paper tries to illustrate bias at the intrinsic level, i.e., differences in embedding space. Nevertheless, research by Ma et al. (2022) points to bias in resource-abundant regions. This paper strives to diminish dependence on labeled task data by examining regional bias at the word embedding level before the fine-tuning process, potentially impacting downstream task performance. However, access to the proposed dataset is still not available to the public yet.

6.2 Beyond regional bias in LLMs

Observation of performance difference is one of the methods to quantify gender bias in NLP and is frequently expressed via accuracy, F1-scores, log-loss of the probability, and false positive rate (Stanczak and Augenstein, 2021). In this paper, we incorporate accuracy, AUC for binary prediction, precision, and recall to quantify bias in performance results but discard the F1-score due to its sensitivity to unbalanced datasets.

Another popular way to illustrate bias, especially stereotype bias, is to apply language models on a test template (Lauscher et al., 2021; Nadeem et al., 2021) where words with stereotype attributes are missing for prediction. Then bias is presented by comparing the predictions. However, template prediction does not seem effective in investigating regional difference bias as it is difficult to construct a lexical that properly reflects the differences between different regions.

Finally, following Bolukbasi et al. (2016)’s method, Zhao et al. (2019) measure gender with a gender axis, where gender-related words should cluster at two ends of the axis. Such an axis is found by finding a space where anchor sets (such as man:woman) have the largest variation. This method is tied to biases of features with two values, such as gender, and is not suitable for features with more values. To lift this constraint for our research, we use LDA to find the feature axis for multi-class features.

7 Conclusion and future work

Pre-trained language models are widely used despite being biased towards specific social and geographic groups. Particularly for regional bias, research focuses on L2 (English as a second language) regions but neglects bias within L1 (first language) regions, typically excluding low-resource regions entirely.

This paper introduced a novel approach for detecting bias in word embedding spaces for L1 regions characterized by similar culture and language behaviors, with a specific emphasis on addressing the challenges posed by low-resource regions. However, our proposed framework using Wasserstein distance in embedding space and LDA projection is general enough to extend to other types of bias.

We apply our framework to two specific datasets with two distinct downstream tasks. Our study demonstrates that regional bias (1) manifests in embedding space and (2) strongly impacts downstream task performance. When English language features exhibit greater distinctions between two regions, the performance gap of the model on datasets from these regions widens. Regions that are underrepresented in standard data sources or for which language features differ are particularly susceptible to regional bias and performance drops.

The findings indicate the importance of evaluating the performance of LLMs across diverse test datasets, as model efficacy can fluctuate significantly due to variations in feature distributions. Additionally, interpreting the results of LLMs necessitates careful consideration of whether the training datasets accurately reflect the characteristics of the target application data. For us as a research community, our findings imply that we need to dedicate further effort to uncovering hidden effects caused by regional bias in LLMs and to mitigating such bias.

Future research efforts will prioritize the generalization of our methodology and explore mitigation strategies for addressing biases within the inner group of the English language. This includes developing robust approaches to counteract biases in regions where language features deviate from the established standard data source. Another promising avenue of research is to extend the regional bias investigations to alternative representation formats, such as graph embeddings.

Our framework incorporates the embedding space to identify distinct features within groups, along with downstream tasks for evaluating performance. Wasserstein distance and LDA space distance are employed to quantify these distinctive group features. Bias is identified when task performances exhibit moderate or strong correlations with group feature differences, as indicated by significant p-values. In future research, we plan to explore the binary decision threshold for more refined bias detection.