1 Introduction

Differences between training and inference distributions are a common occurrence in the field of Natural Language Processing (NLP). This difference can be observed, for instance, when the input data undergoes changes over time or when a model is employed on data from a new domain. This is known as distribution shift (Quionero et al., 2009). Consider the following example from Named Entity Recognition (NER):

Fig. 1
figure 1

An example of differences in sentences that would be affected by a shift in the distribution of the NER training data

The example shows two phenomena (Fig. 1). First, the entities in the training example tend to be relatively well-known entities (e.g. EU), which are highly probable to be present in the data sources utilized by pre-trained language models (Devlin et al., 2019; Lee et al., 2019) that are widely used for NLP tasks. Conversely, the entities in the inference example are unique to a particular domain (e.g. IETF). Second, the labels in the training example differ from the ones in the inference example. This is because entities from different domains possess different types, such as “Organization" versus “Protocol", and variations in labelling for the same type, such as “Location" and "Place". These phenomena represent two common shifts in NLP: input distribution shifts and label distribution shifts (Quionero et al., 2009). We show how these two types of shifts can affect the performance of an existing classifier with a toy example in  Fig. 2. The example shows situations when the test distribution differs from the training distribution, often caused by the change of the underlying relationship between the input x and the label y. When shifts happen, often the performance of the pre-trained classifier (shown in dashed lines) will no longer hold. In this work we primarily focus on category shift in the label space (Lekhtman et al., 2021). Despite the substantial body of literature on measuring domain similarity (Dai et al., 2019), detecting when a shift occurs remains a challenging task in the field. This task is known as shift detection.

Fig. 2
figure 2

Toy examples of input shift and label shift. The dashed lines indicate an existing machine-learning classifier that performs well at training time. We show two possible scenarios when the relationship of x and y changes for each type of shift

A key area where shift detection is useful is domain adaptation, which aims at adapting a model in the presence of distribution shifts (Csurka, 2017). One of the common supervised approaches to achieve adaptation is fine-tuning deep neural networks (Vu et al., 2020). While fine-tuning can be effective, there is still a cost, such as determining the required amount of additional data for fine-tuning. To inform this decision, shift detection methods are frequently employed in other areas that employ machine learning (Rabanser et al., 2019; Cobb & Van Looveren, 2022; Kulinski et al., 2020). This work frequently adopts statistical hypothesis testing as an underlying principled approach to the problem (Rabanser et al., 2019; Cobb & Van Looveren, 2022). Statistical two-sample testing is a methodology for determining whether the distribution of the training data p is equivalent to the distribution of the test data q. While this approach has been explored for computer vision tasks involving high-dimensional data, it has seen limited application to NLP

Hence, to better inform these decisions and quantify the potential impact of distribution shifts, this paper performs a systematic investigation of shifts across benchmark corpora using statistical tests, which have been widely adopted for shift detection in the context of other machine learning tasks. In this work, we specifically focus on the NER task and detect distribution shifts across 12 different datasets that are representative of various domains. We use word frequency and sentence-level representations to characterize input distributions, and label frequency to characterize label distributions. Appropriate statistical tests are identified for each representation and employed to detect and quantify shifts. We then investigate the impact of domain shift in both the input and label space on performance in the supervised setting. We establish a relationship between the shift distance and the performance degradation. These results provide insights into what statistical test one needs to perform to make such a determination.

Summarizing, the contributions of this paper are as follows:

  • The systematic measurement of distribution shift between 12 NER benchmark datasets covering multiple domains.

  • Systematic measurement of how much a distribution shift impacts performance for NER, a prototypical NLP task.

  • Evidence that sentence-based representations provide better information for shift detection for the NER task.

2 Related work

Distribution shifts are prominent in real-world applications (Specia et al., 2020; Michel, 2021; Recht et al., 2019; Engstrom et al., 2019), leading to growing interest in detecting them for machine learning tasks (Rabanser et al., 2019; Cobb & Van Looveren, 2022; Kulinski et al., 2020).


Shift types In the broader landscape of machine learning, Wiles et al. (2022) conducted a fine-grained analysis of distribution shifts, classifying them as spurious correlation, low data drift, and unseen data shift. Additionally, they evaluated 19 different methods on both synthetic and real-world datasets for vision tasks.


The Use of Statistical Tests The use of statistical tests for dataset shift detection was brought to the fore by Rabanser et al. (2019). In their work, they developed a dataset shift detection framework which contains a dimensionality reduction component and a two-sample-testing component. They investigated multiple combinations of methods for each component, and tested on artificially generated covariates and label distribution shifts. Recently, based on two-sample tests for shift detection, Cobb & Van Looveren (2022) developed a general drift detection framework borrowing machinery from causal inference. The framework is used to deal with the situation when the inference data are not expected to form an i.i.d. sample from the historical data distribution.


Domain similarity Within the field of NLP, researchers have explored various methods for measuring domain similarity in the context of domain adaptation including using target vocabulary covered rate and language model perplexity(Dai et al., 2019). However, these methods work well under the assumption that there are sufficient data from the source and target distribution. Therefore, in our work we adopt non-parametric statistical hypothesis testing framework to detect shift without knowing the actual parameters of the population.


Shift detection in NLP Within NLP, Arora et al. (2021) focused on out-of-distribution texts and two approaches for detection. Shifts are categorized into background shift and semantic shift. Model calibration and density estimation are investigated for shift detection across 14 pairs of natural language understanding datasets. Comparing density estimation methods and calibration methods. We investigate different types of shifts than these works.

Given the importance of shift detection, a number of datasets have been developed (Koh et al., 2021; Malinin et al., 2022), however, they are not for the widely used task of NER.

Our work adds to this existing literature. First, we employ widely used labelled NER datasets and compare not only changes in fields (e.g. science to finance) but also changes in text style (e.g. news style text to social media style text). Second, we test the impact of representation choice on shift detection. Lastly, we provide new evidence for the performance impact of distribution shifts on task performance.

Table 1 List of annotated datasets for English NER from different domains

3 Methodology

Our methodology consists of the following steps: data collection; representation choice; statistical hypothesis testing and shift impact measurement. For data collection, we acquire datasets from different domains. Domains are characterized by their language usage arising from the style employed to the use of language particular to given field usage. For all datasets, both the space of input text and the space of labels are considered. In terms of representations, two types of representations are used for the input and one for labels. Statistical hypothesis testing appropriate for each representation is used to detect distribution shifts. The calculated statistics are then used to measure the extent of a shift. Lastly, the impact of each shift on model performance is determined. We now walk through each of these steps in detail.

3.1 Data collection

We collected 12 datasets from different domains covering news, social media, encyclopedic content, finance, science, emails, and business. Table 1 shows the list of datasets with the published year, document source, domains and the number of entity types. Table 2 shows the list of datasets and their entity types. We group the datasets into five categories, which we now describe in-turn.

Table 2 List of NER datasets with corresponding entity types

3.1.1 Wiki data

GUM (Zeldes, 2017) (the Georgetown University Multilayer Corpus) is collected and expanded as part of the curriculum of a course. The current corpus contains texts from public wikis (e.g. Wikinews, Wikivoyage, wikiHow, Wikipedia) as well as social media sites (e.g. Reddit, Youtube). Example types include event, time, animal and abstract. Wikigold (Balasuriya et al., 2009) is a gold-standard NER dataset sourced from Wikipedia. Wikigold has standard types such as person and organization.

3.1.2 Informal text

Formal texts such as in news and Wikipedia are normally verified by multiple people sometimes even experts. Hence, the majority of text has correct grammar and spelling. In comparison, user-generated informal data such as social media texts, often contain less formal language usage characterized by slang, poor grammar, misspellings, the use of satire, etc. BTC (Derczynski et al., 2016)(Broad Twitter Corpus) is a NER dataset where the source data is from Twitter that not only has tweets on general topics but also on specific topics such as disasters. BTC includes 3 types: person, location and organization. WNUT17 (Derczynski et al., 2017) is a NER dataset where the text sources are Reddit, Twitter, YouTube and StackExchange comments. WNUT17 focuses especially on emerging and rare entities. The dataset contains 6 types, including creative, corporation and product besides common types. CEREC (Dakle et al., 2020) is a large-scale corpus for entity resolution in email conversations. The emails are taken from the first large public corpus the Enron Email Corpus (Klimt & Yang, 2004) which contains emails of 150 employees of the Enron Corporation. Cerec contains standard types such as person and digits type.

3.1.3 Specific fields

AnEm (Ohta et al., 2012) is a corpus annotated with species-independent anatomical entity mentions. The texts are from academic papers from the biomedical scientific literature. AnEm contains 11 domain-specific types such as organ, cell and organism_substance. i2b2 (Uzuner, ö., Luo, Y., Szolovits, P., 2007) is a corpus that contains unstructured clinical notes from the Research Patient Data Registry at Partners Healthcare. The dataset consists of 8 types such as hospital, phone and doctor. SEC-filings (Salinas Alvarado et al., 2015) (U.S. Security and Exchange Commission filings) is a randomly selected and manually annotated finance dataset. The texts are from public-domain financial reports. The dataset includes standard types from CoNLL, i.e. organization, person, location and misc. SciERC (Luan et al., 2018) is a dataset that includes annotations for scientific entities in 500 scientific abstracts from AI conferences and workshop proceedings. The dataset focuses on scientific related types including material, method and task. Re3d (Dstl & Laboratory, n.d.) was constructed from documents that are relevant to the defence and security analysis domain, specifically, focusing on the topic of the conflict in Syria and Iraq. It includes domain-specific types such as weapons and military platform.

3.1.4 News

CoNLL-03 (Tjong Kim Sang & De Meulder, 2003)Footnote 1 is a dataset where the texts are taken from the Reuters news stories from 1996 to 1997. It contains the standard types including person, location, organization and misc.

3.1.5 General

OntoNotes (Weischedel et al., 2017)Footnote 2 is a large annotated corpus that consists of various genres of texts including news, conversational telephone speech, weblogs, newsgroups, broadcast, and talk shows). OntoNotes include a large variety of types (18) including common types and less common ones such as money and percent.

Even though the datasets are grouped into five categories, there is still overlap. Wiki-based datasets and OntoNotes or CoNLL belong to different categories, but they might share many similar general entities. This is because common entities in the news are highly likely to have Wikipedia pages. Intuitively, the “similarity” between datasets in the wiki group should be larger. Conversely, the “similarity” between the domain-specific financial dataset SEC and the news dataset CoNLL should be smaller. We introduce methods to statistically quantify the distance between datasets in the following sections.

For all datasets, we preprocess them as follows. Duplicates are removed to prevent overfitting. Labels are unified across datasets shown in Appendix A. Different datasets use different labels to refer to the same type. Hence, to better compare performance, we unify the labels with the same semantic meanings. For example, ‘person’ and ‘PER’ will be unified under the same label.

3.2 Shift detection and measurement

We use statistical testing to determine and measure shifts between datasets. Formally, given a labeled source domain data \(\{({\varvec{x}}_1, {\varvec{y}}_1),\ldots , ({\varvec{x}}_n, {\varvec{y}}_n)\} \sim p\) and labeled target domain data \(\{(\varvec{x'}_1, \varvec{y'}_1),\ldots , (\varvec{x'}_n, \varvec{y'}_n)\} \sim q\), shift detection determines whether p equals q. The null hypothesis is \(H_0: p = q\) and the alternative hypothesis \(H_0: p \ne q\). The statistical values are used as shift measurements. Both shifts occurring in the input distribution \(p({\varvec{x}})\) and the label distribution \(p({\varvec{y}})\) are investigated.

When forming the dataset pairs for the hypothesis testing, the distance functions we use are bi-directional, meaning that given a function Dist calculating distance, Dist(p, q) = Dist(q, p). Therefore, we measure the distribution shifts using combinations without repetition. Additionally, we include the distance of each dataset to itself as a sanity check. This approach results in 66 unique combinations plus 12 self-comparisons, in a total of 78 pairs.

We now discuss the representations we use for the datasets and the corresponding statistical tests we employ.

3.2.1 Representation for input space

We investigate two different representations for the input space.

Word frequencies: in this setting, \({\varvec{x}}\) represents the frequency of each word. The underlying assumption is that the occurrences of words within a dataset indicate how important a word is. The word frequency distribution over the vocabulary represents the dataset.

Distributional representation: In this setting, each instance of \({\varvec{x}}\) is an n-dimensional vector representation of a sentence within a dataset. Sentence-BERT (Reimers & Gurevych, 2019) is used to encode each sentence. The idea behind sentence-BERT is that semantically similar sentences are closer in vector space (Reimers & Gurevych, 2019). The data points in this n-dimensional space are the distribution for each dataset.

3.2.2 Representation for label space

Within the NER task, datasets from different domains have different types of entities. We use category counts as our label distribution \(p({\varvec{y}})\). Among different domains, the most general types include Person, Organization, Location and Miscellaneous. As mentioned in subsection 3.1, we unify labels with the same semantic meanings. We note that very field-specific datasets will have a different label space than more general datasets.

Following recent work on distribution shifts, for label space, we formulate the problem as one of unseen data shift where some attribute values are unseen under p but are seen under q (Wiles et al., 2022). For example, the type Method might have zero observation in many datasets such as in CoNLL and Wikigold, but it will have many observations in dataset SciERC. However, it does not necessarily mean that there are no entities that have the type Methods in the Wikigold dataset, but due to specific data generation processes, those entities are not annotated. We see this as an outcome of different sampling processes. We assume all datasets share a common label set \({\varvec{A}}^l\) and some labels in the set are unseen in p but are seen in q due to systematic sampling error.

3.2.3 Statistical hypothesis testing

For each type of representation, a different statistical test is necessary, which we now detail. Shift decisions are reported based on the significant level. By default, we use 0.05 as the significant threshold for all tests. Furthermore, we use this statistical testing as a means to measure distribution shift and draw a connection between shift and performance.

Chi-Squared Test For frequency distributions of input and label count distribution, each sample \({\varvec{x}}_n\) is one categorical value that represents word occurrence in the domain. We adopt Pearson’s Chi-Squared test, a parametric test for determining if two frequency distributions are the same. The crucial underlying assumption is that a corpus is modelled as a sequence of independent Bernoulli trials. The relevant statistic \(\chi ^2\) can be computed as:

$$\begin{aligned} \chi ^2 = \sum _{i=1}^2 \sum _{j=1}^C \frac{(O_{ij} - E_{ij})^2}{E_{ij}}, \end{aligned}$$

where \(O_i\) is the observed value for category i and \(E_i\) is the expected value for category i. All word occurrences below 5 are filtered out.

There has been a long debate if the chi-squared test, or statistical testing in general, should be applied for corpus linguistics (Gries, 2005). However, it is still widely employed within the literature (Rabanser et al., 2019). Given that the distribution shift literature also employs chi-squared testing, we also make use of it here.

We employ two data processing procedures while using Chi-Squared tests. First, before applying the Chi-Squared test to data, we implement a normalization procedure to ensure that both the observed and expected values are on the same scale (Underhill & Bradfield, 2013). This normalization enhances the robustness of the test to different sample sizes. Second, by design, label distribution may contain a considerable number of zeros for certain categories. Since Chi-Squared test is not viable when dividing by zero, we added a small constant (\(1e-5\)) to each category to ensure that we obtain results without changing the numerical meaning of the results.Footnote 3

Another potential test for this sort of distribution is the Kolmogorov-Smirnov (KS) two-sample test. However, this test fits the cumulative distribution which requires values to be sorted. Sorting items in a vocabulary is not meaningful.


Maximum Mean Discrepancy (MMD) For multi-dimensional representations obtained from sentence-BERT, we employ MMD (Gretton et al., 2012), a nonparametric kernel-based two-sample test to determine if two samples are drawn from two different distributions p and q. MMD tries to calculate the \(L_2\) distance between the mean embeddings \(\mu _p\) and \(\mu _q\) of the distributions in a reproducing kernel Hilbert space \({\mathcal {F}}\) as:

$$\begin{aligned} \textbf{MMD}^2(P, Q) =<\mu _p, \mu _p> - 2<\mu _p, \mu _q> + <\mu _q, \mu _q>. \end{aligned}$$

Empirically, we use the unbiased estimate of the squared MMD statistic:

$$\begin{aligned} \textbf{MMD}^2 = \frac{1}{m^2 - m} \sum _{i=1}^m \sum _{j\ne i}^m \kappa ({\varvec{x}}_i, {\varvec{x}}_j) + \frac{1}{n^2-n} \sum _{i=1}^n \sum _{j\ne i}^n \kappa ({\varvec{x}}'_i, {\varvec{x}}'_j) - \frac{2}{mn} \sum _{i=1}^m \sum _{j=1}^n \kappa ({\varvec{x}}_i, {\varvec{x}}'_j) \end{aligned}$$

where the kernel is computed with a squared exponential function \(\kappa ({\varvec{x}}, \tilde{{\varvec{x}}})=e^{-\frac{1}{\sigma }|{\varvec{x}} - \tilde{{\varvec{x}}} |^ 2}\). \(\sigma\) is the median distance between points (Gretton et al., 2012).

3.3 Impact on performance

The last step in our method is to detect how the shift affects model performance. Our hypothesis is that as the degree of distribution shift increases, so does the likelihood that a model makes an error and hence the degree of this error will also increase. As one of the most widely used pre-trained language models, we use BERT (Devlin et al., 2019) to measure performance. Specifically, we measure the effect of shifts from p(x) and shifts from p(y) on performance.

We fine-tune the BERT model on one dataset and then test its performance across all datasets to identify any performance degradation. When adapting BERT for NER, we treat it as a sequence-to-sequence task utilizing the BIO tagging scheme where each token in a sentence is tagged with Begin, Inside, or Outside to indicate named entity boundaries. Then, we add one fully-connected dense layer for predicting tags. A cross-entropy loss is used for calculating the loss between predicted tag sequence and the golden tag sequence.

Specifically, we first split each dataset into a training set and an inference set, ensuring that all models trained on one dataset can also be tested on the same dataset. We follow the classic machine learning split ratio of 80:20, training on 80% of the dataset and testing on 20%. We then pair any two of the datasets and use the training set of the first as the source domain and the inference set for the second as the target domain. We fine-tune the original baseline model on the source training set and evaluate on the target inference set. Fine-tuning is performed for 10 epochs. Similar to the original BERT paper, we use a batch size of 32 and a learning rate of 5e-5. To ensure robustness in our results, we report the average performance across five trials, each with different random samplings. We train and test our model on GPU GeForce 1080Ti with 11GB GPU RAM. The fine-tuned BERT model’s architecture consists of 12 layers of attention blocks, with each layer having a hidden size of 768 in the embedding layers. We follow the standard configuration of the BERT-base model, which comprises approximately 110 million parameters (Devlin et al., 2019). When we finetune it for NER task, we add one more fullyconnected dense layer. The number of parameters for each neuron is hidden size + bias (768 + 1) = 769. The final number of parameters added is \(num\_tags\) * 768 \(\approx\) 15,380 parameters. Our code for both the hypothesis testing and evaluation is available as supplementary material.

To draw a connection between distribution shifts and performance degradation, we calculate the correlation between the measurement shift and the performance difference \(perf_{ab}\) between any two datasets a and b.

3.4 Experimental setup

We conduct various experiments under different setups. For shift detection, we verify the validity of the tests on the sampled datasets of the same corpus. If the results indicate no shift, this implies that the testing has effectively validated that the two distributions are the same. We subsequently apply tests to all pairs of datasets.

As illustrated in Table 2, the datasets exhibit varying sizes. In general, ML models perform better when fine-tuned with more data. To mitigate the potential biases derived from data size, we uniformly sample a subset of 948 samples from each dataset and perform all experiments. This is because the minimum number of samples among all datasets is 948 from the Re3d dataset. In practice, engineers often fine-tune the model on the full dataset for better performance. To investigate the correlation under this scenario, we also employ identical tests and performance measures on the original full-sized datasets, the results of which are included in the Appendix A.

For the performance measures, BERT is pre-trained on a particular scope of texts and may favour datasets from certain domains. To address this potential bias, we utilize both BERT-base and BioBERT-base (Lee et al., 2019) and compare their respective performance outcomes. The complete results are provided in the Appendix B.

Fig. 3
figure 3

GUM and Wikigold

Fig. 4
figure 4

WNUT-17 and BTC

Fig. 5
figure 5

SciERC, SEC and CEREC

Fig. 6
figure 6

WNUT-17, BTC and CEREC

Table 3 MMD statistics for input embedding distribution on full-sized datasets with a different number of samples

4 Results and discussion

We now present the results of applying the method detailed above. We begin with an analysis of the input datasets to verify our hypothesis about the shift between distributions representing shift between domains.

4.1 Datasets analysis

Figures 3, 4, 5, 6 show the word frequency plots on selected pairs of datasets. WNUT-17 and BTC both include text from Twitter, and we see that word frequency is similar across both datasets. Conversely, in the case of SciERC, SEC and Cerec datasets, which more distinctly represent different domains, we observe greater dispersion within their respective word frequency distributions. Intuitively, these results suggest that word frequency distributions can serve as an indicator of a domain.

4.2 Hypothesis testing

Considering space limitations, Tables 3, 4, 5 display only a subset of results for hypothesis testing with corresponding distributions. We selected two dataset pairs wherein the source and target distribution are from the same dataset. Then we selected the top 5 and bottom 5 pairs in ascending order. All tables are selected following the same style. All tests are conducted on both sampled datasets and full datasets, and complete results for all dataset pairs are available in Appendix B.

4.2.1 Chi-squared testing for input distribution

Table 4 shows the chi-squared testing on both sampled dataset pairs and original-sized dataset pairs. For Chi-squared tests, the shift is made based on the p-value of the testing. A common cutoff for rejecting the null hypothesis in this context is a p-value less than 0.05, indicating a statistically significant difference in distributions. The results of the Chi-squared tests are provided as a proximity for the distribution differences. The higher the value, the greater the disparity. The distance between the same distribution is also reported as a sanity check. When the source and target distributions are equivalent, the testing indicates that no shift is detected. This indicates that the testing is capable of identifying when two distributions are identical.

For the full-sized datasets, the left table in Table 4 reveals that among the 78 dataset pairs, 13 pairs are detected with shifts. Meanwhile, the right table indicates that for the sampled datasets, out of the same 78 pairs, 22 pairs are detected to have shifts. These results suggest that the test is more sensitive to identifying shifts when there is a smaller sample size. On closer inspection of the dataset pairs, we observe that out of the 13 shift-detected full-sized pairs, 11 pairs are also detected in the sampled datasets, which reaches an approximately 84.6% agreement. Additionally, the full results presented in Table 4 reveal that a higher Chi-Squared value does not necessarily imply the detection of a shift. For instance, while the OntoNotes and i2b2 datasets have high Chi-Squared values, no shift is detected. This outcome could arise due to the data samples being non-representative of the full distribution, thereby resulting in the test’s inability to make a confident conclusion.

Analyzing these results, we note that BTC and WNUT-17 datasets have the smallest distance, which is inline with the frequency plots in Fig. 4. On the other hand, the BTC dataset and SEC finance dataset have the furthest distance, which, as expected, reflects that these two datasets have very different text styles. One surprising outcome is GUM and SciERC which have a relatively small distance using this representation while being from what appear to be different domains. These examples illustrate that this test can quantify the distance between datasets (Fig. 10).

Table 4 Chi-squared statistics for input frequency distribution on full-sized datasets (left) and sampled datasets (948 samples)(right) in ascending order

4.2.2 MMD testing for input distribution

For the distributional representations, we apply MMD with a different number of samples (i.e. embedded sentences) from n = [5, 50, 200, 500, 1000, 2000]. Table 3 shows the results of these tests. The table is ordered by the scores generated using 2000 samples. We measure the difference between a dataset and itself as a sanity check.

Intuitively, the MMD test evaluates whether there is a significant difference between two distributions: a higher MMD value suggests a greater disparity between the distributions. The test essentially disproves the null hypothesis that the distributions are identical—when the MMD statistic is significantly high. In our study, the MMD values can appear negative due to estimation errors in smaller samples or due to the kernel choice affecting the calculation. However, the absolute value of MMD should be considered. Typically, a threshold for significance is set, above which the null hypothesis can be rejected. We use 0.05 as our threshold.

CoNLL, a widely used benchmark dataset in NER, is surprisingly far from other datasets in the distance measured by the chi-squared test. However, with MMD tests, the distance is fairly close. This is an indication that sentence-level representation provides more information than word-frequency representation.

4.2.3 Label distribution

To detect category shift in label distribution, we utilized the Chi-squared test, as detailed in subsection 3.4. This testing was performed on both the sampled and full-sized datasets, and the results are presented in Table 5. Results reveal a significant difference between the datasets that share the same categories and those that have different categories.

Datasets that are focused on specialized fields typically contain more specific labels. Consequently, the dissimilarity between these datasets and those from general domains is greater. For example, while the input shift between BTC and WNUT17 may be small, the label shift is relatively significant due to their distinct label spaces. For the NER task, generalizing model performance to datasets that have distinct categories is more challenging, as evidenced in the following section.

Table 5 Chi-squared testing for label distributions of full-sized datasets (left) and sampled datasets (948 samples) (right)
Table 6 Micro-average F1 score when the model is fine-tuned on the source dataset (the row) and tested on the target dataset (the column)
Table 7 F1 scores on full-sized datasets
Table 8 F1 scores on full-sized datasets
Fig. 7
figure 7

Plots for Chi-squared measures with word frequency input distribution and performance difference. Linear regression model fitted

Fig. 8
figure 8

Plots for MMD measures with sentence-level input distributions and performance difference. Linear regression model fitted

Fig. 9
figure 9

Plots for Chi-squared measures with label distribution and performance difference. Linear regression model fitted

4.3 Performance measurement

As noted in subsection 3.4, we conducted four sets of experiments, including fine-tuning models on both original-sized and sampled datasets with 948 samples using both BERT-base and BioBERT-base models. All performance results are reported using the average of five trials with different random sampling. The complete results can be seen in Appendix B.

Tables 6, 7, 8 present the micro-averaged F1 performance of the models trained and tested on the sampled datasets with BERT, full-sized datasets with BERT, and full-sized datasets with BioBERT, respectively. Each row in the table indicates the dataset the model is fine-tuned on. Correspondingly, the columns indicate the dataset which the fine-tuned model is tested upon. All performance results reported here are the average of 5 trials. The standard deviation of these results ranges from 0.0 to 0.07. The last column reports the average F1 scores of all the test results, which represents the generalization ability of the model when fine-tuned on a specific dataset. The rows are ordered by this average F1 score.

Table 7 reveals that even though BTC and WNUT17 contain texts from the same domain, the model’s generalization ability on WNUT17 decreases significantly when fine-tuned on BTC, as there is a significant label category shift between these two datasets.

Comparing Tables 6 and Table 7, we observe that when we control the number of data samples, the average F1 scores tend to decrease. However, the overall rankings of the datasets are similar between the two tables, except for WNUT17 and Re3d. Using a subset of the original dataset reduces the generalization ability significantly, indicating the additional data samples in the original data set help improve the generalization. Conversely, for dataset Re3d, the generalization ability increases while the number of samples decreases, indicating that the additional data samples in the dataset harm the generalization.

Due to mutually exclusive sets of categories, we encounter many zero F1 scores on AnEM dataset and SciERC dataset. Even though fine-tuning helps improve the performance on the same dataset, the generalization ability is low, indicating that category shift has a significant impact on the performance.

Comparing Tables 7 and 8, we observe that the fine-tuning performance is slightly impacted by the text on which these language models were pre-trained. However, the average F1 score rankings remain the same.

4.4 Correlation

To investigate further how shifts impact model performance, we report the correlation between the testing statistics and performance differences in Figs. 7, 8, 9. Within each plot, the x-axis represents the distances calculated by different hypothesis testing, and the y-axis represents the performance differences between a source dataset and a target dataset. The higher the performance distance is, the worse the generalization ability the model exhibits. Assuming there are datasets \(D_a\) and \(D_b\). \(Perf_{ab}\) indicates the performance difference on \(D_a\) and \(D_b\) when the model is fine-tuned on source dataset \(D_s\) where \(s \in \{D_i \mid i=\{1,2,\ldots ,12\}\}\). \(Shift_{ab}\) is the distance between \(D_a\) and \(D_b\) with regarding to each statistical test. The correlation is calculated between perf and shift.

Based on the presented plots, it is evident that the label category shift shows the most statistically significant correlation (P <0.0001) with model performance. This finding suggests that category shift can serve as a reliable indicator of model performance in a supervised setting when evaluating in a new domain. With respect to input distribution shift, while the word frequency distribution’s correlation with model performance is the lowest, it is still significant (P =0.042). The MMD results reveal a moderately strong correlation (P = 0.002). This indicates that in an unsupervised setting, MMD testing with sentence-level representation distribution can be used to estimate model performance when transferring between domains.

5 Conclusion

In this work, we investigated input data and label distribution shifts across 12 benchmark NER datasets. We compared two different types of representations for input shifts. We systematically measured the shifts using the lens of statistical testing. We measured performance differences by fine-tuning BERT models and calculating the correlation between shifts and performance.

The results show that both word frequency distribution and sentence-level distributional representations are useful for ascertaining shift. Changing between domains results in measurable differences in distribution shifts. Results show that label shift correlates more significantly with performance degradation than input shifts for NER. However, there is still a correlation between input shift and performance degradation. Here, sentence-level representations provide more signals for the relation between distribution shift and performance.

Based on these results, we believe that shift detection and the measurement of distribution shifts can play important roles in tackling NLP tasks, especially for new and low-resource domains. In particular, when applying a model to a new domain, or as data changes, the measures detailed above can help researchers and practitioners decide whether the expense of gathering new annotated data and subsequent fine-tuning is warranted. In the future, we hope that distribution shift measurement can become part of widely used NLP paradigms such as crowd-sourcing and active learning.Footnote 4