How different is different? Systematically identifying distribution shifts and their impacts in NER datasets

Li, Xue; Groth, Paul

doi:10.1007/s10579-024-09754-8

How different is different? Systematically identifying distribution shifts and their impacts in NER datasets

Original Paper
Open access
Published: 18 July 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Language Resources and Evaluation Aims and scope Submit manuscript

How different is different? Systematically identifying distribution shifts and their impacts in NER datasets

Download PDF

Xue Li¹ &
Paul Groth¹

154 Accesses
Explore all metrics

Abstract

When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.

Light Pre-Trained Chinese Language Model for NLP Tasks

IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran

UZNER: A Benchmark for Named Entity Recognition in Uzbek

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Differences between training and inference distributions are a common occurrence in the field of Natural Language Processing (NLP). This difference can be observed, for instance, when the input data undergoes changes over time or when a model is employed on data from a new domain. This is known as distribution shift (Quionero et al., 2009). Consider the following example from Named Entity Recognition (NER):

The example shows two phenomena (Fig. 1). First, the entities in the training example tend to be relatively well-known entities (e.g. EU), which are highly probable to be present in the data sources utilized by pre-trained language models (Devlin et al., 2019; Lee et al., 2019) that are widely used for NLP tasks. Conversely, the entities in the inference example are unique to a particular domain (e.g. IETF). Second, the labels in the training example differ from the ones in the inference example. This is because entities from different domains possess different types, such as “Organization" versus “Protocol", and variations in labelling for the same type, such as “Location" and "Place". These phenomena represent two common shifts in NLP: input distribution shifts and label distribution shifts (Quionero et al., 2009). We show how these two types of shifts can affect the performance of an existing classifier with a toy example in Fig. 2. The example shows situations when the test distribution differs from the training distribution, often caused by the change of the underlying relationship between the input x and the label y. When shifts happen, often the performance of the pre-trained classifier (shown in dashed lines) will no longer hold. In this work we primarily focus on category shift in the label space (Lekhtman et al., 2021). Despite the substantial body of literature on measuring domain similarity (Dai et al., 2019), detecting when a shift occurs remains a challenging task in the field. This task is known as shift detection.

A key area where shift detection is useful is domain adaptation, which aims at adapting a model in the presence of distribution shifts (Csurka, 2017). One of the common supervised approaches to achieve adaptation is fine-tuning deep neural networks (Vu et al., 2020). While fine-tuning can be effective, there is still a cost, such as determining the required amount of additional data for fine-tuning. To inform this decision, shift detection methods are frequently employed in other areas that employ machine learning (Rabanser et al., 2019; Cobb & Van Looveren, 2022; Kulinski et al., 2020). This work frequently adopts statistical hypothesis testing as an underlying principled approach to the problem (Rabanser et al., 2019; Cobb & Van Looveren, 2022). Statistical two-sample testing is a methodology for determining whether the distribution of the training data p is equivalent to the distribution of the test data q. While this approach has been explored for computer vision tasks involving high-dimensional data, it has seen limited application to NLP

Hence, to better inform these decisions and quantify the potential impact of distribution shifts, this paper performs a systematic investigation of shifts across benchmark corpora using statistical tests, which have been widely adopted for shift detection in the context of other machine learning tasks. In this work, we specifically focus on the NER task and detect distribution shifts across 12 different datasets that are representative of various domains. We use word frequency and sentence-level representations to characterize input distributions, and label frequency to characterize label distributions. Appropriate statistical tests are identified for each representation and employed to detect and quantify shifts. We then investigate the impact of domain shift in both the input and label space on performance in the supervised setting. We establish a relationship between the shift distance and the performance degradation. These results provide insights into what statistical test one needs to perform to make such a determination.

Summarizing, the contributions of this paper are as follows:

The systematic measurement of distribution shift between 12 NER benchmark datasets covering multiple domains.
Systematic measurement of how much a distribution shift impacts performance for NER, a prototypical NLP task.
Evidence that sentence-based representations provide better information for shift detection for the NER task.

2 Related work

Distribution shifts are prominent in real-world applications (Specia et al., 2020; Michel, 2021; Recht et al., 2019; Engstrom et al., 2019), leading to growing interest in detecting them for machine learning tasks (Rabanser et al., 2019; Cobb & Van Looveren, 2022; Kulinski et al., 2020).

Shift types In the broader landscape of machine learning, Wiles et al. (2022) conducted a fine-grained analysis of distribution shifts, classifying them as spurious correlation, low data drift, and unseen data shift. Additionally, they evaluated 19 different methods on both synthetic and real-world datasets for vision tasks.

The Use of Statistical Tests The use of statistical tests for dataset shift detection was brought to the fore by Rabanser et al. (2019). In their work, they developed a dataset shift detection framework which contains a dimensionality reduction component and a two-sample-testing component. They investigated multiple combinations of methods for each component, and tested on artificially generated covariates and label distribution shifts. Recently, based on two-sample tests for shift detection, Cobb & Van Looveren (2022) developed a general drift detection framework borrowing machinery from causal inference. The framework is used to deal with the situation when the inference data are not expected to form an i.i.d. sample from the historical data distribution.

Domain similarity Within the field of NLP, researchers have explored various methods for measuring domain similarity in the context of domain adaptation including using target vocabulary covered rate and language model perplexity(Dai et al., 2019). However, these methods work well under the assumption that there are sufficient data from the source and target distribution. Therefore, in our work we adopt non-parametric statistical hypothesis testing framework to detect shift without knowing the actual parameters of the population.

Shift detection in NLP Within NLP, Arora et al. (2021) focused on out-of-distribution texts and two approaches for detection. Shifts are categorized into background shift and semantic shift. Model calibration and density estimation are investigated for shift detection across 14 pairs of natural language understanding datasets. Comparing density estimation methods and calibration methods. We investigate different types of shifts than these works.

Given the importance of shift detection, a number of datasets have been developed (Koh et al., 2021; Malinin et al., 2022), however, they are not for the widely used task of NER.

Our work adds to this existing literature. First, we employ widely used labelled NER datasets and compare not only changes in fields (e.g. science to finance) but also changes in text style (e.g. news style text to social media style text). Second, we test the impact of representation choice on shift detection. Lastly, we provide new evidence for the performance impact of distribution shifts on task performance.

Table 1 List of annotated datasets for English NER from different domains

How different is different? Systematically identifying distribution shifts and their impacts in NER datasets

Abstract

Similar content being viewed by others

Light Pre-Trained Chinese Language Model for NLP Tasks

IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran

UZNER: A Benchmark for Named Entity Recognition in Uzbek

Explore related subjects

1 Introduction

2 Related work

3 Methodology

3.1 Data collection

3.1.1 Wiki data

3.1.2 Informal text

3.1.3 Specific fields

3.1.4 News

3.1.5 General

3.2 Shift detection and measurement

3.2.1 Representation for input space

3.2.2 Representation for label space

3.2.3 Statistical hypothesis testing

3.3 Impact on performance

3.4 Experimental setup

4 Results and discussion

4.1 Datasets analysis

4.2 Hypothesis testing

4.2.1 Chi-squared testing for input distribution

4.2.2 MMD testing for input distribution

4.2.3 Label distribution

4.3 Performance measurement

4.4 Correlation

5 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Label mapping

Appendix B: Full results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation