Predicting lexical complexity in English texts: the Complex 2.0 dataset

Shardlow, Matthew; Evans, Richard; Zampieri, Marcos

doi:10.1007/s10579-022-09588-2

Predicting lexical complexity in English texts: the Complex 2.0 dataset

Original Paper
Open access
Published: 23 March 2022

Volume 56, pages 1153–1194, (2022)
Cite this article

Download PDF

You have full access to this open access article

Language Resources and Evaluation Aims and scope Submit manuscript

Predicting lexical complexity in English texts: the Complex 2.0 dataset

Download PDF

4361 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Identifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as complex word identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of CWI datasets for English. We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2.0. We present experiments using both new and old datasets to investigate the nature of lexical complexity. We found that a Likert-scale annotation protocol provides an objective setting that is superior for identifying the complexity of words compared to a binary annotation protocol. We release a new dataset using our new protocol to promote the task of Lexical Complexity Prediction.

RCWI: A Dataset for Chinese Complex Word Identification

Measuring Content Complexity of Technical Texts: Machine Learning Experiments

Natural Language Complexity and Machine Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Predicting lexical complexity can enable systems to better guide a user to an appropriate text, or tailor it to their needs. The task of automatically identifying which words are likely to be considered complex by a given target population is known as Complex Word Identification (CWI) and it constitutes an important step in most lexical simplification pipelines (Paetzold & Specia, 2017).

The topic has gained significant attention in the last few years, particularly for English—which is also the focus of our study. A number of studies have been published on predicting complexity of both single words and multi-word expressions (MWEs) including two recent competitions organized on the topic, CWI 2016 and CWI 2018, discussed in detail in Sect. 2. The first shared task on CWI was organized at SemEval in 2016 (Paetzold & Specia, 2016a) providing participants with an English dataset in which words in context were annotated as non-complex (0) or complex (1) by a pool of human annotators. The goal was to predict this binary value for the target words in the test set. A post-competition analysis of the CWI 2016 results (Zampieri et al., 2017) examined the performance of the participating systems and evidenced how challenging CWI 2016 was with respect to the distribution (more testing than training instances) and annotation type.

The second edition of the CWI shared task was organized in 2018 at the BEA workshop (Yimam et al., 2018). CWI 2018 featured multilingual (English, Spanish, German, and French) and multi-domain datasets (Yimam et al., 2017). Unlike in CWI 2016, predictions were evaluated not only in a binary classification setting but also in terms of probabilistic classification in which systems were asked to assign the probability of the given target word in its particular context being complex. Although CWI 2018 provided an element of regression, the continuous complexity value of each word was calculated as the proportion of annotators that found a word complex. For example, if 5 out of 10 annotators labeled a word as complex then the word was given a score of 0.5. This measure relies on an aggregation of absolute binary judgments of complexity to give a continuous value.

Instead of using binary judgments, the CompLex dataset uses Likert Scale judgments (Shardlow et al., 2020), for which the specification is discussed in depth in Sect. 4. CompLex is a multi-domain English dataset annotated with a 5-point Likert scale (1-5) corresponding to the annotators comprehension and familiarity with the words in which 1 represents very easy and 5 represents very difficult. The CompLex dataset was used as the official dataset of SemEval-2021 Task 1: Lexical Complexity Prediction (LCP) (Shardlow et al., 2021). The goal of LCP 2021 is to predict this complexity score for each target word in context in the test set.

In this paper, we investigate properties of multiple annotated English lexical complexity datasets such as the aformentioned CWI datasets and others from the literature (Maddela & Xu, 2018). We investigate the types of features that make words complex. We analyse the shortcomings of the previous CWI datasets and use this to motivate the specification of a new type of CWI dataset, focusing not on complex-word identification (CWI), but instead on lexical complexity prediction (LCP), that is CWI in a continuous-label setting. We further develop a dataset based on adding additional annotations to the existing CompLex 1.0 to create our new dataset, CompLex 2.0, and use this to provide experiments into the nature of lexical complexity.

The main contributions of this paper are:

A concise yet comprehensive survey of the two editions of the CWI shared tasks organized in 2016 and 2018;
An investigation into the types of features that correlate with lexical complexity;
A qualitative analysis of the CWI–2016 (Paetzold and Specia 2016a), CWI–2018 (Yimam et al., 2018) and Maddela–2018 (Maddela and Xu 2018) datasets, highlighting issues with the annotation protocols that were used;
The specification of a new annotation protocol for the CWI task;
An implementation of our specification, describing the annotation of a new dataset for CWI (CompLex 1.0 and 2.0);
Experiments comparing the features affecting lexical complexity in our dataset, as compared to others;
Experiments using our dataset, demonstrating the effects of genre on CWI.

The remainder of this paper is organized as follows. Section 2 provides an overview of the previous CWI shared tasks. Section 3 provides a preliminary investigation into the types of features that correlate with complexity labels in previous CWI datasets. Section 4 firstly discusses the datasets that have previously been used for CWI, highlighting issues in their annotation protocols in Sect. 4.1, and then proposes a new protocol for constructing CWI datasets in Sect. 4.5. Section 5 reports on the construction of a new dataset following the specification previously laid out. Section 6 compares the annotations in our new dataset to those of previous datasets by developing a categorical annotation scheme. Section 7 shows further experiments demonstrating how our new corpus can be used to investigate the nature of lexical complexity. Finally, a discussion of our main thesis and conclusions of our work are presented in Sects. 8 and 9 respectively.

We have previously published the CompLex 1.0 data as a workshop paper (Shardlow et al., 2020). The CompLex 2.0 data was also described in the SemEval task description paper (Shardlow et al., 2021). In this paper, we seek to build upon these prior works to give an in depth and rounded treatment to the lexical complexity problem.

2 Related work

There have been various studies which have both created datasets and explored computational models for CWI, particularly focusing on English texts (Shardlow, 2013b, a; Gooding and Kochmar, 2019; Finnimore et al., 2019). These studies have addressed CWI as a stand-alone task or as part of lexical simplification pipelines.

Given the direct application of CWI to lexical simplification systems, where the goal is to decide whether or not a word needs to be substituted for a simpler one, the clear majority of studies have addressed CWI as a binary classification task. That said, there have been multiple studies analyzing the shortcomings of approaching CWI as a binary classification task. Some studies have studied the relationship between classification performance and dataset annotation in an attempt to estimate the theoretical upper boundary of binary CWI systems (Zampieri et al., 2017) while others have investigated alternative ways to model the task. One study posed that comparative judgments are more consistent than binary classification for CWI (Gooding et al., 2019).

CWI is of direct interest to those working in lexical simplification as it forms the first part of the lexical simplification pipeline (Devlin & Tait, 1998). Before a word can be simplified, a decision must be made as to whether or not that word requires simplification. Simplification systems (Biran et al., 2011; Bott et al., 2012), then generate potential candidates for simplification and use a similar process to CWI to select the most simple candidate (Paetzold et al., 2017).

Comparative complexity is a related but distinct task to Lexical Complexity Prediction. In this task, two words are taken and a judgment is given to determine which is the most complex. A recent study found that annotations for comparative complexity were more consistent than binary classification (Gooding et al., 2019). Nonetheless, we have not focussed on comparative complexity in this work, but rather on continuous complexity. We are most interested in the complexity of a word in it’s original context, rather than in relation to another word.

The increased interest from the research community in CWI was the primary motivation for the organisation of the two editions of the aforementioned CWI shared task in 2016 and 2018. These shared tasks have made important benchmark datasets available to the community that are widely used beyond these competitions. In the next sub-sections we provide an overview of these two editions: CWI–2016 organized at SemEval 2016 (Paetzold & Specia, 2016a) and CWI–2018 organized at the BEA workshop in 2018 (Yimam et al., 2018). We describe the task setup, present the datasets, and briefly discuss the approaches submitted by participants in the two editions of the competition. We also present the approaches and the features used by each system. Finally, we analyze the results obtained by the participants and the main challenges of each edition of the CWI Shared Task.

2.1 CWI–2016

The first shared task on CWI was organized as Task 11 at the International Workshop on Semantic Evaluation (SemEval) in 2016.^{Footnote 1} CWI–2016 provided participants with a manually annotated dataset in which words in context were labeled as complex or non-complex, where complexity is interpreted as whether a word was understood or not by a pool of 400 non-native speakers of English. CWI–2016 was therefore modelled as a binary text classification task at the word level. Participants were required to build systems to predict lexical complexity in sentences of the unlabeled test set and assign label 0 to non-complex words and 1 to complex ones. Two examples from the CWI–2016 dataset are shown below:

(1)
A frenulum is a small fold of tissue that secures or restricts the motion of a mobile organ in the body.
(2)
The name ‘kangaroo mouse’ refers to the species’ extraordinary jumping ability, as well as its habit of bipedal locomotion.

The words in bold: frenulum, restricts, and motion in Example 1, and extraordinary, bipedal, and locomotion in Example 2 were annotated by at least one of the annotators as complex and thus they were labeled as such in the training set. Adjacent words like bipedal locomotion do not represent multi-word expressions (MWEs) as they were annotated in isolation because the task set-up of CWI–2016 only considered single word annotations. Whilst MWEs were not considered in CWI–2016, they were studied in CWI–2018 (see Sect. 2.2).

The dataset provided by the organizers of CWI–2016 contained a training set of 2237 target words in 200 sentences. The training set was annotated by 20 annotators and a word was considered complex in the training set if at least one of the 20 annotators assigned it as so. The test set included 88,221 target words in 9,000 sentences and each word was annotated by only one annotator. Therefore, the ground truth label for each word in the test was attributed based on a single complexity judgement. According to the organisers of CWI–2016, this setup was devised to imitate a realistic scenario where the goal was to predict the individual needs of a speaker based on the needs of the target group (Paetzold & Specia, 2016a). Finally, the data included in the CWI–2016 dataset comes from various sources such as the CW Corpus (Shardlow, 2013a), the LexMTurk Corpus (Horn et al., 2014), and Simple Wikipedia (Kauchak, 2013).

CWI–2016 attracted a large number of participants. A total of 21 teams submitted 42 systems to the competition. A wide range of features such as word embeddings, word and character n-grams, word frequency, Zipfian frequency-based features, word length, morphological, syntactic, semantic, and psycholinguistic features were used by participants. A number of different approaches to classification were tested, ranging from traditional machine learning classifiers such as support vector machines (SVM), decision trees, random forest, and maximum entropy classifiers to deep learning classifiers, such as recurrent neural networks. In Table 1, we list the approaches submitted to CWI–2016 by the 19 teams who wrote system description papers presented at SemEval.

Table 1 Systems submitted to the CWI–2016 in alphabetical order. We include team names and a brief description of each system including features and classifiers used. A reference to each system description paper is provided for more information

Predicting lexical complexity in English texts: the Complex 2.0 dataset

Abstract

Similar content being viewed by others

RCWI: A Dataset for Chinese Complex Word Identification

Measuring Content Complexity of Technical Texts: Machine Learning Experiments

Natural Language Complexity and Machine Learning

1 Introduction

2 Related work

2.1 CWI–2016

2.2 CWI–2018

3 Analysis of features of complex words

4 Specification for CWI data protocol

4.1 Building on previous datasets

4.2 Specification

5 CompLex 2.0

5.1 Data collection

5.2 Data labelling

5.3 Corpus statistics

5.4 Inter-annotator agreement

5.5 CompLex 2.0 features

6 Predicting categorical complexity

7 Predicting continuous complexity

7.1 Prediction of complexity across genres

7.2 Subjectivity

8 Discussion

9 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation