1 Introduction

LS improves text readability by replacing complex words with simpler alternatives. Complex words are words which a target population find difficult to read or understand. Various user groups benefit from LS. Previous LS systems have been designed for children (Kajiwara et al., 2013), second language learners (Lee & Yeung, 2018b), individuals with reading disabilities (Rello et al., 2013; Devlin & Tait, 1998; Carroll et al., 1998) or low-literacy (Watanabe et al., 2009; Gasperin et al., 2009), and sign language speakers (Alonzo et al., 2022a, b). LS provides a degree of personalization that is unattainable through approaches that focus on sentence rather than word-level simplification (Yeung & Lee, 2018; North & Zampieri, 2023).

Fig. 1
figure 1

LS Pipeline. SG, SS, and SR are the main sub-tasks of LS discussed throughout this survey. Figure adapted from (Paetzold & Specia, 2015)

The introduction of deep learning, and more recently, LLMs and prompt engineering, has significantly changed the way we approach many NLP tasks, including LS. Previous LS systems have relied upon statistical, n-gram, lexical, rule-based, and word embedding models to identify complex words and then replace them for simpler alternatives (Paetzold & Specia  2017b). These approaches would identify a complex word, for example, “rogue” as being in need of simplification, and would suggest “thief” as a suitable alternative (Fig. 1), hereby referred to as a candidate substitution. Transformer-based models, such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), and the latest generation of LLMs with billions of parameters such as as GPT-3 and GPT-4 (Brown et al., 2020), Llama 2 and LLama 3 (Touvron et al., 2023), Mistral (Jiang et al., 2023), and others, automatically generate, select, and rank candidate substitutions. Furthermore, recent shared task results (Shardlow et al., 2024) have confirmed that that LLMs deliver performances superior to traditional approaches resulting in an important paradigm shift within LS which motivates the present survey. This corroborates recent findings of multiple NLP studies that show that the most recent generation of LLMs deliver state-of-the-art performance in various tasks (Minaee et al., 2024).

With the introduction of the aforementioned new deep learning models and LLMs, the field of NLP has seen the arrival of new interested parties: companies, academics, and individuals that may be unfamiliar with prior LS research (Shardlow et al., 2021; Saggion et al., 2022). For this reason, we believe that this is the perfect moment to provide a survey of recent deep learning approaches to LS with the goal of bridging the gap between established researchers and those new to the field. To the best of our knowledge, this is the first survey on deep learning approaches for LS. The paper by Paetzold & Specia (2017b) is the most recent survey on LS, but it has been published before studies that demonstrate the headway made by state-of-the-art deep learning models. A broader survey on TS (Al-Thanyyan & Azmi, 2021), published a few years later, also does not cover recent advances in the field, nor does it focus specifically on LS. Our paper, therefore, fills an important gap in the LS literature by providing the community with the first survey on deep learning approaches to LS and its sub-tasks of substitute generation (SG), substitute selection (SS), and substitute ranking (SR).

2 Pipeline

We organize this survey around the main components of the LS pipeline with SS and SR being described simultaneously due to the likeliness in their deep learning approaches (Section 3). We discuss recent works collected from prominent LS workshops, shared-tasks, conference proceedings, and journals published in three major repositories, namely ACL Anthology, ACM Digital Library, and IEEE Xplore. In addition, we provide an overview of recent datasets (Section 4), and detail open challenges and unanswered research questions in LS (Section 5.1). Normally, a LS pipeline starts with lexical complexity prediction (LCP), also known as complex word identification (CWI). However, since LCP is often considered as a standalone task, we recommend (North et al., 2022b), for a detailed survey on LCP methods.

Substitute Generation (SG)

The goal of SG is to produce a number: k, of candidate substitutions that are viable replacements for a complex word. Typically, a LS system will output candidate substitutions in the range of k = [1, 3, 5, or 10] with top-k referring to the most suitable candidates. These candidate substitutions need to be easier to understand or read than the complex word. Candidate substitutions likewise need to preserve the complex word’s meaning, especially in its provided context (Table 1). For example, given the sentence: “They caught the rogue that stole the gold.”, and the target word: “rogue”, substitute generation would produce k candidate substitutions, such as “villain”, “mugger”, “thief”, and so on.

Table 1 Example of candidate substitution generated via SG

Multiple approaches have been used to generate viable candidate substitutions for a given complex word, ranging from the use pre-existing lexicons, masked language modeling, to the use of recent LLMs and prompt learning. These approaches are described in more detail within Section 3.1. However, SG is not without its limitations. SG systems have been found to produce candidate substitutions that are unsuitable for a target word’s context, and SG systems reliant on multilingual models trained on datasets consisting of multiple languages often produce candidate substitutions not in the desired target language of the complex word. Moreover, in some instances, candidate substitutions are produced that are more difficult to understand than the original complex word. In these instances, additional filtering is conducted via substitute selection or ranking as described in the following sections. For example, an LCP regressor may be trained to predict the lexical complexity of generated candidates on a scale between 0 (easy), 0.5 (neutral), and 1 (difficult); we refer the reader again to North et al. (2022b) for a detailed survey on LCP methods. The approaches described in Section 3.1 attempt to overcome these issues in numerous ways.

Substitute Selection (SS)

The aim of SS is to remove generated top-k candidate substitutions which are not suitable. At this stage, candidate substitutions that are not synonymous or that are more complex than the original complex word are removed (Table 2). For instance, SS would remove those generated candidates: “mugger”, “burglar”, and “poacher”, since they are either more complex, semantically dissimilar, or do not fit into the provided context: They caught the rogue that stole the gold.

Table 2 Example of candidate substitutes removed during SS

Common approaches to SS include comparing cosine similarities between word embeddings, training independent models for candidate selection, or through the use of prompt learning. These approaches are described within Section 3.2. The main challenge of SS is to not remove correct simplifications. A valid simplification, at times, may have a low similarity with the original complex word, i.e. it is less synonymous with the original complex word in comparison to other alternatives. However, the same simplification may better fit within the original context and therefore be a superior simplification. The approaches outlined in Section 3.2 have used various methods to minimise the likelihood of correct simplifications being removed from the pool of candidate substitutions.

Substitute Ranking (SR)

The purpose of SR is to rank the left over top-k candidate substitutions from the most to the least suitable simplification. The original complex word is then substituted with the most viable candidate substitution. The example shown in Table 3 ranks “thief” as being a more appropriate simplification than “criminal”, “vilian”, or “bandit” for the target word “rogue”. This may, in part, be due to “thief” having a higher frequency within a reference corpus or being more frequent within a training set. Alternatively, “thief” may have a lower age of acquisition, higher familiarity score, or even concreteness (abstractness) rating.

Table 3 Example of candidate substitutions ranked via SR

Approaches used for SS are also frequently employed for SR. Candidate substitutions have been ranked per cosine similarity between embeddings, BERT score, or through the use of prompt learning. These approaches are discussed in Section 3.2 in conjunction with SS approaches. However, the ranking of candidate substitutions is not an easy task. Perceptions on word complexity differ from individual-to-individual and therefore the target demographic needs to be taken into consideration when deciding which is the best candidate substitution to replace the original complex word. Unique approaches have been created in this endeavour (outlined in Section 3.2).

3 Deep Learning Approaches

We start our survey of the LS pipeline at the SG phase (Section 3.1). We then move on to SS and SR (Section 3.2). Within these sections, we provide an overview of deep learning approaches for LS and make reference to common evaluation metrics used to assess all LS sub-tasks: precision, recall, F1-score, accuracy (ACC) alongside potential, and mean average precision (MAP) at top-k as defined below.

Potential@k

The ratio of predicted candidate substitutions for which at least one of the top-k candidate substitutions was found within the gold labels as shown in equation (1).

$$\begin{aligned} Potential = \frac{m}{n} \end{aligned}$$
(1)

where m is the total of predicted top-k candidate substitutions that are found within the gold labels, and n is the sum of gold labels taken into consideration.

MAP@k

The ratio of returned top-k candidate substitutions that are equal to the gold labels and have the same positional rank. It is calculated using the following equation (2):

$$\begin{aligned} MAP = \frac{1}{n}\sum _{i=1}^{n}AP_i \end{aligned}$$
(2)

where AP is the average precision of each class, i is the class, and n is the number of classes taken into consideration.

3.1 Substitute Generation

Word embedding models were frequently used for SG. Word embedding models, such as Word2Vec (Mikolov et al., 2013), were combined with more traditional approaches, such as querying a lexicon, or generating candidate substitutions based on certain rules (Paetzold & Specia, 2017b). Word embedding models conducted SG by converting potential candidate substitutions into vectors, thus generating word embeddings. They determined the highest cosine similarity or lowest cosine distance between these vectors and the vector representing the target complex word. These vectors were then reverted to their word forms, constituting the top-k candidate substitutions.

Word Embeddings + Transformers

Word embedding models continued to be used for SG after 2017. However, they were now used alongside word embeddings produced by transformers or by a model’s prediction scores. Alarcón et al. (2021a) utilized BERT and various word embeddings models for generating Spanish candidate substitutions, such as Sense2Vec, Word2Vec, (Trask et al., 2015), and FastText (Bojanowski et al., 2017). It was found that a more traditional approach that generated candidate substitutions by querying a pre-existing lexicon outperformed these word embedding models. Their traditional approach achieved a potential of 0.898, a recall of 0.597, and a precision of 0.043 on the EASIER dataset (Alarcón et al., 2021b). In contrast, the highest-performing embedding model, Sense2Vec, scored lower with a potential, recall, and precision of 0.506, 0.282, and 0.056, respectively. Interestingly, this went against the assumption that word embedding models would have achieved a higher performance given their state-of-the-art reputation (Paetzold & Specia, 2017a). During error analysis, it was found that word embedding models often proposed antonyms of the complex word as potential replacements, thereby hindering LS performance (Alarcón et al., 2021a).

Seneviratne et al. (2022) used a word embedding model and a pre-trained transformer: XLNet (Yang et al., 2019), to provide an embedding similarity score and a prediction score for SG. They took inspiration from a similar approach conducted by Arefyev et al. (2020). Arefyev et al. (2020) utilized context2vec (Melamud et al., 2016) and ELMo (Peters et al., 2018) to encode the context of the target complex word to gain a probability distribution of each word fitting into that particular context. They utilized this probability distribution to gauge the likelihood or appropriateness of a potential candidate substitution being an effective replacement for the target complex word. This score was combined with a prediction score from either BERT, RoBERTa, or XLNet to generate a final list of top-k candidate substitutions. The combined approach of employing a word embedding model alongside a prediction score was found to under-perform compared to utilizing a single pre-trained transformer (Seneviratne et al., 2022; Arefyev et al., 2020). For instance, Seneviratne et al. (2022) reported inferior performance compared to North et al. (2022a) on the TSAR-2022 dataset.

Masked Language Modeling

The arrival of pre-trained transformers, also saw the introduction of Masked Language Modeling (MLM) for SG. MLM is where words in a sentence are masked or hidden, and the model is tasked with predicting what these masked tokens should be based on the context provided by the rest of the sentence. MLM is therefore well suited for SG. Przybyła & Shardlow (2020) used BERT-based models trained on a MLM objective for multi-word LS, whereas Qiang et al. (2020) were the first to utilize MLM for Spanish SG. MLM has emerged as a prevalent approach in SG, with 7 out of the 11 systems submitted to TSAR-2022 incorporating an MLM objective (Saggion et al., 2022).

Qiang et al. (2020) created LSBert, being a pre-trained BERT-based model for LS. Extracts, in the form of sentences, were taken from the LS datasets: LexMTurk (Horn et al., 2014), BenchLS (Paetzold & Specia, 2016b), and NNSeval (Paetzold & Specia, 2016c). Two versions of each sentence were then concatenated, being separated by the [SEP] special token. They were inputted into their model. The initial sentence mirrored one extracted from the datasets, while the subsequent sentence had its complex word substituted with the [MASK] special token. LSBert then predicts the word replaced by the [MASK] special token by analyzing its context, including both the preceding and succeeding text of the target word, alongside the original sentence. In this way, LSBert outputted candidate substitutions with the highest probability (highest prediction score) of fitting into the surrounding context and that are also similar to the target complex word in the original sentence. For the top-k=1 candidate substitution, LSBert achieved F1-scores for SG of 0.259, 0.272, and 0.218 on the three datasets LexMTurk (Horn et al., 2014), BenchLS (Paetzold & Specia, 2016b), and NNSeval (Paetzold & Specia, 2016c), respectively. Performances surpassed that of all prior approaches (Paetzold & Specia, 2017b). Their MLM approach’s ability to take into consideration context before and after the target word and their use of an overall larger BERT-based model is responsible for this increase in performance. The previous highest F1-score was achieved by a word-embedding model that lacked the same contextual understanding having produced F1-scores of 0.195, 0.236, and 0.218 for the same datasets, respectively (Paetzold & Specia, 2017a).

Prior to the release of the TSAR-2022 shared-task (Saggion et al., 2022), Ferres & Saggion (2022) created a new dataset: ALEXSIS (TSAR-2022 ES), that would later become (together with an additional English and Portuguese dataset) the TSAR-2022 dataset (Saggion et al., 2022). Using their new dataset, they experimented with a number of monolingual transformers and several multilingual transformers. Ferres & Saggion (2022) adopted the MLM approach used by LSBert. They used the Spanish pre-trained models: BETO (Cañete et al., 2020), BERTIN (De la Rosa & Fernández, 2022), RoBERTa-base-BNE, and RoBERTA-large-BNE (Fandiño et al., 2022) for SG. They found that their largest pre-trained Spanish model: RoBERTA-large-BNE, attained the greatest SG performance after having also omitted candidate substitutions that were the same as the complex word, regardless of capitalization or accentuation and being less than two characters long.

North et al. (2022a) was motivated by the success of the monolingual models shown by Ferres & Saggion (2022). They likewise tested a range of pre-trained transformers for SG with a MLM objective, including multilingual models: mBERT, and XLM-R (Conneau et al., 2020), and several monolingual models, including Electra for English (Clark et al., 2020), RoBERTA-large-BNE for Spanish, and BERTimbau (Souza et al., 2020) for Portuguese. Their monolingual models scored an ACC@1 score of 0.517, 0.353, and 0.481 on the English, Spanish, and Portuguese TSAR-2022 datasets, respectively. Whistely et al. (2022) likewise used similar monolingual models for SG. They experimented with BERT for English, BETO for Spanish, and BERTimbau for Portuguese. Surprisingly, their models’ performances were lower compared to that of North et al. (2022a), despite their Portuguese LS system consisting of the same pre-trained model. Whistely et al. (2022) produced ACC@1 scores of 0.378, 0.250, and 0.3074 for English, Spanish, and Portuguese, respectively. This is likely because the additional selection and ranking steps implemented by Whistely et al. (2022) and the lack thereof shown within the LS system provided by North et al. (2022a) (Section 3.2).

Wilkens et al. (2022) likewise experimented with a range of monolingual transformers for SG. They employed an ensemble of BERT-like models with three distinct masking strategies: 1) copy, 2) query expansion, and 3) paraphrase. In the copy strategy, similiar to LSBert’s approach (Qiang et al., 2020), two sentences were fed into a pre-trained model, concatenated with the [SEP] special token. The first sentence remained unchanged, while the second had its complex word replaced with the [MASK] token. For the query expansion strategy, FastText was utilized to generate five related words with the highest cosine similarity to the target complex word. In iteration 2a) of this strategy, the first sentence remained unaltered, the second substituted the complex word with one of the recommended similar words from FastText, and the third sentence was the masked version. Iteration 2b) replicated 2a), but the second sentence now comprised all five suggested words. Lastly, the paraphrase strategy generated 10 new contexts for each complex word, consisting of paraphrases of the original sentence, each with a maximum of 512 tokens. These ensembles encompassed BERT and RoBERTa for English, several BETO-based models for Spanish, and several BERTimbau-based models for Portuguese. The paraphrase strategy showed the worst performance with a joint MAP/Potential@1 score of 0.217, whereas the query expansion strategy obtained a MAP/Potential@1 score of 0.528, 0.477, and 0.476 for English, Spanish, and Portuguese, respectively. This outperformed the paraphrase strategy and the original copy strategy used by LSBert, regardless of model.

Prompt Learning

Prompt learning is currently the best performing approach for SG as a result of the utilization of larger and more recent LLMs (Table 4). Prompt learning involves inputting into a LLM a string that is presented in such a way as to provide a description of the task as well as to return a desired output. Prompt learning, otherwise referred to as prompt engineering, also entails the optimization of said prompts to achive the best SG performance. This is achieved by either trial and error, or the employment of prompt learning strategies, such as chain-of-thought prompting. LLMs may also be fine-tuned for SG by being provided a dataset that includes example prompts, instructions and their corresponding outputs. However, little research has been conducted on LLM fine-tuning for SG within the LS research community having preferred zero-shot experimentation. Zero-shot refers to a LLM not being exposed to any training material or example instances with corresponding 1-shot, 2-shot, and so on, referring to the number of example instances shown to the LLM.

PromptLS (Vásquez-Rodríguez et al., 2022) is one of the only examples of prompt learning and LLM fine-tuning applied to SG. PromptLS contains a variety of pre-trained models fine-tuned on several LS datasets. These fined-tuned models were fed four types of prompts: a). “a easier word for rogue is”, b). “a simple word for rogue is”, c). “a easier synonym for rogue is”, and lastly, d). “a simple synonym for rogue is”. These prompts were fed into a RoBERTa model on all of the English data extracted from the NNSeval (Paetzold & Specia, 2016c), LexMTurk (Horn et al., 2014), CEFR-LS (Uchida et al., 2018) and BenchLS (Paetzold & Specia, 2016b) datasets. They were also translated and inputted into BERTIN fine-tuned on the Spanish data obtained from EASIER, along with BR-BERTo fine-tuned on all of the Portuguese data taken from SIMPLEX-PB (Hartmann & Aluísio, 2020). Vásquez-Rodríguez et al. (2022) likewise experimented with these prompts on a zero-shot condition. It was discovered that the fine-tuned models outperformed the zero-shot models on all conditions by an average increase in performance between 0.3 to 0.4 across all metrics: ACC@1, ACC@3, MAP@3, and Precision@3. The prompt combinations that produced the best candidate substitutions were “easier word” for English, “palabra simple” and “palabra fácil” for Spanish, and “palavra simples” and “sinônimo simples” for Portuguese.

Table 4 Top approaches for substitute generation (SG), selection and ranking (SS & SR) on TSAR-2022 datasets. Pot. stands for potential

Prompt learning has also been applied to more recent LLMs for SG, namely GPT-3 models. Aumiller & Gertz (2022) experimented with a variety of prompts, which they inputted into GPT-3. These prompts included: 1). zero-shot with context, 2). single-shot with context, 3). two-shot with context, 4). zero-shot without context, and 5). single-shot without context. In this instance, the size of each shot: n, refers to how many times a prompt is inputted into GPT-3. For instance, those shots with context would input a given sentence and then ask the question, “Given the above context, list ten alternative words for <complex word> that are easier to understand.”, n number of times. Those without context, however, would input n times the following: “Give me ten simplified synonyms for the following word: <complex word>”. Aumiller & Gertz (2022) also combined all types of prompts in an ensemble, generating candidate substitutions from each prompt type and then deciding upon final candidate substations through plurality voting and additional selection and ranking steps (Section 3.2). Their ensemble approach outperformed all other prompt types and SG models submitted to TSAR-2022 (Saggion et al., 2022). Their performance is a result GPT-3 being substantially larger than all other models submitted to the shared-task.

3.2 Substitute Selection and Ranking

Traditional SS approaches are still applied after SG. Methods such as POS-tag and antonym filtering, as well as semantic thresholds have been used to omit inappropriate candidate substitutions generated from the above deep learning approaches (Saggion et al., 2022). However, most modern deep learning approaches have minimal SS, with SS often being conducted at the same time as generation or ranking. For example, the metric used to generate the top-k candidate substitutions, such as similarity between word embeddings or a model’s prediction score, tends not to suggest candidate substitutions that are deemed as being inappropriate by other SS methods. Furthermore, SR techniques that order candidate substitutions per their appropriateness will in turn move unsuitable simplifications further down the list of top-k candidate substitutions to the point that they are no longer considered. For this reason, we have combined SS and SR into one section and described new deep learning approaches below.

Word Embeddings

Word embedding models continue to play a role in SS. For instance, Song et al. (2020) developed a novel LS system that filtered candidate substitutions based on a semantic similarity threshold. They selected only those candidate substitutions sharing the same POS tag as the target complex word, assessed contextual relevance (a measure of the reasonableness and fluency of a sentence after replacing the complex word), and ranked candidate substitutions by applying cosine similarity between word embeddings. They produced word embeddings by Word2Vec and rated their model’s performance on the LS-2007 dataset (McCarthy & Navigli, 2007). It was discovered that the use of Word2Vec enhanced their model’s performance, having achieved an ACC@1 of 0.269 compared to a previous score of 0.218.

Neural Regression

Maddela & Xu (2018) introduced the Neural Readability Ranker (NNR) to arrange candidate substitutions based on their complexity. NNR employs regression, trained on the Word Complexity Lexicon (WCL), alongside various features and character n-grams transformed into Gaussian vectors. It assigns a value between 0 and 1 representing the complexity of any given word. Through pairwise aggregation, the model predicts values indicating the relative complexity between pairs of candidate substitutions. A positive value suggests that the first candidate substitution is more complex than the second, while a negative value indicates the opposite. This process is repeated for all combinations of candidate substitutions for a complex word. Subsequently, each candidate substitution is ranked based on its comparative complexity with others. Applying their NNR model to the LS-2012 dataset, Maddela & Xu (2018) surpassed previous word embedding techniques for SR, achieving a Prec@1 of 0.673 compared to 0.656. Their approach therefore benefited from regression fine-tuning on a new human annotated dataset.

Word Embeddings + Transformers

A popular approach to SS and SR entails the use of word embeddings and transformers. Seneviratne et al. (2022) filtered and ranked candidate substitutions per the same combined score that they used for SG. Their filter consisted of their MLM model’s prediction score of the generated candidate together with the inner product of the target word’s embedding and the embedding of the potential candidate substitution. The returned candidate substitutions were then subject to one of three additional ranking metrics. The first ranking metric arranged candidate substitutions based on the cosine similarity between the original sentence and a modified version where the candidate substitution replaced the complex word. The second and third ranking metrics utilized dictionary definitions of the target complex word and its candidate substitutions. They computed the cosine similarity between each embedding of the definitions and the embedding of the sentence containing the target complex word. Those with the highest cosine similarities between either a) the definition of the target complex word and the definition of the candidate substitution, or b) the definition of the target complex word and the word embedding of the original sentence with the candidate substitution replacing its complex word, determined the rank of each candidate substitution. Their analysis revealed similar performances across all three metrics on the TSAR-2022 dataset, with a) achieving an ACC@1 score of 0.375, b) achieving 0.380, and c) achieving 0.386.

Li et al. (2022) created what they refer to as equivalence score for selection and ranking. Equivalence score determines the semantic similarity between candidate substitution and complex word to an extent that was more expressive than cosine similarity between word embeddings. To calculate equivalence score, they used a RoBERTa-based model trained for natural language inference (NLI) which outputs the likelihood of one sentence entailing another. The product of the generated likelihood of the original sentence with the candidate substitution preceding the original sentence and vice-versa equated to the equivalence score. (Li et al., 2022), employing the SG method akin to LSBert but with a transition to RoBERTa, attributed their system’s improved performance primarily to its distinctive SR. They achieved an ACC@1 of 0.659, surpassing LSBert’s ACC@1 of 0.598 on the TSAR-2022 dataset.

Aleksandrova & Brochu Dufour (2022) employed three metrics to rank candidate substitutions: a) grammaticality, b) meaning preservation, and c) simplicity. Grammaticality was assessed by checking whether the candidate substitution had the same POS tag concerning person, number, mood, tense, etc. If the candidate substitution matched in all POS-tag categories, it was assigned a value of 1; otherwise, it received a value of 0. Meaning preservation was determined by utilizing BERTScore to compute cosine similarities between the embeddings of the original sentence and those of an altered sentence where the target complex word was replaced with the candidate substitution. Lastly, simplicity was gauged using a CEFR vocabulary classifier trained on data from the English Vocabulary Profile (EVP). This classifier was trained by first masking the data and inputting it into a pre-trained BERT-based model. The resulting encodings were then used to train an SVM model, yielding the CEFR classifier. Despite their efforts, their model failed to outperform the baseline LSBert model at TSAR-2022.

LS systems have also solely relied on MLM prediction scores for SS and SR. North et al. (2022a) and Vásquez-Rodríguez et al. (2022) used this approach. They have no extra SR steps and sort their candidate substitutions using their generated MLM prediction scores. That being said, the do apply some basic filtering with both studies omitting duplicates and candidate substitutions that were the same as the complex word. Interestingly, minimal SR outperforms other more technical approaches (Table 4). North et al. (2022a) attained state-of-the-art performance on the TSAR-2022 Portuguese dataset, whereas Vásquez-Rodríguez et al. (2022) consistently produced high performances across the TSAR-2022 datasets.

Prompt Learning

Only North et al. (2023) have experimented with prompt learning for SS and SR. They created a unique selection and ranking pipeline (shown in Fig. 2) that removed candidate substitutions with a low cosine similarity between their own BERT embedding and that belonging to the complex word. Remaining candidates were then filtered by GPT 3.5. after being fed a prompt with one of the following adjectives, a). simplest, b). best or c). most similar: “What word is the [adjective] replacement for complex word in this list?”. GPT 3.5 was then presented with a final prompt which selected the best simplification by assessing each candidates suitability in the original context: “Given the above context, what is the best replacement for complex word in this list?”. This unique filter increased overall performance from an ACC@1 of 0.484 to 0.495 on the Portuguese TSAR-2022 dataset demonstrating the advantages of incorporating prompt learning within a SS and SR pipeline.

Fig. 2
figure 2

The selection and ranking pipeline introduced by North et al. (2023)

4 Resources

LS datasets (post-2017) exist for every LS sub-tasks or for a specific purpose (Appendix, Table 5). In addition, shared-tasks (\(^{SharedT.}\)) often provide their own LS datasets. Resources for LS are available for several languages as shown in the following sections.

Table 5 Datasets that can be used for LS arranged in chronological order

4.1 English

Personalized-LS

Lee & Yeung (2018b) introduced a dataset containing 12,000 English words ranked on a five-point Likert scale for personalized LS. 15 native Japanese speakers were asked to rate the complexity of each word. Complexity ratings were applied to BenchLS personalizing the dataset for Japanese speakers.

WCL

Maddela & Xu (2018) created the Word Complexity Lexicon (WCL). The WCL has 15,000 English words annotated by 11 non-native English speakers using a six-point Likert scale.

LCP-2021\(^{SharedT.}\)

The dataset provided at the LCP-2021 shared-task (CompLex) (Shardlow et al., 2020), was crowd sourced in the UK, Australia and the US. 10,800 complex words in context were extracted from three corpora: the Bible, biomedical articles, and Euro-Parliamentary proceedings. Annotation of complexity was done via a 5-point Likert scale.

SimpleText-2021\(^{SharedT.}\)

The SimpleText-2021 shared-task (Ermakova et al., 2021) designed three pilot tasks: 1). to identify passages to be simplified, 2). to select complex concepts within these passages, and 3). to simplify the complex concepts to produce an easier to understand passage. They gave their participants with multiple sources of data: the DBLP+Citation, Citation Network Dataset, ACM Citation network, and The Guardian newspaper with manually annotated keywords.

TSAR-2022\(^{SharedT.}\)

TSAR-2022 (Saggion et al., 2022) supplied datasets in English, Spanish, and Portuguese. These datasets housed target words in contexts taken from Wikipedia articles and journalistic texts, together with 10 candidate substitutions (approx. 20 in raw data) produced by crowd-sourced annotators from the Spain, Brazil and the UK. The candidate substitutions were ranked per their suggestion frequency. The Spanish dataset contained 381 instances, whereas the English and Portuguese datasets both contained 381 instances.

WCL-DHH

Alonzo et al. (2022a) aimed to provide reading assistance to Deaf and Hard-of-hearing (DHH) adults. They annotated the original 15,000 English words of the WCL (Maddela & Xu, 2018) dataset with lexical complexity provided by 11 DHH annotators. Annotation was done via a six-point Likert scale.

4.2 Datasets in Other Languages

Spanish

The ALexS-2020 shared-task (Zambrano & Ráez, 2020) produced a Spanish dataset containing 723 complex words from recorded transcripts. Merejildo (2021) provided the Spanish CWI corpus (ES-CWI). A group of 40 native-speaking Spanish annotators selected complex words within 3,887 sentences taken from academic texts. The EASIER corpus (Alarcón et al., 2021b) contains 5,310 Spanish complex words in sentences taken from newspapers with a total of 7,892 candidate substitutions. EASIER-500 is a smaller version of this dataset.

Portuguese

The PorSimples dataset (Aluísio & Gasperin, 2010) houses sentences taken from Brazilian newspapers. The dataset is contains nine sub-corpora separated by degree of simplification and text genre. The PorSimplesSent dataset (Leal et al., 2018) was created from the previous PorSimples dataset. It has strong and natural simplifications of PorSimples’s original sentences. SIMPLEX-PB (Hartmann & Aluísio, 2020) provides features for 730 complex words in context.

French

ReSyf contains French synonyms that have been ranked by an SVM per their reading difficulty (Billami et al., 2018). It contains 57,589 instances amounting to 148,648 candidate substitutions. FrenchLys is a LS tool designed by Rolin et al. (2021). It has its own dataset with sentences sampled from french schoolbooks and a French TS dataset: ALECTOR. Substitute candidates were suggested by 20 French speaking annotators.

Japanese

The Japanese Lexical Substitution (JLS) dataset (Kajiwara & Yamamoto, 2015) has 243 target words, each shown in 10 different sentences. Crowd-sourced annotators suggested and ranked candidate substitutions. The JLS Balanced Dataset (Kodaira et al., 2016) added to the prior JLS dataset to make it more representative of multiple genres and has 2,010 generalized instances. Nishihara & Kajiwara (2020) created a new dataset (JWCL & JSSL) that increased the Japanese Education Vocabulary List (JEV). It contains 18,000 Japanese words separated into three levels of difficulty: easy, medium, or difficult.

Chinese

Personalized-ZH (Lee & Yeung, 2018a) provides 600 Chinese words ranked by eight learners of Chinese using a 5-point lickert-scale. HanLS was introduced by Qiang et al. (2021). It has 534 Chinese complex words. 5 native-speaking annotators provided and ranked candidate substitutions. On average, each complex word has 8 candidate substitutions.

Russian

Abramov & Ivanov (2022) provide a parallel translation of the Bible instances found within the original English CompLex dataset (Shardlow et al., 2020). This new Russian LCP dataset consists of 931 distinct words shown in 3,364 different contexts. Annotation was conducted by 10 crowd-sourced annotators located in Russia using a 5-point Likert scale. A later study by Abramov et al. (2023) expanded this dataset by adding several features to each complex word.

5 Discussion and Conclusion

Since the 2017 survey on LS (Paetzold & Specia, 2017b), deep learning approaches have provided new headway in LS. MLM became a popular SG method, with the clear majority of LS studies employing a MLM objective. However, LLMs, such as GPT-3, now surpass the performance of all other approaches when fed a series of prompts, especially when using an ensemble of prompts. LS systems that employ minimal selection and ranking apart from ranking their model’s prediction scores, have outperformed more technical and feature-oriented ranking methods (Table 4). However, an exception is made with regards to equivalence score (Li et al., 2022), which has been shown to be effective at SR.

Recent advances in deep learning will be incorporated into future LS systems, such as the most recent generation of LLMs. Prompt learning and LLMs, such as Llama 3 (Touvron et al., 2023), Mistral (Jiang et al., 2023), and others, have proven to deliver state-of-the-art performance becoming increasingly popular in NLP. Using an ensemble of various prompts for selection and ranking has also been shown to advance LS performance further (North et al., 2023). New metrics will also similar to equivalence score will undoubtedly be beneficial.

5.1 Concluding Remarks and Open Challenges in LS

The advent of deep learning, along with the recent advancements in LLMs and prompt engineering, has greatly transformed the way we approach LS. Past LS systems depended on statistical methods, n-gram models, lexical approaches, rule-based techniques, and word embeddings to identify complex words and replace them with simpler alternatives (Paetzold & Specia, 2017b). Now, deep-learning approaches automatically generate, select, and rank candidate substitutions using the methods described throughout this survey. However, here are various open research questions that are yet to be explored in LS research. In this section, we conclude this survey by outlining key areas for the future development of LS systems.

Evaluation

The automatic evaluation metrics that are used to evaluate LS are not perfect (See Section 3 for definitions). Potential@K and MAP@K are both lenient metrics which are designed to indicate whether a system may have the capacity to simplify something. Potential@K indicates whether any returned simplifications are correct regardless of their ranking and MAP@K compares the ranking of a set of substitutions to the gold ranking. This is helpful for tracking progress of systems and identifying whether one system or approach may be better suited to a scenario as compared to another. However in a simplification pipeline, the only important candidate is the top-ranked candidate. If an unsuitable simplification is ranked as the top candidate (leading to it being used as a replacement for the original word), this will be detrimental to the overall quality of the final text, but will not be penalised by the metrics we have discussed. Automated metrics that aim to capture quality using a single numerical score often do not correlate with human judgments. We believe that exploring more faithful resources and metrics, as well as directly evaluating LS systems with intended user groups is a promising avenue for future work. This can be done by considering variation in data annotation instead of aggregated labels produced by multiple annotators as in most LS datasets currently available.

Wider research conducted for Text Simplification has began to combine annotation done by subject matter experts and crowd-sourced annotators, instead of fully relying on more generalized annotation protocols. Rahman et al. (2024) employed two nurse oncologists and a nurse practitioner as subject matter experts to provide highly representative sentence-level simplifications of healthcare-related material. In addition, the quality of gold labels provided by Text Simplification datasets is now being assessed through human evaluation. Gala et al. (2020) conducted several reading comprehension tests by asking dyslexic readers to recall elements of simplified and non-simplified texts. Results showed that their gold sentence-level simplifications were less likely to result in reading errors verifying that they were easier to read for their intended target demographic. We encourage LS researchers to adopt similar practices.

Explainability

Lexical simplifications are inherently more explainable than sentence simplification as the operations are applied at the word level. However, the decision process about which words should be simplified is often hidden behind a black box model. Research aiming to improve explainability and interpretability of these decisions will allow researchers to better understand the challenges and opportunities of modern NLP techniques applied to LS. An example direction for future research may entail the embedding of features within prompts traditionally associated with lexical complexity to better understand the decision-making process behind LLMs for LS. Features, such as word frequency, familiarity, concreteness,and others, may be used to generate potential simplifications. Correlations between these features and the quality of the produced simplifications may in turn shed light onto which features LLMs consider when determining viable simplifications for a given target word.

Personalization

Different target populations have different simplification needs thus one model does not fit all. The simplification needs of a second language learner compared to a victim of a stroke, compared to a child are each very different (Gooding & Tragut, 2022). As shown in previous research (North & Zampieri, 2023), English L2 speakers have simplification needs that are different from English L1 speakers and generally dependant on their own L1. For example, speakers of Portuguese or Spanish would have no issue with the word necessitate in English as in their language the word necessitar with the same meaning of necessitate exists. However, necessitate would be considered more complex than its synonym need by an English LS system due to the fact that it is longer and it has lower frequency than need. This would trigger an unnecessary substitution for speakers of Portuguese or Spanish L1 that not help readability and could even hinder their text comprehension given that need is not a word of Romance origin. Modeling these needs and using them to personalize LS systems will allow for personalized simplifications that can adequately meet the needs of specific user groups.

Personalized systems are already being developed to identify complex words for the precursor task of LCP before LS (Fig. 1). Lee & Yeung (2018a) created a personalized LCP system for identifying complex words for Chinese as foreign language learners. Koptient & Grabar (2022) implemented a LCP system to rate the complexity of medical jargon for non-expert patients. Ortiz Zambrano et al. (2019) and Ortiz Zambrano & Montejo-Ráez (2021) published a new resource and built a new LCP system for identifying complex words spoken during university lectures for students in Ecuador. However, standalone personalized systems that generate candidate substitutes rather than identify complex words have yet to be developed. Moreover, many demographics and domains still lack their own personalized system, regardless of whether that system is designed for LCP, LS, or Text Simplification in general. This necessitates the need for further research into personalization.

Perspectivism

Even within a target population, each individual will bring a unique perspective on what needs to be simplified. Systems that are able to tailor their outputs to each user’s individual needs will provide adaptive simplifications of potentially higher quality. This will, in turn, improve the evaluation of LS models as discussed in this section and throughout the survey.

Real-time or adaptive machine learning has already been applied within several intelligent tutoring systems found throughout educational platforms (Kabudi et al., 2021). These systems tailor user-content in real-time to provide users with a highly personalized learning experience. Hampton et al. (2018) created the Personal Assistant for Life-Long Learning (PAL3) that uses adaptive machine learning to prevent users from forgetting learned material. Hssina & Erritali (2019) employed real-time machine learning to automatically change lesson content based on the student’s profile. Troussas & Virvou (2020) implemented an adaptive recommendation system that suggests varying activities depending on the user’s needs and preferences. Despite these use cases, real-time or adaptive lexical simplification has not been adopted within live intelligent tutoring services.

Integration

LS is one part of the simplification puzzle, which, in turn, is part of a wider effort of improving readability. Integrating LS systems with explanation generation, text summarization (Peal et al., 2022; Xie et al., 2022), redundancy removal, and sentence splitting will further accelerate the adoption of automated simplification models. This will allow LS technology to reach a wider audience.

As of this moment, LS has only been used in a handful of use cases. LS was originally introduced to aid machine translation. LS was used to reduce the ambiguity of inputted texts. This would result in a machine translation system having higher likelihood of finding a suitable translation in the target language (North et al., 2022b). More recent use cases of LS can be seen throughout research aimed to improve the accessibility of medical-related documents (Koptient & Grabar, 2022) as well as throughout educational technology (Rets & Rogaten, 2020; Zaman et al., 2020). However, real-world production of LS technologies remains limited. We therefore encourage the adoption of LS within future technologies to increase the popularity of this field of research and to provide headway in the aforementioned research areas.