1 Introduction

Data augmentation refers to techniques used to enlarge human-authored datasets by automatically generating more additional instances that are similar to the original data. In natural language processing (NLP), the augmentation of text is a challenging task because of the discrete, symbolic nature of text data. However, despite the challenges, it provides a way to improve machine learning models in situations where human-annotated data is scarce (Şahin, 2022). In this work, we demonstrate how text augmentation by the means of lexical substitution can be used to enrich representations of semantic frames.

Fig. 1
figure 1

An example sentence with its color-encoded frame annotations taken from FrameNet. The Red color indicates the lexical unit, and the Blue color indicates the semantic roles. (Color figure online)

A semantic frame is a linguistic structure used to formally describe the meaning of a situation, action or event (Fillmore, 1982). A frame annotation for a sentence provides (i) a set of target words that evoke frames in this sentence, (ii) the respective frame for each of the targets and (iii) a set of arguments for each of the frames in the sentence. An example sentence is given in Fig. 1 along with two frame annotations taken from FrameNet (Baker et al., 1998)—a widely-used publicly available resource of frame annotations. The example sentence contains two targets: help, which evokes the frame ‘Assistance’ and hope, which evokes the frame ‘Desiring’. The corresponding entries help.v and hope.v with target word(s) lemmas and a part-of-speech tag are called lexical units (LUs) or frame evoking elements (FEE) in FrameNet. The arguments represent semantic roles or frame elements (FEs) that act as participants of the situation described by the frame.

Semantic frames have been used in a wide range of applications, such as question answering (Shen & Lapata, 2007; Berant & Liang, 2014; Khashabi et al., 2018), machine translation (Gao & Vogel, 2011; Zhai et al., 2013), and semantic role labeling (Do et al., 2017; Swayamdipta et al., 2018). However, their impact is restricted by the limited availability of annotated resources. Although there are some publicly available resources like FrameNet (Baker et al., 1998) and PropBank (Palmer et al., 2005), yet for many languages and domains, no specialized resources exist. Besides, due to the inherent vagueness of frame definitions, the annotation task is challenging and requires well-trained annotators or very complex crowd-sourcing setups (Fossati et al., 2013).

In this work, we suggest a different approach to the problem: augmenting the FrameNet resource automatically by generating more synthetic examples of existing frame annotations in context via lexical substitution. This way, we are obtaining additional lexical representations of semantic frames (i.e. synonyms of words describing semantic frames). The goal of lexical substitution (McCarthy & Navigli, 2009) is to replace a given word in a particular context with other words, which are semantically similar or related to the original word. The concept is similar to set expansion in its nature; set expansion refers to expanding a small set of seed entities into a larger set by acquiring new entities that belong to the same semantic class (Wang & Cohen, 2007). We consider that given a small set of seed sentences with their frame annotations, we can expand these annotations (a set of seed sentences) by substituting the targets and arguments of those sentences and aggregating possible substitutions into an induced semantic-frame resource. Table 1 shows one such induced example. To generate these substitutes, we experimented with non-contextualized word embeddings, i.e. fastText (Bojanowski et al., 2017), GloVe (Pennington et al., 2014), and word2vec (Mikolov et al., 2013); distributional thesauri from JoBimText (Biemann & Riedl, 2013); and compared their results to pre-trained Transformer-based contextualized models such as BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019). To complete the comparison, we also include the lexical substitution model of Melamud et al. (2015) that uses dependency-based word and context embeddings and produces context-sensitive lexical substitutes.

Table 1 Lexical representations of the Assistance FrameNet frame are retrieved using lexical substitutes from a single seed sentence with the BERT model

To generate substitutes, we solve two sub-tasks:

  • Lexical unit expansion: Given a sentence and its target word, the task is to generate the frame-meaning-preserving substitutes for this word. This target word can be a verb or a noun. The gold substitutes are lexical items specified by FrameNet. We aim at mining their synonyms fitting the semantics of the original FrameNet frame definition.

  • Semantic role expansion: Given a sentence and an argument, the task is to generate frame-meaning-preserving substitutes for this argument. The gold substitutes are concrete realizations of frame in text. We aim at mining their synonyms and other realizations of role-fitting semantics given in the original FrameNet role definition.

Table 1 presents top substitutes produced by BERT for each highlighted word. These substitutes can replace the highlighted words of the seed sentence to generate new sentences. This leads to augmenting the original set of sentences without manual annotations. To assess the quality of these substitutes and their effectiveness for further augmentation of semantic frame expansion, we performed three types of evaluation:

  1. 1.

    Intrinsic evaluation: we evaluate the quality of the substitutes by comparing them to the gold standard FrameNet lexicon, while the performance is reported in terms of precision.

  2. 2.

    Manual evaluation: for a small dataset, we evaluate the quality of the substitutes using human intuition, as the gold standard dataset can be incomplete.

  3. 3.

    Extrinsic evaluation: we conduct an extensive empirical study using the semantic parsers of Swayamdipta et al. (2017) and Shi and Lin (2019). We compare the performance of these parsers on a number of small seed datasets and their augmented versions.

The main contributions of our work are:

  • A one-shot method for inducing frame-semantic structures using lexical substitution on frame-annotated sentences.

  • A comparative evaluation for various models including simple non-contextualized word embeddings and Transformer-based models for lexical substitution on the ground truth from FrameNet.

  • We show that combining the output of individual models can substantially improve the quality of final substitutes in contrast to their individual performance.

  • A manual evaluation assessment of substitutes to compare it to automatic evaluation with FrameNet gold dataset.

  • We empirically demonstrate that the dataset augmentation procedure based on the word substitution is improving the performance of frame-semantic parsers. For both parsers Swayamdipta et al. (2017) and Shi and Lin (2019), we see statistically significant improvements in argument identification performance.

The code and datasets are made available online for better reproducibility of our results.Footnote 1

The remainder of this article is organized as follows: Sect. 2 provides an overview of related work for the semantic frame induction task. Section 3 describes the models used for lexical substitution. Section 4 describes the lexical substitution for lexical units and the semantic role expansion experiments. Section 5 describes frame-semantic parsing experiments. Finally, Sect. 6 and 7 conclude the overall findings of this work and discuss the possible future directions.

2 Related work

Many data-driven approaches to frame-semantic parsing that take advantage of annotated resources, such as FrameNet, have been proposed in the literature (Das et al., 2010; Oepen et al., 2016; Yang & Mitchell, 2017; Peng et al., 2018), with SEMAFOR (Das et al., 2014) being the most widely-known system for extracting complete frame structures including target identification, frame identification, argument identification, and argument labeling. Some works focus only on a single parsing step, e.g. frame identification (Hermann et al., 2014; Hartmann et al., 2017); Sikos & Padó 2019, argument labeling with frame identification (Swayamdipta et al., 2017; Yang & Mitchell, 2017), or just argument labeling (Kshirsagar et al., 2015; Roth & Lapata, 2015; Swayamdipta et al., 2018), which can be considered as very similar to PropBank-style  (Palmer et al., 2005) semantic role labeling, albeit more challenging because of the high granularity of semantic roles for frames. FrameNet-like resources are available only for very few languages and cover only a few domains. In this article, we venture into the more challenging problem of training a model for frame parsing on merely a very small amount of annotated data. This is similar to the idea of Pennacchiotti et al. (2008), which investigates the utility of semantic spaces and WordNet-based methods to automatically induce new lexical units, evaluating on FrameNet. Resource-scarceness is the typical case here, as some NLP applications might require frames not covered by FrameNet, the granularity of available frames might not match the task, or the parser shall be constructed for a low-resource language.

Several unsupervised semantic frame induction methods have been proposed in the literature. They extract clusters of words from the text, which are then dubbed as semantic frames. These methods are based on hard or probabilistic (soft) clustering of input, commonly represented in the form of dependency trees. Lang and Lapata (2010) perform clustering of verb arguments based on syntactic dependencies. A latent variable probabilistic model is used in Modi et al. (2012) and Titov and Klementiev (2012). Materna (2012, 2013) also cluster subject-verb-object (SVO) triples with a similar model based on LDA (Blei et al., 2003). Kawahara et al. (2014) apply the Chinese Restaurant Process clustering to a collection of verbal predicates and their argument instances. Ustalov et al. (2018) use tri-clustering on SVO triples to jointly induce both lexical units and their arguments. The downside of unsupervised frame induction is the lack of control over the semantics of obtained word clusters and frame granularity. Due to this, such methods are not widely applied.

Our approach conceptually differs from these frame induction methods. We consider the effort of labeling one or a few sentences with frames as tolerable. This enables us to guide the construction of the FrameNet resource with the desired properties. Our experiments show that this minimal supervision can be used to produce the majority of LUs of semantic frames defined in FrameNet and generate meaningful semantic roles. However, since our method uses some training data, it is not directly comparable to these completely unsupervised approaches.

There are few recent works that use pre-trained language models for lexical substitution. Our method takes a direct motivation from the works of Amrami and Goldberg (2018) and  Arefyev et al. (2019a). Amrami and Goldberg (2018) suggest predicting substitute vectors for target words using pre-trained ELMo (Peters et al., 2018) and dynamic symmetric patterns.  Arefyev et al. (2019a) use the same idea of substitute vectors for the SemEval 2019 (QasemiZadeh et al., 2019) frame induction task, but replace ELMo with BERT (Devlin et al., 2019) for improved performance.  Zhou et al. (2019) propose a method for lexical substitution with BERT. A more recent work by Arefyev et al. (2020) shows that injecting the information about target word into state-of-the-art language models can significantly improve their performance for lexical substitution. The re-surge of lexical substitution arises from the fact that it has a wide range of applications in NLP tasks such as word sense induction (Amrami & Goldberg, 2018; Arefyev et al., 2019b, 2020), paraphrasing or text simplification (Kriz et al., 2018; Lee & Yeung, 2019). It is also used for quality assessment of semantic distributional models (Buljan et al., 2018). We are—to our knowledge—the first to employ lexical substitution for the expansion of semantic-frame resources and the first to show that it improves the performance of frame parsers. This work is a direct extension of our previous preliminary work (Anwar et al., 2020) with more advanced lexical substitution methods from Arefyev et al. (2020) and with experiments on the frame-semantic parsing task for extrinsic evaluation of the proposed approach.

3 Inducing lexical representations of frames

We experiment with two groups of lexical substitution models: non-contextualized and contextualized models. Regarding non-contextualized models, we report experiments with static embeddings from word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and fastText (Bojanowski et al., 2017); further, we utilize distributional thesauri constructed with JoBimText (Biemann & Riedl, 2013).

For contextualized models, we use two pre-trained Transformer-based models BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019), as well as the lexical substitution model of  Melamud et al. (2015).

3.1 Non-contextualized models

In this section, we describe common approaches to represent meaning of individual words independently of their context.

3.1.1 Non-contextualized word embeddings

Non-contextualized word embeddings are vector representations of words constructed in a way that words occurring in similar contexts are expected to have similar vectors. To produce substitutes for a target word, we take 200 nearest neighbors of the target word according to the cosine similarity measure between non-contextualized embeddings. We use the following pre-trained embeddings: fastText trained on the Common Crawl corpus,Footnote 2 GloVe trained on the Common Crawl corpus,Footnote 3 word2vec trained on Google News.Footnote 4 The embeddings from all of these models have the dimensionality of 300.

3.1.2 Distributional thesauri

In contrast to the standard word embeddings, distributional thesauri (DT) can capture word similarities using simple n-gram context features and more complex linguistic context features (Lin, 1998), e.g. dependency relations. Grammatical features provide a more refined set of similar terms as compared to bag-of-words-based word embeddings, but their representations are sparser. JoBimText (Biemann & Riedl, 2013) is a framework that offers many DTs constructed using various corpora. Context features for each word are ranked using the lexicographer’s mutual information (LMI) score (Kilgarriff et al., 2004) and used to compute word similarity by feature overlap. We extract 200 nearest neighbors for the target word. In the experiments, we use two JoBimText DTs: (i) DT built on Wikipedia with n-grams as contexts and (ii) DT built on the combination of Wikipedia, Gigaword (Parker et al., 2009), ukWaC (Ferraresi et al., 2008), and LCC (Goldhahn et al., 2012) (59 GB in total) using dependency relations as context.

3.2 Contextualized models

While non-contextualized models are computationally effective, they cannot handle polysemous words. This drawback is addressed by context-aware models that can produce different word representations depending on the context. Therefore, they can also be used to generate different substitutes for a target word depending on its context.

3.2.1 Melamud’s lexical substitution model

The method proposed by Melamud et al. (2015) uses syntax-based skip-gram embeddings of Levy and Goldberg (2014) for a word and its context to produce context-sensitive lexical substitutes, where the context of a word is represented using its dependency relations. We use the embeddings from Melamud et al. (2015), which were trained on the ukWaC (Ferraresi et al., 2008) corpus. To find dependency relations, we use the Stanford Parser (Chen & Manning, 2014) (version 4.0.0) and collapse dependencies that include prepositions. Top k substitutes are produced only when both the target word and some of its context words are present in the vocabulary of the model.

The following cosine-similarity-based measures are proposed in Melamud et al. (2015) to compute suitability of a substitute s for a given target word t in a given context C:

$$\begin{aligned} add= & {} \frac{cos(s,t) + \sum _{c \in C} cos(s,c)}{\left| C \right| +1}, \end{aligned}$$
(1)
$$\begin{aligned} balAdd= & {} \frac{\left| C \right| \cdot cos(s,t) + \sum _{c \in C} cos(s,c)}{2\cdot \left| C \right| }, \end{aligned}$$
(2)
$$\begin{aligned} mult= & {} \root \left| C \right| +1 \of {\frac{cos(s,t)}{2}\cdot \prod _{c\in C}cos(s,c)}, \end{aligned}$$
(3)
$$\begin{aligned} balMult= & {} \root 2\cdot \left| C \right| \of {\frac{(cos(s,t)+1)}{2}^{\left| C \right| }\cdot \prod _{c\in C}\frac{cos(s,c)+1}{2}}. \end{aligned}$$
(4)

Two of these measures (mult and balMult) use the geometric mean to produce high scores when the target word and the context words are all similar to a substitute word, whereas the other two (add and balAdd) use arithmetic mean to achieve high scores, even if some of them are not similar. The balAdd and balMult measures emphasize more on the similarity of substitutes to the target word.

3.2.2 Pre-trained transformer-based models

Transformer-based models pre-trained on various language modelling objectives can predict the distribution of substitutes for a target word in a given context. In this work, we use two Transformer-based models: BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019). The BERT model is the encoder part of the Transformer model (Vaswani et al., 2017) that was pre-trained with the masked language modeling (MLM) objective. In a nutshell, some randomly selected tokens in the training corpus are replaced by a special [MASK] token, and the objective is to restore those tokens based on the remaining left and right context. The XLNet model is an autoregressive language model pre-trained with the permutation language modelling (PLM) objective. In this objective, a new random permutation of tokens is generated for each example in each epoch. The model learns to predict each word given all preceding words in this permutation. Simple autoregressive models learn to predict words one by one either from left to right, or vice versa. In contrast, XLNet learns to predict words in any order and to take advantage of both left and right context of each word. Pre-trained Transformer-based models have effectively outperformed the previous state of the art in many downstream tasks. For our experiments, we use BERT large-cased and XLNet large-cased as implemented in the HuggingFace library (Wolf et al., 2019).

There are several ways to provide information about the target word to the Transformer-based substitution models. We use three options: keeping the original target word in place, using dynamic patterns, and combining conditional probabilities of substitutes given the context and the target word.

Original input: Although BERT and XLNet were both trained to guess a word they do not observe from its context, a substitution model can produce better substitutes if it sees not only the context but also has some information about the target word (Arefyev et al., 2020). The simplest method to introduce the information about the target word is just feed the original example without any masking and get predictions for the position of the first subword of the target word. In case of XLNet, we generate an attention mask resulting in all tokens attending to all other tokens in the content stream. Even though the contextualized embedding for the target position comes from the query stream, it still depends on the target word indirectly through the contextualized embeddings of all other tokens.

Dynamic patterns: Amrami and Goldberg (2018, 2019) use dynamic patterns to inject information about the target word (T) when generating substitutes with ELMo and BERT for the word sense induction task. These patterns are similar to the Hearst patterns (Roller et al., 2018) and are used to replace the target word (T) with some coordinate structure (e.g. “T and -”) to extract better substitutes. For example, the sentence “Rob sold his car to Miller”, after applying the pattern “T and -” to the target word “sold” will be transformed to “Rob sold and - his car to Miller”. Now the substitutes will be generated for the token “-” instead of the original target word “sold”. Arefyev et al. (2019a, b, 2020) also use these patterns to generate substitutes for solving the lexical frame induction and the word sense induction tasks. For our experiments with BERT, we try patterns “T and -” and “T and T” (where the target word is duplicated).

+embs: Arefyev et al. (2020) proposed a method that combines the probability of a potential substitute occurring in a given context P(s|C) with the probability reflecting distributional similarity of this substitute to the target word P(s|T):

$$\begin{aligned} P(s|C, T) \propto \frac{P(s|C)P(s|T)}{P(s)^\beta }. \end{aligned}$$
(5)

The probability P(s|C) is directly estimated by a language model, while P(s|T) is calculated by applying the temperature softmax over the inner product of their non-contextualized embeddings:

$$\begin{aligned} P(s|T) \propto exp\Big (\frac{\textbf{v}_{s}, \textbf{v}_{T}}{\tau }\Big ). \end{aligned}$$
(6)

where, \(\textbf{v}_{s}\) and \( \textbf{v}_{T}\) are embeddings of the corresponding words, and \(\tau \) is the temperature hyperparameter used to balance between closeness of substitutes to the target word and their fitness to the given context. The hyperparameter \(\beta \) can be tuned to promote or penalize frequent words as substitutes. Prior word probabilities P(s) are obtained from the wordfreq library.Footnote 5 The optimal values of \(\tau \) and \(\beta \) are selected using the development dataset. In the experiments with XLNet, we selected these values for lexical unit expansion and semantic roles expansion separately.

3.3 Combination of models

To combine the advantages of several models, we also ensemble the predictions of the best-performing models. Since different models produce scores that are not directly comparable, we consider only substitute ranks, i.e. their positions after ordering substitutes according to their scores obtained from each model. We compute the combined rank as:

$$\begin{aligned} Combined\,Rank(w) = \frac{1}{L} \sum _{i=1}^{L} rank_i(w). \end{aligned}$$
(7)

where \(rank_i(w)\) is the rank of w among the substitutes predicted by the i-th model if it is predicted by the i-th model, and 1000 otherwise. Each model predicts at most 200 substitutes and the value 1000 is used to penalize the combined rank of a substitute that is not predicted by all models in the ensemble. The goal of this penalization is to rank words that are predicted by N models higher than words that are predicted by \(N-1\) models.

4 Intrinsic evaluation: augmenting lexical descriptions in FrameNet

In this section, we show how lexical substitution can be used to fill the gaps in a lexical resource. Namely, given a partially completed descriptions of lexical-semantic frames from the FrameNet resource, one can reconstruct the missing semantic roles and lexical units using our approach.

4.1 Experimental setup

4.1.1 Datasets

We use FrameNet (Baker et al., 1998) version 1.7 to generate our evaluation datasets. The combined data from fulltext and exemplars annotations of FrameNet contains around 170k sentences with 1014 frames, 7828 types of semantic roles, and 10,340 unique lexical units. Table 2 describes more characteristics of these datasets. The datasets for evaluation were derived automatically. Semantic roles and lexical units can consist of single or multiple tokens. For this work, we have only considered single-token substitution.

Table 2 Statistics of evaluation datasets for verb lexical units, noun lexical units, and semantic role expansion tasks derived from FrameNet-1.7

Single-token lexical unit and semantic role expansion: In order to create evaluation data for the LU expansion tasks, for each sentence containing an annotated LU we consider other LUs of the corresponding semantic frame as gold substitutes. We keep only LUs marked as verbs and nouns in FrameNet. FrameNet annotations contain 10 different types of lexical units based on their part-of-speech tags, but verbs and nouns cover about 79% of annotations. We created two separate datasets for the verb lexical unit expansion task and the noun lexical unit expansion task. To construct the evaluation dataset for the semantic role expansion task, for each of the sentences that contain an annotation of a given semantic role we consider all the single-word annotations from the rest of the corpus marked with the same role and related to the same frame as the gold substitutes.

An example frame with its full set of lexical units and two semantic roles is shown in Fig. 2. It contains some example sentences annotated with lexical units and semantic roles. To illustrate how we generated the evaluation datasets Table 3 shows evaluation data that could have been generated based on annotations in Fig. 2. However, the final datasets for experiments have been generated using all data from fulltext and exemeplars annotations of the FrameNet resource, but not example sentences from the frame description files. The resulting datasets contain 79,584 records for verb LUs, 76,229 for noun LUs, and 191,252 records for role expansion. We use 10% of examples as the development sets for tuning the hyperparameters \(\tau \) and \(\beta \) of BERT+embs and XLNet+embs.

Fig. 2
figure 2

The frame Arrest from FrameNet simplified for illustrative purposes. It contains a frame definition and an example sentence, as well as names, descriptions and examples for a few semantic roles (FEs), and finally, a set of lexical units associated with this frame

Table 3 Evaluation data generated from the FrameNet descriptions shown in Fig. 2

Hyperparameters for +embs: Following Arefyev et al. (2020), we set the default values of the hyperparameters \(\beta = 1\) and \(\tau =0.1\). For XLNet+embs we additionally selected the optimal values of these hyperparameters based on the development subsets of all three datasets. The following values were selected: \(\beta = 0.5\) and \(\tau =0.05\) for the verb lexical unit expansion task, \(\beta = -\,1\) and \(\tau =0.07\) for the noun lexical unit expansion task, and \(\beta = -\,0.5\) and \(\tau =0.15\) for the semantic role expansion task. Negative values of \(\beta \) mean that it is beneficial to promote more frequent words as substitutes for the latter two tasks.

4.1.2 Evaluation measures

To evaluate the quality of generated substitutes, we use the standard ranking metric precision at k (p@k), where k represents the number of the highest ranked substitutes to be considered. While p@k measures the correctness of the first k substitutes, to evaluate the quality of the entire list of generated substitutes, we use mean average precision at level k (MAP@k):

$$\begin{aligned} MAP@k = \frac{1}{N} \sum _{i=1}^{N} AP^i@k, \end{aligned}$$
(8)

where

$$\begin{aligned} AP^i@k = \frac{1}{min(k, R^i)} \sum _{l=1}^{k} r_l^i \cdot p^i@l. \end{aligned}$$

Here, N is the total number of examples in the dataset; \(R^i\) is a number of possible correct answers for the i-th example; \(r_l^i\) equals 1 if the l-th predicted substitute for the i-th example is correct and 0 otherwise. We present p@k at levels: 1, 5, as well as MAP@50. Sometimes the post-processing procedure leads to fewer than k substitutes generated. We consider absence of a substitute for a position as a wrong answer of the model.

4.1.3 Text pre-processing

For non-contextualized embeddings, we tried generating substitutes for the target word both with and without lemmatization and found that lemmatizing the target word has no positive effect on the model performance. We assume that the grammmatical form of the target word contains some information about its context, and this can help generating better substitutes by the models that do not have direct access to the context. Thus, we do not use lemmatization for this kind of models. For DTs, lemmatization produced better results, mainly because corpora were lemmatized before building the DTs. Therefore, we employ lemmatization in case of DT models. For all contextualized models, the context is tokenized on whitespace and we do not apply any pre-processing to the original sentences extracted from the FrameNet-annotated corpus.

For BERT, we remove all subwords of the target word except for the first subword during pre-processing. This results in better substitutes generated, see Table 4 for an illustration of this effect. For XLNet, we follow Arefyev et al. (2020) and prepend a fixed prompt, i.e., a ‘warming up’ text fragment, ending with the end-of-document token as the initial context to each example.

Table 4 Pre-processing target words with multiple subwords

4.1.4 Post-processing of substitutes

Lexical substitutes can contain noisy tokens, such as numbers, individual symbols, model specific special tokens, e.g. [UNK], or sub-words marked with the ## prefix. In post-processing, we remove all such non-words from the list of generated substitutes. Substitutes often contain different forms of the same word, especially when static word embeddings are employed. Therefore, we lemmatize the generated substitutes using the Pattern library (Smedt & Daelemans, 2012) and remove duplicated lemmas. For the verb lexical unit expansion task, we drop all substitutes that are not verbs. For this purpose, we use a dictionary of verbs composed of verb lexicons taken from Pattern, WordNet (Miller, 1995), and FreeLing (Padró & Stanilovsky, 2012). For the noun lexical unit expansion task, we remove stopwords, apply POS tagging, and retain only nouns in the final output.

4.1.5 Combination of models

For model combinations, we consider the best-performing individual models according to the mean average precision from the following categories: non-contextualized embeddings (nc-emb), distributional thesaurus based models (DT), the contextualized models of Melamud et al. (2015), contextualized models based on pre-trained Transformer-based models, and the Transformer models with the embeddings of target words (+embs).

4.2 Results

4.2.1 Lexical unit expansion task

The results for the lexical unit expansion tasks are presented in Table 5.

Table 5 Evaluation of lexical substitutes for lexical unit and semantic role expansion

Verb lexical unit: In the verb lexical unit expansion task the best performance among non-contextualized models was achieved by fastText (\(p@1=0.388\) and \(MAP@50=0.156\)), closely followed by word2vec (\(p@1=0.388\) and \(MAP@50=0.151\)). The DTs considered in our experiments perform worse than the embedding-based models word2vec and fastText. Among Melamud’s models, the best performance was achieved by the balAdd (Melamud et al., 2015) with \(p@1=0.393\) and \(MAP@50=0.156\), whereas the balMult performed slightly worse because balAdd can produce substitutes even if the context has no similarity to the substitute, which is only useful for monosemous words. Even though simple BERT and XLNet models (without masking) performed comparably, they could not outperform fastText and word2vec. However, a close examination of some examples shows that contextualized models do make a difference when the target word is polysemous, see Table 6.

Applying the dynamic patterns helped to improve the performance of BERT. While BERT with the pattern “T and -” is substantially worse than just using the vanilla BERT model without masking, the second pattern “T and T” yields the best results for the BERT models. These experiments confirm that such dynamic patterns can help to better capture the semantics of a target word and produce better substitutes. BERT with the pattern “T and T ” outperforms all other models in terms of precision at \(k=1\), but distinctively falls behind fastText, balAdd, and balMult (Melamud et al., 2015) on the higher levels of precision and in terms of the MAP score.

Using the +embs method proved to be a better approach to target word injection compared to the dynamic patterns. With this approach both BERT and XLNet have outperformed all other models. For XLNet, using +embs with the optimal hyperparameters has achieved the best performance overall with \(p@1=0.504\) and \(MAP@50=0.199\). Even the model with the default hyperparameters has obtained better performance than all other models (\(p@1=0.487\) and \(MAP@50=0.189\)).

For combined models we considered: (1) fastText as nc-emb, (2) DT 59g as DT, (3) balAdd for Melamud et al. (2015), (4) XLNet, and (5) XLNet+embs with optimal hyperparameters. Combining substitutes predicted by individual models has a mix effect and the combined scores are sensitive to the individual performance of participating models of the combination. Overall, the highest MAP score is achieved by combining XLNet+embs with balAdd (Melamud et al., 2015) and DT (\(MAP=0.201\)). For the combinations that are based on XLNet+embs, the precision scores slightly decrease in comparison to its individual performance. But, all other combinations obtain higher precision scores than their individual counterparts. Especially the tri-model combinations based on simple XLNet model closely matched the performance of the best model of XLNet+embs (\(MAP = 0.199\) and \(MAP = 0.198\)).

Table 6 contains example sentences with highlighted target words and top 10 substitutes generated by all models (along with the ground truth FrameNet annotations). The first example presents a LU that is associated with only one frame and is unambiguous, all models have produced many matching substitutes. The other two examples present the fact that LUs contain various senses, leading to multiple associated frames to it. Non-contextualized models except GloVe and fastText have predicted at least one valid substitute for the first frame Departing, but most of them failed to produce any substitutes in top 10 for the Causation frame. But BERT and XLNet have successfully generated several matching substitutes for both cases. Particularly XLNet has predicted more matching substitutes than BERT.

Noun lexical unit: Among non-contextualized models, the best performance was achieved by DT 59g (\(p@1=0.398\) and \(MAP@50=0.145\)). Unlike verbs for which the embedding-based models outperformed DTs, in case of nouns DTs perform better than embeddings. The contextualized models of Melamud et al. (2015) also have significantly lower performance compared to DTs. BERT and XLNet perform comparably to DTs. Dynamic patterns do not improve BERT’s performance, but the +embs method improves the results significantly for both BERT and XLNet models. The XLNet+embs model with the optimal hyperparameters has achieved the best performance with \(p@1=0.499\) and \(MAP@50=0.171\).

For combining the following models were taken: (1) GloVe as nc-emb, (2) DT 59g as DT, (3) balAdd for Melamud et al. (2015), (4) XLNet and (5) XLNet+embs with the optimal hyperparameters. The highest MAP score was again achieved by combining XLNet+embs with balAdd (Melamud et al., 2015) and DT (\(MAP=0.189\)). Unlike verbs, for nouns proper model combinations improve the results compared to the best individual model. Since the embedding-based models perform worst for nouns, the combinations with nc-emb have lowest MAP scores among all other combinations.

Table 6 Examples for the verb lexical unit expansion task

As mentioned in Sect. 4.1.4, we use the pattern library for lemmatization and POS tagging for nouns. To investigate the effect of POS tagging on model performance we compare the results produced with the pattern library and another library lemminflect.Footnote 6 The former only returns the most suitable POS tag, while the latter returns all possible POS tags, which may work better for words that can have multiple tags. Table 24 shows the results with lemminflect used for lemmatization and POS tagger. We can see that all embedding-based models have improved significantly. This shows that correct POS tagging is crucial for nouns, otherwise you may drop good candidates in the final output.

4.2.2 Semantic role expansion task

The evaluation results for the semantic role expansion task are presented in Table 5. For role expansion experiments, the non-contextualized models and Melamud et al. (2015) models are outperformed by BERT and XLNet with a significant margin with \(p@1=0.471\) and \(MAP@50=0.118\) for BERT and \(p@1=0.513\) and \(MAP@50=0.144\) for XLNet. The DTs performed substantially better than word embedding models and also comparably to the models of Melamud et al. (2015). A better score is achieved by the DT trained on Wikipedia. But the performance of static word embeddings has dropped, especially the performance of fastText is worst compared to all models, in contrast to the previous experiment where it was found to be the best model. In contrast to previous experiments, the performance of the Melamud et al. (2015) models is also dropped significantly in comparison to BERT and XLNet.

XLNet has performed better than BERT in all settings with \(p@1=0.513\) and \(MAP@50=0.144\) for the simple model without masking and with \(p@1=0.522\) and \(MAP@50=0.159\) with +embs method and default hyperparameters. Whereas selecting optimal hyperparameters further improved its performance with \(p@1=0.542\) and \(MAP@50=0.161\) making it the best overall model. The dynamic patterns did not help to improve the performance of the BERT model for this particular task, most probably because these patterns are not suitable for the semantic role extraction task. Although without +embs method, BERT and XLNet were outperformed by several non-contextualized models in the task of LU expansion, in this experiment, they obtained superior performance compared to all these models. This fact reflects the importance of the context for making reasonable substitutions of words that bear semantic roles. Another reason lies in the fact that their fixed size vocabulary covers more frequent words like verbs than nouns for role arguments.

Combining substitutes predicted by multiple models helps to substantially improve the scores for those which performed worst as single, but shows mixed effect for combinations where one model was significantly better than others. For semantic role expansion task, models we considered are (1) GloVe as nc-emb, (2) DT wiki as DT, (3) balAdd for Melamud et al. (2015), (4) XLNet and (5) XLNet+embs with optimal hyperparameters. The highest MAP score was achieved by combining XLNet+embs with balAdd (Melamud et al., 2015) and DT, but highest precision was achieved by combination of balAdd with XLNet (\(p@1 = 0.574\)) and XLNet+embs (\(p@1 = 0.563\)). Both of these combinations got best precision scores for smaller values of k in comparison to the single best model of XLNet+embs with optimal hyperparameters, which scored highest MAP score (\(MAP=0.161\)). Overall, with p@1 approaching 55% and p@5 approaching 47% and given that our gold standard is necessarily incomplete, this paves the way to fully-automatic expansion for semantic role resources.

Table 7 contains three example sentences with highlighted arguments for semantic roles and top 10 substitutes generated by all models (along with the ground truth FrameNet annotations). The first example demonstrates several valid matching substitutes, because vehicle is the most common sense of “car”. Whereas, the other two examples present an argument “bank” with multiple associated semantic roles. Again, BERT and XLNet were able to distinguish both senses of “bank” and produced several valid substitutes.

For roles, we also produce results with stopwords removal to see how it affects the performance. The results are reported in Table 25. In comparison of these scores to Table 5, we can see that for non-contextualized models and the models of Melamud et al. (2015), there is no meaningful difference in scores, which suggests that these models actually rarely produce such words in their output. For Transformer-based models, results have improved substantially. Since these models predict a word given on its context, there is a high likelihood that based on the position of words and their context, some bad candidate words are produced. Since these models have further improved, this has a slightly negative effect on the combinations, and the difference in their scores from individual models is increased. Overall, XLNet+embs model yields the highest scores (\(p@1=0.581\) and \(MAP@50=0.176\)).

4.2.3 Effect of gold set size

The results reported in Table  5 are generated using whole datasets, without doing any filtering on the size of the gold sets. We have reported MAP at \(k=50\). but there are many instances in these datasets where the size of the gold set is really small. The average size is 22 for verbs, 27 for nouns, and 73 for roles. Whereas the minimum size is 1 for all three. For smaller sets, it is really hard to predict the candidates, especially if the gold members consist of rare words. In a number of situations, even if these members are produced by the model, they may not be ranked higher in the list of potential substitutes. Figure 3 shows the performance of the XLNet+embs model for all three datasets, where they were filtered against the minimum number of values in gold sets. We use a minimum size of 5, 10, and 15. It shows that precision at all values of k increases if we filter smaller sets, but not by a large margin. This suggests that the ranking of candidates needs to be further investigated.

Fig. 3
figure 3

Precision@k curve for XLNet+embs (optimal) model for all three datasets of verbs, nouns, and roles. Here, mgs means minimum gold set size

4.3 Examples of induced lexical semantic frame representations

This section contains a qualitative analysis of lexical expansion examples of few semantic frames for all lexical substitution models, along with the ground truth from FrameNet. Each example sentence represents a specific frame and a single target word labeled either as a lexical unit or a semantic role. For each model, top 10 final substitutes are given. Examples of semantic roles expansion are presented in Table 7. Examples of lexical units expansions are presented in Table 6. Each table contains examples of ambiguous and unambiguous words to compare the substitutes in each use case.

In summary, it is evident that for non-ambiguous words, most models produce several valid substitutes, but for ambiguous polysemous words most of the non-contextualized models either were unable to produce any valid substitutes or they produced a few good substitutes for one sense only. In contrast, contextualized models produce valid substitutes in most situations. A deeper analysis of these examples provides some key insights into the intrinsic evaluation framework. Like, it can be noted that some substitutes may seem to be semantically valid but may not be present in the FrameNet lexicon and hence not marked as true. Similarly, a substitute can actually be a wrong fit, although it is present in the FrameNet lexicon, because it may change the meaning of the sentence or make it grammatically incorrect. We will dive deeper into this issue in the following section.

4.4 Manual evaluation of lexical substitutes

4.4.1 Problems with automatic evaluation of lexical substitutes

As discussed in Sect. 4.3, automatic evaluation of lexical substitutes using the current gold datasets faces two problems related to FrameNet coverage and semantics of substitute and its context.

Table 7 Examples for the semantic role expansion task

Scenario A—substitute fits the context, but is not present in the gold dataset: A substitute may be a good candidate to replace the target word within the context, but is considered as wrong because of not being present in the gold dataset. FrameNet provides a gold dataset of all lexical units for each frame. Normally, we can expect that if these lexical units are verb predicates, then probably the set would be more complete than those of noun predicates. As in the case of noun predicates, they usually do not have a strictly closed set of suitable variants and thus it is highly likely that a given substitute may not be covered in the gold dataset (which also happens with verbs, albeit to a lesser extent since the number of verbs is generally lower than the number of nouns). Similarly, this is aggravated in the case of semantic roles, as theoretically the roles can have countless valid arguments and that works for semantic parsing but becomes an issue for augmentation tasks if a gold dataset shall be used for evaluation. As for our experiments, the gold dataset for roles is extracted from the available sentence annotation, it cannot be considered as a proper gold set for evaluation. See Table 8 for more explanation, for the given sentences and the list of substitutes, there are multiple correct answers that are not present in the gold dataset.

Table 8 Examples of substitutes for scenario A, the substitutes that are marked as right by the annotator, but not present in the gold dataset are highlighted

Scenario B—substitute is present in the gold dataset, but does not fit the context: A given substitute is present in the gold dataset, but may not fit the given context or change the context meaning altogether. See Table 9, for examples of such substitutes. The second example describes a situation where a body part is involved, but not all body parts can be folded as per the context. In the last example, the target word is a semantic role of type Speaker, which can be a pronoun or a person’s name. But not all pronouns can fit this context.

Table 9 Examples of substitutes for scenario B, the substitutes that are present in the gold dataset but do not fit the context are highlighted
Table 10 Examples with their target words highlighted and the list of top 10 final substitutes

4.4.2 Evaluation framework

To manually analyse the appropriateness of a given substitute in the scenarios discussed in Sect. 4.4.1, we define the following rules to evaluate:

  • It does not fit the context. The objective is to maintain the sentence’s meaning. This use-case will also drop the substitute, which can make the sentence grammatically incorrect.

  • It does fit the context, not the frame. The sentence is still meaningful but does not preserve the frame meaning. We use the formal descriptions of frames to decide if a sentence represents the frame. Additionally, for semantic roles, we also consider the semantic role definition. Because frame description alone is not sufficient to evaluate semantic roles.

  • It does fit both the context and the frame. The ideal scenario to replace the target word would be these substitutes, as the main motivation of this work is to preserve the original frame.

Table 10 contains examples for each use case. It also includes the list of substitutes matched with the gold dataset.

4.4.3 Datasets and substitution model

We randomly sampled 50 annotations for each type of target word (verb, noun, and semantic role). Table 11 shows statistics of these datasets. Each annotated instance is evaluated for top 10 substitutes. In summary, the annotator has to evaluate 500 substitutes for each dataset. For the substitution model, we choose the best-performing single model i.e. XLNet+embs.

Table 11 Statistics of three datasets for manual evaluation sampled randomly from datasets used in automatic evaluation

4.4.4 Results

Table 12 shows the results for all datasets against each evaluation use-case. For all datasets, there is a significant improvement in the use-case where the substitute fits both the context and the frame, and precision has improved consistently for all values of k. Precision values for the use-case of does NOT fit context support our first problem for automatic evaluation, that even though the substitute can be present in the gold dataset, it does not fit the context and hence should be ignored. For example, in Table 10, the substitute body does not make sense to replace the original word skin, but it is present in the gold dataset. For those substitutes that fit the context, there can still be some scenarios where they do not preserve the frame. For example, see Table 10, the substitutes blood, fur, cloth do fit the context very well, but since they do not maintain the frame for not being a body part, they cannot be accepted as correct. Not surprisingly, the numerical scores are higher than in the automatic evaluation as the manual judgements are not prone to incompleteness of lexical-semantic resources.

Table 12 Manual evaluation of lexical substitutes for sampled datasets of 500 annotations (50 contexts, 10 substitutions)

5 Extrinsic evaluation: frame-semantic parsing with lexically expanded FrameNet

To evaluate the quality of automatically constructed frame structures, we conducted extensive experiments using two frame-semantic parsers. Our goal was to determine whether these induced frame structures could improve parsing performance in situations where annotated data is scarce. We select a small sample from the FrameNet dataset with original annotations as a seed dataset. Then we augment it by incorporating new sentences constructed using our lexical substitution approach, while keeping the annotations same, which results in a larger training dataset. We compare the performance of the parsers trained on the augmented dataset and on the seed dataset. We do not change the test and development (dev) sets.

5.1 Experimental setup

In this section, we describe the choice of models for lexical substitution, details of the semantic parsers used in our experiments, and the procedure for the construction of training datasets, including the pre and post-processing steps.

5.1.1 Lexical substitution models

For extrinsic evaluation, we select two substitution models with the best performance in the intrinsic evaluation (Table 5). The first model is XLNet+embs with optimal hyperparameters, which demonstrates the best results for both tasks of the lexical unit expansion and the semantic role expansion. The second model is BERT without dynamic patterns and the +embs method extension. We choose to use the standard BERT model without any extensions in order to determine whether performance differences observed in intrinsic evaluation of these two models would also be reflected in a presumably less sensitive extrinsic evaluation.

5.1.2 Frame-semantic parsers

We conduct experiments with: (1) open-SESAME (SEmi-markov Softmax-margin ArguMEnt)—a neural network-based frame-semantic parser by Swayamdipta et al. (2017), and (2) a BERT-based parser for relation extraction and semantic role labeling inspired by Shi and Lin (2019).

Open-SESAME parser (Swayamdipta et al., 2017): The Open-SESAME parser decomposes the task of frame-semantic parsing into three sub-tasks and implements an independently trained model for each sub-task: (1) ArgId model: to identify and label semantic arguments, (2) FrameId: to identify frames using gold targets, and (3) TargetId: to identify target predicates using lexical units of FrameNet (this model is not discussed in the original publication). The objective of the argument identification model is to identify argument spans and their labels. It uses a softmax-margin segmental recurrent neural network as a baseline syntax-free model and adds several modifications to further improve the performance. In particular, it adds some sort of syntax information and syntactic scaffolding. For our experiments, we only used the baseline syntax-free model. The model accepts as input a sentence in form of a token sequence, token part-of-speech tags, a target span, and an associated lexical unit with its frame and outputs a list of possible labeled segments with their start and end positions in the input sentence. The labels are either semantic roles or “null”. The ArgId model only handles non-overlapping segments, and segmentation is only produced for the input frame and its target. The maximum length of a span to be considered can be specified as a hyperparameter. The frame identification model is a syntax-free bidirectional LSTM that takes the same input as the argument identification model except the frame and identifies the frame evoked by the target. It can not predict frames for targets that are not present in the FrameNet lexicon. The target identification model is also based on bidirectional LSTM. It takes as input a sequence of tokens from a given sentence, their part-of-speech tags, and lemmas and for each token, it outputs a binary label indicating whether it is a target or not. The list of possible targets is available through the FrameNet lexicon of lexical units. In our experiments, we use the official, publicly available implementation of the Open-SESAME parser.Footnote 7

BERT parser (Shi & Lin, 2019): The BERT-based parser semantic parser is originally designed for PropBank-style arguments and unlike open-SESAME, it does not perform the target and sense (frame) identification as separate independent tasks. For argument identification and labeling, it can perform sense disambiguation for targets before argument identification (end-to-end). Therefore, it can work with only sentence and the target predicate while keeping the target frame as optional input. For the target sense disambiguation task, it takes a sentence as input and formulates the task as a sequence labeling problem, where each token is assigned a label. The target token is assigned the sense (frame) label and all remaining tokens are assigned either label ‘X’ (non-target tokens) or ‘O’ (sub-tokens of any non-target token). This sequence of tokens is passed through the BERT encoder to obtain contextualized embeddings. The predicate tokens are distinguished by concatenating these contextual embedding to ‘predicate indicator’ embeddings before making a final prediction using a one-hidden-layer multi layer perceptron (MLP). For argument identification and labeling task, it takes as input a pair of sentence and its target predicate, arguments spans are predicted as BIO (Beginning, Inside, Outside) labels for all tokens. The target predicate is paired with the sentence and passed through the BERT encoder to make the sentence embeddings target-aware. These contextualized sentence embeddings are concatenated with ‘predicate indicator’ embeddings and passed to one-layer BiLSTM to obtain hidden states of each token to make the final prediction. The hidden state of the predicate token is concatenated to the hidden state of each token and passed to the MLP to get the probability distribution over the label set. We use the implementation provided by the AllenNLP libraryFootnote 8 and conduct experiments both with and without gold frames.

Note that the open-SESAME parser does not leverage pre-training, but it uses syntax information (part-of-speech tags) for parsing. In contrast, a BERT-based parser (Shi & Lin, 2019) takes advantage of pre-training, while avoids using any syntax information. We consider it is interesting to investigate the effect of lexical expansion for such two conceptually different semantic parsers.

5.1.3 Seed datasets

We use scripts from the open-SESAME parser Swayamdipta et al. (2017) to split full-text annotations of FrameNet-1.7 into train, test, and dev splits. The test set is similar to previous studies (Das et al., 2014). It contains 16 documents, while 8 documents are used for the dev set. The statistics for all three splits are given in Table 13. For our experiments, we generate two sets of splits: (a) with verbs as lexical units; (b) with nouns as lexical units. To do comparative experiments after lexical expansion, all other train datasets were sampled from the train set of these two datasets while keeping their respective test and dev sets same. Seed training datasets were constructed by randomly sampling one frame annotation per sentence. This strategy provides the train dataset for verbs with 2746 annotations in total with 7 annotations per frame on average, and for nouns, it provides 9293 annotations in total with 8 annotations per frame on average.

Table 13 Statistics for data splits for FrameNet-1.7 fulltext annotations and the seed datasets

5.1.4 Dataset expansions

Each annotation of the seed dataset was augmented using three types of words simultaneously. The first two types are based on FrameNet annotations of the sentence tokens:

  • lexical unit: a single-token lexical-unit, that can be either verb or a noun

  • role: all single-token roles.

The third word type is based on a POS tag of the sentence tokens:

  • noun: any word that is a noun or a part of a noun phrase but is neither a lexical unit or a single-token role. The reason to select such nouns for expansion comes from the semantics of roles, as major portion of a sentence is usually covered with semantic roles, which can be mostly multi-token and this ends up with a very few words to be substituted as a single-token roles. This configuration will substitute all noun tokens except those already been substituted as roles. To determine whether a word is a noun, we used predicted part-of-speech tags generated during the pre-processing phase of the parser. We augmented only a fraction of sentence tokens as nouns. For this purpose, we experimented with values in the range of [10, 30, and 50]% of sentence tokens.

For all train datasets, each annotation of the seed dataset was augmented with two more annotations (\(k=2\)) unless mentioned otherwise, to get an approximately three times larger augmented training dataset. See Tables 14 and 15 for statistics of the augmented train datasets under various configurations using BERT as a substitution model. We constructed datasets with expansion of only single word types like either lexical unit or role or noun and then with all combined. For all three word types, the order of expansion is always the lexical unit, followed by role and then noun. This ensures that one specific word is augmented only once even if it belongs to multiple word types.

Table 14 Statistics of the train datasets augmented from seed dataset AnnotationPerSentence-Verbs, for different configurations
Table 15 Statistics of the train datasets augmented from the seed dataset AnnotationPerSentence-Nouns, for different configurations

5.1.5 Post-processing

The list of substitutes produced by the lexical substitution model was post-processed before the final augmentation. Some of these post-processing steps are common to all word types such as removal of noisy words, duplicates, and seed words. While the specific ones for each word type are as follows:

  • lexical unit: substitutes for lexical-unit were filtered as per their gold annotations (frame parser can not predict a frame for a target not present in the FrameNet lexicon). Final substitutes were lemmatized and then inflected to match the tense form of the substituted lexical unit. We use the lemminflectFootnote 9 library as an inflection engine.

  • role: substitutes for roles were also filtered as per their gold annotations and also for a basic list of stop-words.

  • noun: substitutes for nouns were filtered for a basic list of stop-words including digits and the minimum length of two characters. Final filtering was done based on part-of-speech tags to retain only nouns. The final list was lemmatized and inflected to match the singular or plural form of the substituted noun. For lemmatization and part-of-speech tagging, we used the NLTK library.

After substituting all target words, the augmented sentence was again parsed for part-of-speech tags. Tables 14 and 15 provides the total number of annotations for different configurations for both seed datasets of verbs and nouns. As expected, the datasets augmented with just lexical units and roles are the smallest, because for lexical units, the list of final substitutes can be empty if no substitute matches with the gold set of the annotated frame, and for roles, single-token roles may not be present in a sentence. The datasets where all nouns were augmented are larger in comparison to all other configurations. A few examples of sentences taken from AnnotationPerSentence-Verbs, and augmented using one of these configurations are given in Table 16.

Table 16 Examples of expansions using the configuration of lexical unit-roles-nouns-50pc for XLNet+embs and BERT as lexical substitution models

5.2 Examples of augmented sentences

Table 16 shows few examples of augmentation results along with original seed sentences. Here, the seed dataset was AnnotationPerSentence-Verbs. Each sentence is highlighted for all three word types, i.e. target words and phrases for their corresponding word types, which are lexical unit, roles and nouns. As mentioned previously, only single-token roles are augmented. We do augmentations for the seed sentence using two top substitutes from the final list after post-processing steps. These augmented examples were produced using the configuration of lexical unit-roles-nouns-50pc. In some cases, the quality of substitutes for roles and nouns is less reliable as per the overall semantics of the sentence, especially for roles as their gold dataset is limited to FrameNet annotations, and unlike lexical units elements of these gold datasets can be semantically very different from each other. For example, predicting pronouns to substitute nouns. But substitutes for lexical units and nouns are plausible in most cases and preserve the meaning of the sentence.

5.3 Results with the ppen-SESAME parser

Hyperparameters: Optimal hyperparameters for the argument identification model are presented in  Swayamdipta et al. (2017). However, the hyperparameters for the frame and target identification model are omitted. In our experiments, we used the default values for everything defined in the source code of the parser, except for the maximum number of epochs. For target and frame identification, we use 100 epochs with an early stopping patience of 25 epochs. For argument identification, we use 10 epochs with an early stopping patience of 3 epochs. We use these default values to get the total number of training steps for seed datasets. The augmented datasets are three-time larger than the seed datasets; we used the same number of training steps for them as per the corresponding seed dataset and model to keep the training time similar for all of them. For the seed datasets of AnnotationPerSentence-Verbs, this would give 274,600 steps for the target and frame identification models and 27,460 steps for the argument identification models. For the seed datasets of AnnotationPerSentence-Nouns, this would give 299,600 steps for the target and frame identification models and 29,960 steps for the argument identification models. This will reduce the bias in model performance because of the larger size and more training iterations. The final model was selected as per the best \(F_{1}\) score on the dev dataset during training. To compensate for variance in model performance due to random weight initialization, all experiments were run 10 times and their mean and standard deviation on \(F_{1}\) score is reported for both BERT and XLNet+embs. In addition to these measures, we also calculate p values for the paired student t-test to determine the statistical significance of performance differences between models trained on augmented datasets and models trained on the seed datasets. The null hypothesis assumes that both models performed similarly and any difference in their mean performance is not supported statistically. P values are compared against 99% confidence; a p value below 0.01 supports the alternative hypothesis and indicates that both models performed differently and this difference is statistically significant.

Table 17 The performance of Swayamdipta et al. (2017) the frame-semantic parser for the target identification model in terms of the \(F_{1}\) score: TargetId – Verbs
Table 18 The performance of Swayamdipta et al. (2017) the frame-semantic parser for the target identification model in terms of the \(F_{1}\) score: TargetId – Nouns
Table 19 The performance of Swayamdipta et al. (2017) the frame-semantic parser for the frame identification model in terms of the \(F_{1}\) score: FrameId – Verbs
Table 20 The performance of the frame-semantic parser by Swayamdipta et al. (2017) for the frame identification model in terms of the \(F_{1}\) score: FrameId – Nouns

Tables 17 and 18 summarize results for the target identification model and Tables 19 and 20 summarize the performance of frame identification models for training datasets reported in Tables 14 and 15 respectively. For AnnotationPerSentence-Nouns dataset, the TargetId model managed to improve in multiple settings for BERT, getting the highest gain where \(30\%\) of nouns were expanded (\(F_{1}\) \(=41.96\)). For AnnotationPerSentence-Verbs dataset, it also scored better in multiple settings, getting the highest gain where \(50\%\) of nouns were expanded (\(F_{1}\) \(=61.67\)). However, this difference in the mean scores of the models is not statistically significant (p > 0.01). We assume that the base dataset already contains sufficient examples per target on average and further expansions do not help it, but rather decreased its performance in some cases. The high standard deviation also shows that the original hyperparameters such as learning rate and dropout rate are less optimal for these datasets and need to be tuned before drawing final conclusions. For the FrameId model, the performance did not improve for all datasets augmented from AnnotationPerSentence-Verbs. In the case of datasets augmented from AnnotationPerSentence-Nouns, it is better in multiple cases for both BERT and XLNet+embs, but not statistically significant. The datasets, where the lexical unit is not augmented, managed to perform better than those where it was augmented. In the latter case \(F_{1}\) decreased. That is most probably because augmented datasets only added new targets but with the same frame, because no new frame is added to the train, this affects negatively as new targets get just one example of the frame for them. This drop in performance is statistically significant. Contrary to target identification, the standard deviation also remained on the lower side (less than 1.5) for all models, which also hints that hyperparameters are good enough to yield robustness in results.

The results for the ArgId model are reported in Tables 21 and 22. The \(F_{1}\) of the model on the augmented datasets is improved for many of the configurations. For the verbs dataset, the highest \(F_{1}\) score is achieved for the dataset where expansion configurations are lexical unit-roles-nouns-30pc for BERT (\(F_{1}\) = 50.00) and nouns-30pc for XLNet+embs (\(F_{1}\) = 49.47). For the nouns dataset, the highest \(F_{1}\) score is achieved for the dataset where expansion configurations are lexical unit-roles-nouns-50pc for BERT (\(F_{1}\) = 65.10) and lexical unit-roles-nouns-50pc for XLNet+embs (\(F_{1}\) = 65.31). The difference in the performance of models for the augmented datasets is also supported statistically with p values lower than 0.01, particularly in the datasets where all three types of words were augmented. Overall expansion configuration comprising nouns performed better as they got more diversified sentences for training than other configurations.

Table 21 The performance of the frame-semantic parser by Swayamdipta et al. (2017) for argument identification and labeling in terms of the \(F_{1}\) score: ArgId – Verbs
Table 22 The performance of the frame-semantic parser by Swayamdipta et al. (2017) for argument identification and labeling in terms of the \(F_{1}\) score: ArgId – Nouns

The negative results for target and frame identification indicate that using data augmentation to generate more training data is not always useful and it depends on the nature of the data and task itself. Since, we sample data per sentence, that is more suitable for arguments identification, as each sentence occurred once in the seed dataset, it does not seems to be a useful strategy for frame and target identification as they already had enough average number of annotations per instance (see Tables 14, 15). But data would have been sampled as per frame and target then augmentations would have helped and that we actually observed in our initial set of experiments. That different sampling strategy for these tasks does benefit from data augmentation during frame parsing. Dementieva et al. (2020) also reported similar findings for the task of propaganda detection. Similar to our choice of different words, Dementieva et al. (2020) augmented nouns, adjectives, adverbs, and verbs using GloVe, fastText, and BERT as substitution models to generate more training sentences. Their experiments with many different settings showed a slight shift in the precision and recall score while the \(F_{1}\) score did not improve except very slightly in two cases. Another work from Fenogenova (2021) used the fine-tuned mT5 (Xue et al., 2021) model for paraphrasing to generate augmented data for the tasks of sentiment analysis, textual entailment, and question-answering in the Russian language. They also reported similar findings with all three tasks where the performance of the model remained nearly similar with both the original and the augmented training datasets.

5.3.1 Effect of train dataset size over model performance

To further validate the performance of all models against any bias in the seed dataset construction and to see the effect of the seed dataset size on model performance, we trained the two best models on multiple seed datasets. All seed datasets were constructed by randomly sampling the N percentage of training examples from the verbs and nouns dataset. We selected the values of N as 10, 20, 30, 40, 50, and 100%. Each seed dataset was further augmented into two datasets using BERT and XLNet+embs as lexical substitution models. Two best expansion configurations are selected that are lexical unit-roles-nouns-50pc and nouns-50pc. Models trained on the seed datasets use the same number of epochs as discussed in Sect. 5.3. To train each model on augmented datasets, the number of training steps were determined as per the size of their corresponding seed dataset and the model. As per previous experiments, each experiment was run 10 times with different random seeds to get the mean and standard deviation for the curve.

Fig. 4
figure 4

Evaluation of lexical expansion for the ArgId model over increasing size of the seed training dataset. The shaded region represents the standard deviation based on 10 runs of the model. The x-axis is in log scale. Source dataset: Verbs

Fig. 5
figure 5

Evaluation of lexical expansion for the ArgId model over increasing size of the seed training dataset. The shaded region represents standard deviation based on 10 runs of the model. The x-axis is in log scale. Source dataset: Nouns

The learning curves are shown in Figs. 4 and 5. Both augmented datasets have consistently improved the model performance over their seed datasets and on average obtained 2–3% gain in \(F_{1}\) . For datasets sampled from verbs, the difference in model performance for the seed and augmented datasets remained consistent and statistically significant for sample sizes larger than 10%, and it is true for both models regardless of the expansion configurations. But for datasets sampled from nouns, only the expansion configuration lexical unit-roles-nouns-50pc shows more consistent performance for all sample sizes. This difference in the performance of this configuration can also be observed in Table  22 where the expansion configurations with lexical unit-roles-nouns have consistently outperformed the ones where only nouns was expanded. Whereas for verbs, overall both types of configurations have performed better. This also provides interesting insight into the behavior of verb and noun predicates to choose optimal expansion configurations for each. We can conclude that as opposed to targets and frames, semantic roles are a more diversified set of words and hence proved to be an ideal candidate to augment when data is insufficient.

5.4 Results with the BERT-based parser

Hyperparameters As this model is originally designed to work for verb type predicates, so we only report results for the verbs-based datasets here. For seed datasets, the model was trained for 50 epochs, while for augmented datasets, it was trained for 17 epochs to have the same number of training steps for both the seed and augmented datasets. We used BERT large cased with the batch size of 8 and the learning rate of 2e−5. All models were run 10 times to get mean and standard deviation values.

For the BERT-based parser, we present the learning curve and used both BERT and XLNet+embs lexical substitution models for augmented datasets for comparison. The learning curves for both models are shown in Fig. 6. From the top, the first row shows the performance with gold frames and the second row shows the performance without gold frames. It can be confirmed by the curves that lexical expansion is indeed helpful to obtain performance gain when the number of annotations is insufficient. However, the performance gain starts to diminish when moving to the right of the x-axis where the seed dataset size increases, this is also supported by p values that are consistently less than 0.01 for the sample sizes of 10 to 30. The gain in performance shows similar patterns in both situations with or without gold frame information. Whereas using gold frames information obtained significantly higher scores with \(F_{1}\) going above 70.0 for all models and datasets in comparison to using predicted frames where it remains close to 65.0). These scores are significantly higher than the open-SESAME parser for the same datasets and show the advantage of using pre-trained Transformer models to learn the syntax and semantics of the sentence in comparison to using syntactic features such as part-of-speech tags. While there is no clear candidate when it comes to the comparison of lexical substitution models, both BERT and XLNet+embs performed similarly.

Fig. 6
figure 6

Evaluation of lexical expansion for the BERT-based semantic role parser for the ArgId model over increasing size of the seed training dataset. The first row shows the performance using gold frames and the second row shows the combined performance where the first step is to predict the frames and then do the argument identification. The shaded region represents the standard deviation based on 10 runs of the model. The x-axis is in log scale. Source dataset: Verbs

For nouns, the extensive set of experiments could not produce similar results as for verbs. There were no improvements in the performance for the augmented datasets, and where it showed improvements, results were not consistent and the variance between multiple runs of the model was excessive.

6 Conclusion

In this work, we performed a study of text augmentation methods for semantic frame processing based on (i) non-contextualized distributional models such as word2vec and syntax-based distributional thesauri, and (ii) contextualized lexical substitution methods based on neural language models, such as BERT and XLNet. We tested these methods in two extensive experimental setups.

In the first set of experiments, we perform generation of lexical representations of semantic frames. We demonstrated that a single frame annotated example can be used to bootstrap a fully-fledged lexical representation of the FrameNet-style linguistic structures. Non-contextualized models proved to be strong baselines, but failed to produce good substitutes for polysemous words (same word but different semantic frame), whereas contextualized models of BERT and XLNet produced competitive substitutes, especially when information about the target word is injected effectively. Additionally, our experiments show that sometimes, combining individual models to generate lexical substitutes significantly helps to improve their individual performance.

Since automatic evaluation of lexical substitution is sensitive to completeness of the lexical resource itself, to further analyse the effectiveness of our method, we also did manual evaluation of these substitutes on small datasets to show that on the one hand, suitable lexical substitutes are sometimes absent from the gold datasets, while on the other hand, the present substitutes are not always good candidates for the purpose of lexical substitution since they can alter the sentence semantics.

In our second set of experiments, we deal with two neural FrameNet parsers by Swayamdipta et al. (2017) and Shi and Lin (2019). Namely, we demonstrate that text augmentation can be used to build more training samples from a few seed sentences, and these new frame representations help to improve the performance of semantic parsers for the semantic role identification and labeling tasks. These experiments suggest that expansion of roles (usually represented with nouns and noun phrases) and otherwise occurring nouns in the text significantly improves the performance of semantic parsing, while an expansion on verbs—which is an arguably harder task, as verbs does not have as many close co-hyponyms and synonyms—does not improve parsing results.

Overall, our results suggest that: (i) augmentation of lexical units can be of great use for expansion of lexical representation of semantic frames, and for (ii) building semantic parsers, which perform role identification in text, especially in situations where the number of training texts is small.

7 Future work

Going forward, we can expect further improvements from large foundation models like T5 (Raffel et al., 2020), BART (Lewis et al., 2020), FlanT5 (Longpre et al., 2023) and other pre-trained seq2seq transformers, especially those pre-trained on multiple word masking tasks helping to restore multiword expressions accurately. Experimenting with further contextualized lexical substitution methods, such as nPIC/PIC (Roller & Erk, 2016), may yield improvements in combined methods.

While large pre-trained language models are increasingly getting better at performing tasks in an end-to-end fashion, this is seemingly removing the need for explicitly expanding lexical-semantic resources for natural language understanding and generation tasks. However, there are still fields where lexical resources—with examples and their sources—are key to answering research questions or productive work, e.g. for the study of the structure of semantics, for the creation of dictionaries, as well as e.g. for controlled experiments in psycholinguistics and other fields. With our automatic expansion approach, we provide a method to aid the quicker development of these lexical resources in such situations, especially for under-resourced languages and domains.