1 Introduction

Terms are textual expressions that denote concepts in a specific field of expertise. They are beneficial for several terminographical tasks performed by linguists (e.g., construction of specialized terminological dictionaries (Le Serrec et al., 2010)). Moreover, terms can also support and improve several downstream natural language processing (NLP) tasks (e.g., topic detection (ElKishky et al., 2014), information retrieval (Lingpeng et al., 2005), machine translation (Wolf et al., 2011)). To ease the time and effort needed to manually identify terms in domain-specific corpora, automatic term extraction (ATE) approaches were proposed.

The TermEval 2020 shared task, organized as part of the CompuTerm workshop (Rigouts et al., 2020a), presented one of the first opportunities to systematically study and compare several ATE architectures with the introduction of the Annotated Corpora for Term Extraction Research (ACTER) dataset (Rigouts et al., 2020a, b). While the workshop was a significant step forward in systematic comparison, the less-resourced languages (e.g., Slovenian) have not yet been sufficiently explored and remain a research gap. Furthermore, there is still room for improvement in performance. In our previous study (Tran et al., 2022a), the conducted error analysis pointed out that the two most common errors that the tested classifiers made were to predict a shorter term nested in the ground truth term and vice versa, i.e., the model sometimes generates the terms not covered in the ground truth, containing a nested term. This insight leads to a hypothesis about the insufficiency of the widely used BIO labeling regime (Hazem et al., 2020). This regime does not allow labeling the nested terms and giving the model the necessary information to avoid the above mistakes.

Inspired by the success of Transformers (Hazem et al., 2020) and the rise of cross-lingual learning (Lang et al., 2021), our research delves into the effectiveness of the XLMR (Conneau et al., 2020) in multilingual and cross-lingual scenarios. First, having a single model that works across several languages is important, as it can be used also in the languages not seen during the training. Instead of having to construct language-specific models, multilingual and cross-lingual models can be directly used on any new language that is supported by XLMR. In addition, for the languages where the data is available, having a single model instead of many language-specific models is a much simpler solution, and can also make the models less dataset-specific.

Our approach frames the ATE task as a sequence-labeling problem, as this strategy has proven successful in various NLP tasks like Named Entity Recognition (NER) (Lample et al., 2016; Tran et al., 2021) and Keyword Extraction (Martinc et al., 2021). Additionally, we extend our previous work (Tran et al., 2022a) by introducing an innovative nested term labeling mechanism, incorporating two extra labels for single nested terms, and rigorously evaluating the model’s performance in cross-lingual and multi-lingual settings. This comprehensive exploration showcases the power of a multilingual pretrained language model with cross-lingual and multi-lingual settings in capturing and understanding diverse linguistic nuances. The experiments are conducted in the cross-domain setting on the ACTER datasetFootnote 1 containing texts in four domains (Corruption, Wind energy, Equitation, and Heart failure) with three languages (English, French, and Dutch) and the RSDO5 corpusFootnote 2 (Jemec Tomazin et al., 2021) containing Slovenian texts from four domains (Biomechanics, Chemistry, Veterinary, and Linguistics).

The main contributions of this paper can be summarized as follows:

  • We propose a new NOBI annotation mechanism to better capture single nested terms. When a dataset contains a relevant proportion of nested terms, the new labeling regime improves the Recall of the models by a large margin, leading also to further improvements in the F1-score. This is also the main novelty compared to the shorter conference version (Tran et al., 2022a) of this paper.

  • We systematically evaluate the performance of the XLMR on the cross-domain term extraction task in two datasets covering English, French, Dutch, and a less-resourced Slovenian in both standard BIO and the novel NOBI scheme.

  • We compare the performance among cross-lingual, multilingual, and monolingual approaches to determine the general applicability of multilingual language models for sequence labeling in both rich- and less-resourced languages. The datasets using BIO and NOBI annotation regimes are both considered.

2 Related work

The history of ATE has its beginnings during the 1990s with research done by Damerau (1990) and Justeson and Katz (1995). ATE systems usually employ the two-step procedure: (1) extracting a list of candidate terms, and (2) determining which candidate terms are correct.

2.1 Approaches based on term characteristics

Traditional approaches relied on distinctive linguistic aspects of terms to extract possible candidates. Several NLP tools (e.g., tokenization, lemmatization, stemming, PoS tagging) are employed to obtain linguistic profiles of term candidates. As a heavily language-dependent approach, the better the quality of the pre-processing tools (e.g., FLAIR (Akbik et al., 2019), Stanza (Qi et al., 2020)), the better the quality of linguistic methods. More recent studies preferred the statistical approach, which commonly relies on the assumption that a higher candidate term frequency in a domain-specific corpus implies a higher likelihood that a candidate is an actual term. Some measures relying on this assumption include termhood (Vintar, 2010), unithood (Daille et al., 1994) or C-value (Frantzi et al., 1998). More popular statistical approaches also considered the frequency of the term internal words compared to the term frequency to identify rare terms and remove frequent words. Many current systems still apply this approach’s variation, or hybrid mechanisms that combine linguistic and statistical information (Kessler et al., 2019; Repar et al., 2019).

2.2 Approaches based on machine learning and deep learning

Recent advances in representation learning and deep neural networks have also influenced term extraction. Several embedding techniques have been investigated for the task at hand, e.g., uni-gram (Amjadian et al., 2016), non-contextual (Zhang et al., 2018), contextual (Kucza et al., 2018) word embeddings, and the hybrid ones (Gao & Yuan, 2019). The first use of language models for the ATE task was in the TermEval 2020 (Rigouts et al., 2020a) where the winning approach on the Dutch corpus used BiLSTM-based neural architecture with GloVe embeddings while the winning solution on the English corpus (Hazem et al., 2020) extracted all possible n-gram combinations, which are then fed into a BERT binary classifier that determines for each n-gram inside a sentence, whether it is a term. Besides, several Transformer-based variations have also been investigated (e.g., RoBERTa, CamemBERT (Hazem et al., 2020)). Further work includes HAMLET by Terryn et al., 2021, which proposes a hybrid adaptable machine learning classifier that combines linguistic and statistical clues to detect terms.

Recently, sequence-labeling and cross-lingual approaches toward ATE have been gaining traction. Kucza et al. (2018) was one of the first to model term extraction as a sequence-labeling task. Cross-lingual sequence labeling was, on the other hand, explored in Conneau et al. (2020), Lang et al. (2021), Hazem et al. (2022), and Tran et al. (2022a), who took advantage of XLMR, the model we also employ in this work. Lang et al. (2021) compared different cross-lingual approaches, including a sequence classifier, and a token classifier on this sequence-labeling task, and further proposed a sequence-to-sequence (seq2seq) approach, which used mBART (Liu et al., 2020) to generate sequences of comma-separated terms from the input. The results demonstrate the capability of multilingual models to outperform monolingual ones in some specific scenarios and the potential of cross-lingual learning.

Finally, in our conference paper (Tran et al., 2022a) that we extend in this journal paper, we leveraged the multilingual setup by fine-tuning the model using training datasets from several languages and then applying the model to their test sets, separately. By doing so, we examined whether adding more data from other languages to the training set that matches the target language in the testing set improves the model’s predictive performance. After adding the Slovenian corpus into the ACTER training set, our multilingual model demonstrated a significant improvement in Recall across all test languages compared to the monolingual one.

2.3 Approaches for Slovenian term extraction

For Slovenian, the language used in our study, and for less-resourced languages in general, the research is still hindered by the lack of gold standard corpora and limited use of neural methods. Things are nevertheless slowly improving. In recent years, the Slovenian KAS corpus was compiled (Erjavec et al., 2021), quickly followed by another corpus designed for term extraction, the RSDO5 corpus.Footnote 3 Regarding the methods, Vintar (2010) was one of the first to propose statistical approaches for Slovenian ATE tasks. After that, Ljubešić et al. (2019) introduced a hybrid one, in which they extract the initial candidate terms using the CollTerm tool (Pinnis et al., 2019), a rule-based system employing a complex language-specific set of term patterns from the Slovenian SketchEngine (Fišer et al., 2016). Meanwhile, Repar et al. (2019) focuses on term extraction and alignment, where the novelty is the evolutionary algorithm for the term alignment.

The deep neural approaches have not been sufficiently explored for Slovenian data yet. The only neural approach towards Slovenian ATE was proposed in our recent study (Tran et al., 2022b). There, we implemented the Transformers-based sequence-labeling approach, which we extend in this study, in a cross-lingual and multilingual evaluation. Another problem is that often no open-sourced code is available for most current benchmark systems, hindering their reproducibility (for Slovenian, only the code from Ljubešić et al. (2019) and Tran et al. (2022b) methods are available).

2.4 Extraction of nested terms

In many practical applications, it is common that the terms have a nested structure where a term could contain other terms or be part of others. Vintar (2004) first suggested ranking and/or discarding nested terms using the C-value, but their results were unsatisfactory. Marciniak and Mykowiecka (2015) later identified them by combining grammatical correctness and normalized pointwise mutual information (NPMI) based on bigrams in a corpus. However, this method’s efficiency relies heavily on corpus features (e.g., size, thematic homogeneity, and phrase frequency). Recently, Gao and Yuan (2019) proposed an end-to-end architecture that employs classification and ranking for n-gram candidates in text sequences. Nonetheless, this suffers from reduced Recall due to ranking and its threshold output is not applicable to new, unseen domains. Since then, no further methodologies have been proposed, leaving a gap in extracting nested terms for term extraction tasks.

Regarding other NLP downstream tasks sharing the same mechanisms (e.g., NER, Keyword Extraction), besides the common sequence tag schemes (e.g., BIO (Ramshaw & Marcus, 1999), IOBES (Lester, 2020), BMEWO (Ratinov & Roth, 2009), BILOU (Ratinov & Roth, 2009)) for both flat and nested ones, we can categorize the methods to capture nested entities into four main types: (1) sequence labeling, (2) hypergraph-based, (3) sequence-to-sequence (Seq2Seq), and (4) span-based methods. However, none of them except the BIO regime for the sequence-labeling approach has been applied for term extraction yet.

3 Methodology

Section 3.1 presents a brief description of our chosen datasets. We demonstrate the general methodology, experimental setup, and implementation details in Sects. 3.2 and 3.3. Finally, in Sect. 3.4, we present our chosen evaluation metrics.

3.1 Datasets

The experiments were conducted on ACTER (Rigouts et al., 2020a) and RSDO5 version 1.1 (Jemec Tomazin et al., 2021), both comprising texts from diverse languages and domains. The ACTER dataset is a collection of 12 corpora covering four domains (Corruption (corp), Equitation (equi), Wind energy (wind), and Heart failure (htfl)) in three languages (English (en), French (fr) and Dutch (nl)). The dataset has two types of gold standard annotations: one containing both terms and named entities (NES), and the other one containing only terms (ANN). The second dataset is the RSDO5 version 1.1 (Jemec Tomazin et al., 2021), which contains texts in Slovenian (sl), a less-resourced Slavic language with rich morphology. The corpus contains 12 documents collected from 2000 to 2019 covering domains of Biomechanics (bim), Chemistry (kem), Veterinary (vet), and Linguistics (ling). The data analysis is depicted in Figs. 14, 15 , 16, 17 and 18 and Table 7.

3.2 Experimental setup

We consider ATE as a sequence-labeling task where the model returns a label for each token in a text sequence using two different labeling regimes: the benchmark BIO labeling scheme (Lang et al., 2021; Rigouts et al., 2021) and our novel annotation scheme called NOBI. In the BIO regime, B stands for the beginning word in the term, I stands for the word inside the term, and O stands for the word not part of the term. The terms from a gold standard list are first mapped to the tokens in the raw text and each word inside the text sequence is annotated with one of three labels (see the upper example in Fig. 1). However, it is not optimized for nested term extraction. Thus, we propose NOBI, an annotation regime with two additional labels BN and IN, referring to a word being in the beginning or inside the nested term, respectively (see the lower example in Fig. 1). An annotation regime with two additional labels BN and IN, where N refers to nested single-word terms, which can be at the beginning (BN) or inside (IN) position of a longer term.

In Fig. 1, the gold standard contains the following terms: “stent”, “bottleneck stent”, “myocardial”, “infarction”, “myocardial infarction”, etc. In the BIO regime, we ignore the single nested terms, thus, we only mark “bottleneck” as the beginning (B) and “stent” as the inside (I) of the full term “bottleneck stent”. Similarly, “myocardial” is the beginning (B), and “infarction” is the inside (I) of the full term “myocardial infarction”. However, in the NOBI regime, we consider “bottleneck stent” and “stent” as two different terms where “stent” is the nested term of “bottleneck stent”, in contrast to the BIO scheme, where the model extracts just the “bottleneck stent” as a term. Similarly, “myocardial” and “infarction” are two separate terms that are nested in “myocardial infarction”. Therefore, an additional label N is added to the label of “stent”, “myocardial”, and “infarction”.

Fig. 1
figure 1

An example of BIO and NOBI annotation regimes in the ACTER corpus

We do not consider either multi-word nested terms or terms nested in other nested terms – so-called nested terms on the second or higher levels – due to their rarity in the corpora and gold standards (see the nested frequency in the gold standard from Figs. 16 and 18 in Appendix).Despite the difference in the number of terms in each language and domain, the percentage of unique nested terms in all languages and domains is somewhat consistent, ranging around one-third of the total unique terms in the gold standards. However, the number of terms nested in other nested terms only takes one-tenth and one-twelfth of the total amount of unique terms in both corpora, respectively, and the amounts are even much smaller if we specify the ratio per nested level (e.g., in the second level, third level). We also demonstrate in Table 7 in Appendix the proportion of the nested terms with different word lengths k where \(k = \{1, 2, 3, 4, \ge 5\}\) for each domain and language of both corpora. The last column on the right calculates the percentage of single-word nested terms in total nested terms in the first level. On average, the amount of single-word nested terms accounts for 78.06% above all the nested terms on the first levels in the corpora. Therefore, we only label single-word nested terms on the first level.

For both labeling regimes, we experiment with XLMR, a Transformer-based model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. The model is first trained to predict a label for each token in the input text sequence (e.g., we model the task as token classification) and then applied to the unseen text (test data). Finally, from the tokens or token sequences labeled as terms, the final candidate term list for the test data is composed. Note that when the NOBI annotation regime is used, the terms labeled with BN and IN are added to the final term list separately, together with the terms in which they are nested.

We evaluate the cross-domain performance of XLMR in monolingual, cross-lingual, and multilingual settings. Altogether, 78 different scenarios per annotation regime are tested. The distinct settings are described below.

  1. 1.

    Monolingual setup. We evaluate how well the model performs when there is a language-specific training corpus available and there is a match between the language of the train set and the language of the test set. For better comparison with other existing approaches, we apply the same configuration as in the TermEval 2020 shared task where Heart failure of each language is considered as the test set. Thus, we fine-tune our model on a single language, which means we train three monolingual models for three languages (English, French, Dutch) and test each model in the same language for each annotation regime. Besides, we train 12 monolingual models for each annotation regime for Slovenian given 12 different combinations of train-validation-test split regarding the domains.

  2. 2.

    Cross-lingual setup. We evaluate the capability of the model to apply the knowledge learned in one or more languages for ATE in another unseen language. Therefore, we fine-tune the ATE model on one or more languages (e.g., English and Dutch) and test it on another language not appearing in the train set (e.g., French). In this scenario, we, therefore, examine how well the model performs without the language-specific training corpus and how good the knowledge transfer between different languages is.

  3. 3.

    Multilingual setup. We fine-tune our model using a.) training datasets from the languages in the ACTER dataset (English, French, and Dutch) or b.) training datasets from the languages in the ACTER dataset plus the Slovenian training dataset from the RSDO5 corpus, and then apply the model to the test sets of all languages in the ACTER dataset. By doing so, we examine whether adding more data from other languages to the training set in the target language improves the predictive performance of the model.

All three settings are applied in a cross-domain evaluation scenario, where we use two domains for training, another domain for validation, and the rest for testing. One exception is the multilingual and cross-lingual settings with the additional Slovenian corpus in the training set, where we use two domains from ACTER corpora and all domains from the RSDO5 corpus for the training. This way, we can evaluate the model’s generalization capabilities to adapt knowledge in one or more domains to a new, unseen arbitrary one and, therefore, much more useful. In the ACTER dataset, we use the Corruption and Wind energy domains for training, the Equitation domain for validation, and the Heart failure domain for testing, in order to allow for a direct comparison with other benchmark approaches from the related work, which employ the same train-validation-test setting (Lang et al., 2021). Meanwhile, in the RSDO5 corpus, we explore different train-validation-test combinations.

We divide the dataset into train-validation-test splits. The model is fine-tuned on the training set to predict the probability for each word in a word sequence whether it is a part of the term (B, I), whether it is a nested term (BN for nested terms at the beginning of a multi-word term, IN for nested terms at non-beginning positions of a multi-word term), or not part of the term (O). To do that, an additional token classification head containing a feed-forward layer with a softmax activation is added on top of each model.

3.3 Implementation details

We employ the XLMR token classifier from Huggingface.Footnote 4 We fine-tune the model for up to 20 epochs (i.e., the early stopping regime via the validation set) using the learning rate of 2e-05, training and evaluation batch size of 32, and sequence length of 512 tokens, since this hyperparameter configuration performed the best on the validation set. First, the documents are split into sentences. Then, the sentences with more than 512 tokens are truncated, while those with less than 512 tokens are padded with a special \(<PAD>\) token at the end.

During fine-tuning, the model is evaluated on the validation set after each training epoch, and the best-performing model is applied to the test set. Note that the model with BIO annotation regime will predict the probability of whether the word is a part of the terms (B, I) or not (O) while the one with NOBI regime will predict the probability in the same manner as BIO except for the additional information on the nested terms of the first level (BN, IN) where each word with the N label will be considered as an individual single-word candidate term. The sequences identified as terms are extracted from the text and put into a set of all predicted candidate terms. A post-processing step to lowercase all candidate terms is applied before we compare our derived candidate list with the gold standard.

3.4 Evaluation metrics

We evaluate the performance of the ATE systems by comparing the candidate list extracted from the test set with the manually annotated gold standard term list for that specific test set. We use exact string matching to compare the retrieved terms to the ones in the gold standard and calculate Precision (P), Recall (R), and F1-score (F1). These evaluation metrics have also been used in related work (Hazem et al., 2020; Lang et al., 2021; Rigouts et al., 2020a; Ljubešić et al., 2019), therefore, our results are directly comparable to the benchmarks.

4 Results

In this Section, we determine the predictive power of monolingual, cross-lingual, and multilingual learning in ACTER and RSDO5 test sets as well as compare the results from our proposed approaches to the SOTA from the related work.

4.1 Results on the ACTER test set

The performance of the XLMR classifier regarding P, R, and F1 on the ACTER test set using BIO and NOBI annotation regimes are presented in Table 1. The comparison between BIO and NOBI is indicated with arrows, where \(\uparrow\) is used to show better performance of NOBI in the same setting, while \(\downarrow\) denotes lower performance. No matter which annotation scheme, the results indicate that the cross-lingual and multilingual models in both versions of test data, where one excludes the named entities of the test data (ANN) and the other includes them (NES), tend to surpass the performance of the monolingual ones according to all evaluation metrics, except for the Precision obtained by the French monolingual model on the French test set when the BIO scheme is used and Dutch monolingual model on the Dutch test set when NOBI scheme is used.

Table 1 Evaluation on the ACTER dataset given Heart failure as a test set. For each test set, bold is used to indicate the best model in terms of P, R, and F1 for each test set (ANN and NES) and each annotation scheme separately (BIO and NOBI). The arrows are used for the comparison of BIO and NOBI for each setting, where \(\uparrow\) is used to show the better performance of NOBI compared to BIO, while \(\downarrow\) denotes the lower performance of NOBI compared to BIO. In blue, we indicate the best model in terms of the F1 for each test set

Multilingual models tend to outperform cross-lingual ones in F1. However, multilingual models have a tendency to lose their competency in Precision toward monolingual and cross-lingual ones. By adding the Slovenian corpus with four different domains into the training set, the multilingual model demonstrates a significant improvement in Recall across all test languages compared with the monolingual setting. It also outperforms other models in the F1 when we evaluate it in all three test sets in both annotation schemes. However, this improvement is at the cost of Precision.

When it comes to the comparison of the two annotation regimes, using the NOBI annotations in many cases improves the Recall of the model. This is especially visible in the monolingual and multilingual settings (see Figs. 2, 3, and 4) in which the models are trained in multiple languages including the language of the test sets for all scenarios, and cross-lingual settings in which the models are trained on just one language and applied to the others except for French test set. A substantial increase in Recall also tends to lead to the improvement of the overall F1.

Fig. 2
figure 2

Parallel Coordinates Plot in performance of XLMR classifier for the English test set

Fig. 3
figure 3

Parallel Coordinates Plot in performance of XLMR classifier for the French test set

Fig. 4
figure 4

Parallel Coordinates Plot in performance of XLMR classifier for the Dutch test set

The best models from our combinations include: (1) For the English and French test sets, the best results were obtained with English, French, and Slovenian training data; and (2) For the Dutch test set, the best results were gained with the multilingual classifiers of all four languages. Thus, we compare the multilingual XLMR classifier fine-tuned on the pre-defined test language and multiple languages (trained in at least three languages including Slovenian and the test set’s language) using the ACTER dataset in both annotation regimes. This showcases the power of a multilingual pretrained language model with multilingual settings - using (1) English, French, and Slovenian; and (2) all four languages as the training set - in capturing and understanding diverse linguistic nuances in comparison with a monolingual one. Additionally, the NOBI regime outperforms BIO ones for most of the testing scenarios.

Table 2 F1 comparison between our XLMR classifier in multilingual settings and related work in ACTER corpora

Besides, we also compare the proposed results with the benchmarks as in Table 2 to highlight our hypothesis. For comparison, we include the solutions from the winning teams in the competition (TALN-LS2N (Hazem et al., 2020) won on the English and French test set, while NLPLab UQAM (Le & Sadat, 2021) won on the Dutch test set) and other methods (Rigouts et al., 2021; Lang et al., 2021) described in Sect. 2. Note that all the approaches from the related work are (1) cross-domain and (2) use the Heart failure domain as the test set, which shares the same mechanism with our approaches’ validation.

Our proposed classifiers, trained using either BIO or NOBI annotation regimes, outperform previously described benchmark approaches, showcasing significant performance gains as measured by the F1. When comparing classifiers using BIO and NOBI annotation schemes, those utilizing BIO regimes demonstrate superior F1 on the English NES gold standard, which includes named entities. However, classifiers employing NOBI regimes exhibit noteworthy performance, surpassing all existing state-of-the-art (SOTA) models, including our BIO classifiers, across the languages present in both ANN and NES versions, with the exception of the aforementioned English NES corpus.

Furthermore, we conduct a multilingual evaluation to examine the impact of adding additional languages to the training set. In contrast to the findings of Lang et al. (2021), we observe that incorporating other languages generally leads to only marginal improvements in model performance.

4.2 Evaluation on the RSDO5 test set

We also apply monolingual and multilingual cross-domain approaches to the Slovenian RSDO5 dataset. The results grouped by the test domain using BIO and NOBI annotation regimes are presented in Tables 3 and 4, respectively. For each annotation regime, we evaluate monolingual and multilingual settings where ANN and NES versions are added to the training set of the RSDO5 corpus.

Table 3 The evaluation in RSDO5 corpus given each domain as a test set in monolingual setting. Bold indicates the best result for each test set. The comparison between BIO and NOBI as well as the best model in F1-score are set in the same mechanism with Table 1
Table 4 The evaluation in RSDO5 corpus given each domain as a test set in the multilingual setting. In this setting, in addition to Slovenian training data, the data from ACTER in en, fr, and nl is used, and ANN and NES training sets are compared

The monolingual approach, where we use two domains from the RSDO5 corpus for training, validate on the third domain, and test on the last domain, proves to have relatively consistent performance across all the combinations in both annotation regimes. For both regimes, we achieve a Precision of more than 61%, Recall of no less than 55%, and F1 above 57%. Furthermore, they perform slightly better in the Linguistics and Veterinary domains than in Biomechanics and Chemistry. The difference in the number of terms and length of terms per domain pointed out in Sect. 3.1 might be one of the factors that contribute to this behavior. Moreover, a significant performance boost can be observed for the Veterinary domain when the model is trained in the Biomechanics and Linguistics domains and for the Linguistics domain if the Veterinary domain is included in the training set for the model in both annotation regimes. Between these two settings, the classifier with BIO regime gained a performance of up to 68.9% in the F1 for the Linguistics test set, which surpasses other domains in the same regimes as well as outperforms all the cases in the monolingual classifier of the NOBI regime.

We also explore the performance of multilingual approaches on the RSDO5 test sets. We train the model using the ANN and NES labels from all domains of the ACTER dataset and on two domains from the RSDO5 dataset, validate on the third RSDO5 domain, and test on the last domain. Table 3 and 4 present the comparative performance of the multilingual and the monolingual approaches. However, from the results, there exists a discrepancy in the performance-boosting efficiency among the different combinations of training, validation, and test sets. This raises a hypothesis of the domain sensitivity in transfer learning for ATE tasks. Thus, a careful choice of the domains in the training set is undoubtedly necessary for boosting the classifier’s performance.

Besides, we compare two different annotation regimes by evaluating the performance of classifiers using different training, validation, and testing combinations for each regime. Despite the consistency in the predictive power of monolingual and multilingual settings, the classifiers with NOBI annotation presented a worse performance in the Slovenian RSDO5 corpus compared to the BIO regime. This is due to the fact that the proportion of nested terms in RSDO5 is too small for the classifier to learn nested terms properly, which are visualized in the proportion of unique nested terms and terms nested in other nested terms from Figs. 16 to 18.

Table 5 Comparison between our performance and SOTA in RSDO5 dataset

In Table 5, we present the results from the related work for the RSDO5 dataset compared to our proposed monolingual and multilingual approaches. The result from Ljubešić et al. (2019)’s method, which has been re-implemented using the same RSDO5 corpus as our studies, is taken from Tran et al. (2022b). In general, our approach outperforms Ljubešić et al. (2019)’s one by a large margin on all domains and according to all evaluation metrics, especially when it comes to Recall. We achieve results roughly twice as high as Ljubešić et al. (2019)’s approach in F1-score for all test domains regarding both monolingual and multilingual learning. One should note that the method (Ljubešić et al., 2019) was primarily meant for extracting terms from Ph.D. theses, i.e., documents significantly longer than those available in our training data, which explains the low Recall of that approach. However, this result clearly identifies a significant strength of the sequence-labeling approach - it does not rely on the frequency of term occurrences, which makes the approach more robust as shown in this comparison. In our case, we show that the multilingual experiments do in several cases improve our monolingual results (Tran et al., 2022b), but not systematically.

5 Error analysis

In order to determine whether the term length affects the models’ performance, we calculate Precision and Recall for terms of length k = {1,2,3,4, \(\ge\)5} when predicted by our classifiers on the test set. The number of predicted candidate terms (Preds), number of ground truth terms (GTs), number of correct predictions (TPs), Precision (P), and Recall (R) regarding different term lengths k and test domains in ACTER and RSDO5 corpora are presented in Table 9 and 10 (in Appendix) and Precision (P) and Recall (R) of each scenario are visualized below.

5.1 The ACTER dataset

The results for ACTER’s dataset (Table 9) were obtained by employing the best performing model for a specific language in terms of F1 on the Heart failure test set for the most cases (which is the combination of English, French, and Slovenian as the training set).

Fig. 5
figure 5

Performance in P and R per term length per domain in English ACTER test set

Fig. 6
figure 6

Performance in P and R per term length per domain in French ACTER test set

Fig. 7
figure 7

Performance in P and R per term length per domain in Dutch ACTER test set

As demonstrated in Fig. 5, 6, and 7, when using the BIO scheme, the best model proved to be good at predicting terms containing up to four words for English and Dutch and up to three words for French texts in ACTER corpora. A strong correspondence between the F1 and the number of predicted candidate terms has been found where the number of predicted candidate terms likely corresponds to the situation in the training data (see Table 9 in Appendix).

The best models trained using the NOBI annotation scheme demonstrated the same behavior as the one trained using the BIO annotation regime. They performed well at predicting terms containing up to four words for English and Dutch and up to three words for French texts in ACTER corpora. While our expectation was that the NOBI annotation scheme should benefit the model’s ability to predict short one-word nested terms, the classifiers trained using NOBI annotations show better performance than those using the BIO regime on multi-word terms as well, as long as nested terms take a proper proportion as in ACTER corpora. The Recall therefore generally improves for terms of all lengths, even for terms containing 5 words or more. There seems to be some signal in the occurrence of nested terms inside multi-word terms, which leads the model to better identify longer terms as well. Our current hypothesis is that this effect is a combination of (1) the improvement of single-word term identification by having a larger training set available (both nested and independent single-word terms) and (2) nested terms being some sort of anchor exploited by the model to easier identify multi-word terms around that nested terms. Further experiments and analyses should be conducted to fully understand this phenomenon.

Furthermore, a trend that is noticeable across the majority of scenarios is that the NOBI regime reduces the Precision compared to the BIO regime. This seems to be related to the number of terms predicted where we can observe that Precision often drops where the number of predicted terms is higher, i.e., the BIO regime on the English dataset predicts 1,009 single-word terms with a Precision of 63.3 % and the NOBI regime predicts 1,341 terms with a Precision of 59.2%. In a similar but reversed trend, the Dutch NOBI regime produced 1,738 terms with a Precision of 73.5% whereas the BIO regime produced 2005 terms with a Precision of 64.4% (see Table 9 for the statistics).

Table 6 A comparison of the performance between the BIO and NOBI schemes on the entire dataset, single-word terms (SWU), and multi-word terms (MWU)

We performed an additional detailed comparison of the BIO and NOBI monolingual results on the English dataset (i.e., the results from the first line in Table 1) in Table 6. The NOBI scheme produces a marginal improvement in terms of F1 and Recall but has slightly lower Precision. Overall, the algorithm predicted 1,956 candidates when using the BIO scheme and 1,996 when using the NOBI scheme. Out of these, the BIO scheme resulted in 751 single-word terms (SWU) and 1205 multi-word terms (MWU), while the NOBI scheme produced 889 single-word terms and 1,107 multi-word terms. Looking at the performance in Table 6, NOBI results in a better Recall of single-word terms (51.5 vs. 45.9), which leads to an overall improvement of the F1 (52.7 vs. 52.6). It does not improve the Precision of SWU terms, but does, perhaps surprisingly, deliver higher Precision on MWU terms, which could be due to the fact that the NOBI regime prefers single-word terms (due to their higher proportion in the training set) which results in a smaller number of higher quality MWU terms being predicted.

5.2 The RSDO5 dataset

The results for the RSDO5 dataset (Table 10 in Appendix and from Figs. 8, 9, 10 and 11) were obtained by employing the best-performing model in the F1 for each specific test domain for both annotation regimes, which are (1) training on Veterinary and Chemistry, validation on Biomechanics, and testing on Linguistics domain; (2) training on Linguistics and Biomechanics, validation on Chemistry, and testing on Veterinary domain; (3) training on Linguistics and Veterinary, validation on Biomechanics, and testing on Chemistry domain; (4) training on Linguistics and Chemistry, validation on Veterinary, and testing on Biomechanics.

Fig. 8
figure 8

Performance in P and R per term length per domain in RSDO Linguistics test set

Fig. 9
figure 9

Performance in P and R per term length per domain in RSDO Veterinary test set

Fig. 10
figure 10

Performance in P and R per term length per domain in RSDO Biomechanics test set

These results are similar to ACTER corpora, showing that the models are good at predicting short terms containing up to three words for all four domains of the Slovenian corpus. The best model applied to the Linguistics test domain also shows relatively good performance when it comes to the prediction of longer terms, achieving 75.0% Precision and a decent 31.0% Recall for terms with at least five words. Despite the relatively high Precision for prediction of long terms in the Veterinary and Biomechanics test domains, the Recall is pretty low, most likely due to the small amount of longer terms in the dataset on which the models are trained. When predicting the Chemistry domain, there are no correct predictions of more than five-word terms.

Fig. 11
figure 11

Performance in P and R per term length per domain in RSDO Chemistry test set

The NOBI regime often results in a lower Precision compared to the BIO one. Similar to our findings on the ACTER dataset, this seems to be related to the number of terms being predicted. In general, the higher the number of predictions, the lower the Precision (if the number of predicted terms is high enough — this trend is less noticeable for longer terms of which there are few in the corpus). There are some exceptions, like the Chemistry domain, where the NOBI regime results in 909 predicted single-word terms with a Precision of 61.4% compared to 943 terms with a Precision of 61.5% for the BIO regime, and the Veterinary domain where the NOBI regime predicted 2,111 two-word terms (k=2) with a Precision of 70.3% while the BIO regime predicted 2062 terms with a Precision of 70.2%.

As mentioned above, as well as in previous work (Tran et al., 2022b) for the BIO regime, since the corpus contains nested terms, the very common mistake the both BIO and NOBI models make is to incorrectly predict a shorter term nested in the correct term of the gold standard. Vice versa, the model sometimes generates incorrect predictions containing the correct nested terms. However, the NOBI annotation proves to partially reduce the effect of these two mentioned error patterns and improves the general Recall in comparison to the benchmark BIO scheme.

6 Conclusion

In summary, we demonstrated the possibilities of cross- and multilingual learning compared to the monolingual setting in boosting the predictive performance of the cross-domain sequence-labeling term extraction via experiments conducted on multi-domain corpora, namely the ACTER and RSDO5 datasets. In addition, we presented the positive impact of cross- and multilingual models on the ACTER corpora only, and by further adding the texts from the Slovenian RSDO5 corpus in the training set. Furthermore, we examined the cross-lingual effect of rich-resourced training language on less-resourced testing ones such as Slovenian. Last but not least, we proposed a new NOBI annotation regime, that boosted the predictive power of classifiers in comparison to the classical BIO mechanism, as shown in the ACTER corpus, in which the number of nested terms is significant enough. The improvements through the NOBI annotation regime are visible even in multi-word term identification, quite likely by improving single-word term extraction and exploiting single-word terms as anchors to correctly identify multi-word terms. The results demonstrated the potential of the new annotation scheme to enhance the nested term extraction and a promising impact of cross- and multilingual cross-domain learning when transferring from rich- to less-resourced languages.

In future work, we will test the potential of our proposed NOBI mechanism in similar sequence-labeling extraction tasks in other domains (e.g., Named Entity Recognition). In addition, we plan to investigate the integration of active learning into our current approach to improve the output of the automated method by dynamical adaptation after human feedback.