Annotation scheme
To make the dataset fit for different applications of ATE, yet also domain- and language-independent, an annotation scheme was developed with three term labels, based on two parameters. In the pragmatic school of terminology, “two broad classes of distinctions are made, the first using the criterion of known/unknown and the second distinguishing between subject-specific and non-subject-specific terms” (Pearson 1998, 1:21). However, Pearson rejects both classes as too vague and does not define different term categories, believing that users will be more interested in identifying terms than distinguishing between different types of terms. Nevertheless, such domain- and language-independent distinctions between terms may prove helpful for more application-oriented evaluations of terms, since it has long been argued that different users require different terms. For instance, Estopà (2001) had four different groups of professionals annotate terms in a medical corpus: doctors, archivists, translators and terminologists. She found great differences between their annotations, e.g. terminologists annotated most terms and translators annotated much fewer terms (only the ones that did not belong to their general vocabulary and might present translation difficulties). Warburton (2013) similarly remarks that translators are not interested in any terms that belong to the general lexicon. In this sense, having different term labels based on the two classes mentioned by Pearson (1998) might improve customisation options for different applications. The first parameter will be called domain-specificity and represents the degree to which a term is related to the researched domain. This has been mentioned before, e.g. “All specialized languages show a gradient of domain-specificity” (Loginova et al. 2012) and the TermITH project guidelines (Projet TermITH 2014) specify that terms from the transdisciplinary domain and from a different domain should be rejected. The second parameter will be called lexicon-specificity, i.e. the degree to which terms are either part of the specialised lexicon used only by domain experts or part of general language. Drouin (2003) mentioned this term in earlier work and, in a more recent paper (Drouin et al. 2018b), a scale is presented of four degrees of lexicon-specificity: from topic-specific, to subject-specific (in their case: environmental), to transdisciplinary, to general lexicon. By combining lexicon- and domain-specificity in a matrix, as shown in Fig. 1, three term categories can be defined.
The three categories of terms are labelled: Specific Terms, Out-of-Domain (OOD) Terms and Common Terms. Examples in the domain of heart failure are shown in Fig. 1. Specific Terms are both lexicon- and domain-specific and are terms according to the strictest definitions of the concept. An example in the domain of heart failure would be ejection fraction, which is not part of general language and laypeople probably do not know its meaning. At the same time, it is strongly related to the heart failure, having something to do with the volume of blood pumped with each heartbeat. OOD Terms are lexicon-specific, but not domain-specific. For instance, in the corpus about heart failure, some of the medical abstracts contained terminology related to statistics, such as p value, which is not part of the general lexicon, but it is not very specific to the domain of heart failure either. This category contains, among others, what Hoffmann (1985) called “allgemeinwissenschaftlicher Wortschatz”, which can be translated as “non subject-specific terms” (Pearson 1998). The final label, Common Terms, is meant for the opposite case, when terms are strongly related to the domain, but are not very lexicon-specific, such as heart in the domain of heart failure. This may be related to what Hazem and Morin mean when they describe “technical terms that have a common meaning in the general domain” (Hazem and Morin 2016a, 3406). While we do not deny that domain experts will have a much more intricate idea of the concepts behind Common Terms like heart and blood, generally, all laypeople do have at least a basic idea of the concept and know the words. These categories could be used to customise the data to the application, so that, for instance, translators could ignore any terms that are not lexicon-specific, since they would likely be part of the translator’s known vocabulary.
An additional label was included for Named Entities (NEs), since they can be very closely related to terms, as shown by the fact that they are often mentioned in term annotation guidelines with specific instructions (e.g. Projet TermITH 2014; Schumann and Fischer 2016). Another problem that has been mentioned in many related research (Hätty et al. 2017b; Bernier-Colborne and Drouin 2014; Kim et al. 2003) are the Split Terms, i.e. terms that are somehow interrupted by other words or characters. Two common causes are abbreviations (e.g. left ventricular (LV) hypertrophy, where there are two split terms: left ventricular hypertrophy and LV hypertrophy), and coordinating conjunctions (e.g. left and right ventricle, where both left ventricle and right ventricle are terms). This was solved by creating Part-of term labels, which could be connected to each other, as shown in Fig. 2. All annotations were made in the BRAT online annotation tool (Stenetorp et al. 2011).
The annotation scheme is accompanied by elaborate guidelines which were constructed during the annotation process and after discussions between annotators. They contain instructions on as many recurring problems as possible. Like the annotation scheme itself, the guidelines are language- and domain-independent, though examples are cited from the corpora discussed here. Since the complete guidelines are freely available online,Footnote 2 only a sample of some of the most important instructions will be discussed here. The annotation scheme only provides a basis for whether or not to annotate a term and with which label, so many of the instructions in the guidelines concern term boundaries, viz. which span should be annotated. The first important rule is that each occurrence of all terms must be annotated, even if it is embedded within a longer term. This can be seen in Fig. 2, where both the multiword term or complex term right ventricle and the simple, single-word term ventricle are annotated. Moreover, there is no minimum or maximum term length and all content words may be annotated: nouns and noun phrases, but also adjectives, adverbs and verbs. Another notable issue concerns the distinction between different labels, since “decisions on the ‘generalness’ of a term candidate are somewhat subjective” (Warburton 2013, 99). An example regarding the difference between Common Terms and Specific Terms is to check whether the term is used in publications which are addressed to a large, non-domain-expert audience, such as tabloids. If the term is used, without any further explanations, in such a source, it is safe to assume it is part of the general lexicon and therefore more likely a Common Term than a Specific Term. An example here could be the term heart failure. There is no doubt about this term being domain-specific enough, since it was literally the subject of the corpus. However, the lexicon-specificity is more difficult. Is heart failure part of general vocabulary or not? Intuitively, one could assume that many people have at least heard of the term before and have some basic understanding of what it means, but is that only because the term is so descriptive? To decide, we looked at occurrences of the term in a Google News search. Since it appeared regularly and without further explanations in newspapers and magazines which aim at a very large, general audience, it was decided that heart failure would get the label Common Term. While this method is certainly not perfect, it provides a somewhat objective strategy in case of doubt. More examples and strategies can be found in the guidelines online.
A final consideration was that annotators were instructed to annotate the terms as they appeared in the text., irrespective of whether the terms were accepted in the field or if they were spelled according to the latest conventions. As long as they were used as terms in that text, they should be annotated as such. Since the annotators were no domain-experts, this was the most manageable approach. It is also the most logical one if the purpose is to compare human performance against ATE performance, since the annotators were only identifying the terms in the data that was there, without reference to some external ontology to which an ATE system might not have access. Indeed, one of the primary uses of ATE is identifying terms that are not in any databases yet, so it is important to identify terms as they appear in the texts.
Inter-annotator agreement
Pilot study
In a preliminary pilot study, inter-annotator agreement was calculated between three annotators who each annotated around 3 k tokens per language in the corpora about corruption, heart failure and wind energy (total ± 40 k tokens). All possible aids could be used, especially since the annotators were no domain-experts. They were, however, all fluent in the three languages. Similar to the procedure followed during the annotation of the ACL RD-TEC 2.0 (Qasemizadeh and Schumann 2016), there were two annotation rounds, with discussions of the results between each round. First, F-score was calculated to test agreement on term span annotations, without taking into account the given label, where:
$$ \begin{aligned} {\text{Precision\,of\,Annotator\,A\,versus\,B }} & = \frac{{{\text{Annotator\,A }} \cap {\text{Annotator\,B}}}}{\text{Annotator\,A}} \\ {\text{Recall\,of \,Annotator\, A\, versus\,B }} & = \frac{{{\text{Annotator\, A }} \cap {\text{Annotator \,B}}}}{\text{Annotator \,B}} \\ {\text{F-score\,of\,Annotator\,A\,versus\,B }} & = \frac{{2 * {\text{Precision*Recall}}}}{{{\text{Precision}} + {\text{recall}}}} \\ \end{aligned} $$
Agreement was calculated on type, not token. Consequently, when an annotator gave a label to a certain term, but forgot to accord the same label for a later occurrence of the same term, agreement did not decrease. Average F-score after the first iteration was 0.641, which was already good considering the task, but not great. 4207 unique annotations were found in this first round and only 33% was annotated by all annotators, 26% was annotated by two and 41% was annotated by a single annotator. These results are similar to those reported by Vivaldi and Rodríguez (2007). Discussing annotations in detail, improving the guidelines and then returning (separately) for the second iteration resulted in a drastic improvement to an average F-score of 0.895. To determine agreement on the labels, Cohen’s Kappa was calculated on all shared term annotations. These results were already very promising after the first iteration, with an agreement of 0.749 and improved after the second iteration to 0.927. While this was a good indication of the validity of the procedure and a great way to optimise the guidelines, the methodology was imperfect since specific cases were discussed in detail between rounds and the same dataset was re-used. Consequently, more rigorous experiments were organised.
Inter-annotator agreement evaluation
The purpose of this experiment was to see if the proposed annotation scheme and guidelines improved inter-annotator agreement. For this purpose, annotators were asked to annotate terms in a part of the heart failure corpus in three tasks with different instructions:
Test group:
Task 1: single label (Term) annotation with only term definitions from literature (e.g. Cabré 1999; Faber and Rodríguez 2012) as guidelines.
Task 2: term annotation with the four labels as specified above and with an explanation of the annotation scheme, but no further guidelines.
Task 3: term annotation with the four labels like in task 2, but with the full guidelines.
Control group:
Two different abstracts were chosen for each task, all with a similar word count (so six different texts in total). Texts were chosen without any Split Terms, to avoid the added difficulty. Moreover, readability statistics (De Clercq and Hoste 2016) were calculated to ensure that the texts were all of a comparable difficulty. The annotators all came from different backgrounds and the only requirement was that they knew English well enough to read a scientific text. While we expect to obtain much lower agreement scores than would be desirable due to the diverse annotator profiles, the main goal in this experiment was to compare agreement with and without our annotation scheme and guidelines. Therefore, in a control group, annotators were asked to annotate the same texts, but all with the same instructions.
There were 8 annotators in the test group and 6 annotators in the control group. All annotators were between 20 and 30 years of age and knew sufficient English to understand the texts. Other than that, there were few similarities between annotators. Seven of them were students with a language-related degree, but the others all came from very different backgrounds, including a medical student, a music teacher and an engineering student. While there are many other possible patterns in these data, the analysis in this contribution will focus only on the validation of the annotation scheme and guidelines.
Agreement was calculated between all annotator-pairs as described in Sect. 4.2.1: first, F-score was calculated, then Cohen’s Kappa. Since chance-corrected agreement scores, like kappa, can only be calculated when the total number of annotations is known (which is impossible for term annotation in the full text), this is usually only calculated on the intersection of annotations made by both annotators (Qasemizadeh and Schumann 2016). However, this would mean having to exclude the first task from the comparison, since only one label was used in this task. Similar to the methodology proposed by Vivaldi and Rodríguez (2007), we instead take the union of all terms annotated by both annotators as an approximation. Still, comparisons between the first task and the other two will have to be carefully interpreted, since kappa score was calculated on a different number of categories (two categories in task 1: term or not-term; vs. five categories in task 2 and 3: Specific Term, Common Term, OOD Term, Named-Entity or not-term).
In Table 2, it can be observed that agreement scores, especially kappa scores, are low, as expected. However, a first indication in favour of the annotation scheme and guidelines is that agreement increases per task in the test group. While the difference is small, the results are further validated by the fact that agreement in the control group stays roughly the same for all tasks and even decreases. The difference in agreement between the second and third task is very small, which may be due to the fact that the guidelines are too elaborate to be helpful for inexperienced annotators for such a small annotation task. It can even be seen as a sign in favour of the annotation scheme, i.e. that it works well on its own, even without elaborate guidelines. Since the improvement in agreement can be seen for both F-scores and kappa-scores, we carefully conclude that (1) the annotations scheme improves consistent term annotation when compared to annotation based on no more than term definitions, (2) the guidelines may be a further help to annotators, and (3) including multiple labels does not decrease agreement. Finally, while agreement is expectedly low among annotators with such diverse profiles, we are optimistic that experienced annotators/terminologists can be more consistent, as indicated by the pilot study.
Table 2 Average inter-annotator agreement scores per group and per task A final remark concerning inter-annotator agreement is that, as mentioned before, the final annotations were all made or at least checked by one experienced annotator and terminologist, to improve consistency. Additionally, other semi-automatic checks were performed to ensure the annotations would be as consistent as possible. For instance, when the same word(s) received a different label at different instances, the annotator double-checked whether it was an inconsistent annotation, or a polysemous term.
Results and analysis
Around 50 k tokens have been manually annotated per domain and language, leading to a total of 596,058 annotated tokens across three languages and four domains, as represented in Table 3. Only the parallel corpus on corruption was annotated; not the comparable part. This resulted in 103,140 annotations in total and 17,758 unique annotations (= 17.2 unique annotations per 100 tokens). For comparison: in the ACL RD-TEC 2.0 (Qasemizadeh and Schumann 2016) 33,216 tokens were annotated, resulting in 4849 unique terms (= 14.6 unique annotations per 100 tokens). Since only nominal terms were annotated for the ACL RD-TEC 2.0, this difference was to be expected.
Table 3 Number of tokens annotated per domain and language The first observation concerns the number of annotations per language, domain and category. This is shown in Fig. 3 for tokens and Fig. 4 for types. As can be seen in these graphs, the largest differences are between corpora in different domains. Within each domain, the total number of annotations in all languages is reasonably similar, as is the distribution over the different term categories. This is encouraging, since the corpora should be as comparable as possible. Of course, since the corpus on corruption is a parallel corpus, the differences there are smallest. The fact that more tokens were annotated in the English corpus on heart failure than in the other languages, despite it being the smallest in number of tokens, may be related to the fact that English is so predominant in this type of literature. Maybe terms are coined more easily in English or maybe it is related to the fact that, due to less available data in French and Dutch, more abstracts in the alpha sciences, e.g. regarding patient care and quality-of-life were included. Such abstracts may, in this context, contain less terminology than the medical abstracts in the beta sciences, though this is no more than a hypothesis. The corpus on dressage is one of the most focussed corpora in terms of subject, which would explain why, while there are quite a lot of annotations when looking at tokens, there are fewer when types are considered: there are a lot of terms in the corpus and many recurring terms. The same seems to be true for the French corpus on wind energy. Otherwise, both views lead to roughly the same conclusions.
Concerning the distributions over the different term categories, there is one corpus that stands out, namely the one about corruption. In this corpus, there are many NEs and very few Specific Terms when compared to the other corpora. This can be logically explained by the fact that (1) the legal texts often contain many person and place names, in addition to titles of laws etc., and (2) juridical terms are more likely to find their way to general language. Juridical proceedings are often reported in the news and many people get confronted with some legal jargon when, e.g. buying or renting a house, paying taxes, signing any type of contract, etc. The percentage of OOD Terms in the medical corpora can be explained by the prevalence of statistical terms. Since statistics are often required in scientific research, such terms may appear in the abstracts, even though they are not directly related to heart failure. There are relatively more Common Terms when looking at tokens than at types, since there are often few general language terms that are related enough to the domain to be included, but these do occur quite often, e.g. heart and blood in the domain of heart failure. The opposite is true for NEs, which do not occur very often and are not repeated often, so type counts are relatively higher.
The next analysis concerns term length, as shown in the graph in Fig. 5. While the differences are slightly less extreme when counting per token, the general conclusions remain the same. While there are some differences between the different domains as well, most differences are between languages. A first conclusion is that terms are generally quite short, with few exceeding a length of five tokens. The longest term was ten tokens long in the French corpus on heart failure: inhibiteurs de l’enzyme de conversion de l’angiotensine II. There are more single-word terms than two-word terms in all languages, even though the difference is very small for English. There are exceptions, e.g. the English corpus on wind energy has more two-word terms than single-word terms. Still, these findings are surprising when compared to some other research. In the ACL RD-TEC (Qasemizadeh and Handschuh 2014), there are many more two- and even three-word terms than single-word terms. In earlier work (Justeson and Katz 1995), two-word terms are also found to be much more common than single-word terms, except in the medical domain. However, there are also some findings that are more similar to ours. Estopà’s (2001) finds that 42.91% of the terms in her medical corpus are simple noun terms. A German annotation experiment (Hätty and Schulte im Walde 2018) arrived at similar conclusions, with 46.7% single-word terms. One potential explanation for the differences is the inclusion of terms other than nouns and noun phrases. The corpus itself may also have a considerable influence. A final observation is that there are many more single-word terms in Dutch. This is more easily explained by the pervasiveness of single-word compounds in Dutch.
The final part of this discussion will focus on the part-of-speech patterns that were found in the corpora, as shown in Fig. 6. All corpora were automatically tagged using LeTs Preprocess (van de Kauter et al. 2013). These results have been discussed in more detail in (Rigouts Terryn et al. 2018). The main conclusions were, that (1) nouns and noun phrases are important, but adjectives and even verbs are not uncommon; (2) there are a few common patterns, as can be seen in the graph, but 10–30% of the annotations have other, often quite complicated part-of-speech patterns; (3) the patterns vary considerably per language and domain.
Use case with TExSIS
In the previous section, a sample was presented of the type of information that could be gained from the dataset. In this chapter, the practical use of the dataset as a gold standard will be illustrated by means of a use case with the monolingual pipeline of the hybrid ATE system TExSIS (Macken et al. 2013). For this experiment, the threshold cut-off values of TExSIS were set very low, so there was a clear focus on recall over precision. Moreover, TExSIS currently only extracts nouns and noun phrases, which will impact recall, since the gold standard does include other part-of-speech patterns. NEs were included in all analyses, since TExSIS includes a named entity recognition (NER) module. Split Terms, however, were excluded, since TExSIS cannot handle interrupted occurrences of terms. It should also be noted that is not the aim of the use case to provide an elaborate evaluation of TExSIS, but rather to illustrate the usefulness of the dataset for evaluation purposes.
First, precision, recall and F-scores were calculated for all corpora, for which the results are presented in Fig. 7. As expected, recall is much higher than precision. There are also considerable differences between the different domains and languages. For instance, the three corpora with the worst F-scores are all French, though the French corpus on heart failure scores fourth best. This may be because the system seemed to work best on this domain: the three corpora on heart failure have the best (Dutch), third best (English) and fourth best (French) F-scores. In all domains, the French corpora score worse than their counterparts in English and Dutch.
Next, the impact of the different termhood measures was analysed. This included Vintar’s termhood measure (2010) and log-likelihood ratio (llr). For the correctly extracted terms, the overall average termhood score was 12.13 and the average llr was 56.21. For the incorrectly extracted terms (noise), these averages were only 4.23 and 6.84 respectively, which means a difference of 7.89 points for the termhood measure and 49.37 points for llr. These findings confirm that both measures are informative of termhood. However, since TExSIS sorts based on Vintar’s termhood measure, rather than on llr, it was surprising to find that llr seemed so much more informative in this comparison. To examine this in more detail, the evolution of precision, recall and F-score was plotted for the best ranked terms, first when sorted by Vintar’s termhood measure, then when sorted by llr. Since the minimum number of extracted terms was 3884, we looked at the best ranked 3884 terms for each corpus. The results are presented in Figs. 8 and 9. Precision, recall and F-score were averaged over all corpora. A logical pattern would be for precision to start very high and recall very low, changing to cross each other at some point. While this is true for recall, precision does not start high in either case and varies only slightly throughout; it even starts to increase slightly towards the end. Precision for the termhood measure starts with at least a very small peak, but overall, performance is slightly better when term candidates are sorted based on llr. The fact that even the highest ranked term candidates do not reach a higher precision is an indication that these statistical measures fail to capture important term characteristics.
Next, the distribution of the different term labels is compared in the gold standard, versus the correctly extracted terms and the terms that should have been extracted, but were not (silence). The greatest difference was found for the NEs. On average (across all languages and domains), 21% of all unique annotations in the gold standard were NEs. Of the correctly extracted terms, this was only 17%, versus 28% on average for the silence, indicating that TExSIS is worse at identifying NEs than other terms. This is hardly surprising, since TExSIS was mainly designed for ATE and the NER module was not the focus of the tool. Conversely, TExSIS does seem to perform well for the Specific Terms and Common Terms, with larger proportions of each among the correctly extracted terms than among the silence (average difference of 3% for each).
Many other automatic evaluations could be performed by comparing the ATE output to the dataset, such as evaluations of the number of terms extracted, the term length, the part-of-speech patterns or variations between the different domains and languages. However, the analyses presented suffice to show the practical usefulness of the datasets as gold standards for the evaluation of ATE.