Introduction

The usefulness of terminology extraction from corpora is clearly acknowledged as it has generated a great deal of research and discussion. This well-established process is used in natural language processing and has led to the development of several tailored tools such as TBXTools [31], TermSuite [9], BioTex [22], etc.

Based on [22], our proposal deals with domain-based terminology extraction from heterogeneous corpora, and how to efficiently generate a quantitative and qualitative analysis. To this end, we propose a generic methodology hinged on a combination of extraction and analysis strategies. Term extraction strategies are based on combinations of linguistic, statistical measures, and corpus segmentation approaches, while analysis strategies are based on combinations of extracted terms.

Based on the combined strategies, ITEXT-BIO aims to extract: (1) representative terms, (2) discriminant or relevant terms, and (3) new relevant terms from a corpus or corpora. These strategies are specifically useful for dedicated tasks, such as corpus analysis, specific domain monitoring (e.g. epidemiology) or scientific research monitoring.

This paper is organized as follows. In Section Related work, we briefly present the state-of-the-art related to terminology extraction. Section Dataset description details the dataset dedicated to scientific papers. Sections Methodology and Experiments respectively provide an overview of our proposal and the experiments. In Section Case study: epidemic intelligence, we illustrate the genericity of the proposal by presenting a case study of an implementation of the combined strategies for epidemiological intelligence analysis. We conclude in Section Conclusion by presenting some perspectives for future studies.

Related work

Domain terminology extraction is a major focus of interest and discussion in natural language processing (NLP) research. It has prompted several proposals of methodologies [20, 32, 34, 36] geared towards effective extraction of terms within a given corpus. Also known as automatic term extraction (ATE), this task is considered in various NLP applications, such as in information retrieval [2, 4, 11, 37], topic modeling [15, 42], domain-based monitoring [1, 19, 27], keyword extraction [7] and summarization [2], ontology acquisition, thesaurus construction, etc.

According to [23], term extraction techniques can be categorized under four approaches: linguistic, statistical, machine learning and hybrid.

Overall, linguistic approaches take morphosyntactic part-of-speach (POS) rules into account to describe terms with common structures [5]. Statistical approaches use statistical measures such as term frequency [35, 43], or term co-occurrence between words and phrases like Chi-square [26]. Machine learning approaches use statistical measures and are mainly jointly focused on term extraction [7, 8, 12], classification [41] and summarization [2]. They combine linguistic and statistic approaches to extract terms from textual data in order to build machine learning models. In [7], the authors highlighted that most of these tasks are tackled with unsupervised learning algorithms. Hybrid approaches include, for instance, C_Value [34], C/NC_Value [13] methods, which combine statistical measures and linguistic based rules to extract multi-word and nested terms. In [6, 30], the authors combine rule-based methods and dictionaries to extract terms from Spanish biomedical texts and specialised Arabic texts respectively.

Studies such as [18, 21] related to these latter approaches have revealed the effectiveness and high performance of hybrid term extraction approaches.

The proposed methodologies apply to several domains. In [22], the authors proposed BioTex, a linguistic and statistical measure-based tool to extract terms related to the biomedical domain. The same approach was used in [1] to detect terms or signals for infectious disease monitoring on the web. In [28], a hybrid methodology was proposed to extract terminology for electronic heath records. This hybrid approach was also adapted by [44] to extract concepts related to Chinese culture.

The overall related studies have focused on techniques and methods for term extraction mainly from corpora. Based on existing methodologies, we oriented our study to develop an efficient approach for term extraction from heterogeneous corpora, along with a set of combined strategies to analyze these terms in the biomedical domain. Our methodology combines and tailors linguistic and statistic criteria associated with structural information in texts in order to highlight relevant terms therein. The presented strategies also aim to overcome the time-consuming issues related to machine learning methods which require manually annotated or partially annotated data.

Dataset description

Our study focused on the COVID-19 Open Research DatasetFootnote 1 [40] which contains scientific papers on COVID-19 and related historical coronavirus research. Throughout this study, we refer to the dataset as COVID19-MOOD-data.

The COVID19-MOOD-data dataset is divided into two main corpora, respectively named Papers1 and Papers2. Papers1 contains the commercial use subset (includes PubMed Central content), while Papers2 contains the commercial use subset (includes PubMed Central content), the non-commercial use subset (includes PubMed Central content) and the custom license subset.

Three data pre-processing operations are performed per corpus (Papers1, Papers2) in order to create three corpora according to the title, abstract and content:

  • Title represents the corpus that contains only paper titles;

  • Abstract represents the corpus that contains only paper abstracts;

  • Content represents the corpus that contains only paper contents.

We named them PapersX-title, PapersX-abstract and PapersX-content, respectively. See Table 1 for further details and Table 2 for the acronym definitions.

Table 1 Statistics related to the COVID19-MOOD-data dataset
Table 2 Table legend

Methodology

Here we outline two complementary term extraction and analysis approaches: the free term extraction approach and the driven term extraction approach. The first one is based on a combination of the type of corpus and the statistical measures, while the second is based on a combination of the type of corpus and the morphosyntactic variation rules.

The free term extraction approach

The free term extraction approach seeks to ensure that users will be able to extract significant terms related to a specific domain from a given corpus. As we mentioned in Section Related work, existing tools have been proposed for term and concept extraction. We opted for the BioTex tool to support the free term extraction mode for several reasons:

  • BioTex was initially built for medical domain term extraction.

  • BioTex uses hybrid measures (linguistic and several statistical measures) for the term extraction process.

  • Most existing tools (e.g. Maui-indexerFootnote 2, Topia TermextractFootnote 3, KEAFootnote 4, etc.) are designed for keyword extraction within single documents, and they only function for English language documents, while BioTex is tailored for terminology extraction and supports sets of documents (corpora) and multi-language use.

Three essential parameters related to the BioTex tool are defined below:

  • a corpus: this is the data source from which terms are extracted;

  • a statistical measure: as mentioned above, the BioText processing approach is based on linguistic and statistical measures. The linguistic parameter is defined by default, but the user must define the statistical parameter, as several exist, in order to run the term extraction process;

  • the number of words to be extracted per concept: so called n-grams, this concerns the length of the extracted terms and ranges from 1 to 4_g for BioTex.

In addition to these parameters, there is the number of linguistic patterns (like NN NN, JJ NP NP, NN NP NP, etc.) that can be associated, but this is preset at 20 by default in BioTex. BioTex also includes patterns for verb terms, such as: NN VBD NN NN, NP NN VBD NN NP, etc. Figure 1 outlines the overall three-step process for free term extraction.

At the end of the BioTex process, extracted terms are classified in two sets: TermSet, which only contains single word terms (SWT), and MultiTermSet, which contains multi-word terms (MWT). By using the Driven Extraction process (with FASTR), we can capture the entire term for a given incomplete one obtained during the first step (Free Extraction). The Driven Extraction process step uses incomplete terms to capture the entire terms in the document. For example, if “higher risk acute” or “higher risk area” terms are extracted in the Free Extraction process step, an entire term which could be“higher risk acute care area” will be obtained during the Driven Extraction process.

The driven term extraction approach

This extraction approach seeks to ensure that the terms extracted using BioTex could be used to improve the domain terminology. From a given term, the process aims to extract some variations of this term that exist in the corpus.

The overall processing under this approach is handled with FASTR [17]. FASTR is a rule-based linguistic tool that generates morphosyntactic variants of terms. We respectively note NN, NNS, NNP, NNPS for noun paterns, VB, VBD, VBG, VBN, VBP, VBZ for verbs, RB, RBR, RBS for adverbs and finally JJ, JJR, JJS for adjectives. It enables extraction of variants of a given term in full-text documents. For a given term, FASTR helps extract nearby or long terms that contain the initial term. Figure 1 illustrates the two steps (4 and 5) of the driven term extraction approach. For a given term, FASTR helps extract nearby or long terms that contain the initial one.

The driven process has the advantage of extracting relevant new terms that BioTex cannot extract from the corpus.

Fig. 1
figure 1

The Free and Driven process for term extraction using BioTex and FASTR

Proposed combination for term extraction

Based on the elements given in Sects. The free term extraction approach and The driven term extraction approach, we propose a workflow in Fig. 2 for term extraction and analysis dedicated to scientific papers. We outline this workflow according to the type of corpus, measure, and approach:

  • The type of corpus: as described in the data section, for a given paper, we considered three parts to build the corresponding corpora, i.e. the Title (T), Abstract (A) and Content (C);

  • The measures: BioTex integrates several statistical measures, each of which uses a specific strategy to compute the term score. In this case, we selected the two measures C_Value and F-TFIDF-C_M. C_Value indicates the importance of terms that appear most frequently in a document, based on the idea that the frequency of appearance of a term in the document reflects its importance in the document. Moreover, based on frequency criteria, C_Value favors multi-word term extraction by taking into account nested terms (e.g. virus) in multi-word terms (e.g. influenza virus) [13]. F-TFIDF-C_M represents the harmonic mean of the two C_Value and TF-IDF values, which ranks terms by weight according to their relevance in the document while taking the whole corpus into account [24]. C_Value and F-TFIDF-C_M are complementary, as the first favors relevant MWT extraction while the second gives weight to discriminant terms. For each measure, the aim is to organise the extracted terms in to five sets. (1) Terms corresponding to the Title corpus Set(T), (2) terms corresponding to the Abstract corpus Set(A), (3) terms corresponding to the Content corpus Set(C), (4) terms that intersect within the Title and the Abstract corpus Set(TA), and (5) terms that intersect within the Title and the Content corpus Set(TC).

  • The approach: terms could be extracted using both a given corpus and a specific statistical measure in a free extraction approach. Moreover, for the driven process, term variations are extracted by using both a given corpus and specific set of terms. The set of terms could be defined from the output of the previous approach.

Fig. 2
figure 2

Proposed combination for term extraction

Experiments

To set the parameters, throughout our study we used C_Value and F-TFIDF-C_M as statistical measures, 50 different patterns or term extraction rules, and a number of words ranging from 1 to 4-g (\(n = 1,2,3,4\)). These parameters are applied for corpora described in section 3. The choice of C_Value and F-TFIDF-C_M is based on the findings of previous studies [13, 24] which showed that both allow efficient SWT and MWT extraction.

Before applying BioTex, some specific pre-processes were applied for the Papers1-content and Papers2-content corpora due to their size. Papers1-content was divided into 09 sub-corpora (8 corpora of 1000 documents each and 1 corpus of 1315 documents) and Papers2-content into 32 sub-corpora (31 corpora of 1000 documents each, and 1 corpus of 1332 documents). Each corpus was partitioned into smaller units to enhance scalability. The results obtained from the smaller units were then composed by computing the average ranked values. The final rank for a given term was thus equal to the average of its ranked values in all sub-corpora in which it was present. The final result gave a set of terms, listed in ascending order according to the ranking values.

Table 3 shows an example of the MWT set obtained using BioTex. The Terms column contains the extracted terms, the in_umls column indicates if the corresponding term is available in the Unified Medical Language System (UMLS) Metathesaurus [3] or not, and rank shows the significance of the term based on statistical measures in the whole list of terms for a given corpus. In our study, we used the UMLS Metathesaurus as reference for the extracted terms as our study is linked to a biomedical terminology analysis. This comparison aimed to separate new terminologies or terminologies that were not yet listed in the Metathesaurus.

Table 3 Example of BioTex ouput

The free term extraction approach

We used BioTex, as outlined in Section The free term extraction approach, to extract terms from corpora in free mode. Several analyses are performed below on the obtained results. To this end, we conducted the experiments to address three main questions: (1) for each corpus, what are the most representative terms or domain concepts (terms that summarize the main content of the corpus) per statistical measure? (2) for each corpus, what are the most representative concepts for both measures? and (3) what are the discriminant and common concepts of the overall corpus?

For each case, we determined if the extracted terms exist or not in the UMLS Metathesaurus.

Corpus representative terms

In this section, we illustrate how representative terms can be extracted from different datasets. Based on the BioTex ranking measures, a term is more important than another one in a given corpus if it has a higher ranking than the other term.

Figure 3 shows representative terms for the Title, Abstract and Content corpora with the corresponding statistical measures (see Tables 7 and 8 for more details).

This figure highlights which terms are important in each part of the Papers. Note that the extracted terms are different for each measure and sub-corpus, but some of them are similar for both. For example, terms like public health, immune responses are extracted using both measures from the Abstract corpus.

In order to quantitatively display the number of representative intersecting terms from different corpora, we show common terms between Title vs Abstract, and Title vs Content corpora for the Papers2 corpus in Fig. 4. For both measures, Title terms are more representative in the Abstract than in the Content of Papers, i.e. 57% and 27% compared to 28% and 5%, respectively, for Title vs Abstract and Title vs Content. However, we noted that terms extracted with C_Value generated more common terms than those extracted with F-TFIDF-C_M. The common terms represent terms extracted at once in the Title, Abstract and Content corpus for each measure.

Fig. 3
figure 3

Representative terms from Papers1

Fig. 4
figure 4

Common terms in Papers2

As indicated, extracted terms were compared with the UMLS Metathesaurus. Table 4 shows the TOP@20 terms extracted for the Papers1-content corpus using C_Value and F-TFIDF-C_M measures. Bold terms are not in the UMLS Metathesaurus.

Table 4 TOP@20 terms extracted from Paper1-content using C_Value and F-TFIDF-C_M - SWTs vs MWTs

According to these TOP@20 terms, we can see that:

  • the majority of the SWTs are in the UMLS Metathesaurus for both statistical measures (C_Value or F-TFIDF-C_M);

  • for MWTs, several terms are not in the UMLS Metathesaurus. These terms can be categorized as:

  • UMLS sub-terms these are terms that do not exactly match to those present in the UMLS Metathesaurus but could be part of them. For example, health emergency is part of terms like Emergency Health Services in the UMLS Metathesaurus;

  • New terms these terms are not in the UMLS Metathesaurus, but are meaningful (or not) in the COVID-19 context. For example, terms like close contact relate to the COVID-19 contagion mode.

Figures 5 and 6 illustrate the number of terms out of the TOP@100 terms (in percentage) for each measure (C_Value, F-TFIDF-C_M) and dataset (Papers1-title, Papers2-title):

  • In_UMLS: the number of terms in the UMLS Metathesaurus;

  • Not_In_UMLS_V: the number of terms that do not exactly match the UMLS terms, but have some variants or are part of the UMLS terms;

  • Not_In_UMLS: the number of terms that do not match the UMLS terms at all. We indicate these as new terms. Terms which are not in the UMLS Metathesaurus but which could have greater meaning in the study context or which could be added to the UMLS Metathesaurus.

According to these statistics, we first note that the C_Value and F-TFIDF-C_M measures enable extraction of more conventional terms or terms in the UMLS Metathesaurus regardless of the corpus. Secondly, we note that F-TFIDF-C_M generates more new terms (Not In UMLS) than C_Value regardless of the corpus. Finally, the number of new terms is more substantial with MWTs (Fig. 6) than SWTs (Fig. 5) regardless of the measure.

Fig. 5
figure 5

C_Value vs F-TFIDF-C_M SWTs

Fig. 6
figure 6

C_Value vs F-TFIDF-C_M MWTs

Relevant term extraction from corpora for both measures

This involves quantitative and qualitative analysis of the terms extracted within each corpus, while taking both measures (C_Value and F-TFIDF-C_M) into account. In other words, it consists of analysing terms obtained for both measures, i.e. terms detected at the same time, and also terms specific to each of them.

The quantitative analysis aims to highlight, for each dataset, the number of terms obtained by each measure, the number of terms obtained for both measures, and which are available or not in the UMLS Metathesaurus. While the qualitative measure aims to highlight, in each case, how the terms obtained are important or not regarding the study domain.

For the data representation, we take advantages of Venn Diagram [16], see in Appendix Fig. 10 the distribution of the Papers2-title corpus terms. Terms are organised in different sections. For example, gene expression, human coronavirus, case report, public health, respiratory syncytial virus, etc. are available in UMLS Metathesaurus and are recognized by both measures (C_Value and F-TFIDF-C_M). According to the study domain, these terms will tend to be more representative and important in the whole corpus. Moreover, for each measure there are new terms which are not in the UMLS Metathesaurus.

Discriminant and common term extraction from corpora

In this case, term analysis is performed per dataset or by jointly considering multiple corpora, i.e. between Title, Abstract and Content corpora. Appendix Fig. 11 corresponds to discriminant and common term extraction from Papers1-title, Papers1-abstract and Papers1-content.

There are common terms in the overall corpus such as gene expression, virus replication, influenza virus, etc.. These terms tend to be relevant in the Title, Content and Abstract corpora.

Moreover, [respiratory infection, acute respiratory infection, etc.], [innate immune response, endoplasmic reticulum, etc.], and [nucleotide sequences, room temperature, etc.] are discriminant terms in the Title, Abstract and Content corpora.

The driven term extraction process

We performed a driven term extraction strategy using FASTR. Our proposal addresses two main questions: (1) For a given set of terms, how can new and relevant terms variants be extracted from a corpus based on the terms? (2) Do some of the new terms exist in the UMLS Metathesaurus? In our experiment, we used the common terms extracted in section 5.1.3 based on the fact that they were more representative and relevant throughout the corpora.

Figure 7 shows an example of variant terms extracted with the term infectious disease. Among these variants, we only show those which are not in the UMLS Metathesaurus since they are new and might be more informative.

Fig. 7
figure 7

Example of term variants

Table 5 contains a list of TOP@10 variants extracted with six initial terms. Among them, we highlighted (in bold) terms matching terms in the UMLS Metathesaurus. Like free mode extraction, term variant mode may be used to extract useful terms.

Table 5 Term extraction variations using FASTR

Combined strategies for term analysis

Combined strategies for term analysis concern two levels: (1) Intra-corpus term extraction, and (2) Inter-corpus term extraction.

  • Combined intra-corpus term extraction strategies: these are geared towards extracting common or discriminant terms from a given corpus. To this end, extracted terms from both measures are compared. We show the process in Fig. 8, where the set of terms Set(Cp) extracted from the corpus Cp (Title, Abstract or Content) using each measure (C_Value, F-TFIDF-C_M) are jointly compared with the UMLS Metathesaurus terms. Set A represents corpus terms specifically extracted with C_Value, set B represents terms that are specific to F-TFIDF-C_M, while set C represents common terms from both measures and UMLS Metathesaurus elements. We consider that sets A and B are discriminant terms of the corpus according to the measures, and otherwise set C is considered as containing common terms or the most representative terms of the corpus. The new term extraction process with FASTR is run with one of the combined sets (discriminant or common) and the corpus.

  • Combined inter-corpus term extraction strategies: these are geared towards extracting common and discriminant terms, while taking several corpora into account for a given measure. As illustrated in Fig. 9, for each measure (C_Value or F-TFIDF-C_M), the sets of terms Set(Cp1), Set(Cp2), Set(Cp3) are extracted respectively from corpus Cp1 (Title), Cp2 (Abstract), and Cp3 (Content). These sets are compared in order to compute the common term set D for both corpora, and discriminant term sets A, B, C, respectively, for corpora Cp2, Cp1 and Cp3. In this context, new terms are extracted using one of the combined sets with one corpus (Cpx).

Fig. 8
figure 8

Combined intra-corpus term extraction strategies

Fig. 9
figure 9

Combined inter-corpus term extraction strategies

Case study: epidemic intelligence

Epidemic intelligence (EI) aims to detect, investigate and monitor potential health threats in a timely manner [33]. In addition to conventional surveillance system monitoring, such as outbreak notifications from the World Organisation for Animal Health (OIE), the EI process increasingly mainstreams unstructured data from informal sources such as online news. Several web-based surveillance systems have been developed and used to support public health and animal health surveillance (ProMED [25], HealthMap [14], GPHIN [29], PADI-web [38], etc.). In this case study, we focused on the choice keywords with the PADI-Web system for COVID-19 surveillance (i.e. driven surveillance) and for monitoring unknown diseases (i.e. syndromic surveillance).

The Platform for Automated extraction of Disease Information from the web (PADI-webFootnote 5) is an automated surveillance system for monitoring the emergence of animal infectious diseases, including zoonoses [1, 38]. PADI-web monitors Google News through specific really simple syndication (RSS) feeds, targeting diseases of interest (e.g. African swine fever, avian influenza, etc.). PADI-web also uses unspecific RSS feeds, consisting of combinations of symptoms and hosts (i.e. species), thus allowing syndromic surveillance and detection of unusual disease events. RSS feeds consists of combinations of different categories of terms (i.e. keywords) including symptoms, disease names and species.

PADI-Web has been used for monitoring COVID-19 disease [39]. In this context, the choice of COVID-19 surveillance terms is crucial.

In the following subsections, we discuss the choice of terms given by ITEXT-BIO to use in the PADI-Web system [38] and other web-based surveillance systems [14, 25, 29] for COVID-19 and syndromic surveillance. This enables evaluation of the relevance of terms generated by our approach for a dedicated task, i.e. web-based health surveillance.

Relevant term extraction

We compared the relevance of the top 10 terms extracted from Papers2 corpora with either C_Value or F-TFIDF-C (Table 6). Table 9 gives more details on these terms. The relevance was assessed by classifying the terms in one or more of the following categories:

  • COVID-19 surveillance: epidemiological terms specific to COVID-19 (e.g. coronavirus spike).

  • Syndromic surveillance: epidemiological terms not specific to a particular disease (e.g. infectious bronchitis).

  • Domain relevant: terms related to health, i.e. either to specific diseases (e.g. porcine epidemic diarrhoea) or unspecific (e.g. virus infections). The Domain relevant category thus includes the two previous categories, plus diseases other than COVID-19.

  • Part of disease multiword expression (MWE): part of a multiword expression corresponding to a disease name (e.g. East respiratory syndrome for Middle East syndrome coronavirus).

Among the terms extracted with C_Value from Titles, Abstracts or Titles and Abstracts, six to seven were parts of disease MWE. Only one term extracted with F-TFIDF-C_M was a part of disease MWE. C_Value could thus be of particular interest for extracting disease name variants, even if they are incomplete. For domain relevant COVID-19 surveillance and syndromic surveillance terms, F-TFIDF-C_M obtained better results than C_Value, even when the frequency of relevant terms was low (from one to five out of ten terms). No common terms were extracted from (Title + Abstract) or from (Title + Content) using F-TFIDF-C_M. Using C_Value, only three common terms were extracted from Title + Content. Among the top 10 terms extracted from Title + Abstract with these metrics, seven were parts of disease MWE. Regardless of the term category, we extracted more relevant terms from Titles and Abstracts than from Contents. This is in line with the fact that Title and Abstracts are more rich in key information and relevant terms due to their length limitation.

Table 6 Relevance of terms extracted from Papers2 depending on the metrics (C_Value or F-TFIDF-C_M)

Driven term extraction

We selected terms extracted in Section 6.1: respiratory tract, viral infections, SARS coronavirus, incubation period, influenza virus, respiratory infections and infectious diseases. We randomly extracted the variants with FASTR (Section The driven term extraction approach). An epidemiologist manually evaluated the relevance of 10 randomly selected variants per term. Among the 60 evaluated terms (see Table 10), 72% (43/60) were relevant and 7% (4/60) were irrelevant. For 13 variants (22%), the relevance could not be assessed because the expression was truncated and ambiguous, such as “disease has an infectious” for the term “infectious diseases”. FASTR thus seems to be an effective tool for generating term variants efficiently. However, we noted that FASTR generated up to 774 variants for a single term. Thus, to avoid random selection of terms, it would be interesting to compute a relevance index that could be used to rank the proposed variants. Besides, several extracted variants were fragments of expressions that could not be evaluated. This issue could be overcome by displaying the variant context (i.e. the sentence in which the variants appeared).

Conclusion

In this paper we describe ITEXT-BIO, a generic methodology for biomedical term extraction. We show how it allows users to extract terms (or concepts) from different types of textual data using several combined strategies. The free term extraction approach extracts terms from corpora, while the driven term extraction approach extracts, from a corpus and a set of terms, a set of variations of these terms.

We illustrate that the proposed combined strategies based on statistical measures and textual segments help efficiently extract and categorize terms (representative, discriminant and new terms) from a corpus or corpora. We also quantitatively and qualitatively analysed the extracted terms to determine those related to the study domain and those that could be considered as emerging terminology for disease monitoring.

Our future studies will focus on term extraction and analysis by: (i) taking different sections of papers into account and applying the methodology to different types of corpora derived from newspapers or social media such as Twitter, (ii) considering combinations of tools other than BioTex, and (iii) introducing word embedding strategies like BERT [10] to capture semantic aspects of the extracted terms in order to reduce context ambiguity.