ITEXT-BIO: Intelligent Term EXTraction for BIOmedical analysis

Abstract

Here, we introduce ITEXT-BIO, an intelligent process for biomedical domain terminology extraction from textual documents and subsequent analysis. The proposed methodology consists of two complementary approaches, including free and driven term extraction. The first is based on term extraction with statistical measures, while the second considers morphosyntactic variation rules to extract term variants from the corpus. The combination of two term extraction and analysis strategies is the keystone of ITEXT-BIO. These include combined intra-corpus strategies that enable term extraction and analysis either from a single corpus (intra), or from corpora (inter). We assessed the two approaches, the corpus or corpora to be analysed and the type of statistical measures used. Our experimental findings revealed that the proposed methodology could be used: (1) to efficiently extract representative, discriminant and new terms from a given corpus or corpora, and (2) to provide quantitative and qualitative analyses on these terms regarding the study domain.

Introduction

The usefulness of terminology extraction from corpora is clearly acknowledged as it has generated a great deal of research and discussion. This well-established process is used in natural language processing and has led to the development of several tailored tools such as TBXTools [31], TermSuite [9], BioTex [22], etc.

Based on [22], our proposal deals with domain-based terminology extraction from heterogeneous corpora, and how to efficiently generate a quantitative and qualitative analysis. To this end, we propose a generic methodology hinged on a combination of extraction and analysis strategies. Term extraction strategies are based on combinations of linguistic, statistical measures, and corpus segmentation approaches, while analysis strategies are based on combinations of extracted terms.

Based on the combined strategies, ITEXT-BIO aims to extract: (1) representative terms, (2) discriminant or relevant terms, and (3) new relevant terms from a corpus or corpora. These strategies are specifically useful for dedicated tasks, such as corpus analysis, specific domain monitoring (e.g. epidemiology) or scientific research monitoring.

This paper is organized as follows. In Section Related work, we briefly present the state-of-the-art related to terminology extraction. Section Dataset description details the dataset dedicated to scientific papers. Sections Methodology and Experiments respectively provide an overview of our proposal and the experiments. In Section Case study: epidemic intelligence, we illustrate the genericity of the proposal by presenting a case study of an implementation of the combined strategies for epidemiological intelligence analysis. We conclude in Section Conclusion by presenting some perspectives for future studies.

Related work

Domain terminology extraction is a major focus of interest and discussion in natural language processing (NLP) research. It has prompted several proposals of methodologies [20, 32, 34, 36] geared towards effective extraction of terms within a given corpus. Also known as automatic term extraction (ATE), this task is considered in various NLP applications, such as in information retrieval [2, 4, 11, 37], topic modeling [15, 42], domain-based monitoring [1, 19, 27], keyword extraction [7] and summarization [2], ontology acquisition, thesaurus construction, etc.

According to [23], term extraction techniques can be categorized under four approaches: linguistic, statistical, machine learning and hybrid.

Overall, linguistic approaches take morphosyntactic part-of-speach (POS) rules into account to describe terms with common structures [5]. Statistical approaches use statistical measures such as term frequency [35, 43], or term co-occurrence between words and phrases like Chi-square [26]. Machine learning approaches use statistical measures and are mainly jointly focused on term extraction [7, 8, 12], classification [41] and summarization [2]. They combine linguistic and statistic approaches to extract terms from textual data in order to build machine learning models. In [7], the authors highlighted that most of these tasks are tackled with unsupervised learning algorithms. Hybrid approaches include, for instance, C_Value [34], C/NC_Value [13] methods, which combine statistical measures and linguistic based rules to extract multi-word and nested terms. In [6, 30], the authors combine rule-based methods and dictionaries to extract terms from Spanish biomedical texts and specialised Arabic texts respectively.

Studies such as [18, 21] related to these latter approaches have revealed the effectiveness and high performance of hybrid term extraction approaches.

The proposed methodologies apply to several domains. In [22], the authors proposed BioTex, a linguistic and statistical measure-based tool to extract terms related to the biomedical domain. The same approach was used in [1] to detect terms or signals for infectious disease monitoring on the web. In [28], a hybrid methodology was proposed to extract terminology for electronic heath records. This hybrid approach was also adapted by [44] to extract concepts related to Chinese culture.

The overall related studies have focused on techniques and methods for term extraction mainly from corpora. Based on existing methodologies, we oriented our study to develop an efficient approach for term extraction from heterogeneous corpora, along with a set of combined strategies to analyze these terms in the biomedical domain. Our methodology combines and tailors linguistic and statistic criteria associated with structural information in texts in order to highlight relevant terms therein. The presented strategies also aim to overcome the time-consuming issues related to machine learning methods which require manually annotated or partially annotated data.

Dataset description

Our study focused on the COVID-19 Open Research DatasetFootnote 1 [40] which contains scientific papers on COVID-19 and related historical coronavirus research. Throughout this study, we refer to the dataset as COVID19-MOOD-data.

The COVID19-MOOD-data dataset is divided into two main corpora, respectively named Papers1 and Papers2. Papers1 contains the commercial use subset (includes PubMed Central content), while Papers2 contains the commercial use subset (includes PubMed Central content), the non-commercial use subset (includes PubMed Central content) and the custom license subset.

Three data pre-processing operations are performed per corpus (Papers1, Papers2) in order to create three corpora according to the title, abstract and content:

  • Title represents the corpus that contains only paper titles;

  • Abstract represents the corpus that contains only paper abstracts;

  • Content represents the corpus that contains only paper contents.

We named them PapersX-title, PapersX-abstract and PapersX-content, respectively. See Table 1 for further details and Table 2 for the acronym definitions.

Table 1 Statistics related to the COVID19-MOOD-data dataset
Table 2 Table legend

Methodology

Here we outline two complementary term extraction and analysis approaches: the free term extraction approach and the driven term extraction approach. The first one is based on a combination of the type of corpus and the statistical measures, while the second is based on a combination of the type of corpus and the morphosyntactic variation rules.

The free term extraction approach

The free term extraction approach seeks to ensure that users will be able to extract significant terms related to a specific domain from a given corpus. As we mentioned in Section Related work, existing tools have been proposed for term and concept extraction. We opted for the BioTex tool to support the free term extraction mode for several reasons:

  • BioTex was initially built for medical domain term extraction.

  • BioTex uses hybrid measures (linguistic and several statistical measures) for the term extraction process.

  • Most existing tools (e.g. Maui-indexerFootnote 2, Topia TermextractFootnote 3, KEAFootnote 4, etc.) are designed for keyword extraction within single documents, and they only function for English language documents, while BioTex is tailored for terminology extraction and supports sets of documents (corpora) and multi-language use.

Three essential parameters related to the BioTex tool are defined below:

  • a corpus: this is the data source from which terms are extracted;

  • a statistical measure: as mentioned above, the BioText processing approach is based on linguistic and statistical measures. The linguistic parameter is defined by default, but the user must define the statistical parameter, as several exist, in order to run the term extraction process;

  • the number of words to be extracted per concept: so called n-grams, this concerns the length of the extracted terms and ranges from 1 to 4_g for BioTex.

In addition to these parameters, there is the number of linguistic patterns (like NN NN, JJ NP NP, NN NP NP, etc.) that can be associated, but this is preset at 20 by default in BioTex. BioTex also includes patterns for verb terms, such as: NN VBD NN NN, NP NN VBD NN NP, etc. Figure 1 outlines the overall three-step process for free term extraction.

At the end of the BioTex process, extracted terms are classified in two sets: TermSet, which only contains single word terms (SWT), and MultiTermSet, which contains multi-word terms (MWT). By using the Driven Extraction process (with FASTR), we can capture the entire term for a given incomplete one obtained during the first step (Free Extraction). The Driven Extraction process step uses incomplete terms to capture the entire terms in the document. For example, if “higher risk acute” or “higher risk area” terms are extracted in the Free Extraction process step, an entire term which could be“higher risk acute care area” will be obtained during the Driven Extraction process.

The driven term extraction approach

This extraction approach seeks to ensure that the terms extracted using BioTex could be used to improve the domain terminology. From a given term, the process aims to extract some variations of this term that exist in the corpus.

The overall processing under this approach is handled with FASTR [17]. FASTR is a rule-based linguistic tool that generates morphosyntactic variants of terms. We respectively note NN, NNS, NNP, NNPS for noun paterns, VB, VBD, VBG, VBN, VBP, VBZ for verbs, RB, RBR, RBS for adverbs and finally JJ, JJR, JJS for adjectives. It enables extraction of variants of a given term in full-text documents. For a given term, FASTR helps extract nearby or long terms that contain the initial term. Figure 1 illustrates the two steps (4 and 5) of the driven term extraction approach. For a given term, FASTR helps extract nearby or long terms that contain the initial one.

The driven process has the advantage of extracting relevant new terms that BioTex cannot extract from the corpus.

Fig. 1
figure1

The Free and Driven process for term extraction using BioTex and FASTR

Proposed combination for term extraction

Based on the elements given in Sects. The free term extraction approach and The driven term extraction approach, we propose a workflow in Fig. 2 for term extraction and analysis dedicated to scientific papers. We outline this workflow according to the type of corpus, measure, and approach:

  • The type of corpus: as described in the data section, for a given paper, we considered three parts to build the corresponding corpora, i.e. the Title (T), Abstract (A) and Content (C);

  • The measures: BioTex integrates several statistical measures, each of which uses a specific strategy to compute the term score. In this case, we selected the two measures C_Value and F-TFIDF-C_M. C_Value indicates the importance of terms that appear most frequently in a document, based on the idea that the frequency of appearance of a term in the document reflects its importance in the document. Moreover, based on frequency criteria, C_Value favors multi-word term extraction by taking into account nested terms (e.g. virus) in multi-word terms (e.g. influenza virus) [13]. F-TFIDF-C_M represents the harmonic mean of the two C_Value and TF-IDF values, which ranks terms by weight according to their relevance in the document while taking the whole corpus into account [24]. C_Value and F-TFIDF-C_M are complementary, as the first favors relevant MWT extraction while the second gives weight to discriminant terms. For each measure, the aim is to organise the extracted terms in to five sets. (1) Terms corresponding to the Title corpus Set(T), (2) terms corresponding to the Abstract corpus Set(A), (3) terms corresponding to the Content corpus Set(C), (4) terms that intersect within the Title and the Abstract corpus Set(TA), and (5) terms that intersect within the Title and the Content corpus Set(TC).

  • The approach: terms could be extracted using both a given corpus and a specific statistical measure in a free extraction approach. Moreover, for the driven process, term variations are extracted by using both a given corpus and specific set of terms. The set of terms could be defined from the output of the previous approach.

Fig. 2
figure2

Proposed combination for term extraction

Experiments

To set the parameters, throughout our study we used C_Value and F-TFIDF-C_M as statistical measures, 50 different patterns or term extraction rules, and a number of words ranging from 1 to 4-g (\(n = 1,2,3,4\)). These parameters are applied for corpora described in section 3. The choice of C_Value and F-TFIDF-C_M is based on the findings of previous studies [13, 24] which showed that both allow efficient SWT and MWT extraction.

Before applying BioTex, some specific pre-processes were applied for the Papers1-content and Papers2-content corpora due to their size. Papers1-content was divided into 09 sub-corpora (8 corpora of 1000 documents each and 1 corpus of 1315 documents) and Papers2-content into 32 sub-corpora (31 corpora of 1000 documents each, and 1 corpus of 1332 documents). Each corpus was partitioned into smaller units to enhance scalability. The results obtained from the smaller units were then composed by computing the average ranked values. The final rank for a given term was thus equal to the average of its ranked values in all sub-corpora in which it was present. The final result gave a set of terms, listed in ascending order according to the ranking values.

Table 3 shows an example of the MWT set obtained using BioTex. The Terms column contains the extracted terms, the in_umls column indicates if the corresponding term is available in the Unified Medical Language System (UMLS) Metathesaurus [3] or not, and rank shows the significance of the term based on statistical measures in the whole list of terms for a given corpus. In our study, we used the UMLS Metathesaurus as reference for the extracted terms as our study is linked to a biomedical terminology analysis. This comparison aimed to separate new terminologies or terminologies that were not yet listed in the Metathesaurus.

Table 3 Example of BioTex ouput

The free term extraction approach

We used BioTex, as outlined in Section The free term extraction approach, to extract terms from corpora in free mode. Several analyses are performed below on the obtained results. To this end, we conducted the experiments to address three main questions: (1) for each corpus, what are the most representative terms or domain concepts (terms that summarize the main content of the corpus) per statistical measure? (2) for each corpus, what are the most representative concepts for both measures? and (3) what are the discriminant and common concepts of the overall corpus?

For each case, we determined if the extracted terms exist or not in the UMLS Metathesaurus.

Corpus representative terms

In this section, we illustrate how representative terms can be extracted from different datasets. Based on the BioTex ranking measures, a term is more important than another one in a given corpus if it has a higher ranking than the other term.

Figure 3 shows representative terms for the Title, Abstract and Content corpora with the corresponding statistical measures (see Tables 7 and 8 for more details).

This figure highlights which terms are important in each part of the Papers. Note that the extracted terms are different for each measure and sub-corpus, but some of them are similar for both. For example, terms like public health, immune responses are extracted using both measures from the Abstract corpus.

In order to quantitatively display the number of representative intersecting terms from different corpora, we show common terms between Title vs Abstract, and Title vs Content corpora for the Papers2 corpus in Fig. 4. For both measures, Title terms are more representative in the Abstract than in the Content of Papers, i.e. 57% and 27% compared to 28% and 5%, respectively, for Title vs Abstract and Title vs Content. However, we noted that terms extracted with C_Value generated more common terms than those extracted with F-TFIDF-C_M. The common terms represent terms extracted at once in the Title, Abstract and Content corpus for each measure.

Fig. 3
figure3

Representative terms from Papers1

Fig. 4
figure4

Common terms in Papers2

As indicated, extracted terms were compared with the UMLS Metathesaurus. Table 4 shows the TOP@20 terms extracted for the Papers1-content corpus using C_Value and F-TFIDF-C_M measures. Bold terms are not in the UMLS Metathesaurus.

Table 4 TOP@20 terms extracted from Paper1-content using C_Value and F-TFIDF-C_M - SWTs vs MWTs

According to these TOP@20 terms, we can see that:

  • the majority of the SWTs are in the UMLS Metathesaurus for both statistical measures (C_Value or F-TFIDF-C_M);

  • for MWTs, several terms are not in the UMLS Metathesaurus. These terms can be categorized as:

  • UMLS sub-terms these are terms that do not exactly match to those present in the UMLS Metathesaurus but could be part of them. For example, health emergency is part of terms like Emergency Health Services in the UMLS Metathesaurus;

  • New terms these terms are not in the UMLS Metathesaurus, but are meaningful (or not) in the COVID-19 context. For example, terms like close contact relate to the COVID-19 contagion mode.

Figures 5 and 6 illustrate the number of terms out of the TOP@100 terms (in percentage) for each measure (C_Value, F-TFIDF-C_M) and dataset (Papers1-title, Papers2-title):

  • In_UMLS: the number of terms in the UMLS Metathesaurus;

  • Not_In_UMLS_V: the number of terms that do not exactly match the UMLS terms, but have some variants or are part of the UMLS terms;

  • Not_In_UMLS: the number of terms that do not match the UMLS terms at all. We indicate these as new terms. Terms which are not in the UMLS Metathesaurus but which could have greater meaning in the study context or which could be added to the UMLS Metathesaurus.

According to these statistics, we first note that the C_Value and F-TFIDF-C_M measures enable extraction of more conventional terms or terms in the UMLS Metathesaurus regardless of the corpus. Secondly, we note that F-TFIDF-C_M generates more new terms (Not In UMLS) than C_Value regardless of the corpus. Finally, the number of new terms is more substantial with MWTs (Fig. 6) than SWTs (Fig. 5) regardless of the measure.

Fig. 5
figure5

C_Value vs F-TFIDF-C_M SWTs

Fig. 6
figure6

C_Value vs F-TFIDF-C_M MWTs

Relevant term extraction from corpora for both measures

This involves quantitative and qualitative analysis of the terms extracted within each corpus, while taking both measures (C_Value and F-TFIDF-C_M) into account. In other words, it consists of analysing terms obtained for both measures, i.e. terms detected at the same time, and also terms specific to each of them.

The quantitative analysis aims to highlight, for each dataset, the number of terms obtained by each measure, the number of terms obtained for both measures, and which are available or not in the UMLS Metathesaurus. While the qualitative measure aims to highlight, in each case, how the terms obtained are important or not regarding the study domain.

For the data representation, we take advantages of Venn Diagram [16], see in Appendix Fig. 10 the distribution of the Papers2-title corpus terms. Terms are organised in different sections. For example, gene expression, human coronavirus, case report, public health, respiratory syncytial virus, etc. are available in UMLS Metathesaurus and are recognized by both measures (C_Value and F-TFIDF-C_M). According to the study domain, these terms will tend to be more representative and important in the whole corpus. Moreover, for each measure there are new terms which are not in the UMLS Metathesaurus.

Discriminant and common term extraction from corpora

In this case, term analysis is performed per dataset or by jointly considering multiple corpora, i.e. between Title, Abstract and Content corpora. Appendix Fig. 11 corresponds to discriminant and common term extraction from Papers1-title, Papers1-abstract and Papers1-content.

There are common terms in the overall corpus such as gene expression, virus replication, influenza virus, etc.. These terms tend to be relevant in the Title, Content and Abstract corpora.

Moreover, [respiratory infection, acute respiratory infection, etc.], [innate immune response, endoplasmic reticulum, etc.], and [nucleotide sequences, room temperature, etc.] are discriminant terms in the Title, Abstract and Content corpora.

The driven term extraction process

We performed a driven term extraction strategy using FASTR. Our proposal addresses two main questions: (1) For a given set of terms, how can new and relevant terms variants be extracted from a corpus based on the terms? (2) Do some of the new terms exist in the UMLS Metathesaurus? In our experiment, we used the common terms extracted in section 5.1.3 based on the fact that they were more representative and relevant throughout the corpora.

Figure 7 shows an example of variant terms extracted with the term infectious disease. Among these variants, we only show those which are not in the UMLS Metathesaurus since they are new and might be more informative.

Fig. 7
figure7

Example of term variants

Table 5 contains a list of TOP@10 variants extracted with six initial terms. Among them, we highlighted (in bold) terms matching terms in the UMLS Metathesaurus. Like free mode extraction, term variant mode may be used to extract useful terms.

Table 5 Term extraction variations using FASTR

Combined strategies for term analysis

Combined strategies for term analysis concern two levels: (1) Intra-corpus term extraction, and (2) Inter-corpus term extraction.

  • Combined intra-corpus term extraction strategies: these are geared towards extracting common or discriminant terms from a given corpus. To this end, extracted terms from both measures are compared. We show the process in Fig. 8, where the set of terms Set(Cp) extracted from the corpus Cp (Title, Abstract or Content) using each measure (C_Value, F-TFIDF-C_M) are jointly compared with the UMLS Metathesaurus terms. Set A represents corpus terms specifically extracted with C_Value, set B represents terms that are specific to F-TFIDF-C_M, while set C represents common terms from both measures and UMLS Metathesaurus elements. We consider that sets A and B are discriminant terms of the corpus according to the measures, and otherwise set C is considered as containing common terms or the most representative terms of the corpus. The new term extraction process with FASTR is run with one of the combined sets (discriminant or common) and the corpus.

  • Combined inter-corpus term extraction strategies: these are geared towards extracting common and discriminant terms, while taking several corpora into account for a given measure. As illustrated in Fig. 9, for each measure (C_Value or F-TFIDF-C_M), the sets of terms Set(Cp1), Set(Cp2), Set(Cp3) are extracted respectively from corpus Cp1 (Title), Cp2 (Abstract), and Cp3 (Content). These sets are compared in order to compute the common term set D for both corpora, and discriminant term sets A, B, C, respectively, for corpora Cp2, Cp1 and Cp3. In this context, new terms are extracted using one of the combined sets with one corpus (Cpx).

Fig. 8
figure8

Combined intra-corpus term extraction strategies

Fig. 9
figure9

Combined inter-corpus term extraction strategies

Case study: epidemic intelligence

Epidemic intelligence (EI) aims to detect, investigate and monitor potential health threats in a timely manner [33]. In addition to conventional surveillance system monitoring, such as outbreak notifications from the World Organisation for Animal Health (OIE), the EI process increasingly mainstreams unstructured data from informal sources such as online news. Several web-based surveillance systems have been developed and used to support public health and animal health surveillance (ProMED [25], HealthMap [14], GPHIN [29], PADI-web [38], etc.). In this case study, we focused on the choice keywords with the PADI-Web system for COVID-19 surveillance (i.e. driven surveillance) and for monitoring unknown diseases (i.e. syndromic surveillance).

The Platform for Automated extraction of Disease Information from the web (PADI-webFootnote 5) is an automated surveillance system for monitoring the emergence of animal infectious diseases, including zoonoses [1, 38]. PADI-web monitors Google News through specific really simple syndication (RSS) feeds, targeting diseases of interest (e.g. African swine fever, avian influenza, etc.). PADI-web also uses unspecific RSS feeds, consisting of combinations of symptoms and hosts (i.e. species), thus allowing syndromic surveillance and detection of unusual disease events. RSS feeds consists of combinations of different categories of terms (i.e. keywords) including symptoms, disease names and species.

PADI-Web has been used for monitoring COVID-19 disease [39]. In this context, the choice of COVID-19 surveillance terms is crucial.

In the following subsections, we discuss the choice of terms given by ITEXT-BIO to use in the PADI-Web system [38] and other web-based surveillance systems [14, 25, 29] for COVID-19 and syndromic surveillance. This enables evaluation of the relevance of terms generated by our approach for a dedicated task, i.e. web-based health surveillance.

Relevant term extraction

We compared the relevance of the top 10 terms extracted from Papers2 corpora with either C_Value or F-TFIDF-C (Table 6). Table 9 gives more details on these terms. The relevance was assessed by classifying the terms in one or more of the following categories:

  • COVID-19 surveillance: epidemiological terms specific to COVID-19 (e.g. coronavirus spike).

  • Syndromic surveillance: epidemiological terms not specific to a particular disease (e.g. infectious bronchitis).

  • Domain relevant: terms related to health, i.e. either to specific diseases (e.g. porcine epidemic diarrhoea) or unspecific (e.g. virus infections). The Domain relevant category thus includes the two previous categories, plus diseases other than COVID-19.

  • Part of disease multiword expression (MWE): part of a multiword expression corresponding to a disease name (e.g. East respiratory syndrome for Middle East syndrome coronavirus).

Among the terms extracted with C_Value from Titles, Abstracts or Titles and Abstracts, six to seven were parts of disease MWE. Only one term extracted with F-TFIDF-C_M was a part of disease MWE. C_Value could thus be of particular interest for extracting disease name variants, even if they are incomplete. For domain relevant COVID-19 surveillance and syndromic surveillance terms, F-TFIDF-C_M obtained better results than C_Value, even when the frequency of relevant terms was low (from one to five out of ten terms). No common terms were extracted from (Title + Abstract) or from (Title + Content) using F-TFIDF-C_M. Using C_Value, only three common terms were extracted from Title + Content. Among the top 10 terms extracted from Title + Abstract with these metrics, seven were parts of disease MWE. Regardless of the term category, we extracted more relevant terms from Titles and Abstracts than from Contents. This is in line with the fact that Title and Abstracts are more rich in key information and relevant terms due to their length limitation.

Table 6 Relevance of terms extracted from Papers2 depending on the metrics (C_Value or F-TFIDF-C_M)

Driven term extraction

We selected terms extracted in Section 6.1: respiratory tract, viral infections, SARS coronavirus, incubation period, influenza virus, respiratory infections and infectious diseases. We randomly extracted the variants with FASTR (Section The driven term extraction approach). An epidemiologist manually evaluated the relevance of 10 randomly selected variants per term. Among the 60 evaluated terms (see Table 10), 72% (43/60) were relevant and 7% (4/60) were irrelevant. For 13 variants (22%), the relevance could not be assessed because the expression was truncated and ambiguous, such as “disease has an infectious” for the term “infectious diseases”. FASTR thus seems to be an effective tool for generating term variants efficiently. However, we noted that FASTR generated up to 774 variants for a single term. Thus, to avoid random selection of terms, it would be interesting to compute a relevance index that could be used to rank the proposed variants. Besides, several extracted variants were fragments of expressions that could not be evaluated. This issue could be overcome by displaying the variant context (i.e. the sentence in which the variants appeared).

Conclusion

In this paper we describe ITEXT-BIO, a generic methodology for biomedical term extraction. We show how it allows users to extract terms (or concepts) from different types of textual data using several combined strategies. The free term extraction approach extracts terms from corpora, while the driven term extraction approach extracts, from a corpus and a set of terms, a set of variations of these terms.

We illustrate that the proposed combined strategies based on statistical measures and textual segments help efficiently extract and categorize terms (representative, discriminant and new terms) from a corpus or corpora. We also quantitatively and qualitatively analysed the extracted terms to determine those related to the study domain and those that could be considered as emerging terminology for disease monitoring.

Our future studies will focus on term extraction and analysis by: (i) taking different sections of papers into account and applying the methodology to different types of corpora derived from newspapers or social media such as Twitter, (ii) considering combinations of tools other than BioTex, and (iii) introducing word embedding strategies like BERT [10] to capture semantic aspects of the extracted terms in order to reduce context ambiguity.

Notes

  1. 1.

    https://www.semanticscholar.org/cord19/download

  2. 2.

    https://code.google.com/p/maui-indexer/.

  3. 3.

    https://pypi.python.org/pypi/topia.termextract/1.1.0.

  4. 4.

    http://www.nzdl.org/Kea/.

  5. 5.

    https://padi-web.cirad.fr/en/.

References

  1. 1.

    Arsevska E, Valentin S, Rabatel J, de Goër de Hervé J, Falala S, Lancelot R, Roche M. Web monitoring of emerging animal infectious diseases integrated in the French Animal Health Epidemic Intelligence System. PLOS ONE. 2018;13(8):e0199960. https://doi.org/10.1371/journal.pone.0199960.

    Article  Google Scholar 

  2. 2.

    Azarafza M, Feizi-Derakhshi MR, Shendi MB. Textrank-based microblogs keyword extraction method for Persian language. Conference: 3rd International Congress on Science and Engineering, Hamburg, Germany, 2020

  3. 3.

    Bodenreider O. The unified medical language system (umls): integrating biomedical terminology. Nucleic Acids Res. 2004;32(suppl 1):D267–70.

    Article  Google Scholar 

  4. 4.

    Bracewell DB, Ren F, Kuriowa S. Multilingual single document keyword extraction for information retrieval. In: 2005 International Conference on Natural Language Processing and Knowledge Engineering, IEEE, 2005, pp 517–522

  5. 5.

    Brill E (1992) A simple rule-based part of speech tagger. In: Proceedings of the third conference on applied natural language processing, Association for Computational Linguistics, USA, ANLC ’92, pp. 152–155. https://doi.org/10.3115/974499.974526

  6. 6.

    Campillos Llanos L, Sandoval AM, Guirao J. An automatic term extractor for biomedical terms in Spanish. In: Proceedings of the 5th international symposium on languages in biology and medicine (LBM 2013), 2013

  7. 7.

    Campos R, Mangaravite V, Pasquali A, Jorge A, Nunes C, Jatowt A. Yake! Keyword extraction from single documents using multiple local features. Inf Sci. 2020;509:257–89. https://doi.org/10.1016/j.ins.2019.09.013.

    Article  Google Scholar 

  8. 8.

    Conrado M, Pardo T, Rezende SO. A machine learning approach to automatic term extraction using a rich feature set. In: Proceedings of the 2013 NAACL HLT student research workshop, 2013; pp. 16–23

  9. 9.

    Cram D, Daille B. Terminology extraction with term variant detection. In: Proceedings of ACL-2016 system demonstrations, pp. 13–18, 2016

  10. 10.

    Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805, 2018

  11. 11.

    Duari S, Bhatnagar V. Complex network based supervised keyword extractor. Expert Syst Appl. 2020;140:112876.

    Article  Google Scholar 

  12. 12.

    Foo J. Term extraction using machine learning. Linkoping: Linkoping University; 2009.

    Google Scholar 

  13. 13.

    Frantzi K, Ananiadou S, Mima H. Automatic recognition of multi-word terms: the c-value/nc-value method. Int J Digit Lib. 2000;3(2):115–30.

    Article  Google Scholar 

  14. 14.

    Freifeld CC, Mandl KD, Reis BY, Brownstein JS. HealthMap: global infectious disease monitoring through automated classification and visualization of internet media reports. J Am Med Inf Assoc. 2008;15(2):150–7. https://doi.org/10.1197/jamia.M2544.

    Article  Google Scholar 

  15. 15.

    Habibi M, Popescu-Belis A. Keyword extraction and clustering for document recommendation in conversations. IEEE/ACM Trans Audio Speech Lang Process. 2015;23(4):746–59.

    Article  Google Scholar 

  16. 16.

    Ho SY, Tan S, Sze CC, Wong L, Goh WWB. What can Venn diagrams teach us about doing data science better? Int J Data Sci Anal. 2021;11(1):1–10.

    Article  Google Scholar 

  17. 17.

    Jacquemin C. Fastr: a unification-based front-end to automatic indexing. In: Intelligent multimedia information retrieval systems and management - Volume 1, Le centre de hautes études internationales d’informatique documentaire, Paris, FRA, RIAO ’94, pp. 34–47, 1994

  18. 18.

    Ji L, Sum M, Lu Q, Li W, Chen Y. Chinese terminology extraction using window-based contextual information. International conference on intelligent text processing and computational linguistics. Berlin: Springer; 2007. p. 62–74.

    Chapter  Google Scholar 

  19. 19.

    Joung J, Kim K. Monitoring emerging technologies for technology planning using technical keyword based analysis from patent data. Technol Forecast Soc Change. 2017;114:281–92.

    Article  Google Scholar 

  20. 20.

    Kageura K, Umino B. Methods of automatic term recognition: a review. Terminol Int J Theoret Appl Issues Special Commun. 1996;3(2):259–89.

    Article  Google Scholar 

  21. 21.

    Lossio-Ventura JA, Jonquet C, Roche M, Teisseire M. Biomedical terminology extraction: a new combination of statistical and web mining approaches. In: JADT: Journées d’Analyse statistique des Données Textuelles, pp. 421–432, 2014a

  22. 22.

    Lossio-Ventura JA, Jonquet C, Roche M, Teisseire M. Biotex: a system for biomedical terminology extraction, ranking, and validation. In: Proceedings of the 2014 International Conference on Posters & Demonstrations Track - Volume 1272, CEUR-WS.org, Aachen, DEU, ISWC-PD’14, pp. 157–160, 2014b

  23. 23.

    Lossio-Ventura JA, Jonquet C, Roche M, Teisseire M. Yet another ranking function for automatic multiword term extraction. International conference on natural language processing. Berlin: Springer; 2014c. p. 52–64.

    Google Scholar 

  24. 24.

    Lossio-Ventura JA, Jonquet C, Roche M, Teisseire M. Biomedical term extraction: overview and a new methodology. Inf Retriev J. 2016;19(1–2):59–99.

    Article  Google Scholar 

  25. 25.

    Madoff LC. ProMED-mail: an early warning system for emerging diseases. Clin Infect Dis. 2004;39(2):227–32.

    Article  Google Scholar 

  26. 26.

    Matsuo Y, Ishizuka M. Keyword extraction from a single document using word co-occurrence statistical information. Int J Artif Intell Tools. 2004;13(01):157–69.

    Article  Google Scholar 

  27. 27.

    Maynard D, Yankova M, Kourakis A, Kokossis A. Ontology-based information extraction for market monitoring and technology watch. In: ESWC workshop end user aspects of the semantic web, Heraklion, Crete, 2005

  28. 28.

    Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearbook Med Inf. 2008;17(01):128–44.

    Article  Google Scholar 

  29. 29.

    Mykhalovskiy E, Weir L. The Global Public Health Intelligence Network and early warning outbreak detection: a Canadian contribution to global public health. Can J Pub Health. 2006;97(1):42–4.

    Article  Google Scholar 

  30. 30.

    Neifar W, Hamon T, Zweigenbaum P, Khemakhem ME, Belguith LH. Adaptation of a term extractor to Arabic specialised texts: first experiments and limits. International conference on intelligent text processing and computational linguistics. Berlin: Springer; 2016. p. 242–53.

    Google Scholar 

  31. 31.

    Oliver A, Vàzquez M. Tbxtools: a free, fast and flexible tool for automatic terminology extraction. In: Proceedings of the international conference recent advances in natural language processing, pp. 473–479, 2015

  32. 32.

    Pais V, Ion R. Termeval 2020: Racai’s automatic term extraction system. In: COMPUTERM, 2020

  33. 33.

    Paquet C, Coulombier D, Kaiser R, Ciotti M. Epidemic intelligence: a new framework for strengthening disease surveillance in Europe. Eurosurveillance. 2006;11(12):5–6. https://doi.org/10.2807/esm.11.12.00665-en.

    Article  Google Scholar 

  34. 34.

    Pazienza MT, Pennacchiotti M, Zanzotto FM. Terminology extraction: an analysis of linguistic and statistical approaches. Knowledge mining. Berlin: Springer; 2005. p. 255–79.

    Chapter  Google Scholar 

  35. 35.

    Ramos J, et al. Using tf-idf to determine word relevance in document queries. Proc First Instruct Conf Mach Learn. 2003;242:133–42.

    Google Scholar 

  36. 36.

    Rigouts Terryn A, Hoste V, Drouin P, Lefever E. Termeval 2020: Shared task on automatic term extraction using the annotated corpora for term extraction research (acter) dataset. In: 6th international workshop on computational terminology (COMPUTERM 2020), European Language Resources Association (ELRA), pp. 85–94, 2020

  37. 37.

    Shah H, Khan MU, Fränti P. H-rank: a keywords extraction method from web pages using POS tags. In: 2019 IEEE 17th international conference on industrial informatics (INDIN), IEEE, vol 1, pp. 264–269, 2019

  38. 38.

    Valentin S, Arsevska E, Falala S, de Goër J, Lancelot R, Mercier A, Rabatel J, Roche M. PADI-web: a multilingual event-based surveillance system for monitoring animal infectious diseases. Comput Electron Agric. 2020a;169:105163. https://doi.org/10.1016/j.compag.2019.105163.

    Article  Google Scholar 

  39. 39.

    Valentin S, Mercier A, Lancelot R, Roche M, Arsevska E. Monitoring online media reports for early detection of unknown diseases: Insight from a retrospective study of covid-19 emergence. Transboundary and emerging diseases. 2020b.

  40. 40.

    Wang LL, Lo K, Chandrasekhar Y, Reas R, Yang J, Eide D, Funk K, Kinney RM, Liu Z, Merrill W, Mooney P, Murdick DA, Rishi D, Sheehan J, Shen Z, Stilson B, Wade AD, Wang K, Wilhelm C, Xie B, Raymond DM, Weld DS, Etzioni O, Kohlmeier S. Cord-19: the covid-19 open research dataset. ArXiv abs/2004.10706. 2020a.

  41. 41.

    Wang R, Liu W, McDonald C. Featureless domain-specific term extraction with minimal labelled data. Proc Austral Lang Technol Assoc Workshop. 2016;2016:103–12.

    Google Scholar 

  42. 42.

    Wang X, Zhang L, Klabjan D. Keyword-based topic modeling and keyword selection. arXiv preprint arXiv:200107866. 2020b.

  43. 43.

    Whissell JS, Clarke CL. Improving document clustering using okapi bm25 feature weighting. Inf Retriev. 2011;14(5):466–87.

    Article  Google Scholar 

  44. 44.

    Yao Xm, GAN Jh, Jian X. Concept extraction based on hybrid approach combined with semantic analysis. DEStech Transactions on Engineering and Technology Research, 2017.

Download references

Acknowledgements

This study was partially funded by EU grant 874850 MOOD and is catalogued as MOOD 003. The contents of this publication are the sole responsibility of the authors and do not necessarily reflect the views of the European Commission. This study was partially funded by the French Agricultural Research Centre for International Development (CIRAD) the French General Directorate for Food (DGAL), and the SONGES Project (FEDER and Occitanie Region). The research was also supported by the French National Research Agency (ANR) under the Investments for the Future Program, referred to as ANR-16-CONV-0004, #DigitAg.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Maguelonne Teisseire.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

See Tables 7, 8, 9, and 10.

Table 7 Best ranked terms extracted from Paper1 using F-TFIDF-C_M
Table 8 Best ranked terms extracted from Paper1 using C-Value
Table 9 Expanded terms from Table 6
Table 10 60 terms randomly selected from FASTR variants (Section The driven term extraction approach)

Appendix B

See Figs. 10 and 11.

Fig. 10
figure10

Distribution of concepts according to the measures and their presence in the UMLS Metathesaurus: from Papers2-title corpus

Fig. 11
figure11

Distribution of representative concepts when taking multiple corpora into account using C_Value: Papers1 corpora

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kafando, R., Decoupes, R., Valentin, S. et al. ITEXT-BIO: Intelligent Term EXTraction for BIOmedical analysis. Health Inf Sci Syst 9, 29 (2021). https://doi.org/10.1007/s13755-021-00156-6

Download citation

Keywords

  • Biomedical terminology
  • Terminology extraction
  • Intelligent analysis