1 Introduction

In the field of Natural Language Processing, Named Entity Recognition (NER) is not a new notion. Since its introduction by the MUC-6 in 1995, it has been a subtask of Information Extraction with substantial research reported. Named Entities (NEs) are textual items of interest with people, organisations, locations, and numbers being common examples. The aim of NER is to recognise and categorise different types of named entities in a structured or unstructured text. This domain has attracted a lot of interest. Starting with rule-based systems (Appelt et al., 1995; Weischedel, 1995) which needed a group of experts creating hand-written rules such as orthographic patterns, and followed by machine learning models (Bam & Shahi, 2014; Borthwick, 1999), and deep learning models (Lin et al. 2017; Chiu & Nichols, 2016; Devlin et al. 2019), the development of systems capable of recognising these NEs has been considerable.

In the general domain, state-of-the art models can produce excellent results. However, research in specialised domains involving different NE classes has not grown at the same rate. The biomedical domain, for example, is expanding as medical records become more computerised and online biomedical research becomes more accessible. According to Microsoft (n.d.), PubMed adds two biomedical papers every minute, thousands every day, and over a million every year. NER is critical for Natural Language Processing (NLP) since NEs serve as both a referential base for finding information in texts and as important pieces of information. A news article could be summarised in the general domain by extracting the five Ws (who, what, when, where, and why). Each W often corresponds to a NE (Zhang et al., 2004, p. 1). Similarly, the biomedical domain makes use of NER since denominations for genes, proteins, and diseases, among other things, are crucial bits of information for researchers and biomedical experts in situations like literature-based discovery and relation extraction. Because biomedical information is primarily published in English, data in other languages, especially NER datasets, are sparse. To fill in this gap, this research will try to overcome the lack of data in other languages by investigating and analysing the best opportunities for developing a bilingual model that can be used in both Spanish and English.

The available datasets for training NER and other NLP systems appear to be insufficient to handle all new information as the diversity and size of online data grows. The cost of manually annotating data for the biomedical domain is high, due to its level of speciality. Moreover, biomedical NER faces challenges such as variations in spelling, synonyms, and unknown vocabulary, which slows the development of new systems. In this paper we explore two data augmentation techniques to increase the size of the dataset: (a) translation of the dataset using a commercial machine translation system to create a dataset in another language; and (b) entity replacement in which a new dataset is constructed by replacing part of the entities in the original dataset (Liu et al., 2020).

Additionally, in order to create a cross-lingual NER model, we propose the use of Transfer Learning, which is the process of training a model using previously learnt parameters from a pre-trained model (Hira et al., 2019), i.e., use the parameters of a pre-trained model in the X domain to initialise a model in the Y domain. Continuous training is defined as the sequential training of a system using previously obtained knowledge from one or more data sources. Transfer learning is commonly used to fine-tune general models on new domains or languages with promising results using transformers. Saunders et al. (2019) employed this strategy to test in three domains: biological and health, as well as a general biomedical corpus. They were able to demonstrate an increase in the BLEU score in the combination of biological + biomed for English of + 7 points by first training a base model and then finetuning to the other two domains.

The original methodology that we put forward in this paper will make it possible for additional languages to benefit from biomedical datasets for NER. Our novel approach is portable to other languages and to the best of our knowledge, has not been proposed in the biomedical domain.

The following are this work’s main contributions:

  • The creation of a synthetic version of the CRAFT corpus in Spanish for the biomedical domain using a cheap translation approach based on back-translation. Using entity replacement, a separate version of the original CRAFT in English was also produced.

  • The first bilingual NER system (ES-EN) for the biomedical area, which achieved the second highest F1 score compared to the literature’s reports for systems trained in the monolingual CRAFT dataset.

The rest of the paper is structured as follows: Section 2 surveys related work in the field and Sect. 3 presents our methodology, providing details on the experiments conducted. Section 4 discusses the evaluation results and finally Sect. 5 summarises the conclusions of this research.

2 Related work

Although biomedical NER is not a new concept, there is not a global agreement on the entity classes or the annotation criteria, which has led to a handful of datasets with different entity classes and different annotation guidelines, creating inconsistency in the task. For instance, chemical entities are represented in the CHEMDER dataset, which includes the tags: Abbreviation, Family, Formula, Identifier, Multiple, No Class, Systematic, and Trivial (Krallinger et al., 2015). Another chemical dataset is the BC5CDR which also includes diseases (Li et al., 2016). NCBI is a full disease dataset created from 793 PubMed abstracts (Doğan et al., 2014). For proteins/genes identification, the GENIA dataset has a total of 23,996 tagged genes/proteins (Tanabe et al., 2005). The CRAFT dataset also contains genes and proteins but is tagged using instructions from ontologies which might differ from the annotation of other datasets. By combining the datasets provided by the MEDIQA challenge with a subset of the MedQuAD dataset, Lamurias & Couto (2019) presented research on data augmentation for datasets for the Question Answering task, reporting an increase of 0.015 in accuracy in the test set and a decrease of 0.02 in accuracy in the development set.

The Spanish language has not seen the same progress in annotation. The first biomedical NER task in Spanish was called PharmaCoNER, using a chemical dataset containing four tags: No normalizables, Normalizables, Proteinas and Unclear (Gonzalez-Agirre et al., 2019). Focusing on cancer vocabulary, Miranda-Escalada et al. (2020) created the CANTEMIST task for named entity recognition. This corpus contains the following entity types: Disease, Drugs, Unit of Measurement, Excipient, Chemical Composition, Pharmaceutical Form, Medicament, Food, Route, and Therapeutic Action, with a total of 2241 entities.

The lack of standard global guidelines has impeded a unified effort such as that seen in the general domain. As shown above, the entity types differ greatly in all datasets in Spanish or English. Since the objective of this study is to build a single efficient bilingual NER system, the use of a dataset of each language for a single model is not viable, as might be the case for the general domain using the CONLL task’s dataset for Spanish and English. Rather, this would bring inconsistency and poor performance to the NER system.

Early Named Entity Recognition systems utilised rule-based methods in which techniques such as transducers (Appelt et al., 1995), pattern matching on dictionaries (Grishman, 1995), and lexical pattern matching (Weischedel, 1995) were used obtaining as high as 94 F-1 scores. Although these systems performed well, considerable human effort was required for writing the rules and maintaining them. These systems were followed by machine learning-based systems that used supervised methods. For instance, an HMM model worked for the entity classes of proteins, RNA, DNA, and cells (Ponomareva et al., 2007), while a semi-Markov model demonstrated the capacity of such models to integrate information across all tokens in a segment (Leaman & Lu, 2016). Unsupervised methods required less manually annotated data. Representative research include “KNOWITALL” by Etzioni et al. (2005) using bootstrapping which leverages little annotated data to learn patterns, a model based on gazetteer creation (Nadeau et al., 2006), and Habib and van Keulen (2012) who presented a method that uses a lookup strategy against knowledge bases like Yago and DBPedia, as well as an unsupervised method for disambiguation, showing the usefulness of online resources for obtaining and extracting information.

Although these early systems achieved competitive performance in the general domain, they were unable to overcome the dynamics of language and frequently performed poorly on unseen data. To tackle this, deep learning methods based on neural networks (NNs) were employed. Such NNs could learn character and word-level representations, as well as sentence-level relationships, in their learning pipeline. Neural Network models tested Convolutional Neural Networks (CNNs) plus character embeddings and obtained an F1 score of 76.39, while a bi-directional long short-term memory (BiLSTM) character embedding approach achieved a 76.94 F1 score for a diseases dataset (Sahu & Anand, 2016). As with the general domain, BiLSTM-CRF (Conditional Random Field) models are popular in the biomedical domain (Cho & Lee, 2019; Li et al., 2019; Wang et al., 2019) with different combinations of character and word embeddings and features such as POS tags.

Contextualised word representation based on transformers is the state of the art for many NLP tasks, and biomedical NER is no exception. Lee et al. (2020) introduced BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) based on BERT. This transformer was trained on PubMed abstracts and PMC full-text articles. Beltagy et al. (2019) re-trained BERT using biomedical data from Semantic Scholar to create SciBERT. The training corpus included 1.14 million full papers. Lastly, Carrino et al. (2021) trained a transformer-based system for Spanish using data from Scielo, Wikipedia, patents, EMEA, and PubMed, among others. This model is based on a RoBERTa-based transformer. Jofche et al. (2022) present a platform for recognising pharmaceutical documents and performing NER and coreference resolution. This platform uses transformer-based architecture to identify NEs in the BC5CDR and BioNLP15CG datasets in English.

Biomedical NER faces a host of issues: the style and speciality of biomedical data, ambiguity, as some names can refer to one entity class or the other, and the constant discovery and coining of new terms, which creates the issue of unknown words, among other challenges. For Zhao et al., (2021, p. 5) the difficulties faced when dealing with biomedical NER are that the biomedical terms have many variations: (1) small variations such as typos, hyphens, or capitalisation, e.g., ‘FOXP2’ and ‘FOX-P2’, (2) synonyms and abbreviations, and (3) unseen entities. Additionally, for Cho and Lee (2019, p. 5), the difficulties are the entity boundaries, compound noun phrases, bracket-enclosed entities, nested entities, and the corpus annotation inconsistency. These variations of the NEs complicate the recognition and normalisation of such entities. Another challenge in biomedical NER is the so-called “long-tail” Nes, i.e. “named entities that are rare, often relevant only in specific knowledge domains, yet important for retrieval and exploration purposes.” (Liu et al. 2020b, p. 79).

On the basis of the above discussion, one might conclude that the biomedical NER is far from solved, even with state-of-the-art NLP systems. Efficient NLP techniques are needed for automating the extraction and analysis of biomedical data and would help synchronise the efforts across languages.

Previous studies have tested the ability of transformer systems to generalise enough and provide competitive results when trained in another language. Sun and Yang (2019) tested the usability of mBERT (without biomedical data training) and BioBERT (without training in data in Spanish) for the PharmaCoNER dataset. Their results reported competitive F1 scores in both systems (89.24 and 89.02 respectively), as well as the success of current systems in zero-shot transfer. Hakala and Pyysalo (2019) presented an approach of using mBERT for Spanish biomedical named entity recognition without further training, achieving an F1 score of 87 in the test set on the PharmaCoNER dataset.

Mueller et al. (2020) used diverse data in multiple languages to examine the polyglot capabilities of transformers. They show how multilingual models share a large number of parameters, how language-specific training uses those common parameters, and how these models preserve the top 5% of weights for each language. This work reveals that transfer learning is available to create cross-lingual systems in NLP. Saunders et al. (2019) also tested transfer learning for cross-lingual Machine Translation (MT) models, improving the BLEU score by 7 points, starting with a general domain MT system and training it into a biomedical MT system.

Mayhew et al. (2017) used dictionaries to translate an English dataset into several languages and language resources such as PANLEX and created a phrase translation table in which the labels are copied from source to target sentence. They also tested translating a dataset with Google translate but reported difficulties with the alignment and projection of entities into the new language, which according to the authors resulted in a noisy dataset with incorrect entity tags. Finally, Li et al. (2020) created a model that labels a bilingual corpus and processes NER, then uses GIZA +  + for the alignment and extracts NE translation pairs, which are then ranked by calculating the mutual information (MI) value.

3 Methodology and experiments

The selection of the dataset(s) to be used is one of the most important factors to consider when training a system for NER (or any other task). A dataset with a wider coverage of NEs would be more beneficial to Spanish users, especially if the NEs it contains have never been explored before. In the general domain there are datasets for English and Spanish that have the same number of entity tags, the same annotation guideline, and the same class names, such as the CONLL NER task. This is not the case in the biomedical domain, where most datasets only contain a small number of NE classes based on the task at hand. As previously stated, the use of a dataset from the same family with golden annotation is ruled out.

This study will use the CRAFTFootnote 1 corpus to train the NER systems as it is one of the datasets with a larger number of entity tags. The CRAFT corpus is a collection of 97 full-text articles from the biomedical domain, manually annotated (version 1 includes 67 articles), and around 100,000 annotations of nine ontologies: Cell type ontology (CL), Chemical entities of biological interest (CHEBI) ontology, NBCI taxonomy (Taxon), Protein Ontology (PR), Sequence Ontology (SO), entries of the Entrez gene database, and three subontologies from the Gene Ontology (GO) (Bada et al., 2012). Our goal is to develop a system able to transfer the knowledge from one of the biggest and most diverse datasets for biomedical NLP from English to Spanish.

The entities contained in the CRAFT corpus are described by Bada et al. (ibid. pp. 11–17) as:

  • Chemical: Chemical Entities of Biological Interest refer to atoms, biomedical roles and applications, subatomic particles, molecules, and polyatomic entities.

  • Cell: All cell mentions except the types of cell line cells.

  • Gene: Biological processes, including at the level of molecules, and subcellular structures. Also: cellular components representing subcellular structures, both intracellular and extracellular; and macromolecular complexes.

  • Taxon: Biological taxonomy and their corresponding organisms.

  • Protein: Based on the protein oncology without regard to sequence type.

  • Sequence: Biomacromolecular sequences, attributes, and processes.

We have normalised the annotation in the dataset, meaning that instead of three-letter codes for entities we used the full name of the entity: Protein, Cell, Taxon, Sequence, Chemical, and Gene. This dataset uses the IOB annotation scheme: B- beginning, I- inside and O for tokens not corresponding to any tag, e.g., B-Protein, and I-Taxon. The dataset is divided into three parts: a training set (10,875 elements), a test set (7425 elements), and a validation set (3730 elements).

One important point to note is that the CRAFT dataset does not have a Spanish equivalent. Given the scarcity of datasets in Spanish for the biomedical domain, we would like to offer new ways to improve biomedical systems for Spanish users. As a result, to boost the dataset's size, we used data augmentation techniques.

3.1 Data augmentation

Data augmentation is a strategy to increase a system’s performance by generating more training data. Two strategies for the construction of the utilised datasets are presented next: Cheap translation and Entity replacement.

3.1.1 Cheap translation

This initial data augmentation strategy used to train a bilingual NER system was inspired by the success of back-translation in NMT, which is the production of a synthetic dataset via machine translation. According to Edunov et al., (2018, p. 1) “The result is a parallel corpus where the source side is synthetic machine translation output while the target is genuine text written by humans.” This method has been shown to improve MT system performance and introduce previously unseen words and entities. Authors also report “that synthetic data can achieve up to 83% of the performance attainable with real bitext” especially if the data matches the domain of the model (Edunov et al., ibid, p. 497).

Google Translate is one of the most widely used machine translation technologies in the world. It supports over 100 languages and its mobile app has been downloaded over 1 billion times (Pitman, 2021). Caswell and Liang (2020) reported a change to the transformer architecture on Google’s MT system, which resulted in a + 5 BLEU score increase in high-resource languages and a + 7 BLEU score increase in low-resource languages on average. This system was chosen to translate the CRAFT dataset into Spanish because of its constant development, widespread use, and M4 modelling for multilingual transfer learning. This technique, as Mayhew et al., (2017, p. 3) called it, is a “cheap translation” method to get more data (even more than back-translation itself), as no monolingual data in Spanish was back-translated to create an English system, nor used parallel data. This synthetic new dataset will be used to train NER systems, employing different methods. NMT systems, especially those that are not domain-specific, i.e., Google, are prone to errors, so this new MT translated dataset is not intended to be a gold-standard corpus for developing systems for Spanish only.

Mayhew et al. (ibid.) used dictionaries and Google to translate an English dataset. Their dataset was constructed by translating sentences one at a time, using fast_align to obtain alignments, and then projecting the tags, which, according to the authors, resulted in errors in the tag projection and noise. We used a similar process to create a new dataset for Spanish without the use of dictionaries. Because the CRAFT dataset is preformatted in the CONLL format, we reconstructed all of the words with an “O” tag to make real sentences, as well as multi-word NEs. We fed Google Translate with the sentences separated from the NEs, either single words or multi-words. Then, we mapped back the tags to avoid errors in the misalignment of sentences. The translation process of this dataset is described as follows:

  • Concatenate words with the same tag to create sentences, as NMT systems tend to create better output when a sentence is provided instead of a single word.

  • Preserve a unique tag per sentence for future mapping. Step 2 in Fig. 1 shows that by grouping tokens with the same tag and recreating sentences, a single tag is kept; removing prefix “B” or “I” from tag, and so “chromaffin granules” tagged as [chromaffin B-Gene, granules I-Gene] will have the only Gene Tag. As translation might change the order of the words, the beginning of the entity might change and so the “B” prefix. In this example, the Spanish translation “gránulos de cromafina” will later be reconstructed by assigning the beginning tag to “gránulos” instead of “cromafina” as in the English sentence.

  • Pass the datasets (training, test, and development) onto Google translator in an.xlsx format to preserve the order of the sentences and assure the tags match.

  • Collect the output from Google Translate.

  • Assign the labels matching the original annotation previously stored. Since the reconstruction created sentences, a single tag is assigned per sentence.

  • Tokenise each sentence and assign the proper tag using the BIO scheme. All “O” sentences keep only the “O” tag, whereas the entities get “B” for the beginning of the entity and “I” for the rest of the tokens in composed entities.

Fig. 1
figure 1

Steps involved in the translation process of a real example taken from the CRAFT corpus

Figure 1 shows the translation process of a real sentence from the development subset of the CRAFT corpus.

The process creates a new translated Spanish CRAFT dataset. Polyglot systems tend to preserve essential weights for each language, as well as sharing parameters, according to Mueller et al. (ibid.). As a result, the construction of this dataset in Spanish is a strategy to improve performance in a transfer learning environment. Table 1 shows an example of the final output of a sentence in the original English dataset and the output obtained from the translation system.

Table 1 Comparison of an original sentence from the CRAFT corpus and the output of the same sentence after using a commercial MT system

Due to the sentence grouping and reconstruction process, we anticipate some errors in translation. As the sentences are split before and after an entity, some errors in gender, number, and order might arise. The emphasis is placed on the accurate reconstruction of multi-word entities and the precise alignment of the original dataset with the synthetic dataset to ensure confidence in the tag projection from English to Spanish. Although the translation technique is not fully automated, it guarantees that the noise caused by aligners and projecting tags is completely avoided, resulting in a significantly more reliable synthetic dataset. Further testing will try to mitigate such noise by employing pre-trained models and transfer learning.

3.1.2 Entity replacement

In order to increase system performance and create a robust cross-lingual NER system, we follow Liu et al.’s (2020) approach for data augmentation: replacement of entities. Random replacement of existing entities with unseen entities is used to create new datasets for a system. We compiled a set of entities based on the official ontologies that were used to annotate the entities in the original CRAFT corpus. To ensure that all of the objects retrieved from ontologies are, in fact, new, this list has been cross-checked against the original dataset's vocabulary. The results for each tag are as follows:

There are 8846 entities in total. The data gathered from the official ontologies is open-source and available for download on their websites.

To mirror the original format of the corpus, the list of monolingual English entities is formatted into a CONLL format with BIO scheme: [ENTITY, TAG]. 20% of the entities in the training, test, and evaluation sets have been replaced at random from the list corresponding to the tag. We attempted to replace the same number of entities for each tag to include a balanced number of new entities. 7,161 entities were altered in the training set, 5,941 in the test set, and 2,267 in the evaluation set. This was saved as a new dataset to be used in current experiments.

We concatenated some of the datasets to make different training instances that may be used to measure continuous training and compare it to dataset concatenation.

We have a total of five datasets at this point:

1.CRAFT English (Original).

2.CRAFT Spanish (MT translated).

3.CRAFT EN + ES (concatenated).

4.CRAFT EN + EN Augmented (concatenated).

5.CRAFT EN + ES + EN Augmented (concatenated).

The CRAFT EN + ES is a dataset that combines English and Spanish data. There are 3,596 distinct entities in total. The CRAFT EN + EN Augmented specifies the concatenation of the original dataset with the augmented version obtained by entity replacement. It has 4,876 distinct entities. The final and largest dataset, CRAFT EN + ES + EN Augmented, is a concatenation of the original English dataset, the Spanish version, and the augmented English version. It has a total of 6,131 distinct entities.

3.2 Named entity recognition systems training

Inspired by the success of employing a monolingual system to evaluate its capabilities in another language (Sun & Yang, 2019) and (Hakala & Pyysalo, 2019), we employed the pre-trained "BioBERT," a variant of the well-known BERT in English pre-trained on biomedical texts. The second transformer option is the "Roberta-base-biomedical-clinical-es" for Spanish domain-specific corpora, which is a pre-trained RoBERTa-based transformer. Even though the literature has shown promise for zero-shot transfer, we opted to investigate transfer learning approaches to develop more robust systems and a consolidated bilingual unique system. This study aims to verify whether the efficacy demonstrated in back-translation, coupled with the advantages of transfer learning, could be extended to the domain of Named Entity Recognition (NER).

The presence of noise in synthetic datasets is an inherent challenge. As indicated in prior studies, transfer learning allows systems to retain weights from both languages (and datasets), share weights between languages, and handle both languages more efficiently without having to use two separate systems (Mueller et al., ibid, p. 8101). This study will test this strategy in NER. On the one hand, we will test if noise and errors produced by the cheap translation affect the F-scores of the final NER models or if the use of pre-trained Large Language Models with transfer learning are capable of using its learnt knowledge to prevail before the noise. On the other, as one of the main issues of biomedical NER is the unknown words, we introduce a higher vocabulary through augmentation. Also, these systems can benefit from learning words from two different languages in different training instances.

We conducted two types of fine-tuning: direct fine-tuning and continuous training. Direct fine-tuning entails using concatenated data from previously constructed augmented datasets, such as the CRAFT EN + ES dataset. This fine-tuning strategy simply employs one system and one dataset to produce a single output, thus no additional training or fine-tuning is required. Continuous training, on the other hand, entails fine-tuning a model with an initial dataset, e.g. CRAFT EN, and then fine-tuning the resulting system with a new dataset, e.g. CRAFT ES. Because augmentation methods have not been employed in the biomedical NER task, to the best of our knowledge, it served as impetus to evaluate both training strategies.

Since transfer learning is prone to catastrophic forgetting, in which a portion of the original weights is replaced by the new training, we chose to test both training strategies in order to compare the performance of the different datasets in different types of training. We have trained fourteen systems: two base systems on the original English-only dataset, six systems using direct fine-tuning with the concatenated datasets, and the remaining six systems using the continuous training approach for English plus Spanish and entity replacement with the enhanced datasets (EN augmented).

All systems were fine-tuned with the following hyperparameters: learning rate: 3e-05, train batch size: 8, optimiser: Adam, betas: 0.9, 0.99, epsilon: 1e-08, epochs: 4. The following is a list of the names that will be used to refer to them.

Base models:

  1. 1.

    BioBERT EN

  2. 2.

    RoBERTa EN

Direct fine-tuning:

  1. 1.

    EN + ES (concatenated)

  2. 2.

    EN + EN Augmented (concatenated)

  3. 3.

    EN + ES + EN Augmented (concatenated)

Continuous training:

  1. 1.

    EN + ES

  2. 2.

    EN Augmented + EN

  3. 3.

    EN Augmented + ES

For each transformer, each training approach was used once. The results of the training will be discussed in Sect. 4.

4 Results

NER training is processed using various combinations of dataset and fine-tuning approach. We will describe the two training approaches: direct fine-tuning and continuous training. The implemented systems are available on HuggingFace.Footnote 8

Direct fine-tuning uses concatenated datasets to feed the transformer and produce all the training in the specified epochs without using one system’s trained weights to initialise the next one. Six systems were trained: EN + ES, EN + EN Augmented (concatenated), and EN + ES + EN Augmented (concatenated). For each of the previous dataset combinations, one BioBERT model and one RoBERTa-ES model were trained.

Continuous training uses a pre-trained system as a starting point for fine-tuning another one, making use of its learnt weights. First, we fine-tuned the models on the first dataset and using the weights of the trained model as initialiser, we fine-tuned it with the second dataset. The two base models at the bottom of Table 2 are the starting point for the continuous training with the different datasets. Six systems were trained: EN + ES, EN Augmented + EN, EN Augmented + ES. For each of the previous dataset combinations, one BioBERT model and one RoBERTa-ES model were trained.

Table 2 Results on the different training methods for the NER system. It shows the continuous training models and direct fine-tuning models

Data augmentation has increased the performance of all systems; thus, it is safe to say that it is beneficial for training. The success of continuous training over direct fine-tuning can be attributed to the additional fine-tuning phase, as the direct fine-tuning approach lacks the additional training instances carried out in the continuous training strategy. It should be noted that catastrophic forgetting and adopting strategies to prevent it, such as Elastic Weight Consolidation (EWC), are outside the scope of this paper and will be pursued in the future.

Concatenating the datasets gives competitive results but not the highest possible scores. Nonetheless, the best scores using concatenation were the ones without further data augmentation (only the Spanish dataset). This suggests that using concatenation with data augmented datasets, at least with the proposed augmentation technique, does not yield satisfactory results and even results in lower performance than the base models. The basic concatenated models (EN-ES) preserved the top scores in the direct fine-tuning category. However, the fine-tuning times doubled the ones in the continuous training, which makes continuous training more suitable for local training.

The translation of the dataset to create a Spanish dataset showed to be beneficial in terms of enhancing the model's knowledge and providing higher F1 scores. It is worth pointing out that the two systems with the highest scores, 86.39 and 86.25, respectively, were trained on the Spanish dataset. Surprisingly, concatenation of the dataset plus the augmented datasets revealed that the performance figures do not rise but rather drop when employing concatenation. Table 2 shows the results of all the training instances with the different datasets. Both base models at the bottom of the table are monolingual. The results are reported on the evaluation dataset.

Most of the existing research on Named Entity Recognition (NER) using the CRAFT dataset makes use of OGER (OntoGen’s Entity Recogniser), a library/method implemented to perform dictionary-based NER given a terminology list, ontology, or dictionary. Rinaldi, Basaldella, and Furrer have conducted most of the research using the CRAFT dataset and OGER in various settings. Rinaldi et al. (2017) utilised the same ontologies as CRAFT to collect and build a dictionary and employed OGER to match the entities in the text with a neural network (NN) acting as a post-filter for the dictionary annotation. In continuation, Basaldella et al. (2017) adopted the same architecture and added a distiller to filter relevant entities before passing them to the NN. The authors pointed out that "Since the NN output depends heavily on OGER’s input, many of its mistakes are caused by the quality of the dictionary matching." OGER +  + employs a disambiguation filter by training an NN on the CRAFT corpus and, in addition to acting as a filter, provides a probability distribution over all entity types (Furrer et al., 2019a). Furrer et al. (2019b) revisited the NER task with CRAFT by testing it with BioBERT and BiLSTM architectures. These tests aimed at predicting not only entities but also identifiers and combinations of entity predictions with annotations from OGER. The best results were obtained with BioBERT. Finally, Furrer et al. (2021) tested NER + NEN (Named Entity Normalisation) parallel training on a BioBERT architecture, using OGER as before, and tested harmonisation techniques to combine the predictions of two classifiers into one.

Crichton et al. (2017) used a Multi-task learning method, employing 15 datasets, with each dataset representing a task, including AnatEM, BC2GM, BioNLP09, and others. The goal was to take advantage of transfer learning by sharing layers of a Convolutional Neural Network (CNN) among datasets, similar to the known transfer learning capabilities of transformer architectures such as BioBERT used in our research. They reported increases in F1 scores on the multi-task models compared to the single-task (single dataset) models of 6.3% for some datasets and 1.1% for others. Lastly, the popular spaCy python library reports a NER model based on a transition system using a chunking model, and presents findings on several datasets, offering two models that, although not outperforming any models in the literature, are considered competitive baselines for 5 of the 9 datasets.

Comparable to SpaCy, our research diverges from using external resources during training and inference. Research based on OGER involves the construction of a dictionary or ontology to perform dictionary matching. While we extracted lists of entities from the corresponding ontologies to perform data augmentation, our model does not require dictionary matching and all the preprocessing steps associated with it. Leveraging the power of transformer architectures, data augmentations, and transfer learning, we aimed at training a model capable of recognising patterns of how entities present themselves in texts, not only in one language but in two.

As shown in Table 3, in comparison to the second-best-scoring system, the top system outperforms systems in the literature for the same dataset by + 7.33 F1 score points (Crichton et al., 2017). The top-performing system by Furrer et al. (2021) is a combination of BioBERT and OGER, which uses a dictionary-based approach and fares better than our best-performing system by only 0.46 F1 score points. Given that our system will perform in another language and will not employ dictionary-based techniques, these competitive results hold promise (Fig. 2).

Table 3 F1 scores reported in the literature compared to the best performer obtained by this study. All models were evaluated with the CRAFT corpus
Fig. 2
figure 2

F1 scores of the 14 NER systems trained on different fine-tuning methods and with different dataset combinations

5 Conclusions

One of the main contributions of this study is the use of an English dataset in the biomedical domain to create a bilingual system. This NER system not only identifies NEs in both languages, but it does so using one of the datasets with the highest number of NEs (6 classes), where most datasets only contain a small number of NE classes based on the task at hand, such as diseases and chemicals (BC5CDR) or genes and proteins (GENIA). Even fewer NE classes are available in Spanish datasets, such as Chemical and proteins (PharmaCoNER) and Cancer concepts (CANTEMIST), both of which have a significantly different annotation scheme. Furthermore, NEs contained in CRAFT have not been annotated in Spanish texts. The synthetic dataset for Spanish was created using a “cheap translation” inspired by the success of back translation for NMT. The F1 score difference from the best performer, as shown in Table 3, is just 0.46 F1-score points, surpassing most systems in the literature and placing second, proving that the NER system’s training was successful. In contrast to the best performance reported in literature, our system is able to recognise NEs in two languages, EN and ES, and it is independent of dictionaries or external information.

Entity replacement was used as a second augmentation technique. Entities of the same tag were replaced with data extracted from the official ontologies employed by the annotators of CRAFT, as one of the issues was the OOV words, which are common in the biomedical domain due to the constant generation of terms. As a result, a different CRAFT corpus was generated with 20% of its entities replaced. In the future, alternative percentages of replacement might be evaluated and reported, as well as an attempt to replace the entire dataset.

In total, fourteen systems were trained for NER using a variety of dataset combinations, including the original dataset plus the Spanish version or the new CRAFT in English. We employed either direct fine-tuning by concatenating the datasets or continuous fine-tuning by sequentially training the separate datasets and using the transfer learning success of transformers. Systems trained via transfer learning performed best, while systems trained by concatenating datasets showed a downward trend in F-score. As in Lamurias & Couto (2019), the performance of the system did improve by combining datasets, but the transfer learning technique proved to be better. The scarcity and difficulty of biomedical annotated data in other languages can be solved by combining data augmentation with transfer learning, which also uses state-of-the-art systems. Such a method might be applied to different languages to improve results, using in-domain data without relying entirely on the transformers’ zero-shot capabilities, as well as for languages where zero-shot is not an option.

Our novel methodology will make it possible for another language to benefit from one of the biggest biomedical datasets for NER. This dataset, and methodology, can be beneficial for researchers and future studies on other datasets to cover more NE classes. Translations of the dataset in other languages can be used, either from English to the target language or by leveraging the already translated Spanish dataset to obtain another in a close (Romance) language. The proposed method in this research is applicable to larger Language Model architectures (LLMs) and is compatible with current quantization and training enhancement techniques. This approach is portable to other languages, provided any modern MT system has support for it. Incorporating an EWC into the training process to prevent catastrophic forgetting and increasing the learning of new weights in the NER system is another potential future project.