Enhancing Cross-lingual Biomedical Concept Normalization Using Deep Neural Network Pretrained Language Models

In this study, we propose a new approach for cross-lingual biomedical concept normalization, the process of mapping text in non-English documents to English concepts of a knowledge base. The resulting mappings, named as semantic annotations, enhance data integration and interoperability of documents in different languages. The US FDA (Food and Drug Administration), therefore, requires all submitted medical forms to be semantically annotated. These standardized medical forms are used in health care practice and biomedical research and are translated/adapted into various languages. Mapping them to the same concepts (normally in English) facilitates the comparison of multiple medical studies even cross-lingually. However, the translation and adaptation of these forms can cause them to deviate from its original text syntactically and in wording. This leads the conventional string matching methods to produce low-quality annotation results. Therefore, our new approach incorporates semantics into the cross-lingual concept normalization process. This is done using sentence embeddings generated by BERT-based pretrained language models. We evaluate the new approach by annotating entire questions of German medical forms with concepts in English, as required by the FDA. The new approach achieves an improvement of 136% in recall, 52% in precision and 66% in F-measure compared to the conventional string matching methods.


Introduction
Concept normalization, also named as semantic annotation or entity linking, aims to map a sequence of text to a concept of a given knowledge base, such as an ontology, taxonomy or thesaurus. Those mappings or annotations have been applied to enhance search engines, data integration or drug discovery. For example, the MEDLINE database contains journal citations and abstracts for biomedical literature. The data in MEDLINE are annotated using the MeSH (Medical Subject Headings) vocabulary. PubMED, 1 the search engine accessing the MEDLINE database, uses these annotations to improve retrieval speed and quality.
Semantic annotations enhance interoperability of the documents and facilitates data integration. The CDISC Standards, jointly developed by the US Food and Drug Administration (FDA) and the Clinical Data Interchange Standards Consortium (CDISC), define the baselines of the interchange format of medical research data. Since 2016, regulatory submissions to the FDA such as new drug applications have to comply with those standards, that incorporate semantic annotation of any submitted medical form. These study data standards ensure the FDA to process the submissions more efficiently. Furthermore, they also facilitate the FDA to solve research questions that need to integrate data from multiple studies. The vocabulary used for the annotations are defined in the Study Data Tabulation Model Controlled Terminology This article is part of the topical collection "Biomedical Engineering Systems and Technologies" guest edited by Hugo Gamboa and Ana Fred.
(SDTM-CT), which is maintained and distributed as part of the NCI Thesaurus. This terminology covers a large set of medical forms, clinical studies and questionnaires, for instance, the Epworth Sleepiness Scale (ESS) Questionnaire and the Hamilton Depression Rating Scale (HAMD). Here, an entire question is assigned to a unique corresponding concept of the ontology. In this study, we focus on such type of annotatiossns.
Cross-lingual concept normalization denotes the process of annotating non-English documents using English concepts. This process is needed because the portion of English concepts still dominates most of the knowledge bases. For instance, one of the largest biomedical ontology sources, the Unified Medical Language System (UMLS) Metathesaurus, 2 contains more than 16.1 million terms in its current version, 2021AA. Thereof, 71% are in English, followed by 10% in Spanish and only 3% of all terms are in French or Portuguese, respectively. To augment the interoperability of non-English documents, cross-lingual concept normalization is indispensable. It is especially a necessity for finding the corresponding concepts of entire question as such concepts are not available in non-English languages.
It is common that medical forms are translated into other languages for the application in non-English speaking regions, such as for clinical or epidemiological studies. Annotating these non-English forms using the same English concepts is not only a requirement of the FDA but also enables the comparison between multiple studies carried out in different languages. Figure 1 presents examples of such cross-lingual semantic annotations. The same standardized forms in various languages shall retain conceptually equivalent meaning. Hence, many of these forms have not only been translated into a new language but also gone through some cultural adaptation and validation processes. For example, the GAD-7 (Generalized Anxiety Disorder-7) form was first published in English in 2006 [2]. It has been translated and adapted/validated into Portuguese [3], into German [4] and into Spanish [5]. The adaptation might result in further modifications on the question text, which can complicate the cross-lingual concept normalization process.
Mapping questions to concepts in the same language (normally in English) is a trivial task as the concepts are mostly syntactically identical to the question, since the concepts are derived from standardized forms. In fact, our previous studies [1,6] show that the conventional string matching methods can already deliver good results. On the contrary, such methods perform poorly in a crosslingual context due to text deviation caused by translation and adaptation. However, no matter cultural adaption or (machine/manual) translation, the semantics of the questions shall still be preserved. As a consequence, we proposed the idea of using deep neural network models to generate sentence embeddings as semantic representations of the questions and the concepts [1]. We achieved a substantial improvement of the annotation quality and proved that semantic embedding methods are superior to string matching based methods in a cross-lingual setting.
In this work, we expand our previous work [1] and aim to further improve the annotation quality by three means: (1) by applying new encoders (2) by injecting UMLS into new models and (3) by refining the post-processing through re-ranking annotation candidates. This study has the following main contributions: (1) We refine the workflows of using deep network sentence encoders for cross-lingual biomedical concept normalization. (2) We investigate the annotation quality using Biomedical Pretrained Language Models (BPLMs) as encoders. (3) We include more state-of-the-art (SOTA) Sentence BERT (SBERT) encoders. (4) We perform UMLS injection into the SBERT encoders and evaluate their performance. (5) We apply candidate re-ranking using C r o s s -Encoder and test its impact on the annotation quality.

Background and Related Work
In this section, we briefly describe the recent development of the pretrained language models with the main focus on BERT (Bidirectional Encoder Representations from Transformers, [7]) and its derivatives. The BERT-models have achieved many SOTA results in various natural language processing tasks (examples see GLUE 3 and SuperGLUE 4 benchmarks). This also motivates us to integrate some of these models in our workflows for solving the concept normalization problem. BERT consists of multi-layer bidirectional Transformer encoder based on the Transformer implementation in [8]. It was released as two sizes: BERT base consists of 12 Transformer layers and BERT large has 24 layers. BERT is trained using two unsupervised tasks: (1) masked language model (MLM) objective and (2) next sentence prediction (NSP). With the MLM, a certain percentage (usually 15%) of the input tokens are masked at random and BERT learns to predict those masked tokens. With the NSP task, BERT is trained to understand the relationship between sentences such as in Question Answering (QA) and Natural Language Inference (NLI) tasks. The initial BERT is pretrained on the BooksCorpus (800 M words) [9] and the English Wikipedia (2500 M words).
Liu et al. [10] modified the BERT's pretraining approach and proposed RoBERTa (robustly optimized BERT approach). Here, they remove the NSP objective, use dynamic masking for the MLM and increase the mini-batch size. In addition, RoBERTa is trained for more steps and with much more data (use 160 GB instead of 13 GB). These approaches have advanced BERT to a better performing model.
Since BERT is relatively resource intensive to apply, Sanh et al. [11] developed a light-weighted version of BERT, the DistilBERT. The model is compressed using the so-called knowledge distillation [12,13], where a compact modelthe student-is trained to reproduce the behavior of a more complex model-the teacher-by minimizing the differences between the model features. The DistilBERT comprises only 6 Transformer layers and has 40% fewer parameters. Nevertheless, it is 60% faster and still retains roughly 97% of BERT's performance on the GLUE benchmark.
MiniLM [14] is another light-weighted variant of BERT. The compression method of MiniLM, termed as deep selfattention distillation, is also based on knowledge distillation principles but with some modifications. The approach distills the self-attention distribution and self-attention value relation of the last Transformer layer of the teacher model. In addition, it also incorporates an intermediate-size student model, named as teacher assistant [15]. The teacher assistant distills the teacher model first and is subsequently used as the teacher to guide the training of the final student model. MiniLM outperforms DistilBERT in the majority of GLUE benchmark tasks and achieves a slightly lower average GLUE score compared to BERT base [14].
MPNet (masked and permuted language modeling [16]) was proposed to overcome two problems. The first problem is that the MLM of BERT ignores a potential dependency of the masked tokens. To address this disadvantage, XLNet [17] was introduced that uses permuted language modeling (PLM) as pretraining method. PLM inherits the benefits of autoregressive modeling but also allows the model to be trained in a bidirectional manner. However, it suffers from position discrepancy between pretraining and fine-tuning, which evokes the second problem. With MLM, BERT captures the position information and sees 85% of the input (if 15% of the tokens are masked). On the other hand, PLM does not have any position information, as the input sequence is presented in a permuted manner and the model only sees the preceding tokens of the to-predict token. This leads inevitably to the above-mentioned discrepancy between pretraining and fine-tuning of downstream tasks, where the model can see the entire input sequence. Consequently, MPNet introduces position compensation to PLM and alleviates the previously mentioned issues [16].
Sentence-BERT (SBERT) and SBERT-WK In this study, we incorporate many pretrained SBERT models [18] as our sentence encoders. Our concept normalization task involves finding the most similar pair of sentences in a large dataset. Using BERT for such type of comparison is computationally expensive as it requires each sentence pair to be input into the network separately. For a comparison of 10,000 sentences, BERT needs 50 million inference computations ( ∼ 65 h, [18]). It is infeasible for us as the ontologies we use contain over 1 million entries. Hence, Reimers et al. [18] proposed SBERT to overcome such inefficiency. SBERT uses the above-mentioned BERT variants as backbone and adds a pooling operation (generally the mean pooling) to generate a fixed-sized sentence embedding. The models are trained using Siamese or triplet networks. The generated embeddings can be compared using similarity measures such as cosine similarity.
The SBERT-WK 5 [19] aims to refine the sentence embeddings generated by SBERT. It modifies the SBERT word embeddings based on how informative/important the word is. The importance of a word is defined by its neighboring words of the same layer and the changes of its cosine similarities through layers. When a word aligns well with its neighboring word vectors, it is less informative. Similarly, a word which evolves faster across layers (larger variance of the pair-wise cosine similarity), it is more important. Since this pooling strategy only alters the already generated embeddings, no further training is needed.
BERT-based biomedical pretrained language models (BPLMs) Since the publication of the BERT model in 2018 [7], various efforts have been made to adapt it for the biomedical domain. We name these as BERT-based Biomedical Pretrained Language Models (BPLMs). The earliest BPLM is BioBERT [20]. It uses the original pretrained BERT (pretrained on BooksCorpus and English Wikipedia) as base model and is further trained with PubMed abstracts and PubMed Central full-text articles (PMC). A few months later, Alsentzer et al. [21] published Clinical BERT. One of its best performing variants uses BioBERT as base model and is trained with approximately 2 million MIMIC-III v1.4 clinical notes [22]. The BlueBERT [23] can be understood as a combination of BioBERT and Clinical BERT. It has four variants depending on base model size (either BERT base or BERT large ) and used training corpus (trained on the PubMed corpus solely or additional with the MIMIC-III corpus). Interestingly, the large-models do not perform better than the base-models. The BERT base -variant trained solely on the PubMed corpus is analogous to BioBERT, yet the BlueBERT variant is trained for more steps (5 M steps instead of 0.2 M steps). Experiments on various NLP tasks show that this increase in training steps does improve the results.
The above-mentioned BPLM models are all derivatives of BERT which is already pretrained on BooksCorpus and English Wikipedia. Gu et al. [24] challenge such continual pretraining and argue that training BERT from scratch using domain-specific corpora is more beneficial when dealing with domain-specific tasks. They pretrained the BERT model from scratch using the PubMed corpus and name their model as PubMedBERT. In addition to PubMedBERT, the authors also create a new benchmark, the Biomedical Language Understanding and Reasoning Benchmark (BLURB), which comprises biomedical NLP tasks focusing on Pub-Med-based applications. PubMedBERT outperforms the above-mentioned models in almost every BLURB task (only BioBERT is better in 2 of the 13 tasks). Hence, they conclude that to solve domain-specific tasks, it is better to use models entirely pretrained on domain-specific corpora than to use models that have already been trained with outdomain corpora.

UMLS injected BPLMs
Various studies have shown that for named-entity recognition or concept normalization tasks extra training of the language models on a given knowledge base is beneficial [16,[25][26][27][28]. Thus, we incorporate two such models in our workflows: CODER [27] and SapBERT [26]. Both approaches propose pretraining using UMLS synonyms, referred to as UMLS injection. In addition, CODER also embeds the relationships between the concepts into the vector representation. CODER has two versions. The English version, CODER ENG , uses PubMedBERT as base model and the multilingual version, CODER ALL , uses multilingual BERT as base model. Both versions differ also in training corpus: CODER ENG utilizes only the English concepts in the UMLS while CODER ALL is trained on concepts of all languages. CODER applies contrastive learning. Here, term representations are learnt by maximizing cosine similarity between positive term-term pairs (i.e., between synonyms of a given concept) and term-relation-term pairs.
The SapBERT achieves many SOTA results on the medical entity linking (MEL) benchmark. The Self-Alignment Pretraining (SAP) is a procedure that learns to self-align synonyms in the UMLS and can also be used for fine-tuning on task-specific datasets. During pretraining, an online hard triplet mining is necessary to locate the most informative training examples. With each mini-batch, all possible triplets for all terms are constructed. A triplet (x a , x p , x n ) contains an anchor x a , an arbitrary term in the mini-batch and the x p and x n denote each either a positive or a negative match of the x a . Only the triplets are retained for pretraining if they satisfy the following constraint : where f is modeled by a BERT model and is a predefined margin. In other words, only triplets with negative samples that are very similar (in the paper they use cosine similarity) to the positive sample by a margin of are kept for pretraining. They use Multi-Similarity loss function [29] as learning objective that leverages the similarities among and between positive and negative pairs.
Multilingual pretrained language models In the current study we apply several multilingual pretrained language models. We choose the pretrained models developed by the same authors of SBERT. Further, we also include the multilingual versions of CODER (described previously) and SapBERT in our workflows.
Reimers et al. [30] proposed multilingual knowledge distillation that seeks to reinforce better alignment of the multilingual sentence embeddings, i.e., the sentence embeddings of different languages shall be mapped to the same vector space if they are semantically equivalent. Through the distillation, the student model M , generally (but not restricted to) a smaller multilingual pretrained model, learns the behavior of the teacher model M, generally an intensively trained monolingual (English) model. The pretraining requires a set of parallel (translated) sentences ((s 1 , t 1 ), … , (s n , t n )) where t i is the translation of s i . The learning objective is to minimize the mean squared loss so that M (s i ) ≈ M(s i ) and M (t i ) ≈ M(s i ).
The multilingual version of SapBERT, later referred to as SapBERT-XLMR, differs from the English version in two folds. Firstly, it is trained with UMLS terms of all languages. Secondly, the pretraining also incorporates general-domain translation data, including "muse" word translations [31] and parallel Wikipedia article titles. The original and the translated sequences are considered as synonyms for the SAP training process.

Corpus and Ontology
This study uses the same 21 German medical forms and the 497 questions as in [1]. Many of the forms are utilized in the LIFE 6 Adult Study [32], a large scale cohort study investigating the factors leading to civilization diseases, such as vascular disease, heart function, allergies and depression. Examples of the included medical forms are the Patient Health Questionnaire (PHQ, [33]) and the GAD-7.
The UMLS Metathesaurus is one of the largest biomedical ontology sources by far. We consequently choose UMLS so that we can maximize the semantic interoperability for our corpus. Since some of the pretrained models that are applied in this study (namely CODER and SapBERT) use the UMLS version 2020AA for concept injection, we also limited ourselves to the same version for our annotation task for a fair comparison. The UMLS version 2020AA integrates 214 source vocabularies and contains approximately 4.28 million concepts. To improve annotation efficiency and since not all ontologies in the UMLS are relevant, we selected three source ontologies from the UMLS that still cover 99.1% of the GSC annotations [1]. The selected subset contains all concepts from (1) the NCI Thesaurus, (2) the LOINC, and (3) the Consumer Health Vocabulary. In total, the subset includes 1,115,090 terms belonging to 399,758 concepts.
In order to evaluate the annotation quality, we manually annotated the medical forms using the selected UMLS subset and built a Gold Standard Corpus (GSC) [1]. Overall, we identified 1105 GSC annotations. Their frequency distribution of number of annotations per question is shown in Fig. 2. In the GSC, most of the questions have up to 2 annotations and about 10% of the questions have 3 or 4 annotations. There are only a few questions being mapped to more than 5 UMLS concepts. From our observations, the duplication is mainly due to (1) same question of a form might be given multiple CUIs in the UMLS or (2) the same question occurs in different forms and hence has different CUIs. Figure 1 shows such examples.

Annotation Workflows
We design two workflows: (1) Workflow-Multi and (2) Workflow-MT to tackle the cross-lingual concept normalization problem (Fig. 3). In Workflow-Multi we input the German forms directly into a given multilingual sentence encoder to generate sentence embeddings. We use the same encoder to encode the embeddings for the English concepts in the UMLS (Fig. 3a). In Workflow-MT (MT stands for Machine Translation), we first translate the German forms into English using three machine translators (DeepL, 7 Microsoft Translator 8 and Google Translate 9 ) (Fig. 3b). We then generate the embeddings of the translated questions and the English UMLS concepts using a given sentence encoder. The sentence encoders we used in Workflow-MT are not limited to English encoders but also include multilingual ones. In a preliminary study we observed that multilingual encoders we selected to generate English sentence embeddings can also achieve good annotation quality. After the encoding process, cosine similarity is computed between each pair of question and a candidate concept embeddings. These mappings are ranked and the Top k results are retained for evaluation, where k ∈ {1, 2, 3, 5} . We apply the metrics precision, recall and F-measure to evaluate our results. We also use Workflow-MT to annotate the original English corpus for the reference comparison.
There are four optional components in the Workflow-MT, which are presented in dashed lines in Fig. 3. First, the UMLS injection indicates that we train the sentence encoders using concepts in the UMLS to refine the sentence encoders. The methods and encoders used for training are detailed in "UMLS Injected SBERTv2 Models (MG SapFull and MG SapSubset )". Second, we incorporate the SBERT-WK [19] as we observed that applying SBERT-WK to the English embeddings generated by SBERT models does improve the annotation quality significantly [1]. The third and forth optional components in the Workflow-MT are extra postprocessing steps. The Cross-Encoder is used to rerank the candidates. In the combination step, set operations are applied to the result sets generated by different translated corpora. See "Post-processing" for more details about these post-processing methods.

Baseline: AnnoMap
AnnoMap [34,35] is a conventional string matcher that generates candidates using three string similarity functions: TF/IDF, Trigram and LCS (longest common substring). After candidate generation, an optional group-based selection can be applied to improve precision. AnnoMap retains candidates whose similarity scores are above a given threshold . We set two thresholds ∈ {0.6, 0.7} that generally generate the best F-measures. We also retain the same result sizes k as in workflows using language pretrained models, i.e., k ∈ {1, 2, 3, 5} . To be able to obtain the desired result set sizes, we did not apply group-based selection in this study because it might return fewer candidates than a given k. We annotate the same three translated corpora as in the Workflow-MT and also the original English corpus as reference.

Model-Groups
In total, we applied 53 English and multilingual BERTbased pretrained language models that are grouped into five Model-Groups: MG SBERTv1 , MG BPLM , MG SBERTv2 , MG SapFull and MG SapSubset . First group, MG SBERTv1 , includes the ten English SBERT models we used in our previous study [1] as reference. These models are selected from

Biomedical Pretrained Language Models ( MG BPLM )
We select nine BERT-based BPLMs from BioBERT, Pub-MedBERT, SapBERT and CODER ( Since the publication of our previous study [1], various new models are trained into Sentence-BERT and are included in the SentenceTransformers-v2. 10 Among them, we selected ten English models and four multilingual models that have shown to yield good results in NLP tasks and vary in efficiency (Table 3).
English encoders The selected ten English SBERTv2 models are mainly derived from MPNet, RoBERTa, DistilRoBERTa and MiniLM. They are fine-tuned on three different training sets: NLI + STSb, paraphrase and ALL. The NLI + STSb training set includes the NLI datasets 11 and the STSb dataset [38]. Additional corpora 12 are used to train the models using the paraphrase training set. The ALL corpus further expands the paraphrase training set into a dataset including one billion sentence pairs from various sources. 13 We also apply the optional SBERT-WK to the new SBERTv2 English encoders.
Multilingual encoders We choose three new multilingual models from the SentenceTransformers-v2, which are listed as M2-M4 in Table 3. We retain the best performing model in our previous study (M1) for comparison. Unlike the models in MG BPLM , we apply only the English encoders in MG SBERTv2 for Workflow-MT as, according to our preliminary study, the SBERTv2 multilingual encoders do not generate good annotation results using Workflow-MT.

UMLS Injected SBERTv2 Models ( MG SapFull and MG SapSubset )
Among the BPLM models, CODER and SapBERT are both with UMLS injection. Based on our research results, Sap-BERT models perform significantly better than CODER (see Table 8). Consequently, we use the method proposed in SapBERT [26] 14 to inject UMLS 2020AA into the English models of MG SBERTv2 . For the injection, we use either the full version of the UMLS ( MG SapFull ) or the selected subset of the UMLS ( MG SapSubset , subset selection see "Corpus and Ontology"). Since these UMLS injected models are SBERTbased, we are also able to apply SBERT-WK to them.
In the Workflow-MT, a configuration, denoted as config in the following text, is determined by a given model, with or without SBERT-WK, different translated corpora and the various result sizes. Table 4 shows the number of models of each Model-Group and the corresponding number of configs used to generate annotation results.

Post-processing
Combination using set operations Our previous studies [1,6,39] show that combining annotation results using set operations can further improve annotation quality. We also conclude that combining result sets of the three different translated corpora deliver the best quality [1]. Hence, within  Translator) using intersection, union and 2-vote-agreement (an annotation is considered as correct by at least two of the three configs) in this study.

Cross-Encoders
For finding the most similar sentence pair, if two sentences are passed into the encoder-network simultaneously, such a network is named Cross-Encoder [40]. Thakur et al. [41] show that a fine-tuned Cross-Encoder (BERT) delivers better results for the STS Benchmark than a fine-tuned Bi-Encoder (SBERT). However, since the sentences are passed to the network in pairs, using Cross-Encoder for finding most similar sentences is computationally expensive as it demands quadratic time complexity. To overcome this inefficiency and still taking the advantage of the better result quality of the Cross-Encoder, we apply the Cross-Encoder on a limited candidate list. We first reduce the search space by generating a short list of candidates using the standard cosine similarity ranking. We then use the Cross-Encoder to rerank these candidates and evaluate the Topk results accordingly. We utilize the implementation of Cross-Encoder in the Sen-tenceTransformers-v2. 15 We select the Cross-Encoder model stsb-roberta-large as it delivers the best results for the STS benchmark. The Cross-Encoder returns a score for each given sentence pair. Given an encoder config, a sentence pair comprises a question of the given translated corpus (the reference sentence) and one of the candidates found by that config. For each single config, we retain the best 50 candidates for reranking. For each combination, we first rerank the 50 candidates generated by each single config and then apply the set operations to combine the reranked candidates to obtain the final annotation results.

Evaluation
In this section, we first present the annotation quality of AnnoMap, the conventional string matching method. We then report the results of the proposed workflows: Workflow-Multi and Workflow-MT. In addition to annotation quality, we also investigate the computation efficiency of the models used in Workflow-MT and the combination results. Further, we give a reflection on the relationship between recall and result size. At the end of the section, we summarize our main findings.

Baseline: AnnoMap
The best results of the conventional string matching method, AnnoMap, are shown in Table 5. When using the original English corpus, AnnoMap obtain the best precision of 93.6%, best recall of 86% and best F-measure of 79.49%. However, when annotating translated corpora, the AnnoMap performs far less well: 32% reduction in precision, 50.4% in recall and 41.13% in F-measure. The large drop in recall indicates that the paraphrase of the questions after translation/cultural adaptation prevents the conventional string matching method from finding the correct annotations, especially in recall. The results of AnnoMap also show that the most suitable machine translator is Google Translate while the annotation quality using DeepL and Microsoft Translator are worse. The last row of each metric in grey is the best corresponding result obtained using the original English corpus (OE) GO: Google Translate

Workflow-Multi
The Workflow-Multi has the advantage of not requiring machine translators but uses the German questions as input. We applied the three multilingual encoders in the MG BPLM (Table 2) and the four multilingual encoders in the MG SBERTv2 (Table 3) for this workflow. Table 6 presents the averaged annotation quality of these multilingual models. Among the MG BPLM models, SapBERT models perform significantly better than the CODER ALL . Actually, CODER ALL is the worst performing multilingual model among all. Among the MG SBERTv2 models, M3 performs best and is better than the best model we tested in our previous study [1]. The best multilingual encoder is the SapBERT-XLMR large . It gains approximately 8% more than its base model ( SapBERT-XLMR base ) in every averaged metric and also outperforms the best MG SBERTv2 multilingual model (M3). When comparing results of single configs, the superiority of the model over other multilingual encoders can be seen again (Table 7). Using Workflow-Multi, SapBERT-XLMR large generates 60.16% as the best precision, 68.69% as best recall and 52.33% as best F-measure.

Workflow-MT
The following presents the results of Workflow-MT. We test if integrating machine translators into cross-lingual biomedical concept normalization workflow improves the annotation quality. For this workflow, we also use multilingual encoders in MG BPLM to encode English corpora (original English and the three translated English corpora). BPLM models Table 8 presents the performance of the nine models in the MG BPLM , including 6 English encoders and 3 multilingual encoders. Interestingly, the best two models are the multilingual models ( SapBERT-XLMR large and SapBERT-XLMR base ). The models with UMLS injection (SapBERT and CODER models) outperform the models without UMLS injection (PubMedBERT and BioBERT) remarkably. All SapBERT models exceed CODER models. Notably, the two multilingual SapBERT models ( SapBERT-XLMR large and SapBERT-XLMR base ) achieve better results than the two English SapBERT models ( SapBERT mean and SapBERT CLS ). Similarly, the multilingual CODER ( CODER ALL ) is also better than the English CODER ( CODER ENG ).
When comparing the averaged results of the same multilingual models but using different workflows ( Table 6 and Table 8), we observe that the multilingual models deliver better results using Workflow-MT than Workflow-Multi. The SapBERT-XLMR large improves 3.37% in F-measure using Workflow-MT (from 45.85 to 49.22%) and the SapBERT-XLMR base gains an even larger increase of 8.38% in F-measure (from 38.34 to 46.72%). The CODER ALL performs dramatically different on using different workflows, in Workflow-Multi it only reaches an averaged F-measure of 14.13% while using Workflow-MT it achieves an averaged F-measure of 38.10%. Table 9 presents the averaged annotation quality of models without UMLS injection ( MG SBERTv2 ), those injected with 2020AA UMLS full version ( MG SapFull ) and those injected with the selected subset ( MG SapSubset ). UMLS injection is beneficial for 8 of the 10 models (except MiniLM(L12)-ALL and MiniLM(L6)-ALL). We also observe that UMLS injection improves different models in various magnitudes. The most significant improvement is seen by DistilRoBERTa-ALL. Before UMLS injection, RoBERTa-STSb delivers the best averaged annotation quality (colored in blue). After UMLS injection (with either full version or subset), DistilRoBERTa-ALL becomes the best model. The second best model, MPNet-ALL, can also outperform RoBERTa-STSb after UMLS injection. We conduct pairwise t-test to compare the annotation metrics of the same model between different Model-Groups to test the effect of UMLS injection statistically. Each comparison is done between the identical configs of the same model between two Model-Groups: MG SBERTv2 against MG SapFull , MG SBERTv2 against MG SapSubset and MG SapFull against MG SapSubset . The results are shown as superscripts in Table 9. Only one model, MiniLM(L6)-ALL, performs better without UMLS injection statistically (p-value < 0.01). Among the eight models that benefit from the UMLS injection, five (of MG SapFull ) and six (of MG SapSubset ) of  Table 3. The best models of each Model-Group are in blue. The * indicates the models of MG SBERTv2 are significantly better than models in both MG SapFull and MG SapSubset using pairwise t-tests (p-value < 0.01). Contrastingly, ** shows the models of MG SapFull and MG SapSubset are statistically better than those of MG SBERTv2 . + specifies the models of MG SBERTv2 are better than those of MG SapSubset . The better models are in bold when comparing the metrics between MG SapFull and MG SapSubset them perform significantly better than uninjected models (denoted with ** in the table). When comparing the results between injection using full version ( MG SapFull ) or selected subset ( MG SapSubset ), the differences are mostly insignificant: merely two models are better using selected subset (MiniLM(L12)-Paraphrase and MiniLM(L6)-Paraphrase, in bold), while one model (MPNet-Paraphrase) is better using full version. Hence, we can conclude that UMLS injection into SBERTv2 models is generally beneficial for our biomedical concept normalization task, though various effectiveness is observed. Moreover, injection using a relevant subset is sufficient and also more efficient than injecting the full version of the UMLS.

UMLS injection of SBERTv2 models
Best configs of Workflow-MT The best performing single configs in precision, recall and F-measure are shown in Tables 10, 11, 12, respectively. We present the best 5 results within each Model-Group. The last row in gray of each model In column SBERT-WK, "n.a." indicates that SBERT-WK is not applicable to BPLM models. The last row of each metric in grey is the best corresponding result obtained using the original English corpus (OE). The best recall of all configs of translated corpora and OE are in bold GO: Google Translate, DL: DeepL, MS: Microsoft Translator, P: precision, R: recall, F: F-measure group is the best results using original English corpus as an indication of upper bound. The first three metric columns are the results using standard workflow, i.e., the candidates are ranked using the cosine similarities of the mappings. The last three metric columns show the results that are reranked using Cross-Encoder. Overall, we can exceed our previous results in [1] (comparable results shown as MG SBERTv1 models in Tables 10,11,12) in all metrics. The best annotation quality generated by Workflow-MT on annotating original English corpus exceeds that of conventional string matching. We can push the In column SBERT-WK, "n.a." indicates that SBERT-WK is not applicable to BPLM models. The last row of each metric in grey is the best corresponding result obtained using the original English corpus (OE We achieve the best precision of 71.23% with both standard ranking and reranking with Cross-Encoder (Table 10).
Reranking using Cross-Encoder can improve the precision results for almost all the 25 configs (only 4 cases in MG SapFull and MG SapSubset are exceptions). The best recall of standard ranking is 73.67% by the best BPLM model ( SapBERT-XLMR large with Google Translate corpus,   Table 11). Using Cross-Encoder for reranking, we can improve the best recall further to 74.84% with RoBERTa-STSb using Google Translate and with SBERT-WK. Actually, this config delivers the best recalls within each of the SBERTv2 Model-Groups. Similar to the best precision results, reranking using Cross-Encoder improves most best recall results except from three configs (two configs in MG BPLM and one config in MG SBERTv2 ). The best F-measure, 61.90%, is delivered by DistilRoBERTa-ALL using Google Translate with SBERT-WK in MG SapSubset (Table 12). On the other hand, the best F-measure using Cross-Encoder does not exceed this result. But in general, reranking using Cross-Encoder is also beneficial for F-measure results. The F-measures of only 4 configs (two in MG SapFull and two in MG SapSubset ) are not improved by reranking. The best performing model in MG BPLM is the SapBERT-XLMR large . It achieves the best 3 results in every metric using the 3 different translated corpora in the order of Google Translate, DeepL and Microsoft Translator (Tables 10, 11,12). An interesting observation is, since the model is a multilingual encoder, it is also applied in the Workflow-Multi. Comparing its best results using Workflow-Multi (Table 7) and those using Workflow-MT, using translated corpora can deliver even better results. When inputing the German forms directly into SapBERT-XLMR large , the best precision, recall and F-measure are 60.16, 68.69 and 52.33%. On the other hand, using Google Translate translated corpus as input, it achieves 67.61% in precision, 73.69% in recall and 58.37% in F-measure. This implies that the alignment of multilingual sentences of the model is still not as good as aligning solely the English sentences.
Notably, DistilRoBERTa-ALL of MG SapSubset with the setting of including SBERT-WK and using Google Translate corpus delivers the best precision and F-measure of all single configs. Without UMLS injection ( MG SBERTv2 ), RoBERTa-STSb using Google Translate and with SBERT-WK delivers the best precision, recall and F-measure. However, after UMLS injection, DistilRoBERTa-ALL is able to outperform RoBERTa-STSb in best precision and F-measure. RoBERTa-STSb delivers not only the best recalls among all SBERTv2 models, with the help of Cross-Encoder, it achieves the best recall of all models compared in this study. All the best metrics of the SBERT-based models (models in MG SBERTv1 , MG SBERTv2 , MG SapFull and MG SapSubset ) are produced by configs with SBERT-WK. This consents to our previous observation that adding SBERT-WK does improve annotation quality [1]. However, we also observe that many of the configs without SBERT-WK also perform well in recall and F-measure (Tables 11, 12). Furthermore, these best configs all achieve the best results using Google Translate corpora with only one exception (best precision of MG SapFull ). Consent to the AnnoMap results, the Google Translate is the most suitable machine translator also in the pretrained language model workflow.
The best precisions are delivered with configs of result size as Top1 and as expected, the best recalls of result size as Top5. The best F-measures are generated by configs with Top2 as result size. Only 2 of the 75 configs in Tables 10 -12 are exceptions. They are the config SapBERT-XLMR large using Google Translate ranked 5th in both the precision table (result size = 2 instead of 1) and the F-measure table (result size = 3 instead of 2). These exceptions are mainly due to that the SapBERT-XLMR large performs the best in the MG BPLM and therefore, even with sub-optimal result size, it still outperforms other configs. The configs with Top2 result size generates the best F-measure can be explained by that in our corpus most of the questions have 2 GSC annotations (as shown in Fig. 2).
Computation efficiency We used two NVIDIA V100 Tensor Cores as GPUs to encode the questions and the UMLS concepts. Table 13 presents the computation time of the models in MG SBERTv1 , MG BPLM and MG SBERTv2 . Since the UMLS injection using SAP does not change the model structure, the encoding time of the models in MG SapFull and MG SapSubset remains the same as those in MG SBERTv2 . Similarly, the same model but pretrained on different corpora (e.g., MPNet-STSb and MPNet-ALL) also have the same computation time and therefore are not shown separately in the table. We can conclude that the newly selected SBERTv2 models not only outperform the SBERTv1 models in annotation quality, they are also more efficient. MiniLM(L6) and DistilRoBERTa are the fastest models. Applying SBERT-WK drastically increases the computation time because it relies on CPU to operate. Owing to the fact that all BPLM models are direct derivatives of the initial BERT, their efficiency are alike. They are approximately 10% faster than the fastest SBERTv2 models with SBERT-WK.
Combination of results In our previous study [1] we showed that combining using set operations on the result sets of different translated corpora can improve annotation quality further. Therefore, we applied the combinations and obtained the best precisions by intersecting the result sets (Table 14), the best recalls by union (Table 15) and the best F-measures by 2-vote-agreements (Table 16). In each table, we present the best combination result of each Model-Group on the given metric. The last three columns of these tables also show the results of reranking the candidates using Cross-Encoder before combination.
Overall, we are able to improve the best precision of using translated corpora to 93.46% by combining the results of the MG SBERTv1 models (Table 14). This is an improvement of 22.23% compared to best single config result (71.23%, Table 10). Combining three models in MG SapSubset achieves the best recall of 85.25% (Table 15), an increase of 11.58% compared to the best single config (73.67%, Table 11). On the other hand, combination only raises the best F-measure of 1.84% compared to single config (from 61.90 to 63.74%). Again, this best F-measure result is delivered by combining models in MG SapSubset as in best recall result. Unlike the enhancement we could see in the single config results, reranking using Cross-Encoder can not improve the combination results further but rather worsens them.
Recall vs result size It is clear, that applying union to three config result sets achieves a higher recall than a single config, as after union the result size increases by a factor of three at the most. Hence, we ask, given the same result size, which can deliver better recall: the union of three configs or a single config? To answer this question, we plot the change of recall over increasing result size up to 150. We only consider the best model regarding the metric recall of each Model-Group (Fig. 4). We observe that the increase of recalls flattens at a result size of approximately 13 when annotating the original English corpus (Fig. 4a). On the other hand, when annotating a translated corpus, the recalls keep increasing even until a result size of 140, though the increasing rates mostly reaches a saturation at the result size between 55 and 75 (Fig. 4b). Moreover, the single configs deliver higher recalls than combination when the result sizes smaller than approximately 30. However, with larger result sizes, the recalls of combination overtake those of single configs. This shows combination does raise the overall recall limit compared to single configs. These plots also reveal the potential maximum recalls can be reached when retaining the best 150 candidates. By combining the MG SapSubset models, it is possible to reach a recall of 94.48% and with single config using the best BPLM model (i.e., SapBERT-XLMR large ), a recall of 93.48% is attainable.

Result Summary
We compile the best results generated by each approach and show them in Fig. 5. Notably, if the task is not cross-lingual but to annotate the original English forms, the proposed Workflow-MT still outperforms the traditional string matching method even the questions and the corresponding concepts are syntactically identical. Using the sentence encoder workflows, we gain a large improvement in recall and F-measure. This indicates that the use of sentence embeddings as semantic representation does help to find many more correct concepts. Incorporating machine translators into the workflow (Workflow-MT) produces better results than using original German forms as input (Workflow-Multi). This observation still holds true when the The best results of each approach. OE: original English corpus same encoder is applied in different workflows, as we have seen on SapBERT-XLMR large . Hence, we can conclude that using machine translator is still inevitable before better aligned multilingual sentence encoders are available.
The best annotation quality we achieve using single config is 71.23% in precision, 74.84% in recall and 61.90% in F-measure. All these best results are generated using UMLS subset injected SBERTv2 models, i.e., DistilRoBERTa-ALL for precision and F-measure and RoBERTa-STSb for recall. Further, these two models are both enhanced by SBERT-WK and used Google Translate corpus as input. In addition, RoBERTa-STSb obtains the best recall with reranking of Cross-Encoder. Figure 5 also shows that among the three metrics, precision benefits most from combination. The best precision, 93.46%, on translated corpora using combination, is almost equivalent to the result of using AnnoMap to annotate the original English corpus (93.6%). Similarly, we also achieve a best recall of 85.25% that is also amounting to that of the AnnoMap on original English corpus (86%). However, there is still space for improvement in terms of F-measure (the best F-measure 63.74%). Overall, we achieved an improvement of 136% in recall (from 36.11% of AnnoMap to 85.25% of combination), 52% in precision (61.60% of AnnoMap to 93.46% of combination) and 66% in F-measure (AnnoMap: 38.36%, combination: 63.74%). We set our maximum result size as Top5 so that the system can provide a reasonably short list of candidates for further manual verification (semi-automatic annotation). In this case, with the best recall of 85.25% and a precision of 100%, a F-measure of 92.04% is plausible.

Conclusion
In this study, we apply BERT-based pretrained language models to generate sentence embeddings to solve cross-lingual biomedical concept normalization problem. We show that the annotation quality can be improved significantly compared to the conventional string matching tool. For the future work, we aim to apply such techniques onto other types of annotations (e.g., biomedical name entities) or in other domains.
We select current SOTA models that are specifically pretrained on biomedical corpus (the BPLM models) or only pretrained on plain English text (the SBERT models without UMLS injection). The results show that the best performance of these two types of models is similar. This can be due to that many of the questions in our medical forms are in colloquial language as they are designed to interview general public. Therefore, extra biomedical corpus pretraining does not benefit the annotation results. Furthermore, we show that we can further enhance the annotation quality of SBERTv2 models using UMLS injection and outperform the best BPLM model (which is already UMLSinjected). We also discover that UMLS injection using only the relevant subset is sufficient to produce comparable (or even slightly better) results than using the full version of the UMLS. This observation is similar to the idea of the PubMedBERT [24] that more pretraining using out-domain corpora is not necessarily beneficial for solving domain-specific tasks.
We tested two post-processing strategies in this study. Combination can improve annotation quality significantly and also raise the recall upper bound compared to single config. The reranking of Cross-Encoder benefits the results of configs but does not improve combination result further. However, as Fig. 4 shows, with the result size of 150, we have the potential of finding up to 94.48% of the correct annotations. Hence, we plan to develop better post-processing approaches that can rerank the candidates so that the correct annotations are included in the Top5 result sets.