Improving unified named entity recognition by incorporating mention relevance

Named entity recognition (NER) is a fundamental task for natural language processing, which aims to detect mentions of real-world entities from text and classifying them into predefined types. Recently, research on overlapped and discontinuous named entity recognition has received increasing attention. However, we note that few studies have considered both overlapped and discontinuous entities. In this paper, we proposed a novel sequence-to-sequence model that is capable of recognizing both overlapped and discontinuous entities based on machine reading comprehension. The model utilizes machine reading comprehension formulation to encode significant inferior information about the entity category. Then input sequence passes through a question-answering model to predict the mention relevance of the given source sentences to the query. Finally, we incorporate the mention relevance into the BART-based generation model. We conducted experiments on three type of NER datasets to show the generality of our model. The experimental results demonstrate that our model beats almost all the current top-performing baselines achieves a vast amount of performance boost over current SOTA models on overlapped and discontinuous NER datasets.


Introduction
Named entity recognition (NER) is a fundamental task for many natural language processing including information extraction, question answering systems, syntactic analysis and machine translation.It aims to identify mention spans from text input according to pre-defined entity categories, such as location, person, and organization [1,2].With the popularity of deep learning, NER methods have been extensively investigated and the current state-of-the-art NER models have been well established, which generally is solved as a sequence labeling issue.However, this framework is difficult to handle complex business scenarios such as overlapped and discontinuous NER.
Traditionally, many previous approaches regarded NER as a sequence labeling problem [3][4][5] which assign a tag for each token in the sentence.Their underlying assumption is that an entity mention should be a short span of text, and should not overlap with each other.However, overlapped and discontinuous entities appear in corpora in many domains, especially in clinical corpus.Traditional approaches are unfortunately incapable of handling these special types of entities.For example, Fig. 1 shows a example that contains flat, overlapped and discontinuous entity.For overlapped NER, if a token belongs to multiply entities, it needs to be assigned to multiply categories.For discontinuous entities, if several spans belong to one mention, there is a need to devise a method to connect these spans.
Recently, various approaches for overlapped or discontinuous NER have been proposed.Quite a few researchers revised sequence models to support overlapped entities using different strategies [6][7][8].Some works adopt spanlevel methods [9][10][11][12][13] to enumerate all possible spans and conduct the span-level classification.However, these methods suffer from some serious weaknesses.On the one hand, enumeration would produce excessively invalid spans.On the other hand, this pipeline strategy will lead to accumulation of errors since incorrectly segmented entity boundaries will lead to classification errors.Most importantly, these methods cannot recognize overlapped and discontinuous entities at the same time.
Different from the above studies, Li et al. [14,15] first treat NER as a machine reading comprehension (MRC) task and propose a new framework that is capable of handling both flat and overlapped NER.Though their models perform state-of-the-art on multiply datasets, it still failed to recognize discontinuous entities.Recently, numerous studies have focused on sequence generation models and have achieved significant progress.Phan et al. [16] proposed a method named NER2QUES used NER types to automatically generate questions for a low-resource language such as Vietnamese.Cui et al. [17] tried to consider NER as a sequence-to-sequence language model problem and presented a template-based NER using BART.Yan et al. [18] utilized a novel and simple seq2seq framework with the pointer mechanism to generate the entity sequence directly.Though this framework is capable of handling both overlapped and discontinuous NER tasks, the categories which partially match with entities are not effectively utilized.
After carefully rethinking the common characteristics of all three types of NER, we find that the bottleneck of sequence generate unified NER more lies in the lack of inferior information in the generation phase.Inspired by the works about MRC and query focused summarization(QFS) [19], we proposed a sequence-to-sequence model that is capable of recognizing both overlapped and discontinuous entities based on mention relevance attention(MRA).The model utilizes machine reading comprehension(MRC) formulation to encode significant inferior information about the entity category, and then pass through a state-of-the-art QA model [20] to predict the mention relevance of the given source sentences to the query, the further incorporate the mention relevance into the BART-based generation model.We conducted experiments on three type of NER datasets to show the generality of our model.The experimental results demonstrate that our model can effectively recognize not only the flat NER but also the overlapped and discontinuous NER.Our contribute in this work can be summarized as follows: • We proposed a novel sequence-to-sequence model that is capable of even recognizing both overlapped and discontinuous entities at the same time.• We proposed an effective method to calculate mention relevance score of the given source sentences to the query and incorporate it into the the encoder-decoder attention in pre-trained language models which can produce more inferior information.To the best of our knowledge, we first proposed the mention relevance score calculation method in NER task.

Related works
Named entity recognition plays a key role in natural language processing (NLP).This field allows the computer to understand the contents of a given text and detect mentions of real-world entities from text and classify them into predefined types.There are three common named entity recognition methods: sequence labeling methods, spanbased classification and Seq2Seq-based methods.

Sequence labeling methods
In the field of natural language processing, named entity recognition is usually treated as a sequence labeling task [3-5, 14, 21, 22], which assign a predefined label to each token, and decode entities from the labeled sequence.With well-designed features, machine learning algorithms can achieve excellent performance such as hidden Markov models (HMMs) [23] and conditional random fields (CRFs) [5].Recently, the development of pre-trained language models has brought research in the field of NLP to a new level.Contextualized word embedding such as BERT [24] and ELMo [25] further enhanced the performance of NER, yielding state-of-the-art performances [22,25].However, overlapped entities were noticed back in 2003 Fig. 1 An example involving flat, overlapped and discontinuous entities.Entities are marked by different colors.E1 is a flat entity, E2 and E3 are discontinuous entities.At the same time, E1 contains a span of E2 and E3 [26].Jana et al. [27] tried to apply the sequence labeling method to recognize the overlapped entities with concatenating labels.It will be difficult to predict because of the exponential increase in labels.

Span-based classification
In order to identify overlapped and discontinuous entities in text, many scholars abandoned the sequence labeling method and began to try the span-level classifications.These methods enumerate all possible spans, and determining if they are valid mentions and the types.Li et al. [13] traversed over all possible text spans to recognize entity fragments, then perform relation classification to judge whether a given pair of entity fragments to be overlapping or succession.Wang et al. [11] decomposed the problem of extraction of discontinuous entities.Two neural components are designed for these subtasks, respectively, and they are learned jointly using a shared encoder for text.Li et al. [14] regard NER as a MRC task and extract entities as the answer spans.Luna et al. [10] proposed a framework DyGIE for overlapped entity recognition tasks which uses a dynamically constructed span graph to share span representations.But span-based classification methods usually focus on identifying the boundary information of entities.And due to the exhaustively enumerating nature, those methods are less capable of handling long span entities.

Sequence generate methods
Gillick et al. [28] proposed a sequence-to-sequence model to predict the entity's start, span length and label for NER task.Jana Strakova ´et al. [27] proposed a Seq2Seq model for overlapped NER, where decoder predicted one by one for each token, until the decoder outputs the ''\eow [ '' (end of word) label and moves to the next token.Yan et al. [18] utilized a novel and simple seq2seq framework with the pointer mechanism to generate the entity sequence directly.Inspired by Yan et al., Chen et al. [29] proposed a lightweight tuning paradigm for few-shot NER via pluggable prompting (LightNER).Cui et al. [17] treated NER as a language model ranking problem in a Seq2Seq framework and proposed a template-based method, where the target sequence is templates filled by candidate named entity.Fei et al. [30] present a model for irregular (e.g., overlapped or discontinuous) NER based on pointer networks, where the pointer simultaneously decides whether a token at each decoding frame constitutes an entity mention and where the next constituent token is.

Other methods
Different from above approaches, Dai et al. proposed a transition-based neural model [31] for discontinuous NER.Wang et al. proposed hyper-graph model [32] and utilized deep neural networks to enhance it.and W2NER [33] methods.Li et al. [33] simultaneously solved three named entity recognition subtasks by classifying the relationship between words.
Seq2seq-based methods have achieved the best results and are currently the most popular methods in the NER community.However, existing methods pay too much attention to the boundary information of entities, while ignoring the deep semantic relationship between entities and labels.In this paper, we incorporate the mention relevance attention into the BART-based generation model.To the best of our knowledge, we are the first to leverage mention relevance to a mrc-based generative NER task.In addition, our approach can effectively recognize both overlapped and discontinuous entities.

Unified NER framework
In this part, we first give the formal definition of generative named entity recognition.Then we introduce the method of generating mention relevance scores and present our approach to incorporate the mention relevance attention into transformer-based model, as shown in Fig. 2. Finally, we present the overall framework of our proposed model in Fig. 3 and introduce the detail of the model.

Task formalization
Given an input sequence X ¼ fx 1 ; x 2 ; :::; x n g, where n denotes the length of the sequence.We follow Li et al. [14] to associate each tag t 2 T with a natural language query q ¼ fq 1 ; q 2 ; :::; q m g 2 Q, where m is the length of the query, T is a predefined list of all possible tag types and Q is the natural language query associated with the tag T. Our goal is to find every entity of tag t according to query q.In order to formulate three kinds of named entities, we define the output sequence as Y ¼ fs 11 ; e 11 ; :::; s 1j ; e 1j ; t; :::; s i1 ; e i1 ; :::; s ij ; e ij ; tg, where s, e are the start and end index of span.Since an entity may contain one or more than one spans, each entity is represented as ½s i1 ; e i1 ; :::; s ij ; e ij .

Mention relevance attention
In this section, we will give our approach to calculate mention relevance score of the given source sentences to the query.BERT-MRC [14] explored different query construction methods and found Annotation Guideline Notes achieved the highest F1-measure.Therefore, we follow them to generate the query using Annotation Guideline Notes.For example, The query for tag ORG is ''find organizations including companies, agencies and institutions.''Original sentence concatenated with query as the input sequence.Then, we followed [20] to pretrained a QA model on multiply NER datasets and generate mention relevance score with this model.Given a sentence that contains n tokens and a query that contains m tokens, the  model outputs a distribution s 2 ð0; 1Þ for each word's probability of being the start of the mention and a probability distribution e 2 ð0; 1Þ to be the end token of mention.To generate the mention relevance score r for each token, we calculate it by averaging of two distributions: where r 2 ð0; 1Þ.Sequence-to-sequence model transformer [34] were introduced in 2017 by a team at Google Brain and are increasingly the model of choice for NLP problems.Transformer contains three attention mechanisms, namely encoder self-attention, masked decoder self-attention and encoder-decoder attention.It iss core-component scaled dot-product attention as below: where d is the dimension of the key matrix K. Attention was first proposed by Bahdanau et al. [35] for Neural Machine Translation.The mechanism is particularly useful for machine translation as the most relevant words for the output often occur at similar positions in the input sequence.The Transformer encoder layer consists of selfattention layers, where all keys, values and queries come from the original input sentence.This makes every token in the input attend to all other tokens.The Transformer decoder layer which has an additional encoder-decoder attention layer is different from encoder layer.In the encoder-decoder attention layer, queries come from the decoder's self-attention layer, and all keys and values come from the output of the encoder layer.This allows every generating token to attend to not only tokens in the input sequence but also tokens in generated output sequence.
In this paper, we propose a mention relevance attention (MRA) to incorporate the token-level mention relevance score in the transformer decoder.Given a input sentence with n tokens, we generate a mention sequence with a maximum length of t tokens.Let x l 2 R nÂd denotes the output of the l-th transformer encoder layer and y l 2 R tÂd denotes the output of the l-th transformer decoder layer's self-attention layer.The encoder-decoder attention a l 2 R tÂn can be computed as: where W Q and W K 2 R d k Âd k are parameter weights and A ar 2 R tÂn is our mention relevance score.Since the original mention relevance score is an n-dimensional vector, we repeat it t times to generate an t by n attention matrix, which means our mention relevance attention is equal to all generated tokens.

MRA-based sequence-to-sequence NER
We consider NER as a natural language generation (NLG) task under sequence-to-sequence framework.Generative pre-trained models have shown remarkable performance in NLG, including text summarization.We choose to combine our answer relevance attention with BART, a denoising autoencoder built with a Seq2Seq model, for two reasons: (1) [17,18] demonstrated the effectiveness of bart on NER tasks and achieves state-of-the-art performance on several irregular NER datasets.(2) BART follows the standard transformer encoder-decoder architecture, and we can easily combine the answer relevance as explicit attention to the encoder-decoder attention layers.We incorporate the same mention relevance attention for all transformer decoder layers.The detail of overall architecture in Fig. 3. Given a sentence X and the query Q, in order to make the model capture the relevance between token and query, the input sequence is formatted in the following way: Then, the input sequence will be feed into bart encoders, a transformer-based model to get word representations.The mention relevance score for the sentence is generated by the mention relevance attention.After, the decoder uses the pointer mechanism to generate indexes of original sentence and tags.
Since we formulate the NER task in a generative way, we can view the NER task as the following equation: where y 0 is the special ''start of sentence'' control token.We use the Seq2Seq framework with the pointer mechanism to tackle this task.Therefore, our model consists of two components:

Encoder
In our approach, we concatenate the external query Q at the end of the input sentence X to form input sequence X.Each word Þ is represented by adding a word embedding x w i and a position embedding x p i .Encoder embedding layers will encode the input sequence into vectors H e , which formulates as follows: where H e 2 R nÂd , and d is the hidden dimension.In the encoder, bidirectional attention layers are used to enable interaction between every pair of tokens and produce the encoding for the context.

Decoder
The generation process models the conditional probability of selecting a new token given the previous tokens and the input to the encoder.Decoder is to get the index probability distribution for each step P t ¼ P y t j X; Y \t ð Þ .However, since Y \t contains the pointer and tag index, it cannot be directly inputted to the decoder.We followed [18] to use the Index2Token conversion to convert indexes into tokens.
where G ¼ ½g 1 ; g 2 ; :::; g m is the set of entity categories (such as ''Person'' and ''Organazation''), which are answer words corresponding to the entity category.After converting each y t this way, we can get the last hidden state h d t 2 R d with Ŷ\t ¼ ŷ1 ; . ..; ŷtÀ1 ½ as follows: Then, we use pointer-generator mechanism to achieve the index probability distribution P t .Our pointer-generator mechanism is followed a pointer-generator network proposed by [36], as it allows both copying words via pointing, and generating words from a fixed tag list.We definite a hyper-parameter a 2 R to choose between generating a word from the tag list by sampling from P tag , or copying a word from the input sequence by sampling from the encoder distribution.We obtain the following probability distribution over the extended vocabulary: where TokenEmbed is the embeddings shared between the encoder and decoder; H e 2 R nÂd ; G d 2 R lÂd ; ½Á; Á means concatenation in the first dimension; means the dot product.During the training phase, we use the negative log-likelihood loss and the teacher forcing method.During the inference, we use an auto-regressive manner to generate the target sequence.

Datasets
We show the statistics of the datasets in Tables 1, 2, 3.The corpus consists of various types of data, annotating entities and relationships.It mainly contains 7 entity types, and the sentences containing overlapped named entities account for about 30%.For ACE2004 and ACE2005, we use the same data split as the ratio between train, development and test set is 8:1:1.For the GENIA dataset, we use GENIA corpus3.02p.We follow the protocols in [37], we use five types of entities and split the train / dev / test as 8.1:0.9:1.0.

Flat NER Datasets
Discontinuous NER Datasets For discontinuous NER, we conduct experiments on three benchmark datasets from the biomedical domain: CADEC, ShARe13 and ShARe14 corpus.CADEC is derived from AskaPatient2, which is a forum where patients can discuss their medication experience.The entity types include adverse drug events(ADEs), diseases and symptoms.Since only ADEs entities contain discontinuous entities, only these entities are considered in this paper, which also allows us to directly compare our results with those of previous models.ShARe13 and ShARe14 focus on disease identification in clinical records, including discharge summaries, ECGs, echocardiograms, and radiology reports.Although the three datasets come from similar domains, the thrust of CADEC is very different from the ShARe dataset.In general, laymen (i.e., CADEC) tend to use idioms to describe their feelings, while professional practitioners (i.e., ShARe) tend to use concise terms to communicate effectively.This also leads to different features of discontinuous mentions between these datasets.

Baseline methods
In order to validate whether the MRA-based generative named entity recognition model can extract entities, we compare the proposed model with several baseline models.
Sequence Labeling Models Traditional sequence labelingbased models which assign a predefined label to each token are usually used to identify flat entities.We selected typical BIO or BIOES benchmark model, such as LSTM-CRF [3] and CNN-BiLSTM-CRF [5].

Span-based Models
These models enumerate all possible spans, and determining if they are valid mentions and the types.Their variations can solve the problems of overlapped and discontinuous entity identification.Span-based, BERT [13] perform relation classification to judge whether a given pair of entity spans to be overlapping or succession.MRC-BERT [14] formulates the NER task as a machine reading comprehension task.Biaffine-BERT [38] ranks all the spans in terms of the pairs of start and end tokens in a sentence using a biaffine model.
Sequence Generate Models These models generate entity sequences at the decoder side.Bart-NER [18] formulate the NER subtasks as an entity span sequence generation task.Template-based NER [17] treated NER as a language model ranking problem in a sequence-to-sequence framework.Pointer network [30] employs Seq2Seq with pointer network for discontinuous NER.
Other Models These models are different from the methods above, such as transition-based [31], hyper-graph [32] and W2NER [33] methods.

Evaluation metrics
In terms of evaluation metrics, we adopt the precision (P), recall (R) and F1-measure (F1) in prior works ( [11,14,18]).A predicted entity is counted as true-positive if its boundary and type match those of a gold entity.For a discontinuous entity, each span should match a span of the gold entity.The calculation formula is as follows:

Implementation details
For all the experiments, we use the BART-large version to implement our models.Its encoder and decoder each has 12 layers for all experiments.The network parameters are optimized by AdamW [39] with a learning rate of 1e-5.
The batch size is fixed to 16.All the hyper-parameters are tuned on the dev set.We run our experiments on a NVIDIA GeForce RTX 3090 GPU for at most 50 epochs and choose the model with the best performance on the dev set to output results on the test set.We report the test score of the run with the median dev score among 5 randomly initialized runs.We report the span-level F1.

Main results
In this section, we illustrate the main performances of the proposed model on the main datasets.The best model in the development dataset is used to evaluate the test dataset.In the table, we mark the best results for each dataset in the bold.

Results for flat NER
Table 4 shows the results for flat NER datasets.Flat entity recognition is the most classic NER task, and these methods cover three different technical routes.As seen, Huang [3], Ma [5] and Li [14] et al. adapted the BIOES-based endto-end model.These sequence labeling models do not perform better than the recently-proposed sequence generate model [18].In contrast, for the CoNLL-2003 and en-Ontonotes 5.0 datasets, our model achieves the best performances with 94.95% and 91.34% on F 1 .Especially, our model outperforms the previous unified NER framework Yan et al. [18] by ?1.71% on CoNLL-2003 dataset, which demonstrates that MRA plays an important role in our model.

Results for overlapped NER
Table 5 present the results for overlapped NER datasets.
The BERT-MRC result comes from Li et al. [14] and the W2NER result comes from our re-implementation via their code.As seen, our model outperforms the previous works, including sequence labeling models [14], span-based models [13] and sequence generate models [18] and achieves the SoTA performances on F 1 , with ?0.89% and ?0.07% on ACE2004 and ACE2005 datasets, respectively.And it also achieves competitive F 1 on GENIA.

Results for discontinues NER
We evaluate our model on three discontinuous NER datasets.Table 6 presents the comparisons between our model and other baselines.As shown in Table 3, only around 10% mentions are discontinuous in all datasets, which is far less than the continuous entity mentions.Therefore, we report the results on sentences that include at least one discontinuous mention.The results show our model performances previous best model [33] by ?0.08%, ?0.11% and ?0.35% on CADEC, ShARe 13 and ShARe 14.It demonstrates that our model again defeats the baseline models in terms of F 1 .

Effectiveness of mention relevance attention
In this work, we followed [14] to concatenate the external query at the end of the input sentence.This means that the datasets will be expanded by n times, where n is the number of entity classes category.This results in a lot of negative examples that there is no mention in the sentence that corresponds to the query.To show the effectiveness of inferior knowledge and mention relevance attention (MRA), we ablate each part of our model on CoNLL2003, ACE2004 and CADEC datasets.
Results in Table 7 show that the external knowledge and mention MRA can help to improve the F1.We find that: (1) The proportion of negative examples greatly affects the training results.When the proportion of negative samples is too high, the decoder will tend to judge negative samples, which means that the recall of negative samples is high, while the recall of positive samples is low.Adding the appropriate proportion of negative examples, the external context improves the F 1 measure by 0.94%, 0.19% and 0.06% for CoNLL-2003, ACE 2004 and CADEC, respectively.This is because it provides inferior knowledge for   8.
As we can seen, compare to the sequence labeling method, while during the evaluating phase, we have to autoregressively generate tokens, which will make the inference slow.Therefore, further work like the usage of a non-autoregressive method can be studied to speed up the decoding [40].

Conclusion and future work
In this paper, we reformalize the NER task as a sequence generate question.This formalization comes with three key advantages: (1) being capable of addressing overlapping or discontinuous entities; (2) the query encodes significant Fig. 4 An example of mention relevance attention score between the query and each word of the given sentence Table 8 The training and inference time comparison between three models.The results are average from at least five runs inferior knowledge about the entity category to extract; (3) the mention relevance score incorporate each token in sentence to the query, which enhance the decoding accuracy.The proposed method obtains SOTA results on both overlapped and discontinuous NER datasets, which indicates its effectiveness.In the future, we would like to improve the performance by exploring variants of the model architecture, including the inference efficiency and solving the problems in Chinese NER.In addition, the proposed approach can be extended to other NLP tasks such as relation extraction and event extraction.

Fig. 2
Fig.2The mention relevance attention in sequence-to-sequence model.Queries come from the decoder's self-attention layer, and all keys and values come from the output of the encoder layer

Fig. 3
Fig. 3 The architecture of our framework.An input sentence is fed into a HLTC-MRQA model to get the mention relevance score A r .It also encoded by bart encoders to get key matrix K and value matrix V.The result are then fed into the bart decoders BART encoders.(2) On the premise of adding a suitable proportion of negative examples, MRA improves the F 1 measure by 0.26%, 1.23% and 0.76% for CoNLL-2003, ACE 2004 and CADEC, respectively.This is because it combines the correlation of the original input sentences and tags which can help detect the word of entity more accurately.We plot an example of the mention relevance scores in Fig.4.As can be seen, the relevance between the query and each word of the given sentence are captured by mention relevance attention.4.5.2 Inference efficiencySequence-to-sequence architecture is based on generation.In the inference step, we use beam search to increase the performance.It unfortunately suffers from the potential decoding efficiency problem.In this section, we compare the training and inference time of our proposed model and other baseline models including sequence labeling, span-based and sequence generate models.We use the BARTbase version to calculate seconds needed to iterate one epoch (one epoch means iterating over all training set) and seconds needed to evaluate the development set.The comparison is presented in Table For flat NER, we conducts experiments on two English datasets, CoNLL 2003 and OntoNotes 5.0.CoNLL-2003 is one of the most classic named entity recognition datasets.The dataset consists of Reuters news stories from August 1996 to August 1997, including place names, person names, organization names and other entities.In this paper, the training set and validation set are combined to train the proposed model.The OntoNotes 5.0 dataset, a collaborative project between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the Institute for Information Science at the University of Southern California, annotates a large corpus consisting of three languages (English, Chinese, and Arabic) of various types of text, containing structural information and shallow semantics.We uses the English dataset and removes the parts that do not contain entities.

Table 1
Flat NER datasets.Statistics of the dateset sentences, mentions

Table 4
Results for the flat Ours 72.18 73.20 72.69 84.41 79.62 81.95 79.62 84.75 82.10 *Means we return by other paper or our re-implementation via their code *Means we return by other paper or our re-implementation via their code

Table 7
Model ablation studies.w/o means remove this part from the model