1 Introduction

Named entity recognition (NER) aims to detect spans in unstructured text and classify them according to their semantic meanings is a fundamental and essential natural language processing (NLP) task. A typical example of a NER task is the identification of entities such as date, location, and organization in a sentence. The extracted entities can be used for information extraction [1, 2], building knowledge graphs [3, 4], chatbots [5, 6], and question answering systems [7, 8].

Numerous NER models have been proposed so far, and among them most traditional approaches treat NER as a sequence labeling problem, that is, each token in a sentence will be assigned one tag. Cases of such tasks are usually solved with models based on recurrent neural networks (RNN) or conditional random fields (CRF) [9,10,11]. These methods are based on the assumption that the entity spans in the text are not nested with each other. However, nested entities are very common in practice [12]. In the GENIA corpus [13], approximately \(18\%\) of entities contain other entities or are nested within other entities. An example of nested situation is shown in Fig. 1. As can be seen from the figure, the entity “IFN-gamma” itself is an entity of type Protein, and yet it is a part of an RNA entity “IFN-gamma cytoplasmic mRNA.” This nested structure cannot be solved directly by predominant sequence labeling methods since the token in the nested entities needs to be represented by multiple tags. Taking these nested entities into account can benefit many downstream NLP tasks.

Fig. 1
figure 1

An example of nested entities from GENIA corpus

Many approaches have been proposed over the past few years to address this nested NER problem. Among these, one of the representative categories is hypergraph-based approach [14, 15]. However, this approach requires the defining of graph nodes, edges, and transformation actions resulting in elaborate tagging schemas and suffers from unsatisfactory performance results. Another common way of presenting nested named entities is the layered model [16, 17]. They annotate each layer in accordance with the level of nesting. In this way, multiple flat NER layers can be stacked to address the nested NER problem. Unfortunately, this approach suffers from layer disorientation [18] and error propagation problems [17]. The former refers to the fact that the right span and classification may be output from the wrong layer, resulting in over-estimated loss, and the latter is the propagation of errors from the first few layers to the next few layers. In order to solve the above issues, Wang et al. proposed the pyramid model [18]. Between the output features of each two consecutive layers in their decoder, the two adjacent hidden states of the lower layer are embedded into the higher layer using a block consisting of a convolutional layer [19] with a kernel of two and a bidirectional long short-term memory (Bi-LSTM) layer [20]. Consequently, entities of length l are predicted at the l-th layer which solves the layer disorientation problem, and the prediction of each layer does not determine the prediction of the other layers that mitigates the error propagation problem to a certain extent. In their designed decoder, two adjacent hidden states are aggregated with the convolution layer and contextual information is captured with the Bi-LSTM layer. Nevertheless, the way of using convolutional layers for aggregation does not consider the dependencies between the adjacent inputs. Furthermore, they stack multiple layers of Bi-LSTMs, and Bi-LSTMs cannot compute sequential data in parallel, which can lead to comparatively slower training and forecasting. Also, for span representation, information from intermediate words has a higher proportion than on both sides. For example, as shown in Fig. 1, when the pyramid model identifies the span “IFN-gamma cytoplasmic mRNA” of length 3, it aggregates the hidden states of “IFN-gamma cytoplasmic” and “cytoplasmic mRNA.” In the representation of the span, “cytoplasmic” contributes more than words on either side. And this situation becomes more and more severe as the number of layers increases.

To address the above problems, we propose a novel Multi-Head Adjacent Attention-based Pyramid Layered model. When we represent a span of length l (l is greater than 1), we aggregate the hidden states of two adjacent spans of length \(l-1\). Inspired by the self-attention mechanism [21], we project these hidden states of spans into queries, keys, and values. The difference is that we compute the attention score for every two adjacent hidden states. Then, the weighted sum is the output of this layer. Our proposed attention mechanism does not compute queries with all keys in a sequence as self-attention does since we aim to aggregate two inputs and output one output, and doing so reduces the computational cost compared to self-attention. In this way, not only internal dependency is taken into account when representing a span, but also the sequence can be output in parallel at each layer with matrix operations. In addition, the output of each layer of the decoder is fused with hidden states of head and tail words of the corresponding span from the encoder before prediction and feeding to the next layer, which can alleviate the problem of over-weighting of intermediate words.

Our main contributions are as follows:

  • We propose a novel Adjacent Attention mechanism for fusing information from two adjacent inputs. This fusion takes into account the dependencies between the inputs.

  • We design a Multi-Head Adjacent Attention-based module for extracting nested entities based on Adjacent Attention mechanism. Compared with the pyramid model, our module not only takes into account the dependency between inputs, but also allows parallel computation of sequential outputs by matrix operations. Moreover, we add information of head and tail representation to span representation to mitigate the imbalanced contribution problem.

  • Experiment results on three nested datasets illustrate that our model outperforms recently proposed nested NER models.

The remainder of this paper is organized as follows. Firstly, Sect. 2 presents some related works for NER. Then, our proposed Multi-Head Adjacent Attention-based Pyramid Layered model for NER is presented in Sect. 3. Next, in Sect. 4 we display and discuss experiment results. Finally, conclusions are drawn in Sect. 5.

2 Related work

NER has been extensively studied for its frequently used for downstream tasks in the field of NLP, and the majority of the research converts NER into a sequence labeling problem. Before deep learning became popular in various fields, probabilistic graphical models, such as hidden Markov models (HMM) [22, 23] and CRF [24, 25], were commonly applied to tackle flat NER task. Recently, plenty of deep learning models have achieved excellent results in both computer vision (CV) and NLP, and therefore, many researchers have started to introduce deep learning into NER tasks. To the best of our knowledge, Hammerton [26] was the first to use Long Short-Term Memory (LSTM) which is a typical deep learning model to handle sequential data to extract named entities. Collobert et al. [27] proposed a model based on convolutional neural network (CNN), which first encodes tokens with a convolutional layer and then classifies them using a CRF layer. Subsequently, the combination of Bi-LSTM and CRF was widely utilized for these sequential labeling tasks. These methods use hand-crafted spelling features [9], CNN-based character embedding [11, 28], and Bi-LSTM-based character embedding [10] for character-level representation of words. However, these models that focus on sequential labeling cannot handle nested NER task as they only allocate one label to each token.

Over the past few years, an increasing number of studies have focused on nested NER. Early solutions for nested NER were based on a hybrid method combining supervised learning with manual rules [29,30,31]. These authors first used HMM to extract the innermost entities and then a rule-based post-processing approach to obtain the outer entities. The problem with this kind of approach is that a great deal of effort is required to observe the data in order to design reasonable rules. Then, a more popular approach is to design proprietary structures to capture nested entities. Finkel and Manning [12] used a constituency tree to represent a sentence and detected nested entities with a CRF-based parser. However, its time complexity is the cube of the sentence length, making it incapable of handling long sentences. Another proprietary structure is the hypergraph. Lu and Roth [14] were the first to introduce hypergraph-based methods to the task of nested NER. This approach allows edges to be connected to multiple nodes to represent nested entities. Wang and Lu [15] proposed a segmentation hypergraph to represent all possible combinations of tokens. Katiyar and Cardie [32] proposed a different hypergraph-based approach that learns structures in a greedy manner with LSTM. This approach requires a complex decoding process for detecting entities. Another solution to the nested NER is the span-based method. This method first extracts candidate spans from the sentences and then classifies the spans. Sohrab and Miwa [33] proposed the exhaustive model that exhausts every possible spans from the sequence of texts within a limited length and classifies them. Luan et al. [34] represent spans using dynamic span graphs which leverages coreference and relation type confidences. A drawback of the span-based approach is that it usually performs poorly in determining entity boundaries [35]. To alleviate this problem, Tan et al. [35] and Xu et al. [36] added a boundary detection component to facilitate the detection of entities. Layered models [16, 17] have better supervision on the boundaries due to their tagging approach but suffer from layer disorientation and error propagation problems. Wang et al. [18] proposed a pyramid structure-based model to alleviate these problems to some extent. This is the first example of using pyramid structure for NER task to the best of our knowledge. Pyramid structures are commonly used for object detection tasks in computer vision [37, 38]. These models allow detection on multiple scales. Our approach is based on a pyramid-layered structure; therefore, it is possible to enumerate almost all spans, and we design an Adjacent Attention mechanism-based decoder that takes into account internal dependency when representing spans. Furthermore, we also enhance the representation of head and tail words of a span, thus leading to better performance.

Pre-trained models proposed in recent years have been broadly used in NLP tasks due to their generalization capability. Among them, word embedding methods such as word2vec [39] and GloVe [40] can be regarded as distributed representations of a word. Others, called language model embedding, such as ELMo [41], Flair [42], and BERT [43] can capture the semantics of a word in different contexts. These have been shown in past work to have an enhancing effect on the performance of the model, hence we adopt word embedding and language model embedding in our model.

3 Methodology

In this section, we introduce the proposed Multi-Head Adjacent Attention-based Pyramid Layered model for nested NER, which consists of two principal components as encoder and decoder. We first use an encoder to obtain the representation of words with contextual information and then add a Bi-LSTM layer to represent text spans of length 1. This representation is taken as input to the decoder and a pyramid model is applied to represent and classify all possible spans of each length. Figure 2 illustrates overall architecture of the model. In particular, the tagging approach and part of the encoder follow the work of [18] and we propose a new Multi-Head Adjacent Attention module for decoding. We will describe more details of our model in the following subsections.

Fig. 2
figure 2

The architecture of our proposed Multi-Head Adjacent Attention-based Pyramid Layered model. a Encoder is used to obtain the words representation and add a Bi-LSTM layer to represent the span representation of length 1. b Decoder exhausts and predicts spans of length less than or equal to L by stacking Multi-Head Adjacent Attention modules

3.1 Tagging scheme

First, the input of our model is a T-word sentence. After encoding, the spans of length 1 are first classified, where we adopt the IOB2 tagging format [44], i.e., B-C denotes the beginning of an entity mention of category C, I-C denotes the inside of an entity mention, and O denotes that this token is outside the entity mention. Then, text spans of each length can be enumerated with the stacked Multi-Head Adjacent Attention module. Specifically, the representations of spans of length l are output by the \((l-1)\)-th layer of attention blocks. Suppose we intend to exhaust all possible spans of length less than or equal to L, we need to stack \(L-1\) attention blocks. It is worth noting that B-C tag is used for the complete span representation, and in this way, each entity of length not greater than L has a single label and will not be tagged in other layers, thus only the labels of B-C and O are provided except for the topmost layer of the decoder. For the representations of spans with length greater than L, I-C is supplemented in the topmost layer to facilitate annotation. For more details about this tagging scheme, please refer to literature [18].

3.2 Encoder

We first employ word embedding and character-level word embedding to represent words in a sentence to make them semantically meaningful. In this work, word vectors that have been pre-trained with a large corpus are utilized for word embedding. The character-level word embedding is generated using Bi-LSTM with the same settings as [10] which allows the model to alleviate the out-of-vocab (OOV) problem. Next, the two embedding results are concatenated and fed into the Bi-LSTM-based sequence encoding layer to further exploit the contextual information. Thus, for a sentence \(\mathbf {x}=[x_1,x_2,\dots ,x_T]\), the output of sequence encoding \(\mathbf {x}^{\mathrm{se}} \in \mathbb {R}^{T\times d_{\mathrm{se}}}\) is as:

$$\begin{aligned} \mathbf {x}^{\mathrm{se}}=\text {BiLSTM}^{\mathrm{se}}([\text {Emb}^{\mathrm{word}}(\mathbf {x});\text {Emb}^{\mathrm{char}}(\mathbf {x})]) \end{aligned}$$
(1)

where \(\text {Emb}^{\mathrm{word}}(\mathbf {x})\in \mathbb {R}^{T\times d_{\mathrm{word}}}\) denotes the word embedding, \(\text {Emb}^{\mathrm{char}}(\mathbf {x})\in \mathbb {R}^{T\times d_{\mathrm{char}}}\) denotes the character-level word embedding, and [; ] denotes concatenation. The pre-trained word embedding has a distinct vector representation for each word, which can merely be considered a distributed representation of a word. However, the pre-trained language model reflects contextual information when it represents words. Consequently, language model embeddings are added and a linear layer is used for reducing the dimension:

$$\begin{aligned} \mathbf {x}^{\mathrm{rd}} = \text {Linear}^{\mathrm{rd}}([\mathbf {x}^{\mathrm{se}};\text {LM}(\mathbf {x})]) \end{aligned}$$
(2)

where \(\text {LM}(\mathbf {x})\in \mathbb {R}^{T\times d_{LM}}\) denotes pre-trained language model.

Then, another Bi-LSTM layer is added after the word representation is obtained, and the output of this layer will contribute to various parts of the decoder. For examples, the output of this layer will be directly sent to the logits layer to predict entities of length 1; the enhanced contextual representation allows the decoder to focus on the span representation; and the appended head and tail information in the decoder also utilizes the output of this layer. Thus, the final encoder output is:

$$\begin{aligned} {\tilde{\varvec{x}}}^1 = \text {BiLSTM}^{ee}(\mathbf {x}^{\mathrm{rd}}) \end{aligned}$$
(3)

where the superscript 1 of \({\tilde{\varvec{x}}}^1\) is used to indicate that the output feature of this layer can be regarded as representation of 1-length text spans.

3.3 Multi-head attention decoder

The decoder receives outputs of the encoder and enumerates spans of each length from the bottom up. Each attention-based layer in the decoder is like a sliding window with size 2, fusing the information of every two adjacent inputs. As explained in Sect. 3.1, we need to set a maximum individual tag length L to control how many layers we need to stack (the number of layers of the decoder is \(L-1\)). Figure 2b shows an example of our decoder with \(L=3\). More precisely, first, the output \({\tilde{\varvec{x}}}^1\) of the encoder of length T will be sent to two different layers. One of them is a logits layer that classifies spans of length 1, and the other is our proposed Multi-Head Adjacent Attention module. In this way, the output of the first attention layer can be considered as an exhaustive enumeration of spans of length 2, and the output is represented as \({\tilde{\varvec{x}}}^2\), followed by the same operation repeatedly. Then, for the output of each layer, we use a linear logits layer and a softmax function for classification, the formula is as follows:

$$\begin{aligned} {\hat{\varvec{y}}}^l = \text {Softmax}(\text {Linear}^{\mathrm{logits}}({\tilde{\varvec{x}}}^l)) \end{aligned}$$
(4)

where \({\hat{\varvec{y}}}^l\in \mathbb {R}^{(T-l+1)\times E}\) is the predicted probability distribution of l-length spans and E is the number of tag types. When training the model we use the cross-entropy loss function suitable for multi-class classification and the final loss function is defined as:

$$\begin{aligned} \text {Loss} = -\sum _{i=1}^{N}\sum _{l=1}^{L}\sum _{t=1}^{T-l+1}\sum _{e=1}^{E}\mathbf {y}_t^l(e)\log ({\hat{\varvec{y}}}_t^l(e)) \end{aligned}$$
(5)

where \({\hat{\varvec{y}}}_t^l(e)\in [0,1]\) is the predicted probability of t-th span of length l along the e-th class, and \(\mathbf {y}_t^l(e)\in \{0,1\}\) is the corresponding ground truth.

Our proposed Multi-Head Adjacent Attention module is presented in Fig. 3. This module includes two main components, namely Multi-Head Adjacent Attention and head and tail representation. First, the module receives sequential inputs. The decoder needs to stack \(L-1\) such module resulting in the input between layers will be biased, so we use layer normalization [45] to mitigate this problem. Then, we use the Multi-Head Adjacent Attention layer to compute the outputs. The outputs of Multi-Head are concatenated and fed into a linear layer. This stacking will naturally form a pyramid structure, but this approach will result in the information of intermediate words dominating the representation of a span. We believe that the head and tail words are essential for representing spans, and many span-based models fuse the hidden states of head and tail words as a representation of span. Therefore, we append the head and tail representation to the proposed attention module to mitigate this problem. The output of the attention layer is concatenated with the head and tail representation and then passed through an feed-forward neural network (FFNN) which consists of a linear transformation and a ReLU function to obtain the final output of this module. Additionally, weights are shared between all stacked Multi-Head Adjacent Attention modules.

Fig. 3
figure 3

Proposed Multi-Head Adjacent Attention module

3.3.1 Multi-Head Adjacent Attention

Our proposed module aims to combine two adjacent hidden states and represent them using features of the same size. Previous work has used convolutional layers to achieve this goal. However, this method does not consider the dependency of two adjacent inputs. We propose a novel attention mechanism-based approach to consider their correlation when combining two adjacent inputs. Then, the weighted sum is computed as a representation of the fusion based on the attention scores and the respective values of the adjacent inputs.


Adjacent Attention We assume the input of this layer as \(\mathbf {a}=[\mathbf {a}_1, \mathbf {a}_2, \dots \mathbf {a}_T]\). Figure 4 exhibits an example of the Adjacent Attention layer for \(T=4\). Inspired by the self-attention mechanism, we first map each input \(\mathbf {a}_i\in \mathbb {R}^M\) into query, key, and value vectors. The transformation is defined as:

$$\begin{aligned} \mathbf {q}_i&= \text {ReLU}(\mathbf {W}_q\mathbf {a}_i) \end{aligned}$$
(6)
$$\begin{aligned} \mathbf {k}_i&= \text {ReLU}(\mathbf {W}_k\mathbf {a}_i) \end{aligned}$$
(7)
$$\begin{aligned} \mathbf {v}_i&= \text {ReLU}(\mathbf {W}_v\mathbf {a}_i) \end{aligned}$$
(8)

where \(\mathbf {q}_i,\mathbf {k}_i, \mathbf {v}_i\in \mathbb {R}^K\) denote query, key, and value of \(\mathbf {a}_i\), respectively. \(\mathbf {W}_q, \mathbf {W}_k, \mathbf {W}_v\in \mathbb {R}^{K\times M}\) are trainable parameters. In general, linear projections are used for query, key, and value transformation. ReLU function used here is a piecewise function that prunes negative values to zero and retains positive values, thus it has a desirable property that the activation after passing it is sparse [46]. The purpose of using the natural sparsity of ReLU is to prevent over-fitting and reduce the training time [47]. The query generated from the input is used to compute the attention score with the keys of adjacent inputs. For a input \(\mathbf {a}_i\) (\(2\le i\le T-1\)) the way to calculate its attention score with the left and right neighboring inputs are as:

$$\begin{aligned} \beta _{i, i-1}&= \tanh (\mathbf {q}_i\cdot \mathbf {k}_{i-1}) \end{aligned}$$
(9)
$$\begin{aligned} \beta _{i, i+1}&= \tanh (\mathbf {q}_i\cdot \mathbf {k}_{i+1}) \end{aligned}$$
(10)

where \(\cdot\) denotes the dot product of two vectors. In particular, only the attention score with its right neighbor is computed when \(i=1\), and only the attention score with its left neighbor is computed when \(i=T\). Instead of the commonly used softmax function, we choose to use the tanh function here. Our intention is to represent the strength of the dependence of two adjacent inputs, while the output of the softmax function sums to 1, which would result in the sum of the attention scores between two inputs that are not dependent on each other also being 1. Next, we obtain the output using the computed attention score and the values generated from the input data for each time step. Since we aim to fuse two adjacent inputs, the i-th output fuses the i-th and \((i+1)\)-th information of the inputs. Therefore, the output \(\mathbf {b}_i\in \mathbb {R}^K\) is as:

$$\begin{aligned} \mathbf {b}_i = \beta _{i, i+1}\mathbf {v}_{i+1} + \beta _{i+1, i}\mathbf {v}_{i}, \quad 1\le i \le T-1 \end{aligned}$$
(11)

The output has a length of 1 less than the input, which naturally allows the stack to form a pyramid structure.

Fig. 4
figure 4

Proposed adjacent attention layer


Multi-Head Attention Inspired by [21] which claim that multi-head attention allows the model to jointly attend to information from different perspectives, we adopt the multi-head attention mechanism here as well. It can be formulated as follows:

$$\begin{aligned} \text {MultiHead}(\mathbf {a}) = [\mathbf {b}^1;\mathbf {b}^2;\dots ;\mathbf {b}^H]\mathbf {W}^o \end{aligned}$$
(12)

where \(\mathbf {b}^h\in \mathbb {R}^{(T-1)\times K}\) is the output of h-th head and \(\mathbf {W}^o \in \mathbb {R}^{HK\times d_{\mathrm{module}}}\) is parameters to learn. In this work, we set \(d_{\mathrm{module}}=HK\); hence, the dimension of each head is reduced and the total cost of computation is similar to a full-dimension with one-head attention.

3.3.2 Head and tail representation

From Sect. 3.3.1, we can see that the outputs of Adjacent Attention are the fusion of every two adjacent inputs. However, for a representation of a text span longer than 2, the information proportion of intermediate words may be higher than that of the two side words. And the imbalance will become increasingly serious as the number of layers increases. Take Fig. 4 as an example, the output of this layer \(\mathbf {b}_1\) contains the information of \(\mathbf {a}_1\) and \(\mathbf {a}_2\), and similarly \(\mathbf {b}_2\) contains the information of \(\mathbf {a}_2\) and \(\mathbf {a}_3\), then if another layer is added on top of this, the output is defined as \(\mathbf {c}=[\mathbf {c}_1, \mathbf {c}_2,\dots ,\mathbf {c}_{T-2}]\), then \(\mathbf {c}_1\) contains the information of \(\mathbf {b}_1\) and \(\mathbf {b}_2\). However, both \(\mathbf {b}_1\) and \(\mathbf {b}_2\) contain the information of \(\mathbf {a}_2\) to some extent, which makes the information of \(\mathbf {a}_2\) contribute more in the representation of \(\mathbf {c}_1\). To alleviate this problem, we fuse the output of Multi-Head with the information of the head and tail words of this span and send it to the next layer. Therefore, when we want to represent entity mentions of length l, i.e., when \({\tilde{\varvec{x}}}^l=[{\tilde{\varvec{x}}}^l_1, {\tilde{\varvec{x}}}^l_2,\dots , {\tilde{\varvec{x}}}^l_{T-l+1}]\) is to be output, the computation of the representation of the head and tail words is as:

$$\begin{aligned} \text {R}^l_{i} = \text {MaxPool}([{\tilde{\varvec{x}}}^1_i;{\tilde{\varvec{x}}}^1_{i+l-1}]) \end{aligned}$$
(13)

where MaxPool denotes max pooling operation with stride and sliding window size of 2.

4 Experiments

4.1 Datasets

To illustrate the effectiveness of our proposed model, we conduct experiments on three benchmark nested NER datasets: GENIA, SciERC [48], and ADE [49]. Details of the data statistics for these datasets are summarized in Table 1.

Table 1 Statistics of the datasets used in the experiments
  • GENIA is a biology domain dataset built on the GENIA v3.0.2 corpus. We follow previous works of [12, 14] to preprocess the data: first, the subtypes of “DNA,” “RNA,” and “Protein” are assigned to “DNA,” “RNA,” and “Protein,” and then the “Cell-Line” and “Cell-Type” are maintained and the other entity types are removed, resulting in 5 entity types.; the dataset is divided into training, development, and test sets in the ratio of 8.1:0.9:1.

  • SciERC is collected from the abstracts of 500 papers from conferences in the field of artificial intelligence (AI). This dataset includes 6 entity types such as “Task,” “Method,” “Metric,” “Material,” “Other-Scientific-Term,” and “Generic.” We follow the work of [48] to pre-process the dataset and split it into a training set, a development set, and a test set with a ratio of 6.8:1:2.

  • ADE is extracted from medical reports containing descriptions of adverse effects caused by drug use. There are two entity types, “Adverse-Effect” and “Drug.” Since this dataset does not have an official training, development, and test split, we follow previous works [50, 51] to conduct tenfold cross-validation.

4.2 Evaluation metrics

The NER task involves two basic steps, which are detecting entity boundaries and determining entity classes. The widely used quantitative comparison methods are exact match and relaxed match. Exact match evaluation considers a named entity to be correctly identified only when both the detected boundary and category are consistent with the manual annotation. Whereas relaxed evaluation is scored from two perspectives, that is, for entities, if their predicted entity category is the same as the true category, a correct category will be recorded regardless of their detected boundaries as long as they overlap with the annotation; and a correct boundary will be recorded regardless of their category assignment. In this work, we choose exact match evaluation metrics. Previous work has not been evaluated in the same way for these datasets. To conveniently compare with previous work, we follow the evaluation approach used by previous work on each dataset, respectively. For the GENIA and SciERC datasets, we run the model five times with different initial random seeds and report the average Micro Precision, Recall, and F1. For the ADE dataset, we use tenfold cross-validation and report the average Macro Precision, Recall, and F1. All reported results take into account nested entities.

4.3 Implementation details

We implement our proposed model based on the Pytorch library and conduct our experiments on an NVIDIA RTX 3090 GPU. Since all three datasets were extracted from domain-specific corpus, we choose in-domain pre-trained model for the word embedding and language model embedding for each dataset, which proved to be more effective than BERT [52]. Both the GENIA and ADE datasets belong to the biomedical domain. Therefore, we adopt word vectors pre-trained with MEDLINE abstracts and with dimensions of 200 to initialize word embedding [53]. To the best of our knowledge, there are no pre-trained word vectors for AI domain, we therefore choose the commonly used GloVe embedding with 200 dimensions for the SciERC dataset. Initialized word embeddings are frozen during training and test. In addition, for all datasets, the character-based embedding is generated by Bi-LSTM, and we follow the work of [18] to set the dimension as 60. For the GENIA and ADE datasets, we choose Flair [42] and BioBERT [54] as the language model embedding. Flair is a new type of contextual character-level language model embedding that can better handle rare words. Here, we use PubMed-forward and PubMed-backward with dimension 4096, which are trained with \(5\%\) of PubMed abstracts. BioBERT has the same architecture with BERT and is pre-trained with a sizable biomedical corpus. We utilize BioBERT-Base v1.1 which is based on BERT-base-cased with dimension 768. For the SciERC dataset, we choose SciBERT [55] and CsRoBERTa.Footnote 1 SciBERT is also based on the BERT and pretrained with corpus of scientific publications and CsRoBERTa pre-trained with papers of computer science domain. Here, SciBERT-scivocab-uncased and CsRoBERTa-base with dimension 768 are adopted. We extract the last four layers’ hidden states of these pre-trained language models, compute the average value as the embedding, and do not fine-tune language models when training the model. It is worth noting that we adopt two different language models for embedding, and the differences in their tokenize vocab will lead to inconsistencies in the length of the embedding results. Here, we take the average of the sub-words when representing the words. We set the hidden state size of sequence encoding layer varying among \(\{128, 256\}\). As the output of both the entity encoding layer and the attention decoder is sent to the logits layer with shared weights, the dimensions of these two layers should be consistent and among \(\{128, 256\}\) as well. The number of heads of the Adjacent Attention layer is set varying among \(\{4, 8, 16, 32\}\). For the GENIA dataset, we follow the previous work by setting the batch size to 64, and the other datasets to 32. In addition, we set the dropout among \(\{0.3, 0.4, 0.5\}\) and the maximum individual tag length L among \(\{4, 8, 12, 16\}\). We follow the training recipe of [18] using an SGD optimizer with a learning rate of 0.01 and a gradient clipping of 5 to train our model. To determine the best hyper-parameters combination, we utilize grid search method. For the GENIA and SciERC datasets, we use the combination of hyper-parameters that perform best in the development set, while for the ADE dataset we compare the results of tenfold cross-validation. The optimal combination of hyper-parameters for each dataset is exhibited in Table 2.

Table 2 Hyper-parameters settings

4.4 Comparison with the pyramid model

Our model and the pyramid model are both built with a pyramid structure. In contrast to the pyramid model, which uses a combination of convolutional and Bi-LSTM layers, we newly design a decoder based on an attention mechanism to capture the dependencies between inputs to improve performance. Therefore, to validate the effectiveness of our decoder, we compare our model with the pyramid model on three datasets. For the GENIA dataset, we compared the average results of five runs of our model with the average results reported in their paper. For fair comparison, we adopted the same embedding settings as the pyramid model: Bi-LSTM with dimension 60 as character-level word embedding; pre-trained word vectors with dimension 200 as word embedding; and Flair and BioBERT as language model embedding. For the SciERC and ADE datasets, we ran experiments using their publicly available code with the same embedding settings as our model. For the SciERC dataset, we compare the average micro results of running the model five times, while for the ADE dataset we compare the Macro results of tenfold cross-validation. For the pyramid model, the hidden state we set the search space to \(\{100, 128, 200, 256\}\). The reason for this setting is that \(\{128, 256\}\) is our search space and \(\{100, 200\}\) is the space of hidden state size in their work. Other hyper-parameters are the same as in our model. The optimal hidden state sizes of the pyramid model on the SciERC and ADE datasets are 128 and 256, respectively. The performance of our proposed model compared with the pyramid model on the three datasets is tabulated in Table 3. From the table, we can see that our model improves the F1 score by \(0.55\%\), \(0.78\%\), and \(0.43\%\) on the GENIA, SciERC, and ADE datasets, respectively. In addition, both precision and recall are higher than the pyramid model in all three datasets. These results demonstrate the effectiveness of our decoder. We believe that the performance improvement is due to our proposed Multi-Head Adjacent Attention module that considers internal dependencies in the representation of spans and appends head and tail word representations.

Table 3 Performance of our proposed model and pyramid model on three datasets

4.5 Comparison with baselines

We compare our proposed model with several state-of-the-art models proposed in recent years on GENIA, SciERC, and ADE datasets, respectively:

  • BiFlaG [56] is a bipartite flat-graph network with two interactive subgraph modules corresponding to the outermost entity and the inner entity.

  • HIT [57] expresses the nested entities by using two parts: a head and tail detector and a token interaction tagger.

  • Second-best [58] design an objective function that takes the extracted entities as the parent span and extracts nested entities leveraging the second-best path within this.

  • LogSumExpDecoder [59] expands second-best path identification by excluding the effect of best path. This is achieved by selecting and removing chunks at each level to construct different potential functions.

  • BartNER [60] utilizes the pre-trained Seq2Seq model BART, and generates indexes at the decoder with pointer mechanism.

  • DYGIE [34] constructs a dynamic span graph and refines it with relation type confidences and coreference.

  • SPE [61] proposes span encoder and span pair encoder which can import inter-span and intra-span information into the pre-trained model.

  • UniRE [62] constructs a table containing all possible entity pairs and applies a unified classifier to predict each cell.

  • PURE [63] first obtains the contextual representation of each token with pre-trained model and concatenates the hidden states of head and tail tokens and the width embedding as the representation of a span.

  • SpERT [50] fuses the embedding results of tokens with max pooling as the representation of a span. Moreover, the width embedding is added to the span representation to enhance the span representation.

  • SPAN-MultiHead [51] introduces an attention-based semantic representation for extraction framework. In particular, attention is used to compute semantic representations, both span-specific and contextual.

  • SpERT.PL [64] is based on SpERT with the addition of part-of-speech (POS) embeddings to enrich the span representation.

The overall performance of our proposed model and previous work on GENIA, SciERC, and ADE are shown in Tables 4, 5, and 6, respectively. Our model outperforms the best baseline model on three nested datasets. Specifically, the F1 scores of our proposed model improved the previous best model by \(0.56\%\), \(0.95\%\), and \(0.42\%\) on GENIA, SciERC, and ADE, respectively. On the GENIA dataset, BartNER obtained a slightly higher recall than our model, but it was much less precision than our model. Similarly, on the SciERC dataset, UniRE also obtained a high recall, but its precision was low, resulting in a much lower final F1 score than our model. Although BartNER and UniRE employ different methods to extract entities, they have one thing in common, that is, they only use the hidden states of the head and tail words when representing spans, and ignore the information of intermediate words. Therefore, we believe that our recall is not as high as theirs since they have a high probability of determining a span as an entity based on the learned pattern of head and tail word pairs even if intermediate words are replaced with others that have changed the semantics of the sentence. This will result in many non-entities being recognized as entities, which leads to a decrease in precision. In contrast, our span representation treats the contribution of each word in the span as fairly as possible, which improves the precision a lot compared to these methods. On the ADE dataset, both the precision and recall of our model are higher than other models, indicating that our model can not only extract more entities but also filter out non-entities. These results demonstrate that our model is very effective for nested named entities extraction. The main enhancement of the proposed model comes from the fact that we utilize a pyramid structure to enumerate the mentions of each length of the entity and consider the internal dependency when representing the span. This results in a better span representation and ultimately better performance.

Table 4 Performances of our model and baseline models on GENIA dataset
Table 5 Performances of our model and baseline models on SciERC dataset
Table 6 Performances of our model and baseline models on ADE dataset

4.6 Performance of our model with different L

One important hyper-parameter of our model is the maximum individual tag length L, which determines how many attention modules we need to stack. This also determines that all possible spans of length less than or equal to L will be classified using a single representation. Therefore, the setting of L is quite important for the performance of our model. To observe the effect of L on our model, we conduct experiments on the GENIA, SciERC, and ADE datasets, respectively. With the other hyper-parameters following the optimal settings of the model, we let L take values from \(\{4, 8, 12, 16\}\). Table 7 presents the performance of our model on the three datasets with different L. For the datasets GENIA, SciERC, and ADE, the optimal L is 16, 12, and 12, respectively. The optimal L for each dataset is essentially close to the maximum entity length in the dataset. In addition, the table shows that the F1 score is the lowest on all three datasets when \(L=4\). For the GENIA dataset, the F1 score tends to increase as L increases. For the SciERC and ADE datasets, the performance of our model is similar when \(L=8\) and \(L=16\), both of which are within the acceptable range, although there is a decrease. Therefore, although the performance of our model depends on the setting of L, for the setting of L, the named entities can be extracted efficiently within a certain range around their maximum entity length.

Table 7 Performance of our model with different L on GENIA, SciERC, and ADE datasets

4.7 Ablation study

To verify the effectiveness of the components in our model, we select GENIA dataset and remove one component at a time from our model to analyze its impact on performance. The performance of the variants of our model is presented in Table 8, where w/o means without.


Character Embedding First we remove the character embedding. As we can see from the table, the F1 score of our model dropped by \(0.29\%\) after removing this component. This means that as a complement to the pre-trained language model it solves the OOV problem to some extent.


BioBERT Embedding Then, we observe the effect of BioBERT on our model. From the results, we can see that BioBERT contributes more to the model compared to character embedding. We believe that such large pre-trained models can effectively represent tokens and prevent overfitting when applied to downstream tasks with relatively small datasets.


Flair Embedding From the tabulated results, we can see that removing this language model embedding from the F1 score also decreases more than the character embedding, which proves that the contribution of the language model embedding to the model is greater than that of the character embedding.

Language Model Embedding The F1 score of our model decreases by \(1.98\%\) after we remove both pre-trained language models. From this experiment, we can see that the pre-trained language model has a great impact on our model, further verifying that the pre-trained language model helps our model learn contextual information, which is difficult to capture on small datasets.


Head and Tail Representation When we remove the head and tail representation the F1 score drops by \(0.24\%\). This means that adding the head and tail word information to the representation of entity mentions gives a better representation and improves performance.

ReLU Operation in Eqs. 6, 7, and 8 To verify the effectiveness of using ReLU when mapping input to query, key, and value, we removed ReLU in this experiment. Using SGD with a learning rate of 0.01 after removing ReLU leads to the model failing to converge. Therefore, we gradually reduce the learning rate and try to use the Adam optimizer, and finally the model starts to converge when using Adam with a learning rate of 0.001 and exponential learning rate scheduler. The decline curves of loss values are shown in Fig. 5. As can be seen from the figure, the model with ReLU removed converge slower than our model. In addition, it can be seen from the figure that finally our model with ReLU removed can converge to the same level with our model, but the F1 score of the test set decreases by \(0.89\%\) as shown in Table 8. This demonstrates that the addition of ReLU has some effect of preventing overfitting.

Table 8 Ablation study on GENIA dataset
Fig. 5
figure 5

Training loss

4.8 Inference speed

Our model and the pyramid model are both based on the pyramid structure; therefore, we have the same time complexity as the pyramid model which is O(TL), where T is the length of the sentence and L is the maximum individual tag length. However, we propose a decoder based on the attention mechanism, and this decoder can compute the sequential output in parallel. In contrast, the Bi-LSTM used by the pyramid model cannot output in this way. This will lead to the fact that our model should have faster inference speed than the pyramid model although we have the same time complexity. To verify this, we compare the inference speed of our model with the pyramid model on three datasets. Both models are implemented by Pytorch library and ran on a single RTX 3090 GPU. The hyper-parameters of the model are selected as the best-performing ones on each dataset. The results of the comparison are shown in Fig. 6. As can be seen from the figure, the average inference speed of our model is about \(15\%\) faster than that of the pyramid model for various batch sizes on all three datasets.

Fig. 6
figure 6

Inference speed of our model and pyramid model on GENIA, SciERC, and ADE datasets

5 Conclusion

In this paper, we propose a Multi-Head Adjacent Attention-based Pyramid Layered model for nested named entity extraction. We consider the dependencies between two adjacent hidden states when fusing them to better represent text spans with a pyramid structure and ultimately improve the extraction performance. The speed of extracting entities is improved to some extent as our proposed attention-based mechanism can perform parallel computation on sequential inputs. In addition, to mitigate the issue of the intermediate words dominating the span representation, we added the head and tail words representation to the attention-based mechanism module and experimentally demonstrated the effectiveness of this solution. The experimental results demonstrate that our model outperforms other baseline models on three nested NER datasets.

In future work, we will continue to investigate how to represent spans more simply and efficiently to improve the performance of extracting nested entities. In addition, we will try to resolve the problem of discontinuous NER, which has received increasing attention from researchers in the past few years.