Introduction

Named entity (NE) recognition is often modelled as a sequence labelling task, where a sequence model (e.g. a conditional random field (CRF) [1] or or long short-term memory (LSTM) [2]) is adopted to output a maximized labelling sequence. Because sequence models are effective in encoding the semantic dependencies of a sentence and constraining the structure of a labelling sequence, they have achieved great success in NE recognition. However, sequence models assume a flattened structure for each input sentence. They are not effective for finding nested NEs in a sentence. For example, “Guizhou University” is an organization NE, and “Guizhou” is an NE that indicates the location of the university. In this case, the output of a label sequence cannot resolve the nested structure. Because nested structures are effective in representing the semantic relationships of entities (e.g. affiliation, ownership, and hyponymy), they are widely used in natural languages. For example, in the GENIA corpus and ACE corpus, the nesting ratio is 35.27% and 33.90%, respectively [3, 4].

Span classification is an effective method to recognize nested NEs. NE spans from a sentence are generated in a process known as region proposal, and then a class label for each possible NE span is output. It has two advantages to support nested NE recognition. First, the nested NE structure can be resolved as separated NE spans. Second, classification can be implemented on a span representation, where global semantics relevant to a predicated NE are encoded. Currently, span classification has received a great deal of attention. However, many span classification models enumerate all possible NE spans in a sentence, resulting in a high computing complexity and the data imbalance problem. Therefore, many related works only verify possible NE spans up to a certain length (e.g. [5,6,7]), or filter NE spans with predefined thresholds [8] or NE boundary cues [9, 10]. For example, Chen et al. [9] verified NE spans combined with detected NE boundaries. Lin et al. [11] applied a point network to recognize span boundaries relevant to the head word of a mentioned NE. The main problem for span classification is that, due to computation complexity and data imbalance, it is difficult to enumerate all NE candidates in a sentence. If an NE is not enumerated in the region proposal process, it cannot be recognized using span classification.

In our study, we have found that there are some similarities between named entity recognition and object detection. For example, named entities in a sentence and objects in an image have similar spatial structures, e.g. flattened, nested and discontinuous [12]. The task of recognizing them can be modelled as a span (or region) classification problem. The main difference between named entities and objects is that the former uses a discrete position representations. Because a deep neural network has the ability to transform multimodal signals into an abstract semantic space [13], in this paper, we represent the positions of NEs as continuous values. Then, a regression operation can be introduced to regress the boundaries of NEs for locating NEs in a sentence the same way as detecting object detection in an image.

In this paper, motivated by techniques developed in object detection for computer vision, we generate abstract NE representations from sentences in a region proposal process. The generated NE representations are named textual bounding boxes (bounding boxes). Every bounding box is annotated with three parameters to indicate its entity type, start position and length in a sentence. Then, instead of predicting the class of an NE candidate, a boundary regression operation is applied to refine its spatial location in a sentence. Based on boundary regression, we design a boundary regression (BR) model to support nested NE recognition. It is a multiobjective learning framework composed of a basic network module, a region proposal module, a classification module and a regression module. The basic network transforms each sentence into an abstract representation. Then, a region proposal network is applied to generate bounding boxes. They are fed into the classification module for class prediction and the regression module for boundary regression. In the training process, in addition to maximizing the confidence scores of an NE, a linear layer is used to minimize its location offset relative to a true NE.

The BR model simultaneously predicts the classification score of an NE candidate and refines its spatial location in a sentence. The contributions of this paper include the following:

  1. 1.

    The positions of NEs in a sentence are represented as continuous values to support NE boundary regression. It has the advantage of resolving nested NEs and locating NEs with boundary regression operations.

  2. 2.

    A bounding box–based multiobjective learning model is designed to support nested NE recognition. It supports simultaneously predicting the class probability and refining the spatial locations of NEs in a sentence.

The structure of this paper is organized as follows. The “Related Work” section introduce related work. Before discussing the details of this model, our motivation is first discussed in the “Motivation” section. In the “Model” section, the definition of boundary regression and the architecture of the BR model are presented. Experiments are conducted in the “Experiments” section, where several issues about the BR model are discussed. The conclusion of this paper is given in the “Conclusion and Future Work” section.

Related Work

Because the BR model is motivated by object detection for computer vision, in the following, we divide the related work into two parts: object detection and NE recognition.

Object detection is implemented in a multistage pipeline in the early stage. A typical object detection model is often composed of three stages: segmentation, feature extraction and classification. Segmentation is implemented to generate possible object locations for prediction. Generic algorithms (e.g. selective searches) are often adopted to avoid exhaustive searching. Feature extraction is implemented to extract higher-order abstract features from raw input images. The output of this process is often denoted as feature maps. The feature extraction process can be encapsulated as a basic network truncated from a standard architecture for high-quality image classification, including the VGG-16 network [14], and GoogLeNet [15]. Finally, an output layer (e.g. a linear SVM or a softmax layer) is used to predict confidence scores for each proposed region.

End-to-end object detection models can be globally optimized and share computation between inputs. These models are often similar in the feature extraction layer and output layer, where a basic network is adopted to generate conv feature maps, and two fully connected layers synchronously output class probabilities and object locations. The main difference is the strategy to generate the region proposal. For example, Faster R-NN adopts anchor boxes to generate region proposals per feature map [16]. Erhan et al. [17] uses a single deep neural network to generate a small number of bounding boxes. Redmon et al. [18] divided an image into grids associated with a number of bounding boxes. Liu et al. [19] used a basic network that maps an image into multiple feature maps to generate default boxes with different aspect ratios and scales.

In the field of NE recognition, neural networks have also received great attention. Early models usually adopted a sequence model to output flattened NEs (e.g. LSTM, Bi-LSTM (bi-directional LSTM) or Bi-LSTM-CNN). To handle the nesting problem, the sequence model is redesigned. It has three variants: the layering, cascading and joint models [20]. Parsing trees are also widely used to represent nested NEs in a tree structure [21]. For example, Finkel and Manning [22] used internal and structural information of parsing trees to flatten nested NEs. Zhang et al. [23] adopted a transition-based parser. Jie et al. [24] attempted to capture the global dependency of a parsing tree. To make use of all information available in a network for classification, capsule networks also adopted for named entity recognition, which groups neurons into capsules to detect specific features of an entity [25,26,27]. Because named entity recognition suffers from serious feature sparsity problem, masked entity language modeling (MELM) is proposed to generate augmented data from external resources [28, 29]. To avoid the tagging ambiguity problem in conventional tagging schemes, a constituent-based tagging scheme was defined to label tokens for named entity recognition [30, 31].

Recently, many models have been designed to directly recognize nested NEs directly. Lu and Roth [32] resolved nested NEs into a hypergraph representation. Xu and Jiang [7] and Sohrab and Miwa [5] verified every possible fragment up to a certain length. Wang et al. [33] mapped a sentence with nested mentions to a designated forest. Ju et al. [34] proposed an iterative method that implements a sequence model in the output of a previous model. Lin et al. [11] proposed a head-driven structure. Li et al. [35] combined the output of a Bi-LSTM-CRF network with those of another Bi-LSTM network. Strakova et al. [36] proposed a sequence-to-sequence model. Zheng et al. [10] proposed an end-to-end boundary-aware neural model. In Chen et al. [37], a boundary assembly (BA) model is designed to recognize nested NEs. The BA model identifies NE boundaries, assembles them into NE candidates, and selects the most likely ones. For a broad understanding of the NER problem, the interested reader can refer to the survey paper [38] for deep neural network–based NE recognition.

Motivation

The motivation of the BR model is inspired by techniques developed for object detection in computer vision. From our understanding, a sentence is a one-dimensional linear textual stream, and an image is a two-dimensional pixel patch. They are completely different in their external representations. However, in recent years, based on deep neural networks, language and vision have been embedded into a distributed representation and mapped into an abstract semantic space [39]. Therefore, the combination of natural language processing and computer vision has become popular in research communities, e.g. the text retrieval approach in videos [40] and multimodal deep learning [13].

As shown in Fig. 1, the spatial patterns of entities in sentences and objects in images have similar structures.

Fig. 1
figure 1

Similarity between entities and objects

The structures between objects (or entities) can be roughly divided into three categories: flattened, nested and discontinuous [12]. Flattened entities (or objects) are spatially separated from each other. In nested structures, two or more entities (objects) overlap with each other. Discontinuous structures refer to disclosed objects (or entities). For example, “HEL, KU812 and K562 cells” contains three entities: “HEL cells”, “KU812 cells” and “K562 cells”. The first two NEs are discontinuous. The discontinuous structure can be transformed into a nested structure [41]. For example, the above examples can be processed as three nested NEs: “HEL, KU812 and K562 cells”, “KU812 and K562 cells” and “K562 cells”. In this paper, we only consider flattened and nested structures.

In object detection, which is a fundamental task in computer vision, regions of an image are classified to predict the locations of objects. In the early stage, this task is implemented in a multistage pipeline, where a region proposal step is applied to select coarse proposals. Another trend in object detection is the adoption of end-to-end architectures. A deep neural network is first adopted to map an input image into abstract representations known as conv feature maps. Then, proposals are generated from these feature maps. A proposal is an abstract representation of a possible object. Parameters are defined to indicate the location and shape in an image. Finally, a multiobjective learning framework is designed to simultaneously locate objects and predict the class probability.

Motivated by techniques developed in computer vision, we adopt a region proposal network to generate abstract NE representations (referred to as bounding boxes). Every bounding box is annotated with three parameters to indicate its entity type, start position and length in a sentence. Then, the spatial locations of NEs are represented as real values. This enables the location of NEs in a sentence using a regression operation. The concept to regress boundaries for NE recognition is visualized in Fig. 2.

Fig. 2
figure 2

Boundary regression

As shown in Fig. 2, an input sentence is first mapped into recu feature maps by a deep neural network. The feature maps can be seen as an abstract representation of an input sentence. Every feature map denotes a representation of an NE boundary, which can be bounded with others to generate bounding boxes (e.g. \(d_i\) in Fig. 2). Every bounding box has two parameters denoting its position and shape in a sentence. If a bounding box correctly matches a true NE, the box is referred to as a “ground truth box” (or truth box) (e.g. \(g_j\) in Fig. 2).

Every box has two parameters to indicate its position (\(s_i\)) and shape (or length) (\(l_i\))Footnote 1. The regression operation predicts the position offset and shape offset (\(\Delta s_i\) and \(\Delta l_i\),respectively) relative to a truth box. Finally, in the output, the locations of the recognized NEs are updated as \(\tilde{d}_i=\{s_i+\Delta s_i, l_i+\Delta l_i\}\). Because the outputs of a regression operation are continuous values, they are rounded to the nearest word boundary locations.

To design an end-to-end multiobjective learning architecture for boundary regression, we should carefully take the following four issues under consideration:

  1. 1.

    Representation: in object detection usually uses stacked convolutional layers are usually used to map an image into conv feature maps. In language processing, the recurrent (or attention) neural network is more effective for capturing the semantic dependency in a sentence. In this paper, the abstract representations of input sentences are referred to as recu feature maps.

  2. 2.

    Region proposal: in the recu feature map, a feature map position can be considered as an abstract representation of a possible entity boundary. It can be bounded with other feature maps to generate bounding boxes. Every bounding box is an abstract representation of an NE candidate labelled with its location information and class category.

  3. 3.

    Multiobjective learning: because the recurrent neural network can learn the semantic dependency, a bounding box contains semantic information about the whole sentence. In addition to predicting conditional class probabilities on a bounding box, a linear layer can be stacked to predict its location in a sentence.

  4. 4.

    Maximum of overlapping neighbourhoods: in the prediction process, every bounding box approaches a true bounding box. They overlap in the neighbourhood of a true bounding box. It is necessary to collect the most likely matched bounding boxes from overlapped bounding boxes.

According to the above discussion, we have designed an end-to-end multiobjective learning architecture for boundary regression. The architecture of the BR model is given in the following section.

Model

In this paper, instead of modelling the NE recognition task as a classification problem, we frame this task as a multiobjective optimization process. In this framework, in addition to outputting discretized entity categories, a regression operation is integrated into a deep network for predicating the location offset of an NE candidate relative to a true NE in a sentence. The structure of the BR model is shown in Fig. 3.

Fig. 3
figure 3

An input sentence is input into a basic network, which maps the sentence into a recu feature map. In the region proposal network, bounding boxes are generated from the recu feature map. Then, the model is trained to satisfy the training objective. Nonmaximum suppression is adopted to produce a final decision

As Fig. 3 shows, six specific areas (A to F) are highlighted in the BR model. They are discussed as follows.

Basic Network

In a neural network model, the values of inputs represent the tense of signals. Therefore, words in a sentence are traditionally represented as high-dimensional one-hot vectors. At present, deep neural networks have two advantages to support the automatic extraction of semantic features from raw inputs: word embedding and feature transformation. Word embedding is applied to map every word into a distributed representation, which encodes semantic information learned from external resources. In feature transformation, many types of neural layers (CNN [42], LSTM [2] or attention [43] layers) can be stacked to support designed feature transformation for capturing syntactic and semantic features of a sentence, avoiding the need for manual feature engineering.

For feature extraction, there are two differences between object detection and NE recognition. First, the detection of objects is mainly based on internal features of objects. Therefore, an object moving in an image exerts less influence on object detection. However, entities have a strong semantic dependency in a sentence. Second, in nested objects, features in the bottom object are blocked by upper objects, which makes bottom object detection. challenging. Nested NEs share the same context in a sentence. It is important to learn dependent features that are relevant to the considered NEs. Therefore, compared with object detection, encoding semantic dependencies in a sentence is more important for NE recognition. In deep architectures, a recurrent neural network or an attention network is helpful to capture the semantic dependency between words.

In our BR model, we adopt a basic network consisting of an embedding layer and a Bi-LSTM layer. The embedding layer is adopted to map a sentence into a distributed representation, where words (or characters) are embedded into vectors by a lookup table pretrained with unsupervised algorithms. Then, a Bi-LSTM is implemented to encode the semantic dependencies in a sentence. To simplify the region proposal step, we set the length of the input sentence as a fixed number, denoted as L. Longer or shorter sentences are trimmed or padded, respectively.

Feature Map

The output of the basic network is denoted as recu feature maps, where a feature map is an abstract token representation of the input. In the BR model, it also denotes the feature map layer, which represents high-order abstract features of a sentence integrated with dependent semantics between words. In computer vision, images have an invariance property for a zooming operation. Object detection can benefit from multigranularity representations, where region proposal can be implemented on multiscale feature maps to generate bounding boxes with different scales. In natural language processing, it is difficult to condense a textual sequence into multigranular representations. At present, for each input sentence, we generate a single recu feature map layer, which can be seen as a high-order abstract representation of an input sentence. Each feature map position corresponds to a possible entity boundary. Because we adopt continuous location representation, the position of feature maps is normalized into the interval [0, 1] to support the regression operation.

The feature map layer is mainly applied to support region proposal for generating abstract NE representations. Instead of directly generating NE candidates from a sentence (e.g. Sohrab and Miwa [5]), by generating abstract NEs from the feature map layer parameters in the bottom network can be shared. This reduces the computational complexity and enables more potent nonlinear function approximators to enhance model discriminability.

Region Proposal

A feature map corresponds to an abstract representation of a possible NE boundary in a sentence. Each feature map can be set as a start position and combined with right feature maps to generate bounding boxes with different lengths. In this paper, for every feature map, we enumerate K bounding boxes from left to right. The value K is a predefined parameter indicating the longest NE candidate. It is similar to an exhaustive enumeration method, which verifies every possible NE candidate up to a certain length (e.g. Sohrab and Miwa [5]). The difference is that bounding boxes are referred to by their spatial locations in a sentence, which can be used to filter bounding boxes that are unlikely to be truth boxes (discussed in the “Bounding Boxes” section). This reduces the computational complexity and decreases the influence of by negative examples.

To show the potential ability of the BR model to locate NEs that are missed in the region proposal process, in this experiment, we also propose an interval enumeration strategy. For every feature map, we enumerate bounding boxes from left to right with lengths [1, 3, 5, 7, 11, 15, 20]Footnote 2. In the training process, all ground truth boxes in the training data are included to train the classifier. In the testing process, only bounding boxes with lengths [1, 3, 5, 7, 11, 15, 20] are verified. Compared with exhaustive enumeration with lengths from 1 to 20, interval enumeration reduces about computational overhead by approximately 65%. For convenience, we refer to the BR model with interval enumeration as “BR\(_{\text {int}}\)”. The BR model implemented on exhaustive enumeration is referred to as “BR\(_{\text {exh}}\)”.

Bounding Boxes

A bounding box is a high-order abstract representation of a possible NE generated from feature maps by region proposal. Because feature maps are transformed by a basic network that consists of a recurrent neural network or an attention network, each bounding box contains contextual features about a possible NE. Using class labels and location parameters of bonding boxes, a softmax layer and a linear layer can be set to predict their class probabilities and learn the location offset relative to a ground truth box. In the following, we give formal definitions of the bounding box.

Let \(\mathbf {D}=\{d_1,d_2,\cdots ,d_M\}\) denote a bounding box set generated from an input sentence S. M is the size of \(\mathbf {D}\). Each bounding box \(d_i \in \mathbf {D}\) has 3 parameters: \(d_i^s\), \(d_i^l\) and \(\mathbf {c}_i\). Parameters \(d_i^s\) and \(d_i^l\) are two real numbers denoting the start position and length of \(d_i\) in a sentence, respectively. The end position of \(d_i\) can be computed as \(d_i^s + d_i^l\). Parameter \(\mathbf {c}_i=(c_i^1, c_i^2,\cdots ,c_i^Z)\) \((c_i \in \{0,1\})\) is a one-hot vector representing the entity type of \(d_i\), where Z is the number of entity types. Therefore, a bounding box \(d_i\) can be referred to as a three-tuple \(d_i=\langle d_i^s, d_i^l, \mathbf {c}_i \rangle\). If a bounding box corresponds to a true NE, it is referred to as a ground truth box and represented as \(g_j=\langle g_j^s, g_j^l, \mathbf {c}_j \rangle\).

Bounding boxes are labelled with location parameters. Borrowing the intersection over union (IoU) metric developed in computer vision [44], the overlapping ratio between two bounding boxes can be measured by the IoU value.

Let \(d_i\) and \(g_j\) be a bounding box and a ground truth box, respectively. The IoU value between them is computed as:

$$\begin{aligned} IoU(d_i,g_j)=\frac{span(d_i) \cap span(g_j)}{span(d_i) \cup span(g_j)} \end{aligned}$$
(1)

where the function \(span(d_i)\) represents the range of a bounding box in feature maps. If a bounding box has a large IoU value, it is highly overlapped with a ground truth box. A high overlapping ratio indicates that a bounding box contains adequate contextual features about a true NE, which guarantees learning of the location offset relevant to a truth box. Otherwise, if the IoU value of a bounding box is lower than a predefined threshold, it denotes a false NE. This is used to train a classifier for identifying false NEs.

Let \(\mathbf {D}_G\) represent the set of all ground truth boxes in \(\mathbf {D}\). We define two sets as follows:

$$\begin{aligned} \begin{aligned}&\mathbf {D}_p=\{d_i|d_i \in \mathbf {D}, \exists g_j\in \mathbf {D}_G(IoU(d_i,g_j)\geqslant \gamma ) \}\\&\mathbf {D}_n= \{d_i| d_i\in \mathbf {D}, \forall g_j\in \mathbf {D}_G(IoU(d_i,g_j) < \gamma ) \} \end{aligned} \end{aligned}$$
(2)

where \(\gamma\) is a predefined threshold. \(\mathbf {D}_G\) is a subset of \(\mathbf {D}_p\), where \(\gamma\) is equal to 1.

In this paper, \(\mathbf {D}_p\) is referred to as the positive bounding box set, and \(\mathbf {D}_n\) is referred to as the negative bounding box set. In region proposal, a large number of negative bounding boxes will be generated, which leads to a significant data imbalance problem. This is also computationally expensive. In the training process, we collect \(\mathbf {D}_p\) and \(\mathbf {D}_n\) at a ratio of 1:3 to balance the positive and negative samples. This guarantees faster optimization and a stable training process.

Given a bounding box \(d_i\), its relative ground truth box is identified as:

$$\begin{aligned} g_j=\mathop {\arg \max }_{g\in \mathbf {D}_G} IoU(d_i,g) \end{aligned}$$
(3)

Given a ground truth box \(g_j\), all bounding boxes of \(\mathbf {D}_p\) satisfying Eq. 3 are referred to as \(\mathbf {D}_{g_j}\). They are the neighbourhoods of \(g_j\). This is formalized as:

$$\begin{aligned} \mathbf {D}_{g_j}=\{ d_{i} | d_{i} \ \text {is a neighbourhood of}\ g_j.\} \end{aligned}$$
(4)

It is important to know that all bounding boxes in \(\mathbf {D}_{g_j}\) in the training data are labelled with a positive class tag the same as \(g_j\). This labelling strategy is different from the traditional method in which, if the start and end boundaries of an NE candidate are not precisely matched to a true NE, it is labelled with a negative class tag. The reason for this will be discussed in detail in the “Training Objective” section. For consistency, in this paper, we use the term “positive box” referring to a bounding box with an IoU value relevant to a ground truth box larger than \(\gamma\). The term “truth box” refers to a bounding box, which has a location precisely matched to a real NE.

Based on \(\mathbf {D}_G\) and Eq. 3, \(\mathbf {D}_p\) can be partitioned into a set \(\mathcal {D}_p=\{\mathbf {D}_{g_1},\mathbf {D}_{g_2},\cdots \}\). Therefore, \(\mathcal {D}_p\) is a partition of \(\mathbf {D}_p\). If \(i \ne j\), then \(\mathbf {D}_{g_i} \cap \mathbf {D}_{g_j} = \emptyset\). Every bounding box \(d_i \in \mathbf {D}_p\) belongs to a \(\mathbf {D}_{g_j} \in \mathcal {D}_p\).

For convenience, Table 1 lists the definitions of different bounding box sets. Their roles in supporting boundary regression will be discussed in the following subsection.

Table 1 Bounding box sets of different types

Training Objective

The BR model simultaneously predicts the classification score of a NE candidate and refine its spatial location in a sentence. It is a multiobjective learning framework, which involves two objective optimization functions: a location loss function and a confidence loss function. In the training process, we optimize the BR model by reducing the total loss of the location offset and class prediction. The training objective is formalized as follows.

Let \(\hat{d}_{ij}^s = (g_j^s - d_i^s)/g_j^l\) and \(\hat{d}_{ij}^l=log(g_j^l/d_i^l)\) be the normalized position offset and shape offset between \(d_i\) and \(g_j\). Given a bounding box \(d_i\), the BR model outputs 3 parameters: \(\Delta d_i^s\), \(\Delta d_i^l\) and \(\tilde{\mathbf {c}}_i\). \(\Delta d_i^s\) and \(\Delta d_i^l\) denote the predicted position offset and shape offset of \(d_i\) relative to \(g_j\). \(\tilde{\mathbf {c}}_i\) is a confidence score that reflects the confidence that a box contains a true NE. As Fig. 3 shows, \(\Delta d_i^s\) and \(\Delta d_i^l\) are regressed by a linear layer, while the classification confidence score \(\tilde{\mathbf {c}}_i=(\tilde{c}_{i}^0, \tilde{c}_{i}^1,\cdots ,\tilde{c}_{i}^Z)\) is predicted by a softmax layer.

For every \(d_i \in \mathbf {D}_p\), the location offset of \(d_i\) relates to a ground truth box that is predicted by a linear layer. A characteristic function \(E_{ij}^z=\{0,1\}\) is defined to indicate that a default box \(d_i\) is matched to a relative ground truth box \(g_j\) selected by Eq. 3. In the training process, the regression operation updates \(\Delta d_i^s\) and \(\Delta d_i^l\) to approach \(\hat{d}_{ij}^s\) and \(\hat{d}_{ij}^l\), respectively. The location loss can be computed as:

$$\begin{aligned} L_{loc}(x, s, l) = \sum _{g_j \in \mathbf {D}_G} \frac{1}{N} \left( \sum _{d_i \in \mathbf {D}_{g_j}} \sum _{h \in \{s, l\}} E_{ij}^z \text{Smooth}_{L_1}(\Delta d_i^h - \hat{d}_{ij}^h) \right) \end{aligned}$$
(5)

where \(N=|\mathbf {D}_{g_j}|\) is the cardinality of \(\mathbf {D}_{g_j}\). It is used to normalize the weight between \(g_j \in \mathbf {D}_G\). \(\text {Smooth}_{L_1}\) is a robust \(L_1\) loss that quantifies the dissimilarity between \(d_i\) and \(g_j\). It is less sensitive to outliers [45].

Equation 5 shows that only the positive bounding box set \(\mathbf {D}_p\) is adopted to compute the location loss. For every ground truth box \(g_j \in \mathbf {D}_G\), its neighbourhoods \(\mathbf {D}_{g_j}\) are used to generate the location loss. This setting is natural because neighbourhoods contain sufficient contextual features about an NE to support boundary regression. However, negative bounding boxes are farther away from a ground truth box. Because the vanishing gradient problem, it is difficult to precisely regress their location offsets.

When minimizing the location loss, bounding boxes belonging to \(\mathbf {D}_{g_j}\) will approach the ground truth box \(g_j\). Therefore, all bounding boxes in \(\mathbf {D}_{g_j}\) are given a class tag that is the same as the ground truth box \(g_j\).

Confidence loss is a softmax loss over multiple class confidences. It is given as follows:

$$\begin{aligned} L_{con}(x,c)= -\sum _{d_i \in \mathbf {D}_p} E_{ij}^z log(\tilde{c}_i^z) - \sum _{d_i \in \mathbf {D}_n} log(\tilde{c}_i^0) \end{aligned}$$
(6)

where \(\tilde{c}_i^z=exp(\tilde{c}_i^z)/\sum _{z=0}^{Z}exp(\tilde{c}_i^z)\), \(\tilde{c}_i^0\) is the confidence score indicating that an example is negative. A key issue about the classification is that the confidence score should be estimated based on NE representations with refined spatial locations in a sentence.

The total loss function combines the location loss and confidence loss:

$$\begin{aligned} L(x, s, l, c) = L_{loc}(x, s, l) + \alpha L_{con}(x,c) \end{aligned}$$
(7)

where \(\alpha\) is a predefined parameter balancing the weight between the location loss and confidence loss. The training objective is to reduce the total loss of the location offset and class prediction. In the training process, we optimize their locations to improve their matching degree and maximize their confidences.

Non-maximum Suppression

In the prediction process, the BR model outputs a set of bounding boxes for each input sentence, referred to as \(\mathbf {D}=\{d_1,d_2, \cdots ,d_M\}\). Every box \(d_i\in \mathbf {D}\) has 3 outputs: \(\Delta d_i^s\), \(\Delta d_i^l\) and \(\tilde{\mathbf {c}}_i\), indicating the position offset, shape offset and class probability of \(d_i\) relative to a truth box, respectively. After \(\Delta d_i^s\) and \(\Delta d_i^l\) are resized as \(\Delta s_i\) and \(\Delta l_i\), respectively, a predicated NE can be located as [\(s_i+\Delta s_i\), \(l_i+\Delta l_i\), \(\tilde{c_i}\)]Footnote 3.

The output \(\mathbf {D}\) contains a large number of boxes, but many of them overlap. Nonmaximum suppression (NMS), which is implemented in the prediction process to produce the final decision, selects truth boxes from overlapped neighbourhoods. The NMS algorithm is shown in Table 2.

Table 2 The NMS algorithm of the BR model

The NMS algorithm is a one-dimensional algorithm that selects nested NEs from overlapped positive boxes. The NMS algorithm searches local maximized elements from overlapping neighbourhoods in which a smaller number of high-confidence boxes are collected. The threshold is adopted to control the overlapping ratio between neighbourhoods. In our experiments, the value of \(\lambda\) is set as 0.6.

Experiments

In our experiments, the ACE 2005 corpus [3] and the GENIA corpus [46] are adopted to evaluate the BR model. To show the performance of the BR model to recognize flattened NE structures, in the “Ablation Study” section, the BR model is also evaluated on the OntoNotes 5.0 [47] and CoNLL 2003 English [48] corpora.

The ACE 2005 corpus is collected from broadcasts, newswires and weblogs. It is the most popular source of evaluation data for NE recognition. The corpus contains three datasets: Chinese, English and Arabic. In this paper, the BR model is mainly evaluated on the ACE Chinese corpus. To show the extensibility of the BR model regarding other languages, it is also evaluated on the ACE English corpus and the GENIA corpus.

The GENIA corpus is collected from biomedical literature. It contains 2000 abstracts in MEDLINE by PubMed based on three medical subject heading terms: human, blood cells and transcription factors. This dataset contains 36 fine-grained entity categories. In the GENIA corpus, many NEs have discontinuous structures. They are transformed into nested structures by holding the discontinuous NE as a single mention.

In the ACE Chinese dataset, there are 33,238 NEs in total. The number of NEs in the ACE English dataset is 40,122. The GENIA corpus is annotated with 91,125 NEs. The distributions of NE lengths in the three corpora are shown in Fig. 4.

Fig. 4
figure 4

Distributions of NE lengths

In the basic network, the default length of sentence L is 50. Sentences with longer or shorter lengths are trimmed or padded, respectively. In the total loss function, \(\alpha =1\) is used. Two BERT\(_{BASE}\) [49] models are tuned by implementing the innermost and outermost NE recognition tasks. Then, every sentence is encoded into two sequences of vectors by two tuned BERT models, where every word in a sentence is encoded as a concatenated \(2\times 768\) dimensional vector. It is fed into a Bi-LSTM layer, which outputs a \(2\times 128\) dimensional recu feature map. In the training process, word representations are fixed and not subject to further tuning.

In region proposal, two strategies have been introduced: exhaustive enumeration and interval enumeration, which correspond to two BR models referred to as “BR\(_{\text {exh}}\)” and “BR\(_{\text {int}}\)”. The BR\(_{\text {exh}}\) model exhaustively enumerates all NEs with lengths up to 6. We ignore NEs with lengths larger than 6. In the BR\(_{\text {int}}\) model, we intermittently enumerate bounding boxes from left to right with lengths [1, 3, 5, 7, 11, 15, 20]. To collect the positive bounding box set \(\mathbf {D}_p\) to train the linear layer, \(\gamma\) is set as 0.7 and 0.6 for BR\(_{\text {exh}}\) and BR\(_{\text {int}}\), respectively. The quantitative test to set \(\gamma\) is discussed in the “Influence of IoU” section.

In the output layer, a correct NE requires that the start and end boundaries of an NE are precisely identified. Because the BR model uses a regression operation to predicate the spatial locations of NEs in a sentence, all entity locations are mapped into interval [0, 1] for a smooth learning gradient. Therefore, the output of BR is rounded to the nearest character location.

Comparison with Related Work

To show the superiority of our model, we first compare the BR model with related work. The BR model is first evaluated on the Chinese corpus. Then, to show the extensibility of this model, the BR model is transformed to address the English corpus for further assessment.

Evaluation on the Chinese Corpus

On the Chinese corpus, we first conduct a popular sequence model (Bi-LSTM-CRF) [50]. It consists of an embedding layer, a Bi-LSTM layer, an MLP (Multi-Layer perceptron) layer and a CRF layer. The embedding layer and Bi-LSTM layer have the same settings as the basic network of the BR model. We adopt cascading and layering strategies to solve the nesting problem [20]. The layering model proposed by Lin et al. [11] is adopted for comparison.

On the Chinese ACE corpus, BA is a pipeline framework for nested NE recognition that has achieved state-of-the-art performance [9]. The original BA is a “Shallow” model, which uses a CRF model to identify NE boundaries and a maximum entropy model to classify NE candidates. NNBA is a neural network version, where the LSTM-CRF model is adopted to identify NE boundaries, and a multi-LSTM model is adopted to filter NE candidates.

In this experiment, the “Adam” optimizer is adopted. The learning rate, weight decay rate and batch size are set as 0.00005, 0.01 and 30, respectively. Shallow models refer to CRF-based models. In the BR model, we use the same settings as those used by Chen et al. [9] to configure the basic neural network, where the BERT is adopted to initialize word embeddings. These models are implemented with the same data and settings as Chen et al. [9]. The results are shown in Table 3.

Table 3 Evaluation in the Chinese corpus
Table 4 Evaluation in the English corpus

In Table 3, all deep models outperform shallow models because neural networks can effectively utilize external resources by using a pretrained lookup table and have the advantage of learning abstract features from raw input. In deep models, the performances of the innermost and outermost models are heavily influenced by a lower recall rate, which is caused by ignoring nested NEs. The deep cascading model also suffers from poor performance because predicting every entity type by an independent classifier does not make full use of the annotated data. The deep layering model is impressive. This model is produced by implementing two independent classifiers that separately recognize the innermost and outermost NEs. It offers higher performance, even outperforming the NNBA model. The reason for the improvement is that, in our experiments, entities with lengths exceeding 6 are ignored, which decreases the nesting ratio. Most of the nested NEs have two layers, which can be handled appropriately by the layering model. In Table 3, the BR model exhibits the best performance.

The Chinese language is hieroglyphic. It contains very little morphological information (e.g. capitalization) to indicate word usage. Because there is a lack of delimitation between words, it is difficult to distinguish monosyllabic words from monosyllabic morphemes. However, the Chinese language has two distinctive characteristics. First, Chinese characters are similarly shaped squares. They are known as square-shaped characters, and their locations are uniform. Second, because the meaning of a Chinese word is usually derived from the characters it contains, every character is informative. Therefore, character representation can effectively capture the syntactic and semantic information of a sentence. The BR model works well on the Chinese corpus.

Evaluation on the English Corpus

On the ACE English corpus and the GENIA corpus, we adopt the same settings as those by Lu and Roth [32] to evaluate the BR model, where the evaluation data are divided according to the proportion 8:1:1 for training, developing and testing. On the GENIA corpus, researchers often report the performance with respect to five NE types (DNA, RNA, protein, cell line and cell type). To compare with existing methods, we generate results for the five NE types.

As shown in Table 4, Lu and Roth [32] and Katiyar and Cardie [51] represented nested NEs as hypergraphs. Ju et al. [34] fed the output of a BiLSTM-CRF model to another BiLSTM-CRF model. This strategy generates layered labelling sequences. The stack-LSTM [33] uses a forest structure to model nested NEs. Then, a stack-LSTM is implemented to output a set of nested NEs. Sequence-to-nuggets [11] first identifies whether a word is an anchor word of an NE with specific types. Then, a region recognizer is implemented to recognize the range of the NE relative to the anchor word. Xia et al. [6] and Fisher and Vlachos [52] are pipeline frameworks. They first generate NE candidates. Then, all candidates are further assessed by a classifier. Shibuya and Hovy [53] iteratively extracted the entities outermost to innermost. Strakova et al. [36] encode an input sentence into a vector representation. Then, a label sequence is directly generated from the sentence representation. Wang et al. [54] used a CNN to condense a sentence into a stacked hidden representation with a pyramid shape, where a layer represents NE candidate representations with different granularities. Shen et al. [56] proposed a two-stage method, where NE candidates are located by a linear layer, then fed into a classifier for prediction. These models are all nesting-oriented models. Their performances are listed in Table 4.

Table 4 shows that the performance on the GENIA corpus is lower than that on the ACE corpus. There are three reasons for this phenomenon. First, the GENIA corpus was annotated using discontinuous NEs. For the example mentioned in the “Motivation” section, “HEL, KU812 and K562 cells” contains two discontinuous NEs. Second, in the GENIA corpus, nested NEs may occur in a single word. For example, “TCR-ligand” is annotated as an “other_name” entity, and it is nested with a “TCR” protein. Third, a large number of abbreviations are annotated in the GENIA corpus, which brings about a serious feature sparsity problem. Therefore, the performance is lower on the GENIA corpus.

In related work, many models, e.g. Xu and Jiang [7], Sohrab and Miwa [5], Xia et al. [6] and Tan et al. [8] also exhaustively verify every possible NE candidate with length up to 6, because limiting the length of NEs can reduce the influence caused by negative instances. As a result, these models achieve higher performance. In comparison, the BR\(_{\text {int}}\) model achieves a significantly improved performance. Using the testing data, the ratios of NEs with lengths [1, 3, 5, 7, 11, 15, 20] on the ACE English and the GENIA corpora are 79.47% and 58.34%, respectively (the ratio in the ACE Chinese corpus is 39.89%). Therefore, the BR\(_{\text {int}}\) model also achieves competitive performance on the ACE English and GENIA corpora.

In Table 4, all neural network-based models exhibit higher performance. Especially in the BERT models, the performance is improved considerably. Li et al. [55] presented a model based on machine reading comprehension, where manually designed questions are required to encode NE representations. It achieves higher performance with on the GENIA corpus. However, because this model benefits from prior knowledge and experience, which essentially introduce descriptive information about the categories, it is rarely used for comparison with related work. In comparison with related work on the English corpus, the BR model also shows competitive performance.

Table 5 Feasibility of boundary regression

Ablation Study

In natural language processing, continuous location representation, which denotes the positions of linguistic units in a sentence, has not been used. Therefore, the regression operation is rarely used to support information extraction. It is known that the BR model represents the first attempt to locate linguistic units in a sentence by regression operation. To analyse the mechanism of boundary regression for nested NE recognition, we design a traditional NE classification model named the bounding box classifier (BBC) for comparison. It is generated by omitting the linear layer from the deep architecture in Fig. 3. In the output, only a softmax layer is adopted to predict the class probability for every bounding box.

In this section, three experiments are conducted to show the usefulness of the regression operation. We first conduct two ablation studies to show the feasibility of boundary regression. In the first experiment, exhaustive enumeration is adopted in region proposal. The BBC model and the BR model are compared to show the ability of the BR model to refine the spatial locations of NEs in a sentence. In the second experiment, the BR model is implemented on intermittently enumerated bounding boxes. The experiment shows the ability of the BR model to locate true NEs from mismatched NE candidates. The BR model is mainly designed to support nested NE recognition. It can also be used to recognize flattened NEs. Therefore, in the third experiment, we evaluate the BR model on flattened NE recognition. The first experiment and the second experiment are conducted on the ACE Chinese corpus. The third experiment is implemented on two English corpora with flattened NE annotations: the OntoNotes 5.0 [47] corpus and the CoNLL 2003 [48] corpus.

Performance with Exhaustive Enumeration

In this experiment, we compare the BR\(_{\text {exh}}\) model with two BBC models: BBC (0.7) and BBC (1.0). The BBC (0.7) model is implemented on the same evaluation data as the BR model with \(\gamma =0.7\) to collect positive bounding boxes. In the BBC (1.0) model, \(\gamma =1.0\) is applied. Under this setting, the positive bounding box set and the negative bounding box set can be denoted as \(\mathbf {D}_G\) and \((\mathbf {D}_p \cup \mathbf {D}_n)-\mathbf {D}_G\). It means that every positive bounding box is precisely matched to a true ground truth box. Therefore, the BBC (1.0) model is a traditional classifier implemented on precisely annotated evaluation data.

We implement the BBC model and the BR\(_{\text {exh}}\) model with the same data and settings. The results are shown in Table 5, where the “number” column refers to the number of annotated NEs in the corpus. The performance is reported with respect to 7 true entity types. The “total” column denotes the microaverage of all entity types.

In NE recognition, a correct output requires that both the start and end boundaries be precisely matched to a manually annotated NE. Because the BBC model is a traditional classifier that cannot regress mismatched boundaries, as Table 5 shows, it suffers from significantly diminished precision caused by mismatched NE boundaries. The BBC (1.0) model is implemented on the evaluation data with \(\gamma =1.0\), where boundaries of positive bounding boxes are precisely matched to true NEs. The result in Table 5 shows that, in comparison with the BBC (0.7) model, BBC (1.0) achieves higher performance.

In the BR\(_{\text {exh}}\) model, because bounding boxes in \(\mathbf {D}_p\) have a high overlapping ratio relevant to a ground truth box, they have sufficient semantic features with respect to a true NE for supporting boundary regression. In the prediction process, mismatched boundaries of bounding boxes can approach a ground truth box through the regression operation. In comparison with the BBC (0.7) model, mismatched boundaries can be revised, which considerably improves its performance. This result indicates that the regression operation truly regresses boundaries and locates NEs in a sentence.

Comparing the BR\(_{\text {exh}}\) model with the BBC (1.0) model, all NEs with lengths up to 6 are enumerated and verified. Under this condition, in the prediction process of the BR\(_{\text {exh}}\) model, approaching an NE that has already been verified is less helpful to improve the performance. However, because the BR\(_{\text {exh}}\) model can refine the spatial locations of bounding boxes in \(\mathbf {D}_p\) and share model parameters in the bottom network, a higher recall ratio can be achieved in the BR\(_{\text {exh}}\) model, which improves its final performance.

Performance of Interval Enumeration

In the second experiment, the BBC (1.0) model is compared with the BR\(_{\text {int}}\), which only verifies bounding boxes with lengths [1, 3, 5, 7, 11, 15, 20] in the testing dataset. The results are listed in Table 6.

Table 6 Superiority of boundary regression

Because the BBC is a traditional classifier that only assigns a class tag to every NE candidate, it cannot regress NE boundaries to locate possible NEs. Therefore, if a true NE is not enumerated in the testing data, it will be missed by the traditional classifier, which leads to greatly reduced recall. For example, in the ACE Chinese corpus, the sentence “中国要把广西发展为连接西部地区和东南亚的桥梁” (China wants to develop Guangxi into a bridge connecting the western region and Southeast Asia) contains five NEs: “中国” (China), “广西” (Guangxi), “西部地区” (the western region), “西部” (the western), and “东南亚” (Southeast Asia), which correspond to five ground truth boxes: [0, 2, GPE]Footnote 4, [4, 2, GPE], [11, 4, LOC], [11, 2, LOC], and [16, 2, GPE]. In the BBC model, only “东南” can be enumerated and verified, which considerably worsens its performance.

In the BR\(_{\text {int}}\) model, suppose that a true NE is missed in the region proposal process, if it is overlapped with one or more bounding boxes, a regression operation can be implemented to refine their spatial locations in a sentence, enabling these boxes approaching the missing true NE. As in the previous example, in the BR\(_{\text {int}}\) model, “西部地区” cannot be enumerated in the testing data. However, it is overlapped by at least two bounding boxes: [11, 3, ?] (“西部地”) and [11, 5, ?] (“西部地区和”), where “?” means that the class tag is unknown. Because their IoU values with the truth box [11, 4, LOC] are larger than 0.7 ( \(\gamma >0.7\)) , and they contain semantic information about “西部地区”. The softmax layer outputs a high confidence score on “LOC”. Above all, the offsets which are relevant to the truth box [11, 4, LOC] are learned, which enables the NE “西部地区” to be correctly recognized.

Evaluation on the Flattened Corpus

In this section, the OntoNotes 5.0 [47] and CoNLL 2003 English [48] corpora are employed to evaluate the performance of the BR model to recognize NEs with flattened structures. The OntoNotes corpus is collected from a wide variety of sources, e.g. magazines, telephone conversations, newswires. It contains 76,714 sentences and is annotated with 18 entity types. The CoNLL corpus consists of 22,137 sentences collected from Reuters newswire articles. It is divided into 14,987, 3466 and 3684 sentences for training, developing and testing.

The BR model is compared with several SOTA models conducted on the OntoNotes and CoNLL corpora. Ma and Hovy [57] used a BiLSTM-CNN-CRF model, that automatically encodes semantic features from words and characters. Ghaddar and Langlais [58] also used a BiLSTM-CRF model to learn lexical features from word and entity type representations. Devlin et al. [49] used the BERT framework, which is effective in learning semantic features from external resources. Li et al. [55] is a model based on machine reading comprehension. Yu et al. [59] used a biaffine model to encode dependency trees of sentences. Luo et al. [60] also used a Bi-LSTM model based on hierarchical contextualized representations. The results are shown in Table 7.

Table 7 Evaluation in the flattened corpus

In Table 7, the compared models are all sequence models. Three of these models are directly based on the BiLSTM network. Another tree model (the BERT, MRC and biaffine) also applied Bi-LSTM as an inner structure for capturing the semantic dependencies in a sentence. Because sequence models output a maximized labelling sequence for each input sentence, they are effective in encoding syntactic and semantic structures in a sentence. Therefore, in flattened NE recognition, sequence models achieve the best performance.

Comparing the BR model with sequence models, the BR model can be seen as a span classification model, which applies a regression operation to refine the spatial locations of NEs in a sentence. Because classification is based on enumerated spans, due to the vanishing gradient problem, encoding long-distance semantic dependencies in a sentence is weak for flattened NEs. Nevertheless, as Table 7 shows, the BR model also achieves competitive performance in flattened NE recognition.

Influence of Model Parameters

Because the IoU value and the NMS algorithm are influential on the BR model, in this section, we conduct experiments to analyse the influences of IoU and NMS.

Influence of IoU

In Eq. 2, a predefined parameter \(\gamma\) is adopted to divide the training data into positive bounding box set \(\mathbf {D}_p\) and a negative bounding box set \(\mathbf {D}_n\). Every bounding box in \(\mathbf {D}_p\) has a high overlapping ratio with a true bounding box. The overlap of these boxes enables each bounding box to contain semantic features about a truth box, which are used to train the linear layer. This is the key to supporting boundary regression. As Eq. 5 shows, the location loss, which is computed from \(\mathbf {D}_p\), which aggregates all position offsets between each bounding box and its relevant ground truth bounding box. Therefore, the IoU value directly determines the number of bounding boxes used for computing the location loss.

This experiment is conducted to analyse the influence of the IoU value \(\gamma\) on the final performance. Because \(\gamma =0.0\) cannot be used to collect positive bounding boxes, the value is initialized from 0.1 to 1.0 with a step size of 0.1. The result is shown in Fig. 5.

Fig. 5
figure 5

Influence of IoU values

In both the BR\(_{\text {exh}}\) model and the BR\(_{\text {int}}\) model, if \(\gamma\) has a small value, \(\mathbf {D}_p\) contains many bounding boxes with small overlapping ratios relevant to a true NE. In these bounding boxes, there are insufficient semantic features with respect to a true NE. The regression operation cannot guarantee appropriate learning of the location offset, which worsens the performance. The result indicates that a bounding box that is farther away from any true box is less helpful for boundary regression.

The BR\(_{\text {exh}}\) model achieves high performance when \(\gamma\) is approximately 0.7. When \(\gamma > 0.7\), the output of the BR\(_{\text {exh}}\) model exhibits stable performance. The reason for this phenomenon is that, when the value of \(\gamma\) is large enough, \(\mathbf {D}_p\) contains almost exclusively enumerated ground truth boxes. As Eq. 5 reveals, the influence of regression is weakened. However, because the BR\(_{\text {exh}}\) model verifies every NE candidate with length from 1 to 6, in this condition, the BR\(_{\text {exh}}\) model is almost degenerated into a traditional classification model. Its performance is heavily dependent upon the output of the softmax layer.

In the BR\(_{\text {int}}\) model, the highest performance is achieved when \(\gamma\) is approximately 0.6. When the value of \(\gamma\) exceeds 0.6, the performance is considerably diminished. In the BR\(_{\text {int}}\) model, a large \(\gamma\) means that \(\mathbf {D}_p\) contains a smaller number of positive bounding boxes. Their boundaries are almost precisely matched with ground truth boxes. In particular, when \(\gamma =1.0\), \(\mathbf {D}_p\) only contains grounding truth boxes. As Eq. 5 shows, in the training process, the location loss is always zero. Therefore, the linear layer cannot be trained appropriately.

Influence of NMS

In the testing process, the BR model adopts a one-dimensional NMS algorithm to select true bounding boxes from the output (as Table 2 shows). The NMS algorithm was originally designed to support object detection in computer vision, where a rectangle is adopted to frame an object. One difference between object detection and entity recognition is that, when detecting an object, a rectangle is permitted to have mutual overlap with the reference object. In contrast, recognizing an NE requires that both the start and end boundaries of an NE be precisely matched. In this experiment, we study the influence of NMS on nested NE recognition. The result is shown in Fig. 6.

Fig. 6
figure 6

Influence of NMS values

The results show that when \(\lambda =0.0\), the lowest recall is obtained because many bounding boxes are discarded. When \(\lambda >0.1\), increasing \(\lambda\) slowly improves the performance. Because bounding boxes belonging to a true NE are closely overlap, if the \(\lambda\) is not large enough, increasing \(\lambda\) exerts little influence on the performance. Therefore, a stable performance is achieved when \(\lambda\) takes a value from 0.1 to 0.6. The BR models achieve the best performance at approximately \(\lambda =0.6\). Comparing the BR\(_{\text {int}}\) model with the BR\(_{\text {exh}}\) model, the performance of the BR\(_{\text {int}}\) model is decreased when \(\lambda > 0.6\). The reason for this is that the BR\(_{\text {exh}}\) exhaustively enumerates all NE candidates with lengths up to 6. The output contains a large number of bounding boxes that have precisely matched boundaries.

In the NE recognition task, identifying an NE heavily depends on its contextual features. Therefore, highly overlapped bounding boxes may refer to different true NEs, which will be discarded when \(\lambda \ge 0.6\). This problem can be avoided by setting \(\lambda = 1.0\). In the BR model, the performance of \(\lambda =1.0\) is the same as that of disabling the NMS algorithm. In this setting, only fully overlapped bounding boxes are considered and redundant boxes are removed . Because many bounding boxes are remained even they have high overlapping ratio, this setting has a higher recall. However, this approach worsens the precision. As shown in Fig. 6b, \(\lambda = 1.0\) leads to a poor F1 score.

Time Complexity of Boundary Regression

In object detection of computer vision, compared with multistage pipeline models (e.g. R-CNN [61]), an end-to-end framework (e.g. Faster R-NN [16]) is employed due to its superior of speed. The reason for this is that the background of an image is learned in a single pass in the training process. It is shared by all proposed regions in an image.

To show the time complexity of boundary regression, in this experiment, we compare our model with those of Zheng et al. [10] and Wang et al. [54]. Zheng et al. [10] presented a boundary-aware neural model that detects entity boundaries. Boundary-relevant regions are then utilized to predict entity categorical labels. The boundary detection and region prediction share the same bidirectional LSTM for feature extraction. Wang et al. [54] presented a pyramid-shaped model stacked with neural layers. This model directly implements NE span prediction. Therefore, it has a higher speed.

In this experiment, we implement these models on the ACE English corpus with the same data split, settings and GPU platform. The times required to train these models are shown in Fig. 7, where the height of the histogram represents the time cost in seconds.

Fig. 7
figure 7

Time complexity comparison

In comparison with the two models described above, boundary regression achieves the least time complexity. The BR model has two characteristics that support high speed recognition. First, feature maps are generated from a basic network. They are shared by all bounding boxes in a sentence. In fact, all bounding boxes mutually overlap. They are parts of feature maps, which considerably reduce the model parameters. Second, every bounding box has location parameters. Therefore, in the learning process, the IoU value can be adopted to filter negative bounding boxes. This strategy is effective in reducing the time complexity.

Visualization of Boundary Regression

For a better understanding of boundary regression and to investigate more details of the BR model, in the following, we present a visualization of boundary regression.

The sentence “埃及是中东地区最重要的国家”Footnote 5 is selected from the testing data. It contains four nested NEs: “埃及” (Egypt, GPE), “中东地区最重要的国家” (the most important country in the Middle East area, GPE), “中东地区” (the Middle East area, LOC), and “中东” (the Middle East, GPE). A bounding box is denoted by 3 parameters \(s_i\), \(l_i\) and \(c_i\), which represent the starting position of the box, the length of the box and the class probability of the box, respectively. To visualize a bounding box, it is drawn as a rectangle. The horizontal ordinate represents the boundary positions of the bounding boxes in a sentence, which are normalized to [0, 1]. The vertical coordinate represents the classification confidence score. The colours of the bounding box represent NE types. To generate bounding boxes, the selected sentence is predicted by a pretrained BR model. All output bounding boxes are collected and drawn with respect to the sentence. The result is shown in Fig. 8.

Fig. 8
figure 8

Visualization of the bounding box regression

In Fig. 8a, bounding boxes are predicted by the BR model without training (0 iterations). From Fig. 8b to f, the BR model is trained with different rounds (denoted in the titles of the subfigures). Because the regression operation may output negative values for parameters \(s_i\) and \(l_i\), we filter bounding boxes with \(l_i \le 0\) or \(s_i+l_i > 1\) (beyond the sentence range).

In Fig. 8a, there is no tendency between bounding boxes. They are distributed evenly across the whole sentence and all NE types. In Fig. 8b, the BR model is implemented on the training data in only one round. One interesting phenomenon is that the red bounding boxes and blue bounding boxes are quickly grouped around NEs. Furthermore, other true entity types are appropriately reduced. From Fig. 8c to f, when the number of iterations is increased, there are two tendencies with respect to the bounding boxes. First, the BR model becomes more confident in the entity type prediction, which increases the classification confidence of the bounding boxes. Second, the locations of the bounding boxes approach the true NEs. This indicates that the regression operation to locate NEs is feasible.

Overlapped bounding boxes are the key to solving the nested NE problem. Figure 8f shows that nested NEs are being distinguished appropriately. We have tracked several bounding boxes and found that bounding boxes do not smoothly or directly approach true NEs. There are some oscillations between them. In the training process, a bounding box may perfectly match the true NE, and then move away from it in the next iteration. However, in accordance with the increase in the number of training steps, these oscillations tend towards stability.

Conclusion and Future Work

In this paper, we proposed a boundary regression model for nested NE recognition. The BR model can be seen as a framework to support nested NE recognition. In the “Feature Map” section, we divided the BR model into two modules: “perceptional module” and the “cognitive module”. In the perceptional module”, various deep architectures can be designed to extract high-order abstract features from raw inputs. In whe cognitive module, instead of bounding boxes, abstract NE representations can be defined with other position and shape parameters. These issues will be addressed in our future work. To enumerate NE candidates, new strategies can be designed to support region proposal. These issues will be addressed in our future work. They are also open for researchers who are interested in this work.