1 Introduction

Relation extraction is an important task in information extraction, which refers to the process of generating relation triples from unstructured textual data [1], the form of relation triples is \(< entity \; e_{1}, entity \; e_{2}, relation\; r>\). The relation extraction based on full supervision has high precision, but the training of the model requires a large amount of annotated data, which will consume a lot of time and labor cost if the manual form is used for annotation. Therefore, the idea of distant supervision provides a new way to obtain annotated data quickly, and the annotated data can be automatically generated by the alignment of the knowledge base and large-scale text.

The distant supervision method can save a lot of time and manpower, but its assumption is too strong. There is a lot of noisy data in the automatically generated annotation data, which has a great negative impact on the model training. In order to alleviate the problem of noisy data, Riedel et al. [2] weakened the hypothesis and proposed that “if there is a relation between two entities, at least one sentence in the sentence set containing these two entities describes this relation". The model selected a sentence most likely to express this relation for model training, which filtered a large amount of noise data and improved the training effect of the model. Hoffmann et al. [3] believed that “more than one sentence has this relation”, the model selected multiple sentences that may express this relation for training.

In view of the effectiveness of the attention mechanism in processing feature information, many researchers have introduced it into relation extraction tasks. Lin et al. [4] assigned different weights to different sentences to get rid of some noisy sentences and improve the performance. Ye et al. [5] learned sentence weights based on noise distribution information, which improved the precision and robustness of relation extraction. Qu et al. [6] proposed word-level attention and used the semantic information of entity word vectors as supplementary features for relation extraction. Zheng et al. [7] proposed the residual attention mechanism to strengthen the weight of keywords while ensuring the transmission of semantic information.

At the same time, the traditional relation extraction model does not make full use of the supervised information of the correctly labeled instances, which affects the performance of relation extraction. The soft label method can use correctly labeled instances to dynamically correct wrong labels during training, which greatly improves the precision of relation extraction. Therefore, many researchers introduce it into relation extraction tasks. Liu et al. [8] combined the relation score represented by entity pairs with the confidence of noise labels through the joint scoring function to obtain corrected labels for specific entity pairs to improve the performance of relation extraction. Li et al. [9] integrated gating mechanism and soft label, and adopted sentence-level attention mechanism to improve the performance of the model.

However, the soft label method relies too much on the correctly labeled instances and ignores the valid supervisory information hidden in the incorrectly labeled instances, leading to the loss of supervisory information. To solve the above problems, we propose the distantly supervised relation extraction based on Residual Attention and Self Learning model (RASL). In this model, we integrate self-learning and residual attention mechanism, and the residual attention mechanism is used to strengthen the weight of keywords while ensuring the transmission of semantic information. Self-learning mechanism and clustering method are used to select class prototypes to generate corrected labels and conduct model training. Self-learning here refers to self-supervised learning strategy, that is, to construct corresponding corrected labels based on the structure or characteristics of the data itself and conduct model training. The introduction of corrected labels can not only reduce the problem of incorrect labeling caused by distant supervision, but also make the model independent of any additional supervision information, so as to retain the available supervision information in the data as far as possible and improve the data utilization rate.

The contributions of our paper can be summarized as follows:

(1) We propose the distantly supervised relation extraction based on residual attention and self-learning model, which uses residual attention mechanism to enhance the transmission of semantic information and generates corrected labels through self-learning strategy, so as to reduce the impact of noisy labels.

(2) We select class prototypes by clustering to generate corrected labels for model training. Compared with most existing models that are trained by corrected labels, the proposed model does not need any extra supervision information, and makes full use of the available supervisory information in the data to improve data utilization.

(3) We compare the proposed model with other relation extraction models using corrected labels, which proves the effectiveness of the proposed model

The rest of this paper is structured as follows. We present related work in Sect. 2, which mainly includes the neural network-based relation extraction algorithm. In Sect. 3 we propose the RASL distantly supervised algorithm. In Sect. 4 we further design experiments to demonstrate the effectiveness of the RASL algorithm and describe the detailed experimental results. Finally, the summary is presented in Sect. 5.

2 Related Work

Mint et al. [10] proposed the idea of distant supervision, assuming that two entities \(e_{1}\) and \(e_{2}\) have relation r in the knowledge base, all sentences containing \(e_{1}\) and \(e_{2}\) express the relation r. For example, if there is a relation triple \(< Steve Jobs, Apple, Founder>\) in a knowledge base, the sentences containing ”Steve Jobs ” and ”Apple” are considered to express the ”founder” relation. However, it is possible that the words ” Steve Jobs ” and ”Apple” are referring to other relation. Therefore, many improved distantly supervised relation extraction methods have been proposed by researchers.

Zeng et al. [11] firstly used the deep learning method for distantly supervised relation extraction and proposed PCNN model. The PCNN model combined the word embedding with the position feature as the input of the model, and adopted the piecewise pooling to increase the fine-grained information. Jiang et al. [12] proposed the multi-instance multi-label convolutional neural network for relation extraction. This model dealt with the problem that an entity pair \(e_{1}\) and \(e_{2}\) correspond to multiple relations r, and made full use of the information between different sentences. Huang et al. [13] first applied the 9-layer deep residual network model in the distantly supervised relation extraction task, and achieved good results. Zhu et al. [14] used the methods of deep learning and reinforcement learning to build models to deal with the problem of insufficient instance label correlation modeling. Qu et al. [15] developed value-based reinforcement learning to simultaneously solve multi-instance and multi-label problems in neural relation extraction. Yang et al. [16] proposed a new entity concept enhanced relation extraction method to solve the small sample learning problem in the long tail problem. By introducing the inherent concept of entity, more information was provided for relation extraction, so as to improve the precision of relation extraction. Han et al. [17] designed a novel Recursive Hierarchy-Interactive Attention network (RHIA) to further handle long-tail relations, which modeled the heuristic effect between relation levels.

To better extract semantic features, many researchers have introduced the attention mechanism to the task of relation extraction. Lin et al. [4] assigned different weights to different sentences to get rid of some noisy sentences and improve the performance. Zhou et al. [18] assigned higher weights to all correctly labeled instances in the bag according to the correlation between instances in the bag. Therefore, the model can focus on all correctly labeled instances in the bag to improve the precision of relation extraction. Shan et al. [19] proposed a relation extraction model based on pattern-aware self-attention, which can automatically identify various forms of relation patterns without losing global dependencies. Li et al. [20] incorporated an entity-aware embedding module and a self-attention enhanced selective gate mechanism to integrate task-specific entity information into word embedding and then generated a complementary context-enriched representation for PCNN. Shang et al. [19] proposed a novel distant supervised relation extraction model, which employed a specific-designed pattern-aware self-attention network to automatically discover relational patterns for pre-trained Transformers in an end-to-end manner. Wang et al. [21] used the position feature attention mechanism to calculate all position combinations of repeated pairs of target entities to solve the problem of repeating entities in sentences. Li et al. [22] proposed the position attention mechanism, which used the Gaussian function to model the distance between words to reduce the influence of noisy words. Zheng et al. [7] proposed the residual attention mechanism to strengthen the weight of keywords while ensuring the transmission of semantic information.

Soft label method can use correctly labeled instances to dynamically correct wrong labels during training, which greatly improves the precision of relation extraction. Therefore, many researchers introduce it into relation extraction task. Shang et al. [23] assigned credible labels to noisy sentences and transformed them into useful training data to improve the performance of the model. Liu et al. [8] combined the relation score represented by entity pairs with the confidence of noise labels through the joint scoring function to obtain corrected labels for specific entity pairs to improve the performance of relation extraction. Li et al. [9] integrated gating mechanism and soft label and adopted sentence-level attention mechanism to improve the performance of the model.

Different from these methods, we propose the corrected label supervision model based on self-learning and residual attention, which does not rely on any additional supervision information and makes full use of the supervision information of the instance itself. It can enhance the transmission of semantic information and reduce the influence of noise labeling. Specifically, the model first uses the feature extractor based on the residual attention mechanism to extract features from the training data; Then it performs clustering according to the feature similarity to select representative sentences for each class as the class prototype. The corresponding correction labels are generated by comparing the similarity between the sentence and the class prototype; Finally, the corrected label is added into the training process of the model as a supervision signal, the loss based on the corrected label and the loss based on the noise tag are weighted and summed to generate a joint loss to guide the network parameter update. Thus, the possibility of optimizing the model in the wrong direction is reduced. Similar methods have achieved good results in image classification tasks [24].

3 Model

The RASL model proposed in this paper consists of trunk branch and label self-learning branch, and the overall structure is shown in Fig. 1. We use the feature extractor based on residual attention to extract the feature vectors of sentences and generate sentence vector E. Label self-learning branch is mainly used to generate corrected labels. In the label self-learning branch, we cluster according to the feature similarity and select representative sentences for each class as class prototype. The model generates the corresponding corrected label \(y^{*}\) by comparing the similarity between the sentence and the class prototype, and the corrected label \(y^{*}\) is added into the training process of the trunk branch of the model as a supervised signal to reduce the influence of noise labels on the model. The trunk branch is mainly used to optimize model parameters, including neural network parameters and loss weights, to minimize losses. In the trunk branch, the model generates a joint loss is based on the weighted sum of the loss generated by the corrected label and the loss generated by the noise label to guide the network parameter update, so as to reduce the possibility of model optimization in the wrong direction.The feature extractor is introduced in Sect. 3.1, the trunk branch will be introduced in Sect. 3.2, the label self-learning branch will be introduced in Sect. 3.3, and the model training process will be introduced in Sect. 3.4.

Fig. 1
figure 1

RASL model structure

3.1 Feature Extraction

We adopt the residual attention module structure proposed by Zheng et al. [7] for text feature extraction, the model structure is shown in Fig. 2. Firstly, we express the input sentences in vector form through vectorization layer, so that the computer can process them. Then we extract the sentence features by the residual attention layer.Then the maximum pooling layer is used to prevent the possible overfitting of the model. Finally, we use the sentence level attention mechanism in the classification layer to assign weight to sentences and generate sentence vectors.

Fig. 2
figure 2

Residual attention model structure

3.1.1 Vectorization

Neural networks cannot directly process natural language text, so we need to represent the sentence as a vector before feature extraction. A sentence vector consists of a word vector and a position vector.

For word vectors, we use the open source tool Word2Vec[25] to transform each word \(w_{i}\) in a sentence \(S^{*}=\left\{ w_{1}, w_{2}, \ldots , w_{n}\right\} \) into a k-dimensional real number vector, the word vector of the input sentence is shown in (1).

$$\begin{aligned} S^{*}=\left( w_{1}^{*}, w_{2}^{*}, \ldots , w_{n}^{*}\right) \end{aligned}$$
(1)

where n denotes the number of words in sentence s, and \(w_{i}^{*} \) denotes the k-dimensional vector of real numbers corresponding to word \(w_{i} \),\(0<i \leqslant n\).

Although word vectors can well reflect the semantic information of words, they cannot reflect the structural information of sentences. To this end, we use position features to complement the structural information of the sentences [11],which map the relative distance from each word \(w_{i}\) to two entities into d-dimensional position vectors \(w_{i}^{a}\), \(w_{i}^{b}\) and join them to the word vector \(w_{i}^{*}\). For example, " \( \left\langle e_{1}\right\rangle \) Tom \(\left\langle e_{1}\right\rangle \) was born in the \(\left\langle e_{2}\right\rangle \) New York \(\left\langle e_{2}\right\rangle \),in this sentence, the relative distance from the word ”born” to \(e_{1}\) is 2 and the relative distance to \(e_{2}\) is -3. The two relative distances are vectorized to form a position vector.

Let the input sentence be \(s=\left\{ w_{1},w_{2}, \ldots ,w_{n}\right\} \), the output of vectorization layer is shown in (2).

$$\begin{aligned} X(s)=\left( x_{1}, x_{2}, \ldots , x_{n}\right) \end{aligned}$$
(2)

where \(X \in R^{n \times m}\), n represents the number of words in sentence s, m is the dimension after concatenating word vector and position vector and \({m = k + 2 \times d}\). \(x_{i}=\left( w_{i}^{*}, w_{i}^{a}, w_{i}^{b}\right) \) is the full vector representation of the word \(w_{i}\), \(0<i \leqslant n\).

3.1.2 Residual Attention

We convolve and activate the output of vectorization layer to extract the preliminary features of sentences. Then we use the trunk branch and mask branch for feature extraction and attention feature learning respectively. The two branches are constructed by residual units, up-sampling and down-sampling. Finally, we use a structure similar to residual network to combine the output features of the two branches to get the final sentence features.

(1) Residual unit

The structure of residual unit is shown in Fig. 3. We obtain \(\textrm{F}(\gamma )\) after two convolution operations of input \(\gamma \). As a more complete reference data, we combine \(\gamma \) with \(\textrm{F}(\gamma )\) to form the output of the structure. This jumping structure can not only reduce the training burden of deep neural networks, but also alleviate the problem that the underlying features gradually disappear during transmission.

Fig. 3
figure 3

Residual unit structure

(2) Down-sampling and up-sampling

The function of the mask branch is to learn the attention feature, the attention feature with the same size as the output of the trunk branch is learned by using the attention mechanism of down-sampling and up-sampling. The branch uses down-sampling to reduce the dimension of the input, so as to increase the receptive field of the feature extracted by convolution, so that it can more effectively infer the position of the key semantic features in the sentences. Then, the up-sampling is carried out by interpolation method, and the similar features among similar words are used to better locate the detail feature of the sentence while expanding the dimension. At the same time, we add a jump connection between down-sampling and up-sampling to capture information of different scales.

(3) Feature combination

The output of the residual attention layer is obtained by combining the output of the two branches. Let the input be \(\gamma \), the output of the trunk branch and mask branch be \(T(\gamma )\) and \(M(\gamma )\), respectively.\(M(\gamma )\) is used as the control gate of the neuron to combine with \(T(\gamma )\), the calculation formula is shown in (3).

$$\begin{aligned} P(\gamma )=T(\gamma ) * M(\gamma ) \end{aligned}$$
(3)

Since the output of the two branches are directly combined by the multiplication operation, the effective features in \(T(\gamma )\) may be destroyed by \(M(\gamma )\). so we merge the outputs of the two branches in a form similar to residual network structure, and the calculation formula is shown in (5).

$$\begin{aligned} P(\gamma )=T(\gamma ) *(1+M(\gamma )) \end{aligned}$$
(4)

This kind of attention combination is called residual attention mechanism, which can not only ensure that \(M(\gamma )\) can suppress the noise features in \(T(\gamma )\), but also make \(T(\gamma )\) bypass the mask branch to weaken the feature selection ability of \(M(\gamma )\) and retain the effective feature propagation.

3.1.3 Max Pooling

We adopt the maximum pooling method to reduce the dimension of the output \(P(\gamma )=\left( p_{1}, p_{2}, \ldots , p_{l}\right) \) of the residual attention layer, as shown in (5).

$$\begin{aligned} p_{i}^{*}=\max \left( p_{i}\right) (0<i \le l) \end{aligned}$$
(5)

The output feature of the maximum pooling layer is \( P^{*}(\gamma )=\left( p_{1}^{*}, p_{2}^{*}, \ldots , p_{l}^{*}\right) \), where l represents the number of convolution kernels, each sentence gets a value under the action of one convolution kernel, and this sentence gets an l-dimensional feature vector \( P^{*}(\gamma )\) under the action of l convolution kernels.

3.1.4 Sentence Level Attention

The model uses sentence-level attention mechanism to reduce the impact of mislabeled sentences on relation prediction. Suppose the sentences set consisting of m sentences containing the same entity pairs is \(S=\left\{ s_{1}, s_{2}, \ldots , s_{m}\right\} \), sentence-level attention mechanism assigns the weight to each sentence in S. Sentences that are highly correlated with the relation vector get higher weights, and vice versa get lower weights. Then the weighted sum of sentence feature vectors is carried out to obtain the feature vector of S, as shown in (6).

$$\begin{aligned} R=\sum _{j=1}^{m} \omega _{j} P^{*}(\gamma )_{j} \end{aligned}$$
(6)

where \(\omega _{j}\) is the weight of each sentence in S relative to the predicted relation and \(\sum _{j} \omega _{j}=1\). The formula for calculating \(\omega _{j}\) is shown (7).

$$\begin{aligned} \omega _{j}={\text {softmax}}\left( P^{*}(\gamma )_{j} A v_{r}\right) \end{aligned}$$
(7)

where A is the randomly initialized weighted diagonal matrix and \(v_{r}\) is the vector representation of the relation r.

3.2 Trunk Branch

The structure of the trunk branch is shown in Fig. 1a, where FC is a fully connected classifier used to predict the relation of entity pairs, and its output is the probability distribution of the relation category of entity pairs.

The goal of trunk branch is to train the model parameters so that the model can get better performance. In general, models with good performance have smaller losses. Loss refers to the degree of inconsistency between the predicted value f (x) of the model and the label y. Cross-entropy is a common loss function, and its expression is shown in (8).

$$\begin{aligned} l(f(x), y)=-\frac{1}{n} \sum _{i=1}^{n} y_{i} \log \left( f\left( x_{i}\right) \right) \end{aligned}$$
(8)

Here, n represents the batch size and \(y_{i}\) represents the label corresponding to the sentence \(x_{i}\). Due to the problem of noisy labels, we introduce the corrected labels into the model training process as supervised signals in this paper. The corrected labels are generated by the label self-learning branch. Since the correctness of both the original noise label and the corrected label are unknown, we use an adaptive weight update method to generate the weight values in the joint loss, the model loss function is defined as:

$$\begin{aligned} l_{\text{ total } }=\frac{1}{2 \alpha ^{2}} l( f(x), y)+\frac{1}{2 \beta ^{2}} l\left( f(x), y^{*}\right) +\log (\alpha * \beta ) \end{aligned}$$
(9)

where l represents the loss calculated through Eq (8), y represents the noise label, and \(y^{*}\) represents the corrected label. \(\alpha \) and \(\beta \) respectively represent the weight of the two losses in Eq. (9), which are directly learned and generated by the network. Since \(l_{total}\) is required to be minimum, we hope that \(\alpha \) and \(\beta \) are as large as possible. To prevent degradation, \(\log (\alpha * \beta )\) is used as a regularization term to limit the infinite increase of \(\alpha \) and \(\beta \).

3.3 Label Self-Learning Branch

This branch refers to the reference [24], which introduces self-supervised learning into the training process of the model. Through self-supervised learning, the corresponding corrected label is generated for each sentence in the training set. Corrected labels will provide supervised signals for the training phase to reduce the impact of noisy labels on model performance. The branch can be divided into the following two parts.

3.3.1 Class Prototype Selection

Class prototypes refer to representative training samples in each class. The similarity degree between the class prototype and the sample can be used to judge whether the sample label is correct or not. Since a single class prototype has limited ability to describe a class, the model selects multiple class prototypes for each class to enrich its text features.

Firstly, we use the feature extractor E to map the sample features to the low-dimensional space, then we use the cosine similarity to calculate the similarity \(S_{i j}\) between every two samples and construct the cosine similarity matrix \(S \in R^{m \times m}\) where m denotes m randomly selected samples with label c from the training set, \(S_{i j} \in S\).The calculation formula of \(S_{i j}\) is shown in (10).

$$\begin{aligned} S_{i j}=\frac{E\left( x_{i}\right) ^{T} E\left( x_{j}\right) }{\left\| E\left( x_{i}\right) \right\| _{2}\left\| E\left( x_{i}\right) \right\| _{2}} \end{aligned}$$
(10)

Where \(E\left( x_{i}\right) \) represents the features of sentence \(x_{i}\). \(S_{i j}\) represents the similarity between sentences \(x_{i}\) and \(x_{j}\). A larger \(S_{i j}\) indicates a higher similarity between two sentences. In order to select the appropriate prototype, the model introduces the density \(p_{i}\) for each sentence \(x_{i}\) as the selection basis. In the noisy dataset, if the density \(p_{i}\) of sentence \(x_{i}\) is larger, the probability of the corresponding label being correct is higher. The sentences with higher density will be selected as class prototypes, and the density calculation formula is shown in (11).

$$\begin{aligned} p_{i}=\sum _{j=1}^{m} {\text {sign}}\left( S_{i j}-C\right) \end{aligned}$$
(11)

where, sign(x) is a sign function. C is a constant that can be any value in S. Since we calculate the relative density of sentences, the value of C has no effect on the selection of prototype. In order to avoid the degradation of multi-class prototypes into single-class prototypes, we introduce another selection criterion \(\delta \) to make the selected prototypes not only have high density but also have distinct characteristics among each other. The formula of \(\delta \) is shown in (12).

$$\begin{aligned} \delta _{i}=\left\{ \begin{array}{c} \min _{j} S_{i j}, \quad p_{i}=p_{\max } \\ \max _{j, p_{j}>p_{i}} S_{i j}, \quad p_{i}<p_{\max } \end{array}\right. \end{aligned}$$
(12)

where,\(p_{\max }=\max \left\{ p_{1}, \ldots , p_{m}\right\} \).

3.3.2 Corrected Label Generation

In this paper, we obtain the corrected label by comparing the feature similarity of sentence x with the class prototype. Let the class prototype feature set be \(\left\{ E\left( X_{l}\right) , \ldots , E\left( X_{t}\right) , \ldots ,\right. \left. E\left( X_{K}\right) \right\} \), where \(X_{t}=\left\{ x_{c 1}, x_{c 2}, \ldots , x_{t r}\right\} \) represents the class prototype selected for class t, and r represents the number of class prototypes. Given a sentence x, the formula for calculating the similarity between it and the class prototype is as follows:

$$\begin{aligned} \partial _{t}=\frac{1}{p} \sum _{l=1}^{p} \cos \left( E(x), E\left( x_{t l}\right) \right) , \quad t=1 \ldots k \end{aligned}$$
(13)

where \(E\left( x_{t l}\right) \) denotes the class prototype of the l-th class with category t. The generation formula of the corrected label is as follows.

$$\begin{aligned} y^{*}={\text {argmax}}_{t} \partial _{t}, \quad t=1 \ldots k \end{aligned}$$
(14)

The obtained correction label \(y^{*}\) will be added to the training of the model as a complementary supervised signal.

3.4 Model Training

The model extracts feature through neural networks, and the neural network parameters are shared between the training stage and the label correction stage. In the training stage, the joint loss is used to guide the optimization of model parameters. The joint loss consists of \(l\left( f(x), y \right) \) and \(l\left( f(x), y^{*}\right) \). \(l\left( f(x), y \right) \) is calculated by the distant supervision label, and \(l\left( f(x), y^{*}\right) \) is calculated by the corrected label. The steps for model training are as follows.

Algorithm 1
figure a

Model training

4 Experiment

4.1 Hyperparameter Settings

We refer to the experimental parameters of the reference [4], the setting of hyperparameters in the neural network is shown in Table 1.

Table 1 Hyperparameter settings

4.2 Dataset and Evaluation Metrics

Our experiments use FreeBase + NYT dataset and Wiki-KBP dataset. The FreeBase + NYT dataset was proposed by Riedel et al. [2] in 2010, which has been used as a benchmark dataset in the distantly supervised relation extraction task. The dataset is generated by aligning the knowledge base FreeBase with the text data of the New York Times through distant supervision. It is divided into training data and test data. The number of sentences in the training data is 522,611, and the number of sentences in the test data is 172,448, containing 53 kinds of relations (NA means that the two entities have no relation). The dataset Wiki-KBP, proposed by Ling et al. [26] in 2012, is generated by the alignment of the knowledge base FreeBase with Wikipedia articles. The number of sentences in the training set is 23,111, and the number of sentences in the test set is 15,874, containing 7 kinds of relations (NA means that two entities have no relation).

In this paper, we adopt the same evaluation criteria as references [10] and [4], the model was evaluated by comparing precision-recall (P-R) curve, P-R line area (AUC) and average precision (P@N). The calculation formula of precision and recall is shown in Eqs. (15) and (16) respectively, and the calculation formula of P@N is shown in Eq. (17).

$$\begin{aligned} precision= & {} \frac{right}{ all } \end{aligned}$$
(15)
$$\begin{aligned} recall= & {} \frac{ right }{test } \end{aligned}$$
(16)
$$\begin{aligned} P @ N= & {} \frac{ right_{N}}{N} \end{aligned}$$
(17)

where right denotes the number of correctly predicted relation instances in the output; all denotes the total number of relation instances in the output; test denotes the number of relation instances in the test set; P@N denotes the precision of the top N instances with the highest probability by sorting the relation instances test according to the prediction probability, and \(right_{N}\) denotes the number of correctly classified instances among the top N instances.

4.3 Experiment Results

In this paper, we select the following methods as the baseline comparison methods:

(1) CNN-ATT was proposed by Lin et al. [4] using sentence-level attention mechanism for weight learning, the feature extraction methods of the model are convolutional neural network and piecewise convolutional neural network, respectively.

(2) PCNN-ONE + soft-label and PCNN-ATT+ soft-label were proposed by Liu et al. [8] in 2017. Through the joint scoring function, the model combined the relation score of entity pair representation with the confidence degree of noise label to obtain the corrected label for specific entity pair. Corrected labels are used to improve the performance of relation extraction. Among them, the PCNN-ONE+ Soft-label model is based on the assumption of at-least-one, and the PCNN-ATT+ soft-label model is based on selective attention.

(3) GPCNNs was proposed by Li et al. [9] in 2020. GPCNNs integrated gating mechanism and soft label, and adopted sentence-level attention mechanism to improve model performance.

(4) RERAN was proposed by Zheng et al. [7] in 2022. The model adopted residual structure and attention mechanism to strengthen the weight of keywords while ensuring the transmission of semantic information.

The above models are compared with the model proposed in this paper, and the P-R curve drawn is shown in Fig. 4.

Fig. 4
figure 4

P-R curves of different algorithms on the FreeBase + NYT dataset

Figure 4 shows the comparison of P-R curves between RASL model and other models on FreeBase + NYT dataset, indicating that the overall performance of RASL model is higher than that of other benchmark models.

Table 2 is the P@N evaluation table of RASL model and other models on FreeBase + NYT dataset. The experimental results show that the average precision of RASL model is improved by about 5% to 12% compared with PCNN-ONE + soft-label, PCNN-ATT+ soft-label and GPCNNs models, which also use corrected labels to improve performance, and about 3% compared with RERAN, which also use residual attention mechanism.

Table 2 P@N evaluation results of different models on the FreeBase + NYT dataset

The P-R curve comparison of model RASL with other models on the Wiki-KBP dataset is shown in Fig. 5, and the P@N evaluation comparison is shown in Table 3, which show that RASL has better performance compared with other comparison models.

Fig. 5
figure 5

P-R curves of different algorithms on the Wiki-KBP dataset

Table 3 P@N evaluation results of different models on the Wiki-KBP dataset

Under the P-R curve and P@N evaluation index, the following conclusions can be drawn according to the experimental results of our model on the above two datasets.

The performance of RASL model is better than PCNN-ONE+soft-label, PCNN-ATT+soft-label and GPCNNs, indicating that the combination of residual attention and learning strategy can fully retain and use the labeled data while learning more effective features to obtain better relation extraction results.

RASL achieves better performance results than RERAN indicating corrected labels generated by the self-learning strategy provide richer supervised information for the model, which can effectively alleviate the impact of noise labels on the model performance.

In order to verify the effectiveness of the self-learning strategy in improving the performance of distantly supervised relation extraction, we apply it to the mainstream baseline model CNN_ATT to conduct the same experiment (marked CNN_ATT_S), the experimental results are shown in Fig. 6.

Fig. 6
figure 6

Experimental results of P-R comparison of CNN_ATT model with self-learning strategy

In Fig. 6, CNN_ATT_S represents the CNN_ATT model after introducing the self-learning strategy, the experimental dataset uses FreeBase + NYT. Compared to the baseline approach, the model with the introduction of the self-learning strategy achieves better performance.Meanwhile, the effect of CNN_ATT_S model is not as good as that of RASL model, indicating that the combination of residual attention and self-learning strategy is more conducive to improve the performance of relation extraction.

To further illustrate that the corrected labels generated by the self-learning strategy can provide effective supervision information for model training, we sample the corrected labels generated during the RASL model training as shown in Table 4.

As can be seen from Table 4, the RASL model assigns the label /people/person/ place_of_birth to the first sentence, indicating that for infrequent relations, the model can still identify related instances and generate corresponding corrected labels to reduce the impact of noisy label on relation extraction performance. The RASL model generates the corrected label NA for the second sentence, which also reduces the influence of noise annotation on the performance of relation extraction.

Table 4 Example of corrected labels generated by RASL model

5 Conclusion

In this paper, we introduce the idea of self-learning into the distantly supervised relation extraction task, which generates the corresponding corrected labels for the training data containing noise by clustering method. Corrected labels provide supervised information for model training to prevent erroneous parameter updates caused by noisy labels. The experimental results verify that the average precision of the proposed model RASL is 82.4%, which is improved by about 5% to 12% compared with the PCNN-One + Soft-label, PCNN-ATT+ Soft-label and GPCNNs models which also adopt corrected labels to improve the performance. Compared to RERAN, which also uses the residual attention mechanism, our model improves by about 3%.

With respect to the future work, the model in this paper will focus on feature extraction and corrected label recognition. In relation extraction tasks, features extracted should cover more semantic information as much as possible. Transformer is a major focus of our future work. At the same time, the generative adversarial network through the game of two sub-models to improve the precision of the model to identify the correct label is another major focus of our future work.