1 Introduction

KG is a large-scale language network, which is used to describe the relationship between entities or concepts. Currently, KGs have been widely applied in the field of Natural Language Processing (NLP) [1], improving the accuracy and efficiency of various tasks. In recommendation systems [2], for instance, by matching users' historical behaviors with entities in the knowledge graph, relevant entities or information can be recommended accurately. In question answering systems [3], KGs can help identify and extract entities and relationships relevant to the question, generating answers based on their semantic information. In information retrieval tasks [4], KGs can expand query semantics by providing corresponding entity relationships for the query words.

However, most KGs usually suffer from incompleteness, lacking a large number of real objective triples. Due to the incomplete nature of KGs, there is a risk of missing critical information that can impede understanding and inference of entity relationships, leading to incorrect answers to certain system queries. Moreover, incompleteness in KGs can result in the overlooking or overemphasizing of certain information, thereby introducing biases and errors that further impact the accuracy of analysis and decision-making. KGs often suffer from incompleteness [5], lacking a significant number of objective triples [6, 7]. The incomplete nature of KGs poses a risk of missing critical information, which may hinder the understanding and inference of entity relationships, leading to incorrect answers to system queries. Additionally, incompleteness in KGs may result in overlooking or overemphasizing certain information, introducing biases and errors that further impact the accuracy of analysis and decision-making. Continuous refinement is necessary to improve the completeness of existing KGs. However, due to the explosive growth of data, manual construction is limited. Therefore, it is crucial to use various automatic KG completion methods [8], which utilize machine learning and deep learning technologies to enhance the usefulness of KGs in various fields. This will have profound implications for the development of internet technology.

A graph is a structure, equivalent to a set of objects, and some pairs of objects are related in some sense [9]. Researchers represented graph elements as low-dimensional vectors by exploring structured knowledge in KGs, then the relations between entities is predicted by calculating the spatial information of the elements in the triplets. A triple is a structured statement with defined rules and contains sufficient semantic information, so some recent studies exploited the textual semantic information of entities and relations in KGs to enrich knowledge representations. However, the above two types of mainstream knowledge completion approach only used a certain aspect of knowledge, and they are only experimented on artificially created densely linked datasets, lacking the ability to efficiently fit complex data. This paper aims to achieve the complementary advantages of two knowledge inference methods by applying the information from two aspects of triplets. Two parallel BERT [10] models are utilized to introduce semantic information of knowledge and accelerate the reasoning process, followed by a deep neural network that extracts semantic features and calculates triplet scores. Additionally, a reordering scheme is designed to introduce triplet structure information and reorder promising entities from the initial ranking. Finally, this paper demonstrates the effectiveness of the proposed model on a basic dataset and two sparsely connected datasets constructed in-house.The main contributions of this paper are summarized as follows:

  • To improve the computational efficiency of the model, this study proposes a dual-stream BERT embedding method that splits a complete triple into two parts: the head entity and relation, and the tail entity. Two parallel BERT models are utilized to perform semantic embedding of the knowledge, and a candidate set of entities is constructed to avoid redundant embedding and save computation time, while retaining some contextual information of the knowledge.

  • This study developed the TSTR method for KGC, which utilizes pre-trained language models and deep convolutional architecture. The approach demonstrated strong performance across multiple datasets, particularly in challenging environments with sparse connections.

  • In this paper, a re-ranking scheme TSTR(Ensemble) is developed, which combines RotatE [11] and TSTR, and applies the triple structure information and semantic information at the same time. It has a certain research value to realize the complementary advantages of the two approaches. The approach achieved the best results in predicting links on two datasets, FB15k-237 s and WN18RRs, which have even sparser connections.

2 Related Work

In many earlier studies, researchers mainly focused on the approaches of KGC based on graph embedding. The entities and relations in the knowledge graph are mapped into low-dimensional dense vectors, and the spatial relations of triples are used to learn the structural information between entities. Typical graph embedding approaches include two subclasses of translation-based approaches and semantic matching approaches. (1) The translation-based approach regards the relation as the translation from head entity to tail entity, TransE [12] represents each entity and relation as a low-dimensional vector and determines the validity of a triple by computing the distance between the result of adding the head entity vector and the relation vector to the tail entity vector, but TransE represents each entity as a fixed vector, it cannot distinguish between different meanings of the same entity. TransR [13] improves the performance of the model by mapping entities and relations to different spaces, which helps to better distinguish between different meanings of the same entity. RotatE [11] is an efficient and highly scalable model that predicts the tail entity by applying complex rotation operations to the head entity and relation vectors. RotatE can effectively handle symmetric and antisymmetric relations. HAKE [14] utilizes a hierarchical embedding approach to represent the hierarchical structure information of entities and relations, which can effectively utilize the hierarchical structure information in the knowledge graph and improve the performance of the model (2) DistMult [15] is a tensor decomposition-based model with the characteristics of few parameters and high computational efficiency, but it cannot handle symmetric and anti-symmetric relations. ComplEx [16] represents entities and relations as complex vectors. Compared to DistMult, the advantage of ComplEx is that it can effectively handle symmetric and anti-symmetric relations, but it requires more parameters and computational resources. TuckER [17] embeds the entities and relations in a three-dimensional tensor and decomposes it into a product of multiple low-rank matrices. TuckER has achieved excellent performance on multiple KG reasoning tasks and has faster training speed and lower parameter count than other competing models, but performs poorly on large-scale KGs. Although the above approaches successfully utilized the structure information of triples, they completely ignored the semantic information and context information of knowledge [18]. Therefore, entities and relations not seen in the training process cannot be evaluated [19]. These shortcomings seriously weaken their prediction quality. Because the graph embedding approaches can learn its structural knowledge by explicitly modeling the triple itself, this kind of approaches is not completely abandoned in the work of this paper. In the re-ranking stage, a reordering model is developed using RotatE.

To increase the depth of the network and improve parameter utilization efficiency, Dettmers et al. proposed the ConvE model [20], which combines head entities and relations, convolves them with multiple filters, and then performs a dot product with the tail entity vector to obtain a score for inferring the authenticity of triplets. ConvE is the first application of CNN [21] in KGC research. However, CNN has drawbacks such as spatial invariance and low coding efficiency [22]. Therefore, CapsE [23] was proposed to replace CNN with capsule neural networks, which have a significant effect on KG reasoning tasks and improve the accuracy of triplet classification. DSKG [24] is a sampling method specifically designed for training models on KGs, using a two-layer model that processes entities and relations with different convolutional neural units to enhance the representation of entity information and achieve high accuracy on specific datasets. In the field of machine vision, ResNet [25] has caused extensive discussion once it was proposed. Compared with traditional convolutional neural networks (CNN), the performance of ResNet does not decrease with the increase of network depth. This feature also enables ResNet to model complex relations in KGs.

The development of pre-trained language models such as ELMo [26], BERT [10], roBERT [27], and XLNet [28] has significantly improved the performance of many NLP tasks. BERT [10], built on the transformer model, has been utilized in various domains with remarkable results. For instance, Lin et al. [29] achieved state-of-the-art results in text classification by using BERT and GCN. Song et al. [30] employed BERT for dialogue generation and demonstrated great success. Le et al. [31] used BERT in combination with 2D CNN for DNA sequence recognition and fetched remarkable results via comparative experiments. BERT is pre-trained using MLM and NSP tasks, where MLM masks tokens and predicts their values, while NSP predicts the continuity of two input sentences. Pre-training tasks help BERT to obtain strong representation capabilities. Using BERT to embed knowledge contextually in KGs can obtain deep semantic and contextual information, thus exploring its effect in KGC tasks is essential.

In recent studies, researchers have applied pre-trained models to complete knowledge graphs, such as KG-BERT [32]. The authors input the entire triple into BERT to make it capable of KGC and achieve state-of-the-art results on multiple datasets. However, the biggest problem with KG-BERT is its high computational cost, especially in the knowledge embedding stage. Inputting the entire triple causes the same entities and relations to appear repeatedly, resulting in a lot of wasted time. KEPLER [33] also uses BERT as a basis to combine factual knowledge with language representation, enhancing the representation ability of knowledge. However, like KG-BERT, it is also plagued by computational efficiency issues. In this work, to avoid huge computational costs, triples are divided into two parts and two parallel BERT models are used for knowledge embedding, while a candidate entity set is established to avoid duplicate embedding of the same entity and relation. At the same time, a deep convolutional architecture is applied for feature extraction. This method greatly improves computational efficiency, and with an increasing number of entities, this efficiency will become more evident, making this method suitable for large-scale KGs.

3 Method

In this section, the knowledge completion approach TSTR proposed in this paper will be introduced for the first time, as shown in Fig. 1. Next, the three components of TSTR will be introduced in detail: (1) dual stream knowledge embedding using Bert. (2) Deep convolution architecture and scoring function. (3) Re-ranking scheme integrating triplet semantic information and structural information.

Fig. 1
figure 1

The KGC method TRTS uses two parallel BERT models for knowledge embedding, using ResNet to capture the deep semantic features of knowledge, feeding the extracted features into a fully connected layer with softmax on top and gets the initial triple score. Finally, the top K entities are re-ranked by applying RoatE [11] in combination with the previous deep convolutional architecture

3.1 Overview of TSTR

In the embedding stage, parallel BERT [10] is used for two aspects of knowledge embedding for head entity and relation and tail entity. Aiming at the problem of low complexity and poor effect of the existing model, the existing model architecture is selected and a deep convolutional network is developed to fully extract the deep se-mantic information of the triplet, which improves the performance of the model when dealing with sparse connections. Finally, considering the importance of the structural information of triples in the completion of knowledge graphs, an ensemble model is designed to re-rank triples, scoring the top k entities together with the deep convolutional architecture. A variable parameter is assigned the weights of the two models.

3.2 Text Knowledge Representations

BERT is a large-scale pre-trained language model with a very large parameter scale. For example, the BERTbase contains 110 M parameters, and the BERTlarge pa-rameter is an astonishing 340 M, such a huge parameter scale is bound to generate high computational cost. KG-BERT [32] is the first application of BERT in knowledge graph completion. Although it has achieved very good results in many experiments with the powerful performance of BERT, this success cannot be extended to practical applications, the biggest problem of KG-BERT is that the computational cost is too high, especially in the evaluation of link prediction, Reference [34] mentioned that the inference time of KG-BERT using only a small-scale BERTbase on the test set of WN18RR is as long as 32 h. Since KG-BERT uses the semantic information of knowledge, it needs to replace the original entities or relations with the external description of knowledge. In the experiment, it can be observed that the same entities or relations in KGs will be reused multiple times, and the corresponding text replacement and the embedding calculation is also repeated, which will undoubtedly cause a great waste of computing resources.

Through the analysis of KG-BERT, it is found that although the scale of BERT cannot be directly reduced, avoiding repeated text replacement and embedding of the same entity can also achieve the purpose of saving computing costs. Inspired by the 1-N scoring in ConvE [20], each triplet is divided into two parts for embedding, and a candidate entity set is constructed, then they are combined into a complete triplet before evaluation, thus avoiding the problem of excessive computational cost is solved in this paper.

Triplet is a structured sentence that contains rich semantic and contextual information. In the embedding phase, the goal of this study is to fully integrate this semantic information into embedding vectors. In a KG, the number of relationships is much smaller than the number of entities, and the semantics of relationships are more unique than those of entities. Additionally, relationship descriptions in a KG are usually not expanded. Therefore, this study chooses to incorporate relationship embedding into the embedding process of the head entity. Although this sacrifices the independence of relationship semantic information, it preserves the contextual information between entities and relationships. A triplet \((h,r,t)\) is split into two parts: head entity and relation \((h,r)\), tail entity \(t\), the entities and relations are represented as their names or descriptions. Following the knowledge embedding approach in KG-BERT, the embedding process is written as:

$$h^{\prime} = [[{\text{CLS}}],{\text{Tok}}_{i}^{h} ,{\text{[SEP}}],{\text{Tok}}_{i}^{r} ,{\text{[SEP}}]],$$
(1)
$$e^{h} = {\text{droput(BERT - Embedding(}}h^{\prime}{)),}$$
(2)

where token \([{\text{CLS}}]\) is always the first token of input sequence and token \([{\text{SEP}}]\) is used to separate different elements. There are special tokens defined in BERT. \({\text{droput(}} \cdot {)}\) is used to collect hidden output at token \({\text{[CLS}}]\).\({\text{BERT - Embedding(}} \cdot {)}\) denotes contextualized embedding using BERT. The length of \(e^{h}\) is \(d\) with one channel.

Similarly, the text embedding of tail entity \(t\) is represented as:

$$e^{t} = {\text{dropout(BERT - Embedding(}}t{)),}$$
(3)
$$where,t = [[{\text{CLS}}],{\text{Tok}}_{i}^{t} ,{\text{[SEP}}]].$$
(4)

3.3 Deep Convolutional Architecture

Compared to fully connected neural networks, CNN [21] captured complex relationships by learning nonlinear features with a much smaller number of parameters. Consider that CNN can increase the complexity of the model without increasing parameters, researchers proposed ConvE [20], the first approach to apply CNN to KGC tasks. It is worth noting that ConvE is limited by the depth of the network, and it is not strong in abstracting triples in complex environments.

Inspired by machine vision recently, ResNet [25] is applied in our KGC research. ResNet solves the problem of network degradation with the deepening of network depth through residual learning. For a stacked layer structure, when the input is x, the learned feature is \(H(X)\), then the learned feature is \(F(x) = H(X) - x\) can be learned. When the residual is \(F(x) = 0\), at this time, the stacking layer only does the identity mapping to ensure that the performance of the network will not be degraded. In fact, the residual will not be zero, so that the stacked layers can learn new features based on the input features, and thus have better performance. According to this characteristic of ResNet, a deeper network can be trained to fully express the semantic information of knowledge.

Splice the head entity Embedding \(e^{h} \in {\mathbb{R}}^{1 \times d}\) and the tail entity Embedding \(e^{t} \in {\mathbb{R}}^{1 \times d}\) along one dimesion.

$$v = [e^{h} :e^{t} ],v \in {\mathbb{R}}^{2 \times d}$$
(5)

\(2n^{2}\) convolutions of size 1 is used to slide along the length of the feature vector \(v \in {\mathbb{R}}^{2 \times d}\) to produce \(2n^{2}\) feature maps, then the feature map \(m_{i}\) is projected to a 2D plane of size \(2n \times n\), therefore, \(M \in {\mathbb{R}}_{{}}^{2n \times n}\) is used to represent a complete feature map of triples and allows direct input into ResNet.

Text fragments of fewer than four lines should not appear at the tops or bottoms of pages, following a table or figure. In such cases, it is better to set the figures right at the top or right at the bottom of the page. A figure should never be placed in the middle of a paragraph.

\(N\) Bottleneck blocks are applied to the feature map, where \(N\) has different values for different datasets. As shown in Fig. 2, every bottleneck block consists of a 1 × 1 convolution, a 3 × 3 convolution and another 1 × 1convolution, where the 1 × 1 layers are used for reducing and then increasing (restoring)dimensions, leaving the 3 × 3 layer a bottleneck with fewer parameters [25]. This design allows the depth of the model to be increased without significantly increasing its parameters, improving the model's ability to adapt to complex relationships in the training data.

Fig. 2
figure 2

A deeper residual function, \(F(x)\) represents the application of the bottleneck convolutions, and the output of the bottleneck block is \(H(X) = F(X) + X\). The relu is a modified linear activation function, applied before each convolutional layer. Expalantion: the drawing is drawn according to reference [25]

The calculation of triplet scores is similar to KG-BERT(a). The feature map \(M^{\prime}\) generated after deep convolution contains the semantic relationship between the two parts of the triplet, and the feature map is extracted into a one-dimensional vector \(C\) using the maximum pooling and flatten functions. Then, using a Multilayer Perceptron with a softmax on top, a score for the positive and negative probability of each triple is obtained. The scoring function is:

$$p = {\text{sigmoid}}({\text{MLP}}(CW^{T} )),$$
(6)

where \({\text{MLP}}( \cdot )\) represents Multilayer Perceptron, and \(W\) represents the learnable parameter matrix.

Since the evaluation of triples is abstracted as a binary classification problem of judging whether a triple is a positive example, binary cross-entropy loss is chosen as the loss function, it can be described as:

$$L = - \sum\limits_{{T \in tp \cup tp^{\prime}}} {(y_{T} \log (p^{\prime}) + (1 - y_{T} )\log (p))} ,$$
(7)

where \(T\) is a triple \((h,r,t)\) and \(y_{T}\) represents the label of \(T\). \(tp\) represents the positive triple set and \(tp^{\prime}\) represents negative set. \(p,p^{\prime} \in [0,1][0,1]\) and \(p + p^{\prime} = 1\).

3.4 Re-ranking Ensemble Scheme

A triple is a structured statement. Although the TSTR fully utilizes the textual information of the triple to avoid the correct triple ranking being too low, it does not model the triple itself and does not apply it to its structure information. Therefore, the problem of entity ambiguity arises, and it is difficult to rank the entities accurately. Aiming at the above problems, a re-ranking KGC approach TSTR(Ensemble) is developed by combining the existing approach of Graph Embedding and TSRT. Introducing triple structure information in the evaluation stage, this simple combination approach is very effective from the experimental results. The above process can be understood as a correction of the calculation result of TSTR using the triple structure information.

TSTR is a KG completion model proposed in this paper to address the issue of sparse connections by relying on semantic information for knowledge inference. RotatE is a knowledge representation learning model that utilizes rotational operations to map knowledge onto a complex plane, thereby modeling information encoded in triple structures. RotatE has been demonstrated to possess high predictive accuracy and scalability in KGs. This study conducts a parallel connection between the trained TSTR model and the RotatE model to construct a TSTR (ensemble) model using a simple weighted fusion method to integrate both sources of information. During testing, a weight parameter y is used to balance the weight of the two models in scoring the input data separately. Then, Softmax is applied to map the scores of both models to a specific range, and the predicted score is obtained by calculating the weighted sum of the scores from the two models, based on their respective weights. This approach addresses the issue of sparse connections in KGs and achieves high predictive accuracy (Table 1).

Table 1 Statistics of all datasets we used

Given an incomplete triple i.e., \((h,r,?)\), the goal of training is to find a variable parameter \(y\) to adjust the weights of the two models so that the correct candidate entity gets the highest score possible. Send the top K entities in the initial ranking to the combination model of TSTR and RotatE [11] for reordering, and sets a variable parameter to adjust the weight of the two models. The TSTR(Ensemble) can be written as:

$$s_{Ensemble} = yp + (1 - y)d,y \in [0,1],$$
(8)

where \(p\) and \(d\) represent the entity's score in TSTR and in the RtoatE. Respectively, the value of \(y\) is set to 0 when the predicted entity does not appear during training, otherwise \(y\) uses the value predefined.

The value of \(y\) will directly affect the prediction results of the ensemble model, so it is necessary to obtain a high quality \(y\) value through multiple tests. In order to select the most suitable value, set \(y \in \{ 0.4,0.45,0.5,0.55,0.6,0.65,07,0.75\}\) and test it on two datasets, FB15k-237 [20] and WN18RR [35]. The experimental results are shown in Fig. 3. It can be seen that the ensemble model obtains the highest MRR value when the value of \(y\) is 0.65.

Fig. 3
figure 3

MRR performance of the ensemble model for different values of \(y\)

4 Experimental Results

This chapter evaluates the performance of STRS through experiments. Specifically, the experimental objectives include the following points:

  • Validation of the effectiveness of the ensemble scheme through connection prediction experiments;

  • Evaluate the TSTR model on two sparsely connected datasets, and further verify that the deep convolutional architecture has better robustness under complex data conditions by comparing it with other approaches;

  • Record the training time of each epoch for TSTR and KG-BERT under the same ex-perimental conditions, and intuitively reflect the improvement of TSTR training efficiency by the dual-stream knowledge embedding mechanism proposed in this paper through the line graph.

4.1 Datasets

As research in the field of KGC continues to increase, many benchmark datasets have been developed such as WN18RR [35], FB15k [12], FB15k-237 [20] and YAGO3-10 [36]. However, in order to simplify the task, the researchers deliberately created dense links that do not correspond to real KGs, for example, YAGO3-10 only retains entities with at least 10 relations at the time of creation, FB15k each entity has at least 100 relations. In this paper, to simulate the performance of the KGC method in real KGs, some dense connections in FB15k-237 and WN18RR are removed to obtain sparsely connected datasets FB15k-237 s and WN18RRs.

FB15k-237 comes from a part of the data in Freebase, including humanities, geography, technology and other aspects of knowledge, with a total of 14,541 entities and 237 kinds of relations, resulting in very dense connections. An average of 6.78 relations are connected per entity in the training set. Such dense connections are inconsistent with the real KGs. Therefore, random down-sampling is used to eliminate the effect of this density, and the density of the data is reduced by randomly removing 30% of the connections in the training data. The dataset is named FB15K-237 s after down-sampling the training data of FB15K-237.

As a subset of WN18, WN18RR retains the symmetric relationship, asymmetric relationship and combination relationship in the original data set, removing the reverse relations. It is a dataset that describes the association characteristics between English words, containing a total of 40,943 entities and 11 relations. The same down-sampling approach is used as above to randomly remove 25% of the connections in the training data to obtain the sparse dataset WN18RRs.

In this paper, extensive experiments are conducted on six datasets: UMLS [20], WN11 [5], FB15k-237, WN18RR, FB15k-237 s and WN18RRs.

WN11 and WN18RR are two subsets of WordNet, where each entity is a syntactic set consisting of several words and corresponds to different semantics. FB15k-237 is a subset of Freebase that contains a lot of real word facts. UMLS is a dataset in the medical field that builds a system of medical terminology. FB15k-237 s and WN18RRs are sparsely linked datasets downsampled from FB15k-237 and WN18RR, respectively.

The test dataset of UMLS, FB15k-237, FB15k-237 s, WN18RR and WN18RRs only contains positive triples \(tp\), then error triples \(tp^{\prime}\) is generated by randomly replacing head or tail entities. This process can be written as:

$$\begin{gathered} tp^{\prime} = \{ (h,r,t^{\prime})|t^{\prime} \in E \wedge t^{\prime} \ne t \wedge \{ (h,r,t^{\prime})| \notin G\} \hfill \\ \cup \{ (h^{\prime},r,t)|h^{\prime} \in E \wedge h^{\prime} \ne h \wedge \{ (h^{\prime},r,t)| \notin G\} \} . \hfill \\ \end{gathered}$$
(9)

\(E\) represents the set of all entities in the knowledge graph \(G\).

4.2 Baselines

The TSTR proposed in this paper is compared with state-of-the-art KGC approaches as follows: Translation-based approaches, i.e., TransE [12], TransH [37], TransR [13],RotatE [11], approaches based on semantic matching i.e., Distmult [15] and ComplEx [16], approaches based on deep learning i.e., ConvE [12], ConvKB [38], CapsE [23] and KG-BERT [32].

TransE [12] is the pioneering work of KG embedding, which treats relations as translations from head entity to tail entity in a low-dimensional vector space, but the TransE structure is too simple to handle complex relations. The TransH [37] model defines a hyperplane for each relation, and the two entities in the entity space are projected onto the hyperplane by the relation mapping matrix, which can effectively deal with 1-n, n-1 and nn relation schemas. The innovation of TransR [13] is to project each relation into an independent space, which solves the problem of entities ambiguity, but the overall effect is not significantly improved. RotatE [11] changes the previous thinking and regards the relationship as a rotation from the head entity to the tail entity, this approach can reason about various relation schemas.

DistMult [15] uses vectors to represent entities and diagonal matrices to repre-sent relations, and measures the probability of a triple fact by matching the latent semantics of entities and relations in the embedded vector space. ComplEx [16] introduces complex valued embedding on the basis of DistMult, and the embedding of entities and relationships is no longer in the real-valued space but in the complex space. The above two types of approaches are limited by the scale of model parameters, and the ability to obtain semantic information is insufficient.

ConvE [12], ConvKB [38], CapsE [23] are the application of convolutional neural network and its variant models on KGC tasks, the calculation process is to use the combined vector of the head entity and the relationship as the input of the model, and then generate multiple feature maps through convolution, and finally project the feature map into a one-dimensional vector and calculate the dot product with the tail entity as the final triplet score. CNN can handle large-scale KGs, while maintaining network depth and model complexity. Therefore, these approaches show strong competitiveness on datasets with large data volumes.

KG-BERT [32] is the first application of BERT [10] on KGs, which can obtain deep semantic information of knowledge through the description of entities and relationships, and has outstanding performance on connection prediction tasks and triplet classification tasks, and has attracted extensive attention in recent years.

4.3 Training Settings

In training phase, up to 20 epochs is trained using the Adam optimizer, and training is stopped when the validation set results do not improve in 3 epochs. To reduce computational overhead, only the BERT-base pretrained model is used in the experiments, taking the top \(K\) candidate entities in the TSTR (Ensemble). The hyperparameter settings for different datasets are shown in Table 2.

Table 2 In the training phase, optimal hyperparameter settings on different datasets. Here, \(N(tp^{\prime})\) represents the number of downsampling for each positive sample, and \(N({\text{Bottleneck}})\) is the number of bottleneck convolutions

4.4 Link Prediction

The connection prediction task refers to given \((h,r,?)\) to predict the tail entities or given \((?,r,t)\) to predict the head entities, where \(?\) represents the missing entity.

In the experiment, different entities are used to replace the head and tail entities in the triplet, and finally, the ranking of the triplet correctness scores is used as the ranking of the entities. The mean rank (MR), mean reciprocal rank (MRR), and Top-\(N\) percentage (Hits@-\(N\)) were used as evaluation metrics for the model. Among them, the MR the lower, the better, the MRR and Hits@-\(N\) the higher the best.

Table 3 shows the performance of various models on densely connected datasets, which can be observed from the experimental results in the table. The bold font represents the best experimental results in this experiment. (1) The MR of TSTR is similar to or even lower than that of KG-BERT, because TSTR also uses the semantic information of knowledge. (2) Due to the feature that TSTR (ensemble) can explicitly model triples by virtue of RotatE, TSTR (ensemble) has significantly improved indicators Hits@1, Hits@3 and MRR than TSTR. (3) The advantages of TSTR and baselines are not obvious. It is because the connections of the datasets FB15K-237, WN18RR and UMLS are too dense, and a simple model can achieve good results in the evaluation stage, which is more obvious on the dataset UMLS.

Table 3 Link prediction result on FB15k-237, WN18RR and UMLS datasets, and the results of the baselines are all from the original article. Bold numbers indicate the best results

From Table 4, it can be observed that the sparsity of the data has a huge impact on the low-complexity model. The evaluation indicators of TSTR are higher than baselines. A reasonable explanation is that, ResNet does not have the problem of model degradation when the network depth increases, and can effectively capture the complex dependencies between entities in different semantic spaces. Most of the existing KGs are sparse and incomplete. Therefore, deeper network architectures are very valuable for research in KGC tasks.

Table 4 Link prediction result on sparsely connected datasets. Bold numbers indicate the best results

In the previous analysis, it was noted that TSTR (Ensemble) improves the ability to model triple structures by incorporating triplet structural information. To better evaluate the positive impact of RotatE on TSTR, this study compared the performance of TSTR (Ensemble) and TSTR using different relationship models. Since the FB15K-237 dataset contains 237 relationships, and the WN18RR dataset contains only 11 relationships, this study selected the WN18RR dataset with fewer relationships for further relationship prediction analysis. TSTR and TSTR (Ensemble) were used to predict each relationship in the WN18RR dataset, and the prediction results were evaluated using the MRR metric. The experimental results are shown in Fig. 4. The results show that the TSTR (Ensemble) model outperforms the TSTR model in 8 out of the 11 relationship models, indicating a higher prediction accuracy and better ability to model various relationship patterns. These findings provide further validation of the necessity of the design. Therefore, the TSTR (Ensemble) model has a great potential for effectively modeling complex relational data, making it a valuable tool in the field of knowledge reasoning.

Fig. 4
figure 4

The experimental results for link prediction under different relationship models, where the black line represents the trend of the number of relationships

In this paper, some tail entities with different indegrees in WN18RRs were selected to perform link prediction experiments on TSTR and KG-BERT, respectively. The results are shown in Fig. 5. It can be intuitively seen that TSTR has a greater improvement than KG-BERT in sparser data, which also shows that TSTR can cope with complex and difficult data environments due to its deeper network.

Fig. 5
figure 5

The improvement of TSTR compared to KG-BERT at different sparsity levels

4.5 Triple Classification

Triple classification is to judge the correctness of a given triple [13]. Equation (5) is used to calculate the triplet's score on the TSTR in the experiment. Table 5 shows the triplet classification accuracy of different approaches on FB13 and WN11, TSTR achieves the best results on FB13, and the classification accuracy on WN11 is 93.1%, which is slightly lower than KG-BERT and has great advantages compared with other approaches that do not use semantic information. Triple classification can be regarded as a simple text classification problem, so using BERT to obtain the semantic information of knowledge can bring certain advantages to the model.

Table 5 The results in the baseline in the table are all from the original text

4.6 Efficiency Comparison with KG-BERT Baseline

The computational efficiency of KGC models has been a neglected topic for a long time, but as the scale of KGs continues to expand, it is very necessary to discuss the efficiency of the model. In order to demonstrate the computational efficiency improve-ment of TSTR compared to KG-BERT [32], the TSTR and KG-BERT models are imple-mented using PyTorch and trained for five epochs on the dataset KB15k-237. The experiments are conducted on a sever with an Intel(R) Core (TM) i7-10875H CPU @ 2.30 GHz, an NV NVIDIA Tesla K80 GPU and 24 GB memory. As shown in Fig. 6, under the same experimental conditions, the training time of TSTR is almost half of that of KG-BERT. It can be seen that the design of dual-stream embedding brings the advantage of computational efficiency to TSTR. In the testing process, the advantages of TSTR become more obvious, the calculation time is much less than that of KG-BERT and the predicted results are more advantageous. The reason why the calculation efficiency of TSTR in the test set is more obvious is because the test process needs to calculate all entities, Compared with KG-BERT, which uses the whole triplet as the input of the model, TSTR uses the dual-stream embedding design to avoid the repeated embedding of the same entities and relations, therefore, as the number of entities increases, the advantages of TSTR are more obvious.

Fig. 6
figure 6

Comparisons with KG-BERT on FB15k-237 s, the time was collected

5 Conclusions

KGs serve as carriers for the constantly expanding and enriching knowledge content. However, the scope and quantity of knowledge are consistently growing, and there are limitations in the human or algorithmic construction of knowledge graphs, resulting in incomplete data in the current KGs. Incomplete data pose challenges in supporting various knowledge-driven applications. Therefore, completing the KG is crucial.

Currently, mainstream KGC (KGC) methods perform poorly in sparsely connected knowledge graphs. This is due to the simple model structure of these methods, and their network depth and complexity are insufficient to extract the deep connections between entities. To address the existing problems with the current methods, this paper mainly proposes several aspects of work: (1) it proposes a KGC method, TSTR, which uses BERT and deep convolutional structures to capture deep semantic connections of triplets, effectively processing KGs containing sparse connections; (2) it innovatively proposes an adjustable integration scheme TSTR (Ensemble), used to reorder predicted entities. The combination of TSTR and RotatE can simultaneously apply semantic and structural information of triplets in the evaluation stage, providing a new idea for the fusion of the two methods; (3) it experimentally verifies each part of the model. This paper conducts experiments on three dense linkage datasets and two self-built sparse linkage datasets, achieving good results, proving that the combination of pre-trained language models and deep learning algorithms in KGC tasks is very effective, especially on sparse linkage datasets, where this paper's model has a significant advantage over other models. In addition, in the efficiency comparison experiment with KG-BERT, this paper's model has twice the acceleration during training and more obvious efficiency improvement during testing, making it capable of handling large-scale KGs.

However, this study still has some shortcomings. For example, although the design of dual-stream embedding can bring some speed improvement to TSTR in the training stage, there is still a significant gap between TSTR's calculation speed and graph embedding methods. Weight parameters are pre-set rather than learned by the model. In future work, we will continue to explore more intelligent integration schemes to simultaneously apply the structural and semantic information of triplets and explore more effective methods to improve the computational efficiency of the model, such as automatically learning weight parameters.