Keywords

1 Introduction

Knowledge graphs (KGs) have become a key resource for artificial intelligence applications including question answering, recommendation system, knowledge inference, etc. Recently years, several large-scale KGs, such as Freebase [2], DBpedia [1], NELL [4], and Wikidata [25], have been built by automatically extracting structured information from text and manually adding structured information according to human experiences. Although large-scale KGs have contained billions of triples, the extracted knowledge is still a small part of the real-world knowledge and probably contains errors and contradictions. For example, 71% of people in Freebase have no known place of birth, and 75% have no known nationality [5]. Therefore, knowledge graph completion (KGC) is a crucial issue of KGs to complete or predict the missing structured information based on existing KGs.

A typical KG transforms real-world and abstract information into triples denoted as \((\text {head entity}, \text {relation}, \text {tail entity})\), (hrt) for short. To complete or predict the missing element of triples, such as (hr, ?), (h, ?, t), (?, rt), representation learning (RL) is widely deployed. RL embeds entities and relations into a vector space, and has produced many successful translation models including TransE [3], TransH [26], TransR [16], TransG [29], etc. These models aim to generate precise vectors of entities and relations following the principle \(h + r \approx t\), which means t is translated from h by r.

Most RL-based models concentrate on structured information in triples and neglect the rich semantic information of entities and relations, which is contained in most KGs. Semantic information includes types, descriptions, lexical categories and other textual information. Although these models have significantly improved the embedding representations and increased the prediction accuracy, there is still room for improvement by exploiting semantic information in the following two aspects.

Representation of entities. One of the main obstacles of KGC is the polysemy of entities or relations, i.e., each entity or relation may have different semantics in different triples. For example, in the triple (Isaac_Newton, birthplace, Lincolnshire), Newton is a person, while in (Isaac_Newton, author_of, Opticks), Newton is a writer or physicist. This is a very common phenomenon in KGs and it causes difficulty in vector representations. Most works focus on entity polysemy and utilize linear transformations to model different semantics of an entity in different triples to attain the high accuracy. However, they represent each entity as a single vector which cannot capture the uncertain semantics of entities. This is a critical limitation for modeling the rich semantics.

Estimation of posterior probability. Another problem of most previous works is the neglect of prior probability of known triples. Most previous works optimize the maximum likelihood (ML) estimation of vector representations. Few models discuss the posterior probability, which incorporates a prior distribution to augment optimization objectives. Specifically, previous ML models essentially maximize the probability p(hrt) that h, r, t form a triple p(hrt). When predicting the missing tail of (hr, ?), however, h and r are already known and they may influence the possible choices of t. Thus, the posterior probability \(p(t \, | \, h,r)\) of predicting t is a more accurate expression of optimization goals than p(hrt). In another word, we could prune the possible choices based on the prior probability of the missing element in a triple.

To address the two issues above, we propose a type-based multiple embedding model (TransT). TransT fully utilizes the entity type information which represents the categories of entities in most KGs. Compared with descriptions and other semantic information, types are simpler and more specific because types of an entity are unordered and contain less noise. Moreover, we can construct or extend entity types from other semantic information, if there is no explicit type information in a KG. For example, in Wordnet, we can construct types from the lexical categories of entities. Other semantic information does not have this advantage. In addition to entity types, we construct multiple types of relations from common types of related entities. We measure the semantic similarity of entities and relations based on entity types and relation types. With this type-based semantic similarity, we integrate type information into entity representations and prior estimation which are detailed below.

We model each entity as multiple semantic vectors with type information to represent entities more accurately. Different from using semantics-based linear transformations to separate the mixed representation [16, 26, 27, 31], TransT models the multiple semantics separately and utilizes the semantic similarity to distinguish entity semantics. In order to capture entity semantics accurately, we dynamically generate new semantic vectors for different contexts.

We utilize the type-based semantic similarity to incorporate prior probability in the optimization objective. It is inspired by the observation that the missing element of a triple semantically correlates to the other two elements. Specifically, all entities appearing in the head (or tail) with the same relation have some common types, or these entities have some common type owned by the entities appearing in the tail (or head). In the “Newton” example mentioned above, if the head of (Isaac_Newton, author_of, Opticks) is missing, we can predict the head is an entity with “author” or “physicist” since we know the relation is “author_of” and the tail is “Opticks”, a physics book. Therefore, we design a type-based semantic similarity based on the similarity of type sets. With this similarity, TransT captures the prior probability of missing elements in triples for the accurate posterior estimation.

Our contributions are summarized as follows:

  • We propose a new approach for fusing structured information and type information. We construct multiple types of relations from entity types and design the type-based semantic similarity for multiple embedding representations and prior knowledge discovering.

  • We propose a multiple embedding model that represents each entity as multiple vectors with specific semantics.

  • We estimate prior probabilities for entity and relation predictions based on the semantic similarity between elements of triples in KGs.

The rest of this paper is organized as follows. Section 2 shows the recent studies of KGC. Section 3 introduces our approach including multiple embedding model, prior probability estimation, and objective function optimization. Section 4 displays the evaluation of our approach on FB15K and WN18. Section 5 concludes the paper.

2 Related Work

TransE [3] proposes the principle \(h+r \approx t\) to assign a single vector for each entity and relation by minimizing the energy function \(\Vert \varvec{h}+\varvec{r}-\varvec{t}\Vert \) of every triple. It is a simple and efficient model but unable to capture the rich semantics of entities and relations.

Some models revise function \(\Vert \cdot \Vert \) in the energy functions for the complex structures in KGs. TransA [13] adaptively finds the optimal loss function without changing the norm function. Tatec [7] utilizes canonical dot products to design different energy functions for different relations. HolE [20] designs the energy function based on a tensor product which captures the interaction in features of entities and relations. ComplEx [24] represents entities and relations as complex-number vectors and calculates Hermitian dot product in the energy function. ManifoldE [28] expands the position of triples from one point to a hyperplane or sphere and calculates energy function for the two manifolds. KG2E [9] models the uncertainty of entities and relations by Gaussian embedding and defines KL divergence of entity and relation distributions as the energy function. ProjE [22] proposes a neural network model to calculate the difference between \(h+r\) and t.

Some models design \(\Vert \varvec{h_r} + \varvec{r} -\varvec{t_r} \Vert \) to make an entity vector adaptive to different relations. They aim to find appropriate representations of \(\varvec{h_r}\) and \(\varvec{t_r}\). TransH [26] projects entity vector into hyperplanes of different relations. It represents \(\varvec{h_r}\) as the projection vector of \(\varvec{h_r}\) on the relation hyperplanes. TransR [16] adjusts entity vectors by transform matrices instead of projections. It represents \(\varvec{h_r}\) as the result of linear transformation of \(\varvec{h}\). TranSparse [12] considers the transform matrix should reflect the heterogeneous and imbalance of entity pairs and improves the transform matrix into two sparse matrices corresponding to the head entity and the tail entity respectively. TransG [29] considers relations also have multiple semantics like entities. It generates multiple vectors for each relation.

Semantic information, such as types, descriptions, and other textual information, is an important supplement to structured information in KGs. DKRL [30] represents entity descriptions as vectors for tuning the entity and relation vectors. SSP [27] modifies TransH by using the topic distribution of entity descriptions to construct semantic hyperplanes. Entity descriptions are also used to derive a better initialization for training models [17]. With type information, type-constraint model [14] selects negative samples according to entity and relation types. TKRL [31] encodes type information into multiple representations in KGs with the help of hierarchical structures. It is a variant of TransR with semantic information and it is the first model introducing type information. However, TKRL also neglects the two issues mentioned above.

There are several other approaches to modeling KGs as graphs. PRA [15] and SFE [8] predict missing relations from existing paths in KGs. These approaches consider that sequences of relations in paths between two entities can comprise the relation between the two entities. RESCAL [21], PITF [6] and ARE [19] complete KGs through retrieving their adjacent matrices. These approaches need to process large adjacent matrices of entities.

3 Methodology

3.1 Overview

The goal of our model is to obtain the vector representations of entities and relations, which maximize the prediction probability over all existing triples. The prediction probability is a conditional probability because except the missing element, the rest two elements in a triple are known. Specifically, when predicting the tail entity for a triple (hrt), we expect to maximize the probability of t under the condition that the given triple satisfies the principle \(h+r \approx t\) and the head entity and relation are h and r. We denote this conditional probability entity as \(p(t \, | \, h,r,true)\) which means triple \((h,r,*)\) is “true”. “true” represents the triple satisfies \(h+r \approx t\) principle. “true” triples are also called correct triples in this paper. Maximizing this probability is the aim of the tail prediction. According to Bayes’ theorem [10], \(p(t \, | \, h,r,true)\) can be seen as a posterior probability and its correlation with the prior probability is derived as

$$\begin{aligned} p(t \, | \, h,r,true)= {\left\{ \begin{array}{ll} \frac{ p(true \, | \, h,r,t) \, p(t \, | \, h,r) }{ p(true \, | \, h,r) } &{} p(t \, | \, h,r) \ne 0 \\ 0 &{} p(t \, | \, h,r) = 0, \end{array}\right. } \end{aligned}$$
(1)

where \(p(true \, | \, h,r,t)\) is the probability that (hrt) is “true”, \(p(t \, | \, h,r)\) is the prior probability of t. To obtain the most possible entity, we can only compare probabilities of triples \((h,r,*)\). All these probabilities have the same \(p(t \, | \, h,r)\). Thus, we can omit \(p(true \, | \, h,r)\) in (1):

$$\begin{aligned} p(t \, | \, h,r,true) \propto p(true \, | \, h,r,t) \, p(t \, | \, h,r). \end{aligned}$$
(2)

Similarly, the objective of the head prediction is

$$\begin{aligned} p(h \, | \, r,t,true) \propto p(true \, | \, h,r,t) \, p(h \, | \, r,t), \end{aligned}$$
(3)

and the objective of the relation prediction is

$$\begin{aligned} p(r \, | \, h,t,true) \propto p(true \, | \, h,r,t) \, p(r \, | \, h,t) . \end{aligned}$$
(4)

All the three formulas have two components: likelihood and prior probability. \(p(true \, | \, h,r,t)\) is the likelihood estimated by the multiple embedding representations. The other component is the prior probability estimated by the semantic similarity. TransT introduces a type-based semantic similarity to estimate the two components and optimizes the vector representations to maximize these posterior probabilities over the training set.

3.2 Type-Based Semantic Similarity

In order to estimate the likelihood and prior probability, we introduce the semantic similarity to measure the distinction of entity semantics with the type information.

Fig. 1.
figure 1

The entities in the head or tail of a relation have some common types. In this example, all the head entities have “person” type and all the tail entities have “location” type. Therefore, “person” and “location” are the head and tail type of this relation, respectively. Moreover, if we relax this constraint, “physicist” type is also the head type of the relation since most head entities contain this type.

All entities appearing in the head (or tail) with the same relation have some common types. These common types determine this relation as shown in Fig. 1. There are head and tail positions for each relation. Thus, each relation r has two type sets \(T_{r,head}\) for entities in the head and \(T_{r,tail} \) for entities in the tail. We construct type sets of relations from these common types:

$$\begin{aligned} T_{r,head} = \bigcap _{\begin{array}{c} e \in {Head}_{r} \\ \rho \end{array}} T_e \qquad T_{r,tail} = \bigcap _{\begin{array}{c} e \in {Tail}_{r} \\ \rho \end{array}} T_e , \end{aligned}$$
(5)

where \(T_e\) is the type sets of entity e, \(Head_r\) and \(Tail_r\) are the set of entities appearing respectively in the head and tail with relation r. \(\bigcap _{\rho }\) is a special intersection which contains elements belonging to most of the type sets. This intersection can capture more type information of entities than the normal intersection. However, more information may include more noises. Thus, we balance the influence by the parameter \(\rho \), which is the lowest frequency of types in all \(T_e\).

With the type information of entities and relations, we denote the asymmetric semantic similarity of relations and entities as the following similarity of two sets inspired by Jaccard Index [11]:

$$\begin{aligned} s(r_{head},h) = \frac{|T_{r,head} \cap T_h |}{| T_{r,head} |} \quad s(r_{tail},t) = \frac{|T_{r,tail} \cap T_t |}{| T_{r,tail} |} \quad s(h,t) = \frac{|T_h \cap T_t |}{| T_h |} , \end{aligned}$$
(6)

where \(s(r_{head},h)\) is the semantic similarity between the relation and the head, \(s(r_{tail},t)\) is the semantic similarity between the relation and the tail, s(ht) is the semantic similarity between the head and tail.

The type-based semantic similarity plays an important role in the following estimations especially in the prior probability estimation.

3.3 Multiple Embedding Representations

Entities with rich semantics are difficult to be accurately represented in KGC. Thus it is difficult to measure the likelihood \(p(true \, | \, h,r,t)\) accurately. In this section, we introduce the multiple embedding representations to capture the entity semantics for the accurate likelihood.

Fig. 2.
figure 2

TransE represents each entity as a single vector which tries to describe all semantics of the entity. Thus the vector representation is not accurate for any entity semantics. In TransT, separate representations of entity semantics describe the relationship among a triple more accurately.

As shown in Fig. 2, there is only one vector representation for one entity in previous work, e.g., TransE. To overcome this drawback, TransT represents each entity semantics as a vector and denotes each entity as a set of semantic vectors. In our approach, we embed each semantics into a vector space. We assume relations have single semantics and entities have multiple semantics. Thus, each relation is represented as a single vector. To adapt the rich entity semantics, we represent each entity as a set of semantic vectors instead of a single vector. Therefore, an entity can be viewed as a random variable of its multiple semantic vectors. Furthermore, the likelihood \(p(true \, | \, h,r,t)\) depends on the expected probability of all possible semantic combinations of random variables h and t. This can define the likelihood of the vector representations for the triple as below

$$\begin{aligned} p (true \, | \, h,r,t) = \sum _{i=1}^{n_h} \sum _{j=1}^{n_t} w_{h,i} w_{t,j} p_{true} ( v_{h,i},v_{r},v_{t,j} ) , \end{aligned}$$
(7)

where \(n_h\) and \(n_t\) are the number of entity semantics of h and t; \(w_{h}=( w_{h,1}, \ldots , w_{h,n_h} )\) and \(w_{t}=( w_{t,1},\ldots ,w_{t,n_t} )\) are the distributions of random variables h and t; \(v_{h,i}\), \(v_r\), \(v_{t,j}\) are the vectors of h, r, t; \(p_{true} (v_{h,i},v_r,v_{t,j} )\) is the likelihood of the component with i-th semantic vector \(v_{h,i}\) of h and j-th semantic vector \(v_{t,j}\) of t. According to the principle \(h+r \approx t\), this likelihood is determined by the difference between \(h+r\) and t:

$$\begin{aligned} p_{true} (v_{h,i},v_{r},v_{t,j} ) = \sigma ( d ( v_{h,i} + v_{r} , v_{t,j} ) ) , \end{aligned}$$
(8)

where the distance function d measures this difference; the squashing function \(\sigma \) transforms values of d from 0 to \(+\infty \) into probability values from 1 to 0 since the probability of a semantic combination is larger if the distance between their corresponding vectors is smaller. To satisfy the property, we set \(d(x,y)=\Vert x-y \Vert _{1}\) (1-norm) and \(\sigma (x) = e^{-x}\).

In order to capture entity semantics more accurately, we do not assign the specific semantics of entities and the size of their vector sets in advance. We model the generating process of semantic vectors as a random process revised from Chinese restaurant process (CRP), a widely employed form for the Dirichlet process [10]. This avoids the man-made subjectivity for setting \(n_h\) and \(n_t\).

In training process, the tail (or head) entity in each triple generates a new semantic vector with the following probability

$$\begin{aligned} p_{new,tail} (h,r,t) = \left( 1 - \max _{ t_i \in {Semantics}_t } s ( t_{i},r_{tail} ) \right) \frac{ \beta e^{- \Vert r \Vert _1 }}{ \beta e^{- \Vert r \Vert _1 } + p (true \, | \, h,r,t) } , \end{aligned}$$
(9)

where \(\beta \) is the scaling parameter in CRP which controls the generation probability. The bracketed formula means t more possibly generates a new semantics when the existing semantics are more different from r; the fraction part is similar to the CRP in TransG [29], which indicates that t possibly generates a new semantics if its current semantic set cannot represent t accurately. Similarly, the new semantic vector of h can be generated with the probability \(p_{new,head} (h,r,t)\).

3.4 Prior Probability Estimation

In our model, the prior probability reflects features of a KG from the perspective of semantics. We estimate the prior probabilities (2), (3) and (4) by the type-based semantic similarity.

Note that the type sets of three elements in a triple have obvious relationships. We can estimate the prior distribution of the missing element from the semantic similarity between the missing element and the others.

When we predict t in a triple (hrt), the entities with more common types belonging to r and h have higher probability. Therefore, we use the semantic similarity between t and its context \((* ,h,r)\) to estimate t’s prior probability:

$$\begin{aligned} p(t \, | \, h,r) \propto {s(r_{tail},t)}^{\lambda _{tail}}\,{s(h,t)}^{\lambda _{relation}}, \end{aligned}$$
(10)

where \(\lambda _{relation}, \lambda _{head}, \lambda _{tail} \in \{0,1\}\) are the similarity weights, because h and r have different impacts on the prior probability of t. We use these weights to select different similarity for different situation. Similarly, the prior estimation of head entity h is

$$\begin{aligned} p(h \, | \, r,t) \propto {s(r_{head},h)}^{\lambda _{head}}\,{s(t,h)}^{\lambda _{relation}}. \end{aligned}$$
(11)

By the similar derivation, the prior estimation of relation r is

$$\begin{aligned} p(r \, | \, h,t) \propto {s(r_{head},h)}^{\lambda _{head}} \, {s(r_{tail},t)}^{\lambda _{tail}} . \end{aligned}$$
(12)

To adapt different datasets, the parameters, \(\lambda _{relation}\), \(\lambda _{head}\) and \(\lambda _{tail}\), should be adjusted.

3.5 Objective Function with Negative Sampling

To achieve the goal of maximizing posterior probabilities, we define the objective function as the sum of prediction errors with negative sampling [18].

For a triple (hrt) in the training set \(\varDelta \), we sample its negative triple \((h',r',t') \notin \varDelta \) by replacing one element with another entity or relation. When predicting different elements of a triple, we replace the corresponding elements to obtain the negative triples. Therefore, the prediction error is denoted as a piecewise function:

$$\begin{aligned} l (h,r,t,h',r',t' ) = {\left\{ \begin{array}{ll} - \ln { p(h\,|\,r,t,true) } + \ln { p(h'\,|\,r,t,true) } &{} h' \ne h \\ - \ln { p(t\,|\,h,r,true) } + \ln { p(t'\,|\,h,r,true) } &{} t' \ne t \\ - \ln { p(r\,|\,h,t,true) } + \ln { p(r'\,|\,h,t,true) } &{} r' \ne r , \end{array}\right. } \end{aligned}$$
(13)

where we measure the performance of the probability estimation by the probability difference of the training triple and its negative sample. We define the objective function as the total of prediction errors:

$$\begin{aligned} \sum _{(h,r,t) \in \varDelta } \sum _{ ( h', r', t' ) \in {\varDelta }_{(h,r,t)}' } \max \left\{ 0, \gamma + l (h,r,t,h',r',t' ) \right\} , \end{aligned}$$
(14)

where \({\varDelta }_{(h,r,t)}'\) is the negative triple set of (hrt).

The total posterior probabilities of predictions are maximized through the minimization of the objective function. Moreover, stochastic gradient descent is applied to optimize the objective function, and we normalize the semantic vectors of entities to avoid overfitting.

4 Experiments

In this paper, we adopt two public benchmark datasets that are the subsets of Freebase and Wordnet, FB15K [3] and WN18 [3], to evaluate our models on knowledge graph completion and triple classification [23]. As for knowledge graph completion, we divide the task into two sub-tasks: entity prediction and relation prediction. Following [3], we split datasets into train, validation and test set. The statistics of datasets are listed in Table 1.

Type information of entities in FB15K has been collected in [31]. There are 4,064 types in FB15K and the average number of types for entities is approximately 12. There is no explicit type information in WN18. Thus we construct type sets of entities from lexical categories. For example, the name of “__trade_name_NN_1” contains its lexical category “NN” (noun), we define the type of “__trade_name_NN_1” as “NN”. Because each entity in Wordnet represents the exact semantics, the number of types for entities is 1. There are 4 types in WN18.

The baselines include three semantics-based models: TKRL [31] utilizes entity types; DKRL [30] and SSP [27] take advantage of entity descriptions.

Table 1. Statistics of datasets

4.1 Entity Prediction

Entity prediction aims at predicting the missing entity when given an entity and a relation, i.e. we predict t given \((h,r,*)\), or predict h given \((*,r,t)\). FB15K and WN18 are the benchmark dataset for this task.

Evaluation Protocol. We adopt the same protocol used in previous studies. For each triple (hrt) in the test set, we replace the tail t (or the head h) with every entity in the dataset. We calculate the probabilities of all replacement triples and rank these probabilities in descending order. Two measures are considered as evaluation metrics: Mean Rank, the mean rank of original triples in the corresponding probability ranks; HITS@N, the proportion of original triples whose rank is not larger than N. In this task, we use HITS@10. This setting is called “Raw”. Some of these replacement triples exist in the training, validation, or test sets, thus ranking them ahead of the original triple is acceptable. Therefore, we filter out these triple to eliminate this case. This filtering setting is called “Filter”. In both settings, a higher HITS@10 and a lower Mean Rank mean better performance.

Experiment Settings. As the datasets are the same, we directly reuse the best results of several baselines from the literature [16, 26, 31]. We have attempted several settings on the validation dataset to get the best configuration. Under the “unif.” sampling strategy [26], the optimal configurations are: learning rate \(\alpha = 0.001\), vector dimension \(k = 50\), margin \(\gamma = 3\), CRP factor \(\beta = 0.0001\), similarity weights \(\lambda _{head}=\lambda _{head}=0\), \(\lambda _{relation}\) is set to 0 or 1 for different relations depending on statistical results of the training set, on WN18; \(\alpha = 0.00025\), \(k = 300\), \(\gamma = 3.5\), \(\beta = 0.0001\), \(\lambda _{head}=\lambda _{tail}=1\), \(\lambda _{relation}=0\) on FB15K. We train the model until convergence.

Results. Evaluation results on FB15K and WN18 are shown in Table 2. On FB15K, we compare impacts of multiple vectors and type information. Single or Multiple means entities are represented as single vectors or multiple vectors. Type or no type means type information is used or not. From the result, we observe that:

Table 2. Evaluation results on entity prediction
  1. 1.

    TransT significantly outperforms all baselines on WN18. On FB15K, TransT significantly outperforms all baselines with the filter setting. This demonstrates that our approach successfully utilizes the type information and multiple entity vectors can capture the different semantics of every entity more accurately than linear transformations of single entity vector.

  2. 2.

    Compared with baselines, TransT has the largest difference between the results of Raw and Filter settings on FB15K. This indicates that TransT ranks more correct triples ahead of the original triple. This is caused by the prior estimation of TransT. Specifically, if the predicted element is the head of the original triple, these correct triples have the same relation and tail. Thus, when we learn the prior knowledge from the training set, the head entities of these correct triples have higher semantic similarities to the head entity of the original triple than other triples. TransT utilizes these similarities to estimate the prior probability resulting in ranking similar entities higher. In fact, this phenomenon shows that the prior probability improves the prediction performance.

  3. 3.

    There is less difference between the results of Raw and Filter settings on WN18 than FB15K. The reason is that the type-based prior knowledge in WN18 is more accurate than that in FB15K. Specifically, WN18 includes 4 types with simple meanings: noun, verb, adjective and adverb. In addition, an entity in WN18 can only have one type. Thus, types in WN18 have stronger ability to distinguish different entities.

  4. 4.

    Both the two approaches, multiple-vector representation and type information, have their own advantages. Type information performs better in raw setting, while multiple-vector representation performs better in filter setting.

4.2 Relation Prediction

Relation prediction aims at predicting the missing relation when given two entities, i.e., we predict r given \((h,*,t)\). FB15K is the benchmark dataset for this task.

Evaluation Protocol. We adopt the same protocol used in entity prediction. For each triple (hrt) in the test set, we replace the relation r with every relation in the dataset. Mean Rank and HITS@1 are considered as evaluation metrics for this task.

Experiment Settings. As the datasets are the same, we directly reuse the experimental results of several baselines from the literature. We have attempted several settings on the validation dataset to get the best configuration. Under the “unif.” sampling strategy, the optimal configurations are: learning rate \(\alpha = 0.0001\), vector dimension \(k=300\), margin \(\gamma = 3.0\), CRP factor \(\beta = 0.001\), similarity weights \(\lambda _{head}=\lambda _{tail}=1\), \(\lambda _{relation}=0\).

Table 3. Evaluation results on relation prediction

Results. Evaluation results on FB15K are shown in Table 3. From the result, we observe that:

  1. 1.

    TransT significantly outperforms all baselines. Compared with TKRL, which also utilized type information, TransT improves HITS@1 by 3.5% and Mean Rank by 0.88.

  2. 2.

    In the Raw setting, TransT also achieves the best performance. This result is different from the entity prediction task. The reason is more prior knowledge of relation predictions. In the entity prediction task, the prior knowledge is derived from relations. In the relation prediction task, the prior knowledge is derived from the head and tail entities. The latter has more sources for prior estimation. Thus, TransT ranks more incorrect triples behind the original triple. This further supports the necessity of the prior probability.

4.3 Triple Classification

Triple classification aims at predicting whether a given triple is correct or incorrect, i.e., we predict the correctness of (hrt). FB15K is the benchmark dataset of this task.

Evaluation Protocol. We adopt the same protocol used in entity prediction. Since FB15K has no explicit negative samples, we construct negative triples following the same protocol used in [23]. For each triple (hrt) in the test set, if the probability of its correctness is below a threshold \(\sigma _r\), the triple is incorrect; otherwise, it is correct. The thresholds \(\{ \sigma _r \}\) are determined on the validation dataset with negative samples.

Experiment Settings. As the datasets are the same, we directly reuse the experimental results of several baselines from the literature. We have attempted several settings on the validation dataset to get the best configuration. Under the “unif.” sampling strategy, the optimal configurations are: learning rate \(\alpha = 0.001\), vector dimension \(k=300\), margin \(\gamma = 3.0\), CRP factor \(\beta = 0.01\), similarity weights \(\lambda _{head}=\lambda _{tail}=\lambda _{relation}=0\).

Table 4. Evaluation results on triple classification

Results. Evaluation results on FB15K are shown in Table 4. TransT outperforms all baselines significantly. Compared with the best result, TransT improves the accuracy by 2.5% and it is the only model whose accuracy is over 90%. This task shows the ability to discern which triples are correct.

4.4 Semantic Vector Analysis

We analyze the correlations between the semantic vector number and several statistical properties of different entities. We adopt the vector representations obtained by TransT and TransE during the entity prediction task on FB15K.

Figure 3 shows that the correlations between the number of semantic numbers and the average number of relations/types/triples for different entities. For an entity represented by more semantic vectors, it has more types and appears with more different relations and triples in the training set. Thus the entities with more semantic vectors have more complex semantics. Therefore, the result of TransT conforms to our understanding of the entity semantics.

Figure 4 shows that the prediction probabilities of several selected entities. Our approach generates at most 11 semantic vectors for entities. The entities with more semantic vectors have broader concepts. Thus, popular places and people including “Paris”, “Alan Turing”, have more semantic vectors than awards like “Film Award” and events like “2007 NBA draft”. Compared with TransE, multiple semantic vectors improve prediction probability of most entities.

Fig. 3.
figure 3

The bar chart is the number of entities with different semantic numbers. The left y-axis is the number of relations or types. The right y-axis is the number of triples. The x-axis is the number of semantic vectors.

Fig. 4.
figure 4

HIT@10 of 11 entities with different numbers of semantic vectors. The number of semantic vectors are placed above the bars.

5 Conclusion

This paper proposes TransT, a new approach for KGC, which combines structured information and type information. With the type-based prior knowledge, TransT generates semantic vectors for entities in different contexts based on CRP and optimizes the posterior probability estimation. This approach makes full use of type information and accurately captures semantic features of entities. Extensive experiments show that TransT achieves markable improvements against the baselines.