Few-Shot Relation Prediction of Knowledge Graph via Convolutional Neural Network with Self-Attention

Knowledge graph (KG) has become the vital resource for various applications like question answering and recommendation system. However, several relations in KG only have few observed triples, which makes it necessary to develop the method for few-shot relation prediction. In this paper, we propose the Convolutional Neural Network with Self-Attention Relation Prediction (CARP) model to predict new facts with few observed triples. First, to learn the relation property features, we build a feature encoder by using the convolutional neural network with self-attention from the few observed triples rather than background knowledge. Then, by incorporating the learned features, we give an embedding network to learn the representation of incomplete triples. Finally, we give the loss function and training algorithm of our CARP model. Experimental results on three real-world datasets show that our proposed method improves Hits@10 by 48% on average over the state-of-the-art competitors.


Introduction
A fact in knowledge graph (KG) is expressed as a triple (h, r, t), where r indicates the relation between the head entity h and tail entity t.Large-scale KGs, such as Wordnet [1], YAGO [2], Freebase [3] and Wikidata [4], have become the vital resource for many artificial intelligence tasks, like question answering [5,6], recommendation system [7,8].However, several relations only have few observed triples.For example, about 10% of the relations in Wikidata have no more than 10 observed triples [9].
It is challenging to extract effective and representative features with few observed triples.To this end, few-shot relation prediction [10] has attracted broad attention, which aims at predicting whether the incomplete triple (h, ?, t) holds w.r.t.r by only observing few triples about r.
Previous knowledge graph embedding (KGE) methods [11][12][13] require sufficient training triples to learn the representations of entities and relations, and thus cannot be adopted for few-shot relation prediction.Recent attempts [9,[14][15][16] introduce background information, such as neighbors and contexts of entities, to learn more features about entities and relations in few-shot scenarios, but the background information might not be always available.Actually, from the practical application point of view, the few observed triples contain useful features that have not been fully used.
For example, the relation Capital concerns three triples (China, Capital, Beijing), (Italy, Capital, Rome), and (France, Capital, Paris).Few-shot relation prediction aims at predicting whether the incomplete triple (UK, ?, London) or (UK, ?, Liverpool) holds w.r.t.Capital by only 1 3 observing three triples about Capital.Note that the head entities {China, Italy, France} imply the property Country, and the tail entities {Beijing, Rome, Paris} imply the property City.These head and tail properties help choose the triples whose head and tail should have the properties of Country and City, respectively, such as (UK, ?, London) and (UK, ?, Liverpool).Furthermore, the given triples share the same relation Capital, which helps determine the correct triple (UK, Capital, London), instead of (UK, Capital, Liverpool).Note that both the few observed and correct triples include the properties Country, City and Capital.The more similar the properties included in an incomplete triple to the properties that are common in the observed triples, the more likely the incomplete triple is a fact.This means that the properties included in both the few observed and incomplete triples help predict the new facts in few-shot scenarios.Thus, it is valuable to develop the method to predict new facts by only observing few triples.
In this paper, we investigate to learn the features of relevant properties to improve the accuracy of few-shot relation prediction of KG without introducing background knowledge.For this purpose, two key challenges still remain: How to describe the correlations, and how to learn the property features from the few observed triples?
As shown in the aforementioned example, the few observed triples are frequently correlated with each other, which is useful to discern the property features.Fortunately, the self-attention mechanism [17] allows the inputs to correlate with each other and find out to whom they should pay more attention.By using the self-attention mechanism, we give different weights to different features of the observed triples to describe their correlations and thus, the property features could be highlighted.Meanwhile, it is known that convolutional neural network (CNN) [18] is particularly useful for learning the property features in a digital image via a set of convolutional kernels, which make the network tolerant to the translation of the image property features.Thus, by an analogy between the set of observed triples and the digital image in terms of the indivisibility and translation feature invariance, we build the feature encoder to learn the property features by incorporating the CNN with self-attention-based correlations.
Following, we learn the probability distributions of property features to enhance their representations as well as relevant relations.Next, we give the matching function by incorporating the property features into the incomplete triples, such that the model's capability to match correct relations could be improved.Finally, we give the loss function by constraining the property feature space to ensure that the model is able to distinguish positive triples from negative ones.Specifically, by focusing on learning relation property features from the few observed triples, we propose the Convolutional Neural Network with Self-Attention Relation Prediction (CARP) model, including the following contributions: • We propose the method to learn property features from the few observed triples, so that the relation representation could be enhanced.

Related Work
Knowledge graph embedding-based relation prediction.
Many knowledge graph embedding methods have been successfully used for relation prediction, including the distance-based and the neural network-based methods.In the former kind of methods, TransE [19] interprets a relation as the translating operation between head-tail entity pairs.TransH [20] learns the relations as hyperplanes and projects head and tail entities to the relational-specific hyperplane to form embeddings. TransAt [21] learns the translationbased embedding, relation-related categories of entities, and relation-related attention simultaneously.In the latter kind of methods, ConvE [22] uses 2D convolution over embeddings to model the interactions between entities and relations.RESCAL [23] learns the inherent structure of dyadic relational data by tensor factorization.ComplEx [24] adopts complex valued embeddings to effectively learn the antisymmetric relations.GraIL [25] learns to predict relations over subgraph structures based on the graph neural network.TACT [26] categorizes all pairs of relations into several topological patterns and learns the importance of different patterns to facilitate link prediction.SAttLE [27] uses a large number of self-attention heads to capture the mutual information between entities and relations.However, these methods focus on learning the embeddings of entities and relations under the assumption that the model has sufficient training examples, but ignore the shared latent features within the few training triples corresponding to the same relation.It is still challenging to learn useful features with few training triples.
Few-shot relation prediction of KG.Several methods have been proposed for few-shot relation prediction of KG.For example, GMatching [9] proposes a neighbor encoder to enhance entity embedding with their local graph neighbor and performs multistep matching to compare the incomplete triple with the few observed triples.FSRL [14] designs a recurrent autoencoder aggregation network to aggregate the representation of few observed triples and employs a matching metric to discover new facts with few observed triples.FAAN [15] introduces an adaptive attention network to learn the dynamic representations via various impacts of neighbors and adopts a stack of transformer blocks to differentiate few observed triple's contributions w.r.t.different incomplete triples.MetaR [28] focuses on transferring the relation-specific meta information to quickly optimize the model parameters.GANA [16] proposes a global-local framework based on a gated and attentive neighbor aggregator together with TransH to accurately integrate the semantics of neighbors and match the incomplete triple with the few observed ones.Li et al. [29] construct a Gaussian distribution for the relation of each triple in few-shot scenario according to the distributions of its similar relations in background graphs.HiRe [30] learns and refines the representation of relations by learning three levels of relational information (entitylevel, triple-level and context-level).RSCL [31] exploits graph contexts of triples to learn the global and local relation-specific representations in few-shot scenarios.These few-shot relation prediction methods rely on the introduced background information, such as neighbors and contexts of entities, to learn more useful representations of the entities and relations.However, the background information might not be achieved easily in real-world KGs.On the contrary, the correlations implied in the few observed triples might not be fully used.Differently, we build the feature encoder based on CNN with self-attention to effectively learn the relation property features from the few observed triples without introducing background information.

Definitions and Problem Formalization
We first define some concepts as the basis of later discussion.Definition 1 KG is denoted as G = ⟨E, R, T⟩ , where E , R , and T = {(h, r, t)|h ∈ E, t ∈ E, r ∈ R} denote the sets of enti- ties, relations, and triples in KG, respectively.Definition 2 A reference is denoted as R r , the set of few observed triples associated with relation r, where Definition 3 A query is denoted as Q r , the set of incomplete t r i p l e s t o b e p r e d i c t e d , w h e r e Q r = {(h q , r, t q )|h q ∈ E, t q ∈ E, r ∈ R, (h q , r, t q ) ∉ R r } k q=1 .Definition 4 A few-shot relation prediction task, denoted as T r = {R r , Q r } , aims at predicting the triple in the query Q r that holds for the relation r when given the reference R r .
To fulfill the few-shot relation prediction of KG, we construct a set of prediction tasks T mtr = {T r } as the training set, where each task T r corresponds to an individual relation r.
Similarly, we construct a set of new prediction tasks T mte = {T r � } as the testing set, where the relations are unseen.Table 1 shows an example of the tasks of training and testing for few-shot relation prediction.
The problem to be solved in this paper is formulated as follows.Given a few-shot relation prediction task T r = {R r , Q r } , we build the feature encoder to learn the head property feature z h , tail property feature z t , and rela- tion property feature z r from R r .Then, for each (h q , t q ) in Q r , we use the embedding network to obtain the feature- enhanced representation ẑr of (h q , t q ) by incorporating z h and z t .Finally, we calculate the similarity score between ẑr and z r to measure whether (h q , t q ) holds w.r.t.r.

Framework
As shown in Fig. 1, our CARP model consists of two major components: feature encoder for learning property features and matching processor for matching the incomplete with few observed triples.

Feature Encoder
In this component, we aim at mining the property features shared by the head and tail entities within the given few triples of the same relation, as well as the relation property features shared by the head-tail entity pairs, to facilitate the generation and selection of correct triples.For simplicity, we use the randomly initialized matrix X to denote the embeddings of the head entities, tail entities and references.
To describe the correlations among X , we project X into a feature space by using the following linear transformation: where W f denotes the transformation matrix.
To assign different weights to different features of V , we calculate the scaled dot products as the weights between V and its transpose.Then, we use the following softmax function to obtain the attention scores X attn on V: denotes the scaling factor.
Thus, the importance of the property features in X could be highlighted as X attn .Then, we feed X attn into the L-layers CNN to identify the property features.For each layer in the CNN, we give a convolutional kernel to learn the property feature of the current feature map and an activation function to introduce the nonlinearity.The l-th feature map could be obtained by the following function: where X 0 = X attn , X l−1 , W l−1 and b l−1 denote the feature map, convolution kernel and bias on the (l − 1)-th layer, respec- tively; * ; LN(⋅) and ReLU(⋅) denote the convolution, layer normalization and activation, respectively.Since the mean-pooling function summarizes the features present in a region of the feature map generated by a convolution layer, we use the mean-pooling function upon the L-th feature map X L to obtain the property feature of X.
Finally, to enhance the representation of x , we learn its prob- ability distribution by mapping x to a Gaussian distribution p(z|x) = N( , 2 ) , where the mean and standard deviation constitute the output of the multilayer perceptron (MLP), defined as: where { W , W } and { b , b } denote the weights and biases, respectively.
Note that the process of sampling from a distribution is not differentiable.To solve this problem, we use the following reparameterization strategy to sample z as the final representation of property feature, defined as: where ∼ N(0, I) , and ⊙ denotes the element-wise product.
To constrain the property feature space, we assume that the prior of z follows a standard normal distribution q(z) = N(0, I) , since the sample distribution of a random variable will follow a normal distribution if the sample size is large enough according to the central limit theorem [32].We then give the method to better approximate the posterior p(z|x) to the prior q(z) by ( 3) Fig. 1 Framework of CARP (⊕ and ⊖ denote the concatenation and subtraction, respectively.H , T , z h , z t , z r , h q , t q and ẑr denote the embedding of head entities, embedding of the tail entities, head prop-erty feature, tail property feature, relation property feature, embedding of h q , embedding of t q and embedding of the entity pair (h q , t q ) , respectively) calculating the Kullback-Leibler (KL) divergence between p(z|x) and q(z) as follows:

Matching Processor
Given a few-shot relation prediction task T r = {R r , Q r } , we first separate the reference R r into two parts, the set of head entities, and the set of tail entities.We randomly initialize the matrices H and T to denote the embeddings of head entities and tail entities.Then, we feed H and T into the feature encoder to obtain the head property feature z h and tail property feature z t by Eq. ( 7).Thus, we can obtain the KL loss L h kl of the head property feature and L t kl of the tail property feature by Eq. ( 8).Then, we obtain the head feature map H L of H and tail feature map T L of T by Eq. ( 3).Following, we concatenate H , H L , T and T L as the input of the feature encoder to obtain the relation property feature z r by Eq. ( 7) and the KL loss L r kl of the relation property feature by Eq. (8).
Note that there exist latent correlations between the reference R r and query Q r within the same few-shot relation task T r .To incorporate the correlations between h q and t q into q h q ,t q , we build a MLP network by using two linear transformations with the following activation function: where h q and t q denote of h q and t q , respec- tively, { W 1 , W 2 } and { b 1 , b 2 } denote the weights and biases, respectively, ⊕ and ReLU(⋅) denote the concatenation and activation function, respectively.
Similarly, we incorporate the correlations between z h and z t into z h,t by Eq. (9).
To incorporate z h,t into q h q ,t q , we build an embedding net- work by using the linear transformation with the following activation function upon z h,t and q h q ,t q : (8) where z h,t denotes the representation of (z h , z t ) , q h q ,t q denotes the representation of (h q , t q ) , { W o , W h , W i } denotes the weights, and tanh(⋅) denotes the activation function.
Following, by the matching processor to match (h q , t q ) with R r , we transform the matching problem into the Euclidean distance-based clustering, since the independent few-shot relation prediction tasks could be viewed as clusters.To this end, we deem each few-shot relation prediction task as a cluster and take z r as the cluster center.Upon z r and ẑr , we use the following Euclidean distance to measure the distance between ẑr and z r : where ‖ ⋅ ‖ 2 2 denotes the L 2 norm.The smaller the f r (h q , t q ) , the more likely ẑr belongs to the cluster of z r , that is, (h q , t q ) holds w.r.t. the relation r.

Training Algorithm
For the relation r, we randomly sample k triples as the refer- . The remaining triples Q r = {(h q , r, t q )|(h q , r, t q ) ∈ G, (h q , r, t q ) ∉ R r } are regarded as positive triples.Moreover, we construct a set of negative triples N r = {(h q , r, t − q )|(h q , r, t − q ) ∉ G} by corrupting the tail entities.
To distinguish positive triples from negative ones and ensure that the similarity score between the positive triple and R r is at least lower than that between the negative one and R r , we minimize the following hinge loss [33] on both Q r and N r : where [x] + = max[0, x] , denotes the margin, f r (h q , t q ) denotes the similarity score between (h q , r, t q ) and R r , and f r (h q , t − q ) denotes the similarity score between (h q , r, t − q ) and R r .
Meanwhile, we minimize the KL loss L h kl of the head property feature, L t kl of the tail property feature, and L r kl of the relation property feature obtained by Eq. ( 8) to constrain the space of the property features, since the smaller the KL loss, the more similar the probability distribution of the property features to the standard normal distribution.Generally, we give the loss function of our CARP model as follows: The above idea is given in Algorithm 1, whose time com- for each T r ∈ T mtr do 6: Sample k entity pairs from T r as reference Corrupt each tail entity in (h q , t q ) to obtain the set of negative triples Obtain z h , z t and z r by Eq. ( 7) Obtain L 1 kl , L 2 kl and L 3 kl by Eq. ( 8) Calculate the total loss L by Eq. ( 13) 12: Calculate the gradient ∂L ∂Θ 13: end for 15: end while 16: return Θ

Experiments
In this section, we present experimental results on three real-life datasets to evaluate our CARP method.We first introduce the experimental settings, and then conduct four sets of experiments: (1) MRR/Hits@1/Hits@5/Hits@10 on 3/5-shot relation prediction, (2) impacts of few-shot size, (3) ablation study, and (4) case study to evaluate our method compared with existing methods.

Experiment Settings
Datasets and Evaluation Metrics.Our experiments were conducted on three KG datasets, NELL-One, FB-One, and Wiki-One, where NELL-One and Wiki-One were constructed by Xiong et al. [9].NELL-One was based on a system collecting structured knowledge from via an intelligent agent.Wiki-One was based on Wikidata, a free general structured knowledge base consisting of encyclopedic knowledge.Furthermore, we followed the similar process to build another dataset from Freebase, a large collaborative knowledge base consisting of social knowledge.Specifically, we first removed the inverse relations and then, selected the relations with over 50 but less than 500 triples as the dataset for few-shot relation prediction.Each few-shot relation prediction task consists of the triples corresponding to the same relation.There are 67, 131 and 183 few-shot relation prediction tasks on NELL-One, FB-One and Wiki-One, respectively.Following the original settings [9], we split the training/test/validation few-shot relation prediction tasks as 51/5/11, 98/11/22, and 133/16/34 on NELL-One, FB-One, and Wiki-One, respectively.The statistics of the datasets are shown in Table 2.
To evaluate the accuracy of our CARP model, we used two common ranking metrics: (1) Mean Reciprocal Rank (MRR), the average of the inverse of correct triples ranks.
Implementation.To implement few-shot relation prediction by KGE methods, we used all the triples in the training set and those in the reference of both the validation and testing set as the training triples.For TransE/TransH/Com-plEx/RESCAL, we used the open-source codes released by [34].For GraIL/TACT/GMatching/FAAN/FSRL/MetaR/ GANA, we downloaded the code released by themselves.For a fair comparison, the embedding dimension was set to 100, 100, and 50 by following [9] for NELL-One, FB-One, Table 3 MRR/Hits@1/Hits@5/ Hits@10 on 3/5-shot relation prediction (Bold numbers denote the best results) 3-shot 5-shot MRR Hits@1 Hits@5 Hits@10 MRR Hits@1 Hits@5 Hits@10 and Wiki-One, respectively.During the training of CARP, we used Adam [35] with a learning rate of 0.001 to update the parameters.In all experiments, the batch size was set to 64 and the training epochs were set to 200 for NELL-One, 300 for FB-One, and 400 for Wiki-One, respectively.
In summary, our CARP model improves MRR, Hits@1, Hits@3, and Hits@10 by 90%, 124%, 70%, and 48%, respectively, on average over the second-highest comparison model on three real-world datasets.This demonstrates that: (1) Our CARP model could adapt to different datasets, while the comparison methods perform unstably on different datasets.For example, FSRL performs better on FB-One but works worse on Wiki-One.(2) Our CARP model could learn more useful representation of the entities by mining the property features rather than using background information in fewshot scenario.

Exp-2: Impacts of Few-shot Size
To evaluate the impacts of few-shot size k, we set k = 1, 3, 5, 7 , and tested the MRR with different k.The results are reported in Fig. 2, which tell us that: • Our CARP model outperforms the comparison models in different k on NELL-One/FB-One/Wiki-One, demonstrating that the effectiveness of our model for few-shot relation prediction.
• MRR increases slightly with the increase in k, indicating that the larger the reference, the richer the information learned by our CARP model.

Exp-3: Ablation Study
To evaluate the contributions of the feature encoder and matching processor, we conducted ablation studies with two settings.Firstly, to test the effectiveness of the feature encoder, we replaced the feature encoder module with a mean-pooling layer over the reference, denoted as AS_1.Secondly, to test how much the property feature learned with the feature encoder contributes to the query, we replaced the property feature with the random feature as the input of the embedding network, denoted as AS_2.The results are reported in Table 4, which tell us that: • Our CARP model outperforms the variant AS_1, indicating that the feature encoder of our model could learn more effective and representative features from the given reference.Specifically, for the 3-shot relation prediction, our model improves MRR, Hits@1, Hits@5 and In this paper, we propose the CARP model to predict new facts with few observed triples.Focusing on learning the relation property features from the few observed triples rather than introducing background information, which may avoid the noise introduction.CARP not only enhances the representation of relations, but also facilitates to predict new facts in few-shot scenarios.
In the future, we will consider learning more valuable features about relations in few-shot scenarios.Besides, we will consider shuffling the order of the triples in the reference as a data augmentation strategy to enhance the representations of entities and relations.

Fig. 2
Fig. 2 Impacts of few-shot size k

Table 1
Example of training and testing tasks (America, Language, English) (France, Language, French) Query (UK, Language?, Chinese) (UK, Language?, English) Training set T mtr = {T r }, number of epochs n, learning rate η, few-shot size k, margin γ, parameter set Θ Output: Updated Θ kl Algorithm 1 CARP training Input: