Introduction

Knowledge Graph (KG) is a new concept proposed by Google, which is used to construct multivariate relational data. In recent years, KGs have been widely used in the artificial intelligence field, such as intelligent answers [28, 43] and social network analysis [45]. However, because of the incompleteness of KGs, the performances of tasks related to knowledge graphs will be affected. Like many KGs, such as WordNet [26], Freebase [1], and Google Knowledge Graph [2], these KGs are static and have no temporal information. In previous studies, researchers have proposed many models, such as TransE [3] and its improved models, to complete KGs. Most of these models embed entities and relationships into low-dimensional space and achieve good performance.

In the past few years, knowledge data usually contain abundant temporal information. In this case, many researchers add timestamps to the traditional knowledge graph triples (s, r, o) as temporal knowledge graph (TKG), which is described as quadruples (s, r, o, t). ICEWS [4] and GDELT [21] are two famous temporal knowledge graphs but they are far from complete. The important task of a TKG is to complete quadruples without correct subject–object entity or relational information. For example, the form of an incomplete quadruple may be (?, r, o, t), (s, r,?, t) or (s, ?, o, t), and we need to infer “?” from the quad we already have. Although KG completion model has achieved remarkable results, it has little effect on TKGs completion model. In this case, the completion of TKGs still has a lot of research space, and it also faces great difficulties. With the development of research, researchers add temporal information processing to the model, such as TTransE [19] and TA-TransE [9]. In addition, some models based on recurrent neural networks have also emerged such as RE-Net [16].

In the previously proposed method, the researchers assumed that each relationship has a sufficient number of entities to train for better performance. However, in fact, in TKGs, a large number of relationships have few entity pairs, which is called the long-tail relationship. For example, for a “s the citizen of” relationship, there may be thousands of entities, but for a “is the president of” relationship, there may be only a few hundred entities. The number of entities corresponding to the two relationships varies greatly, so we should study this situation. The long-tail relationship cannot be ignored in the real world. In this case, several models with few shots, such as GMatching [44], MetaR [5] and FANN [34], were proposed successively. These models are developed for static knowledge graph with few samples, and cannot explain temporal knowledge graph. The encoders they use cannot embed the temporal relation between entities into the models, and the information sharing between fewer entities and the influence of heterogeneous neighborhoods is not considered. After adding temporal information, we propose a relational-aware heterogeneous neighborhood encoder based on temporal information inspired by FSRL [49]. In addition, the one-time learning environment cannot meet the training situation with few samples, so it is necessary to design a new model to realize the interaction between the reference set and temporal information. In addition, we find that the meta-optimizer can be combined with LSTM [12], and LSTM can solve the problem of gradient descent and gradient explosion in training. In addition, combining LSTM update and gradient descent can obtain the optimal parameters of the model better and faster, update the training model through a small number of gradients, realize fast learning of new tasks, and train another neural network classifier through an optimization algorithm in a small number of states.

In this paper, we combine several modules and propose a new model to complete the short shot TKG. The paper makes the following contributions:

  • We propose a few-shot completion model, which addresses few-shot completion in temporal knowledge graphs.

  • We use timestamp information to enhance the representation of task entities and entity pairs by constructing a time-based relationship-aware heterogeneous neighbor encoder.

  • We propose a cyclic automatic encoder aggregation network for TKG.

  • We conduct abundant experiments on two public datasets to demonstrate that FTMO outperforms existing state-of-the-art TKG embedding methods and few-shot completion methods.

The rest of the paper is organized as follows. In “Related work”, we describe related work. In “Our model”, we illustrate the relevant task definitions and the details of the proposed model. Experimental setups and comparative analysis of the experimental results are presented in “Experiments”. In “Conclusion”, we give a conclusion and possible directions for future improvement.

Related work

Static knowledge graph completion methods

There are two types of static knowledge graph completion models: translation models and others.

On one hand, the translation model transforms relationships and entities into vectors, and calculates the dissimilarity of vectors. Bordes et al. propose the famous TransE [3] model, which interprets the relation vector as the semantic translational operation of the entity vector in the vector space. If s + r o is true, the completion result is correct. However, it only focuses on 1–1 relationship, and is not a good fit for 1–N, N–1 and N–N relationships. To this end, several improved models are taken into account such as TransH [42], TransR [23], and TransD [14], etc. TransH translates subject vector to the front of the object vector by relation and projects the subject vector and object vector onto a plane associated with the current relation. TransR uses the mapping matrix corresponding to the relation to transform entities into different relation semantic spaces to obtain different semantic representations. TransD uses entity-related vectors and relation-related vectors to dynamically obtain the projection matrix of the relation through cross-product computation. On the other hand, one of the most important other models is semantic model, which mainly calculates a similarity score through the latent semantics between entity vector and relation vector, and ranks the completion results according to the calculated similarity score. DistMult [46] proposes a framework, considering entities and relations as low-dimensional vectors and bilinear and/or linear mapping functions. ANALOGY [24] proposes a framework for optimizing the latent representations in the case of the analogical properties of the embedded entities and relations. RESCAL [30] adopts a relation weight matrix to interact the latent features of entities, but its function is too simple, which causes it cannot get efficient vector representations. To have better representations, NTN [36] and HolE [29] are proposed to obtain a better vector representations for improvements. On the other hand, MMKRL [25] can utilize multi-modal knowledge effectively to achieve better link prediction and triple classification by summing different plausibility functions and using specific norm constraints. Wang et al. [41] propose the modeling of complex internal logic by integrating the fusion semantic information, which can make the model converge faster. Huang et al. [13] propose local information fusion to join entities and their adjacencies to obtain multi-relational representations. However, these models cannot be applied to temporal knowledge graph directly.

Temporal knowledge graph completion methods

With the growth of the amount of data information, temporal information has been widely considered. In recent years, temporal stamps have been embedded into low-dimensional space, and three elements (s, r, o) have been extended to four elements (s, r, o, t) to complete TKG. Inspired by TransH [42], Dasgupta et al. propose HyTE [15] model which explicitly combines temporal information in the entity-relationship space by associating each temporal stamp with the corresponding hyperplane. TTransE [19] upgrades TransE by improving the scoring function and learns the temporal representations from the temporal text by a recurrent neural network, so TTransE can complete the temporal knowledge graph and obtain great achievements. García-Durán et al. propose TA-TransE [9] and TA-DistMult [9] adding temporal embeddings into its score function to use temporal information to complete the graph. However, in these models, all static temporal information ignores the relevance of the related quad. In addition, the time dependency also needs to be considered. To make good use of the time dependency, Trivedi et al. present Know-Evolve [38], which is an in-depth assessment of the knowledge semantic network structure. RE-Net [16] chooses to aggregate the neighborhoods of entities and applies the recurrent neural network for time dependence. Chrono-Transation [33] deals with temporal information using rule mining and graph embedding operations. However, these models usually assume that enough training quads are provided for all relationships, and the long tail relationship is not considered, which leads to poor performances in the environment with few samples.

Few-shot knowledge graph completion methods

To obtain good performance, a large amount of data is often used to train the model. But in a real knowledge graph, there are relationships with few entities. Meta-learning methods include metrics-based, model-based and optimization-based methods, aiming at fast learning with a small number of samples.

Because long-tail relations are common in KGs, GMatching [44] learns a matching metric by the learned embeddings and one-hop graph structures and proposes a one-shot relational learning model. By observing a few associative triples. However, GMatching assumes that all local neighbors contribute equally to entity embedding, while heterogeneous neighbors may have different influences. It ignores that the interaction between a few reference instances limits the representation ability of reference sets. MetaR [5] studies few-shot link prediction in KGs. It enables the model to learn faster by considering transferring relation-specific meta information. Xiong et al. [44] propose a metric-based approach to link prediction of long-tail relationships with fewer samples. However, the performance of MetaR is affected by the sparsity of entities and the number of tasks, which affects the performances. FSRL [49] can infer the true entity pairs effectively given the set of few-shot reference entity pairs for every relation, which aims at discovering facts of new relations with few-shot references. FSRL does not consider the importance of timestamp information for the completion of temporal knowledge graph. REFORM [40] studies the problem of error-aware few-shot KG completion to accumulate meta-knowledge across different meta-tasks, and propose neighbor encoder module, cross-relation aggregation module, and error mitigation module in each meta-task. MTransH [31] proposes a few-shot relational learning model with the global stage and the local stage. FAAN [34] proposes an adaptive attentional network for few-shot KG completion, which is predictive for knowledge acquisition. However, these methods are mainly aimed at static knowledge graph, and cannot make good use of the timestamp information in temporal knowledge graph.

Our model

In this section, we design a model called FTMO to complete the missing object entities in the dataset of the few-shot temporal knowledge graph, shown in Fig. 1. FTMO mainly includes the following aspects: entity embedding is generated by a time-based heterogeneous neighbor encoder; a small number of reference entity pairs are aggregated by a time-based cyclic automatic encoder to generate reference set embedding; the similarity score between query pair and reference set is calculated by the matching network, and the candidate entities are sorted to obtain the highest ranked entity. Throughout the paper, the main notations are summarized in Table 1.

Fig. 1
figure 1

The framework of FTMO model

Table 1 The main notations in this paper

Few-shot completion task

The representation of TKG can be represented as (s, r, o, t), where s and o represent entities, r represents relationships, and t represents timestamps. The TKG tasks mainly include three types: (1) given the subject entity s, the relationship r and the timestamp t to predict the object entity o: (s, r, ?, t); (2) given the relationship r, the object entity o, and the timestamp t to predict the subject entity s: (?, r, o, t); (3) given the subject entity s, the object entity o, and the timestamp t to predict the relationship r: (s, ?, o, t). In this study, the first case is taken into consideration because we want to complete the determination of the missing object entity in the relationship.

Definition 1

Few-shot TKG completion. Given a few-shot TKG, assuming that the relation r and its few-shot reference entity pairs are known, few-shot TKG task is defined as designing a machine learning model. The object candidate entities of each new subject entity s are sorted according to the known content, so that the similarity score of the real object entities of the subject entity s is the highest.

Training task

Our goal is to design a machine learning model to predict the missing object entities in a few-shot TKG. There are two types of training models with few shots. (1) The first is a metric-based method [18, 27, 35, 39], which can learn effective metrics and the corresponding matching functions in a set of training examples. (2) The second method is based on meta-optimization [7, 8, 20, 22, 32, 47]. Its purpose is to quickly optimize model parameters and give gradients on a small number of shot data instances. Here, we use meta-optimization [32] and add LSTM [12] on this basis, which can learn the precise optimization algorithm in a small number of shot states. The task of few-shot knowledge graph completion is described as follows:

Given a training task, each relationship r ∈ R in the temporal knowledge graph should have a corresponding training dataset called \({D}_{r}^{\mathrm{train}}\), which contains only few-shot entity pairs about the relation r and a testing dataset called \({D}_{r}^{\mathrm{test}}\) contains all entity pairs about the relation r. Therefore, given the test queries (si, r, ti) and a small number of reference pairs in the \({D}_{r}^{\mathrm{train}}\) training, we can sort all the candidate entities and test our model on this basis. In summary, the loss function of r can be defined as \({\mathcal{L}}_{\ominus }\left({s}_{i},{o}_{i},{t}_{i}|{Q}_{{s}_{i},r,{t}_{i}},{D}_{r}^{\mathrm{train}}\right)\), where \(\ominus \) is a collection of all the parameters, and \({Q}_{{s}_{i},r,{t}_{i}}\) represents the remaining candidate entities set.

The process of meta-optimizer is as follows: first, the parameters of meta-learning are initialized, and then n-dimensional learning tasks are cycled. Each task samples \({D}_{r}^{\mathrm{meta-train}}\) and \({D}_{r}^{\mathrm{meta-test}}\) from \({D}_{r}^{\mathrm{train}}\) data set, that is, support set and reference set. For each meta-learning task, T quadruples are extracted from \({D}_{r}^{\mathrm{meta-train}}\) as reference set, one batch is extracted as query set, and then another query set is obtained by processing its query set, and its matching score is calculated, respectively. After calculating its loss function, LSTM network is used for gradient calculation, and meta-learning parameters are updated. After doing the same for \({D}_{r}^{\mathrm{meta-test}}\), update the related parameters of meta-train.

Meta-testing refers to the fact that after sufficient data training, the learning model can be learned spontaneously, and then it can be used to predict the fact that every relationship r ∈ R in TKG. It should be noted that in the self-learning process of a meta-learning model, every relationship is invisible to the outside world. In addition, in the same pattern as mentioned earlier, every relation \({r}^{{\prime}}\) in meta-testing should also have a few-shot training data \({P}_{{r}^{{\prime}}}^{\mathrm{train}}\) that contains only few-shot entity pairs about the relation \({r}^{{\prime}}\) and a few-shot testing data \({P}_{{r}^{{\prime}}}^{\mathrm{test}}\) that contains all entity pairs about the relation \({r}^{{\prime}}\). The objective of model training can be defined as Eq. (1) as follows:

$${\mathrm{min}}_{\ominus }{\mathbb{E}}_{\mathcal{T}}\left[\sum_{\left({s}_{i},{o}_{i},{t}_{i},{Q}_{{s}_{i},r,{t}_{i}}\right)\in {D}_{r}^{\mathrm{test}}}\frac{{\mathcal{L}}_{\ominus }\left({s}_{i},{o}_{i},{t}_{i}|{Q}_{{s}_{i},r,{t}_{i}},{D}_{r}^{\mathrm{train}}\right)}{|{D}_{r}^{\mathrm{test}}|}\right].$$
(1)

Note that the meta-optimization is performed over the model parameters \(\ominus \), whereas the objective is computed using the updated model parameters \(\ominus \).where \(|{D}_{r}^{\mathrm{test}}|\) is the number of quad (s, r, o, t) in \({D}_{r}^{\mathrm{test}}\). Details on how to calculate each function section and how to modify and optimize the functionality are discussed in subsequent sections.

Time-based relational aware encoding heterogeneous neighbors

In this section, we propose a time-based relationship-aware heterogeneous neighbor encoder. The local coding structure of explicit graphs performs well in relation prediction [37]. In the previous neighborhood encoders, the average of the encoded features between neighbors is used to embed a given entity. Although embedding with the feature vector average can achieve good performance, this method ignores the different influences that heterogeneous neighbors may bring when calculating feature vectors, and this influence will produce differences in the final results [48]. Because of the existence of temporal elements, we design a relational aware heterogeneous neighbor encoder based on time combined with previous work.

Different from FSRL [49], in this process, we upgrade the matrix from three dimensional to four dimensional. In the neighbor coding process, our model first calculates the temporal information and relationship information once and then calculates the temporal information and entity information in a unified and joint way in the subsequent operation. Given a header entity s, the set of time-based relationship neighbors (relationship, entity, time) can be represented as \({\mathcal{N}}_{h}=\left\{\left({r}_{i},{o}_{i},{t}_{i}\right)|\left({s,r}_{i},{o}_{i},{t}_{i}\right)\in G^{\prime}\right\}\), where \(G^{\prime}\) is the background TKG, and \({r}_{i}\), \({o}_{i}\), and \({t}_{i}\) are the \(i\)-th relation, the corresponding object entity, and the current time point \(s\). Therefore, the time-based heterogeneous neighbor encoder can comprehensively consider the different influences of isomorphic and heterogeneous neighbors \(\left({r}_{i},{o}_{i},{t}_{i}\right)\in {\mathcal{N}}_{h}\), and combine entity and temporal information to calculate the feature vector representation of specific entities. On the basis of this, it can encode \({\mathcal{N}}_{h}\) and output a feature representation of \(s\) well. The attention module and formula for embedding \(s\) are defined as follows:

$${ f}_{\theta }\left(s\right)=\sigma \left({\sum}_i{{z}_{i}}\right),$$
(2)
$${\alpha }_{i}=\frac{\mathrm{exp}\{{\mu }_{ro}^{O}\left( \, {\mathcal{W}}_{\text{ro}}\left({e}_{{l}_{i}}\oplus {e}_{{o}_{i}}\right)+{b}_{ro}+{b}_{rt}\right\}}{{\sum }_{i}\mathrm{exp}\{{\mu }_{ro}^{O}\left( \, {\mathcal{W}}_{\text{ro}}\left({e}_{{l}_{i}}\oplus {e}_{{o}_{i}}\right)+{b}_{ro}+{b}_{rt}\right\}},$$
(3)
$${e}_{{l}_{i}}={e}_{{r}_{i}}\oplus {e}_{{t}_{i}},$$
(4)
$${{z}_{i}=\alpha }_{i}{e}_{{o}_{i}}{e}_{{t}_{i}},$$
(5)

where \(\sigma \) denotes the activation unit, \(\oplus \) represents the concatenation operator, \({e}_{{r}_{i}}\), \({e}_{{o}_{i}}\), and \({e}_{{t}_{i}}\)\({\mathbb{R}}\)(d×d×1) are pretrained embeddings of \({r}_{i}\), \({o}_{i}\) and \({t}_{i}\), and \({e}_{{l}_{i}}\) represents the variable of the \({e}_{{r}_{i}}\) and \({e}_{{t}_{i}}\) join operations. In addition, \({\mu }_{ro}\)\({\mathbb{R}}\)(d×d×1), \({W}_{\text{ro}}\)\({\mathbb{R}}\)(d×d×2d) and \({b}_{rt}\)\({\mathbb{R}}\)(d×d×1) (d: pretrained embedding dimension) are learnable parameters.

By leveraging the embeddings of entity \({o}_{i}\), relation \({r}_{i}\) and temporal stamp \({t}_{i}\) to compute \({\alpha }_{i}\) and obtain good use of the attention weight\({\alpha }_{i}\), the formulation of \({f}_{\theta }(s)\) can consider the different impacts of heterogeneous relational neighbors well. The specific details of the time-based relational aware heterogeneous neighbor encoder are shown in Fig. 2. First, the relationship information and temporal information of the quadruple related to the same subject entity are embedded, then the new variable is embedded with the object entity and temporal stamp, and then the weight factor is calculated with the intermediate quantity. Finally, the characteristic representation of the main entity is calculated.

Fig. 2
figure 2

Time-based relation-aware heterogeneous neighbor encoder

Aggregation network of cyclic automatic encoders

In this section, we design an aggregator network consisting of recurrent autoencoder aggregators. \(\left({s}_{k},{o}_{k},{t}_{k}\right)\) can be represented as \({\upepsilon }_{{s}_{k},{o}_{k},{t}_{k}}=\left[{f}_{\theta }\left({s}_{k}\right){\oplus f}_{\theta }\left({o}_{k}\right)\oplus {f}_{\theta }({t}_{k})\right]\) by applying the time-based neighbor encoder \({f}_{\theta }(s)\) to each entity pair \(\left({s}_{k},{o}_{k},{t}_{k}\right)\in {R}_{r}\). The embedding of Rr can be represented as follows [6, 35]:

$${f}_{\epsilon }\left({R}_{r}\right)={\mathcal{A}\mathcal{G}}_{\left({s}_{k},{o}_{k},{t}_{k}\right)\in {R}_{r}}\left\{{\upepsilon }_{{s}_{k},{o}_{k},{t}_{k}}\right\},$$
(6)

where \(\mathcal{A}\mathcal{G}\) is an aggregate function for pooling operation and feedforward neural network.

To apply current neural network aggregators in graph embedding [10], we study a cyclic automatic encoder aggregator between a few samples. Specifically, the entity pair embeddings \({\upepsilon }_{{s}_{k},{o}_{k},{t}_{k}}\in {R}_{r}\) are fed into a recurrent autoencoder sequentially by Eq. (7):

$${\upepsilon }_{{s}_{1},{o}_{1},{t}_{1}}\to {n}_{1}\to \cdots \to {n}_{k}\to {d}_{k}\to \cdots \to {d}_{1},$$
(7)

where k is the size of the reference set.

In Eq. (7), \({n}_{k}\) denotes encoding and \({d}_{k-1}\) denotes decoding, which are both hidden states of the decoder. \({n}_{k}\) and \({d}_{k-1}\) are calculated as Eqs. (8) and (9):

$${n}_{k}=RN{N}_{\mathrm{encoder}}\left({\upepsilon }_{{s}_{k},{o}_{k},{t}_{k}},{n}_{k-1}\right),$$
(8)
$${d}_{k-1}=RN{N}_{\mathrm{dencoder}}\left({d}_{k}\right),$$
(9)

where RNNencoder represents the recurrent encoder and RNNdecoder describes the decoder.

The reconstruction loss for optimizing the autoencoder can be measured as Eq. (10):

$${\mathcal{L}}_{\text{re}}\left({R}_{r}\right)=\sum\limits_{k}{\Vert {d}_{k}-{\upepsilon }_{{s}_{k},{o}_{k},{t}_{k}}\Vert }_{2}^{2}.$$
(10)

The role of \({\mathcal{L}}_{\text{re}}\) is to merge with relationship-level losses to optimize the representation for each entity pair, thereby improving the model performance.

To embed the reference set, the hidden states of the encoder, residual links [11], and attention weights are aggregated and defined as follows:

$${\text{n}}{^{\prime}}_{k}={n}_{k}+{\upepsilon }_{{s}_{k},{o}_{k},{t}_{k}},$$
(11)
$${\beta }_{k}=\frac{\mathrm{exp}\{{\mu }_{R}^{T}({\mathcal{W}}_{R}n{^{\prime}}_{k}+{b}_{R})\}}{\sum\limits_{k^{\prime}}\mathrm{exp}\{{\mu }_{R}^{T}({\mathcal{W}}_{R}n{^{\prime}}_{k^{\prime}}+{b}_{R})\}},$$
(12)
$${f}_{\epsilon }\left({R}_{r}\right)=\sum\limits_{k}{\beta }_{k}n{^{\prime}}_{k},$$
(13)

where \({\mu }_{R}\in {\mathbb{R}}^{\left(d\times d\times 1\right)}\), \({\mathcal{W}}_{R}\in {\mathbb{R}}^{\left(d\times d\times 2d\right)}\), and \({b}_{R}\in {\mathbb{R}}^{\left(d\times d\times 1\right)}\) (d: pretrained embedding dimension).

Compared with FSRL, our model not only improves temporal information processing and the matrix dimension, but also uses a smaller gradient in combination with LSTM, which makes the result better. The cyclic automatic aggregation network for reference set contains encoder part and decoder part, as shown in Fig. 3. The encoder combines the LSTM aggregation of a small number of reference sets and feature representation vectors of entities to generate relations with small sample embeddings. The decoder aggregates the LSTM and aggregates a small number of reference sets and intermediate amounts of feature representation vectors of entities to compute the loss function.

Fig. 3
figure 3

The cyclic automatic aggregation network for reference set

Matching query and reference set

For matching \({R}_{r}\) with each \(\left({s}_{l},{o}_{l},{t}_{l}\right)\) in r, temporal information is considered in the process of the matching network. According to the previous efforts, there are two types of embedding vectors, which are \({\upepsilon }_{{s}_{l},{o}_{l},{t}_{l}}=\left[{f}_{\theta }({s}_{l}){\oplus f}_{\theta }({o}_{l})\oplus {f}_{\theta }({t}_{l})\right]\) and \({f}_{\epsilon }\left({R}_{r}\right)\). For the recurrent processor, we adopt \({f}_{\mu }\) [11] for multiple step matchings to measure the similarity between these two vectors. The \(t\)-th process step can be represented as follows:

$${g}_{t}^{{\prime}},{c}_{t}={RNN}_{\mathrm{match}}\left({\upepsilon }_{{s}_{l},{o}_{l},{t}_{l}},\left[{g}_{t-1}\oplus {f}_{\epsilon }\left({R}_{r}\right)\right],{c}_{t-1}\right),$$
(14)
$${g}_{t}={g}_{t}^{{\prime}}+{\upepsilon }_{{s}_{l},{o}_{l},{t}_{l}},$$
(15)

where RNNmatch is the LSTM cell [12], including the hidden state \({g}_{t}\) and the cell state \({c}_{t}\).

The inner product results between \({\upepsilon }_{{s}_{l},{o}_{l},{t}_{l}}\) and \({f}_{\epsilon }({R}_{r})\) are employed as the similarity score. The matching network for query pair and reference set is shown in Fig. 4.

Fig. 4
figure 4

The matching network for query pair and reference set

The query set and LSTM are combined for embedding, then the reference set and LSTM are combined for calculation, and finally the similarity score is obtained. First, the query set and LSTM are combined for embedding. Second, the reference set and LSTM are combined for calculation. Finally, the similarity score can be obtained.

Target mode training

To test the performance of the model in obtaining a reference set Rr for the query relation r, we select randomly a set of few positive (true) entity pairs \(\left\{\left({s}_{k},{o}_{k},{t}_{k}\right)|\left({s}_{k},r,{o}_{k},{t}_{k}\right)\in G\right\}\). After that, the remaining positive (true) entity pairs can be represented as \({\mathcal{P}\upepsilon }_{r}=\left\{\left({s}_{l},{o}_{l},{t}_{l}\right)|\left({s}_{l},r,{o}_{l},{t}_{l}\right)\in G \bigcap \left({s}_{l},{o}_{l},{t}_{l}\right)\notin {R}_{r}\right\}\), where \({\mathcal{P}\upepsilon }_{r}\) is the positive entity pairs. On the other hand, the negative (false) entity pairs \({\mathcal{N}\upepsilon }_{r}=\left\{\left({s}_{l},{o}_{l}^{-},{t}_{l}\right)|\left({s}_{l},r,{o}_{l}^{-},{t}_{l}\right)\notin G\right\}\) are created by polluting the tail entities. The ranking loss can be computed by Eq. (16):

$$ \begin{aligned} {\mathcal{L}}_{\mathrm{rank}}=&\sum_{r}\sum_{\left({s}_{l},{o}_{l},{t}_{l}\right)\in {\mathcal{P}\upepsilon }_{r}}\\ &\sum_{\left({s}_{l},{o}_{l}^{-},{t}_{l}\right)\in {\mathcal{N}\upepsilon }_{r}}{\left[\xi +{\mathcal{S}}_{\left({s}_{l},{o}_{l}^{-},{t}_{l}\right)}-{\mathcal{S}}_{\left({s}_{l},{o}_{l},{t}_{l}\right)}\right]}_{+},\end{aligned}$$
(16)

where \({\left[x\right]}_{+}=\mathrm{max}\left[0,x\right]\) is the standard hinge loss, \(\xi \) is the safety margin distance in the model, and \({\mathcal{S}}_{\left({s}_{l},{o}_{l}^{-},{t}_{l}\right)}\) and \({\mathcal{S}}_{\left({s}_{l},{o}_{l},{t}_{l}\right)}\) are the similarity scores between query pairs \(\left({s}_{l},{o}_{l}/{o}_{l}^{-},{t}_{l}\right)\) and reference set Rr.

The final objective function can be formulated as Eq. (17):

$${\mathcal{L}}_{\mathrm{joint}}={\mathcal{L}}_{\mathrm{rank}}+\gamma {\mathcal{L}}_{\mathrm{re}},$$
(17)

where \(\gamma \) is the trade-off factor between \({\mathcal{L}}_{\mathrm{rank}}\) and \({\mathcal{L}}_{\mathrm{re}}\). \(\gamma \) is a hyperparameter because the final joint loss function cannot be computed directly using the sum of the two partial loss functions. In the experiments, we make the loss function of the two parts an order of magnitude by observing the loss function of the two parts and then finally determine the value of \(\gamma \).

To minimize \({\mathcal{L}}_{\mathrm{joint}}\) and optimize the parameters, we deal with each relation as one task and design a batch sampling on the basis of the meta-training. In each training task, few-shot entity pairs and a set of query sets are firstly selected and extracted. Then, a set of negative entity pairs is created by polluting object entities. The feature representation of subject entities, the reconstruction loss for optimizing autoencoder, the challenge and embedding formulation, the ranking loss and the loss function are successively calculated according to the formula proposed in this paper. The optimizer parameters are updated until the task is finished. Finally, an optimal parameter can be returned.

Experiments

Experiment setup

Datasets preprocessing

In the previous knowledge graph completion model, every relation in the existing dataset contained a large number of entity pairs, so it could ensure training accuracy. However, in the real world, there are many relations with a small number of entities. Therefore, to study this kind of relationship called long-tail relationship, each relation should have a small number of entity pairs in a small number of datasets. Therefore, based on the less beat standard [18, 35] and inspired by GMatching [44], we adjust the number of entities in each relationship based on the existing dataset and control the number in a lower range. For example, in a normal TKG dataset, the relationship “is the president of” may have approximately 10,000 entity pairs, but in a few-shot dataset, it may only have 50–500 entity pairs. To test the performance of our model, we should adjust the existing TKG dataset accordingly. Therefore, we keep the number of entities per relationship in the dataset within this range and reduce the number of relationships to less than 100. Details of specific dataset processing are described as follows:

In the experiments, ICEWS [4] and GDELT [21] are used for evaluations. The number of entities for one relation maintains between 50 and 500, and the number of relations is controlled under 100. We divide the dataset into the training set, test set and verification set with a ratio of 70: 15: 7. The statistics of ICEWS and GDELT are listed in Table 2.

Table 2 Statistics of ICEWS-Few and GDELT-Few

Baselines

We perform two kinds of baseline models for comparisons. One kind is the vector representation and relational embedding models such as TransE [3], DistMult [46], TTransE [19], TA-TransE [9] and TA-DistMult [9]. The other kind is neighborhood coding models such as RE-Net [16], GMatching [44], MetaR [5] and FSRL [49].

Parameter settings

For GMatching, MetaR, FSRL, and FTMO, the optimal hyper-parameters are listed in Table 3, where n is the embedding dimension, λ is the learning rate, x is the maximum size of ICEWS and GDELT, h is the hidden dimension, q is the maximum local neighborhood number of the heterogeneous neighborhood encoder species, p is the number of steps, a is the weight attenuation, m is the edge distance, and f is the transaction factor. For the other baselines, the optimal hyper-parameters are listed in Table 4, where n is the latitude of vector embedding, B is the batch size of training data, v is the discard probability. Adam optimizer [17] is selected in the process of updating parameters.

Table 3 The optimal hyper-parameters for baseline models on both datasets
Table 4 The optimal hyper-parameters for baseline models on each dataset

Evaluation index

We use the hit ratio (Hits@1, Hits@5, and Hits@10) and the mean reciprocal rank (MRR) to evaluate the performances.

Experimental results

Comparisons with baselines

In this group of experiments, performance comparisons with baselines on ICEWS-Few and GDELT-Few are presented in Table 5, where the pre/post scores describe experimental results from the validation/test set. In Table 5, the best experimental results are shown in bold, and the best experimental results of the comparative baselines are underlined.

Table 5 Performance comparisons on ICEWS-Few and GDELT-Few

From Table 5, we can make the following conclusions.

  1. (1)

    In two different comparison models, we can clearly see that the results of the model using graph neighborhood coding are better than those of the relational embedding method, which shows that neighborhood coding can better deal with heterogeneous relational entities, and the method combined with the matching network can better represent and embed entities, thus enhancing the performance of entity completion.

  2. (2)

    Our model achieves the best performance in all the evaluation parameters, which proves the effectiveness of our model, indicating that preprocessing relational and temporal attributes and the combined use of heterogeneous neighborhood encoder and cyclic autoencoder aggregation network can complete the work of few-shot entities well.

Comparisons over different relations

In this group of experiments, we perform comparative experiments over different relations to evaluate the validity and stability. The comparisons are performed between FSRL and FTMO over ICEWS-Few and GDELT-Few. The experimental results are shown in Tables 6 and 7, where the pre/post scores represent the experimental scores of FSRL and FTMO, respectively.

Table 6 The results of FSRL and FTMO for each relation in dataset ICEWS-Few
Table 7 The results of our model and FSRL for each relation in dataset GDELT-Few

From Tables 6 and 7, we can make the following conclusions.

  1. (1)

    From the results, we can see that the variance values of the two models we used are higher in different relationships. This is because in the evaluation process, the size of candidate sets is different for different relationships, so it is normal to have a large variance. As seen in the table, relationships with smaller candidate sets score relatively higher.

  2. (2)

    It can be observed that our model has strong robustness for different relationships, which shows that our model has strong stability and can handle abnormal situations. The results show that in most cases, our model performs better for most relationships.

Ablation experiments

In this group of experiments, we perform ablation experiments to evaluate the impact of each module in FTMO from three viewpoints: without the time-based relational-aware heterogeneous neighbor encoder (W1), without the cyclic autoencoder (W2), and without the matching network (W3). We replace the relationship-aware heterogeneous neighbor encoder with an embedded average pool layer covering all neighbors in W1. In W2, the cyclic automatic encoder aggregator network is replaced by an average pool operation. In W3, LSTM is canceled and the inner product between the query embedding and reference embedding is used as the similarity score. The evaluations are performed over ICEWS-Few and GDELT-Few. The experimental results are shown in Tables 8 and 9, where the pre/post scores indicate the experimental results from the validation/test set. Different experimental results can be observed from Tables 8 and 9, and the performance differences indicate their impact of different modules in FTMO.

Table 8 The results of ablation Experiment for our model and FSRL in dataset ICEWS-Few
Table 9 The results of ablation Experiment for our model and FSRL in dataset GDELT-Few

Stability

Impact of few-shot size. The main task of this paper is to investigate TKG completion with small samples, so we study the influence of the size of K. The few-shot size describes the size of K, and K is the size of the hit ratio (Hits@). We conduct experiments on FTMO, FSRL, and GMatching. The evaluations are performed over ICEWS-Few and GDELT-Few, and the experimental results are shown in Figs. 5 and 6. It can be observed that the completion performance of FTMO, FSRL, and GMatching improve with the increase of K value, and the performance of FTMO is always higher than that of FSRL and GMatching. It shows that FTMO has better stability and robustness when completing few-shot TKG, and is more suitable for completing few-shot TKG.

Fig. 5
figure 5

Impact of few-shot size K in dataset ICEWS-Few

Fig. 6
figure 6

Impact of few-shot size K in dataset GDELT-Few

Computational complexity analysis

The time cost of FTMO mainly comes from neighbor encoder, aggregation and matching modules. The computational complexity of neighbor coding is expressed as O(|R||Ε|d), where |R| is the maximum number of neighbor relationships of task relationship r, |Ε| is the number of neighbor entities of task entities involved in the training process, and d is the embedded representation dimension in the experiments. For the aggregation of FTMO, the time cost of updating the entity representation is O(|E|Ld), where L is the number of aggregation layers, d is the embedding dimension. The computational complexity of the matching processor is O(|R|(|E| +|T|), where the embedding representation of the input entity pair is O(|R||Ε|) and |T| is the number of neighbor timestamps of task entities involved in the training process. The computational complexity of timestamp information between entities is O(|R||T|). The initial representation complexity of entity, relationship and timestamp information in the dataset is O(|\(g\)|d), where |\(g\)| is the number of tuples in the training set in the temporal knowledge graph. Because there are N iterations in the training process, the total computational complexity of the model is O(N(|R||E|d +|E|Ld +|R|(|E| +|T|) +|\(g\)|d)). By analyzing the computational complexity of the model in the training process, we found that compared with the baseline, we increase the processing of time information, which leads to an increase in computational complexity. When the number of neighbors facing the task relationship increases, the time complexity is further improved, that is, it takes a lot of time to train large-scale datasets, so the computational efficiency of the model will become lower for training large-scale datasets.

Defects analysis

Our model combines a time-based relational-aware heterogeneous neighbor encoder, cyclic automatic encoder aggregation network, and matching network to complete few-shot TKG. Although FTMO has better stability and is more suitable for completing few-shot TKG, there are several limitations:

  1. (1)

    Datasets limitations: Although the model achieves good performance on two datasets, our model is suitable for specific temporal datasets with few samples. When applied to other datasets, the dataset needs to be processed into a few-shot temporal dataset with only a few relationships and a few entity pairs.

  2. (2)

    Method limitation: In the process of time-based relational-aware neighborhood coding, the completeness of temporal knowledge graph is improved by processing relational information and temporal information first. Because the data in the dataset do not necessarily contain temporal information, the embedding ability of entities and relationships is unknown for the data without temporal information, which may have an uncertain impact on the final results.

Conclusion

In this paper, we first extracted few-shot datasets from two common datasets according to the rule of few-shot samples and proposed an innovative small relational model to solve the problem of few-shot TKG completion. The proposed model combines the time-based relational-aware heterogeneous neighbor encoder, cyclic automatic encoder aggregator network and matching network, and obtains good results through experiments. We performed experiments on two few-shot datasets, and the results are superior to those of the existing baseline methods. The completion ability of our model significantly improved in comparative experiments, up to 12% in ICEWS-Few dataset and up to 18% in GDELT-Few dataset. In addition, we also conducted ablation experiments and K-size analysis experiments, and the results show the effectiveness of each module for model performance and the stability of our model for entity completion.

Due to the complex structure of the neural network of FTMO model and the consideration of timestamp information, the tensor in FTMO is higher than that in baselines, which leads to the increase of computational complexity. Although the high tensor increases the complexity, it also makes the effect of entity embedding and aggregation of entity pairs in FTMO model higher than that in baselines, so that the performances of FTMO model with few-shot completion is higher than that of the baselines. In the future, we will consider these issues and make optimizations. In addition, there are other studies on few-shot TKG completion in the future. For example, we can combine entity attributes or text descriptions to improve entity embedding quality, or we can consider the relationship between different timestamps of the same triple when processing temporal information. These improvements may further improve the model performances.