1 Introduction

Knowledge Graphs are large graph stores that represent real-world facts by connecting entities through relations. These facts follow the triple form (\(e_s\), r, \(e_o\)) where \(e_s\) and \(e_o\) are respectively, the subject and object entities and r is the relation between them. Unfortunately, Knowledge Graphs have many missing facts and therefore are not complete. The amount of facts, entities, and relations is vast and thus, containing all of them in a Knowledge Graph is a complex task, e.g., Wikipedia’s task is to contain information on all branches of knowledge. Furthermore, most Knowledge Graphs model information areas that are continually evolving and changing, which causes the graphs to be incomplete. Different Machine Learning approaches have been proposed to tackle this issue.

Since Knowledge Graphs can be represented as a third-order binary tensor where each position of the tensor represents whether a fact (triple) is true or false, different algorithms based on Tensor Factorization have been proposed for inferring the missing knowledge. These factorization algorithms [2, 25, 44] produce distributed representations of the entities and relations within the Knowledge Graph, also known as Knowledge Graph Embeddings. Embeddings are then used to predict new facts from the known ones.

Embeddings, when distributed, are, however, difficult to interpret. They are real-valued vectors whose individual dimensions can not be understood, i.e., their latent dimensions have low semantic meaning. Several cognitive arguments based on the economy of storage maintain that a small set of features is not enough to describe every semantic concept and domain in our vocabularies [23, 24, 31]. From the cognitive point of view, some words require more characteristics to be described and others that require less. Besides, certain properties belong to specific semantic domains and sharing properties is not natural for humans. Furthermore, we do not describe concepts based on what they are not since it would be uneconomical; thus, we do not store that a desk is not a vegetable or a glass can not fly. Following these arguments, as authors in [23] presented, the feature set of a representation should only store positive facts, a wide range of feature types, and only a small quantity of these features should be used to describe a concept. To achieve such properties, the feature set of the concepts representations should be non-negative and sparse. In this work, we present a solution to get the sparsity property in training time for Knowledge Graph Embeddings. Nevertheless, while we do not strictly induce non-negativity in our method, it highly enforces embeddings to have high non-negativity values, as we demonstrate in Sect. 4.4.

Other approaches also have been proposed to increase interpretability for Knowledge Graph embeddings. Authors from [5] regularize embeddings incorporating additional entity co-occurrence statistics from text data, thus inducing interpretability in the embeddings. [12] extract weighted Horn rules from embeddings to interpret them. In the area of word embeddings, different authors proposed various models for modelling sparse word-embeddings [23, 34].

A common approach to introduce sparsity in machine learning models is to apply the \(l_1\) regularization to the training process. However, the online optimization with stochastic gradient descent (SGD) cannot produce sparsity for the \(l_1\) regularizer making it a non-trivial task. To overcome such an issue, we propose to use the generalized Regularized Dual Averaging (gRDA) algorithm [6], which can produce \(l_1\) regularized embeddings in an online learning setting. We are, to our knowledge, the first to propose a sparse Knowledge Graph Embedding (KGE) solution without any external knowledge.

To conduct our study, we propose the RESCAL model and introduce the \(l_1\) regularizer into the embeddings applying gRDA. We perform experiments to evaluate both the performance and interpretability of the embeddings. We test the embeddings on the link prediction task and compare the performance to the other models. Furthermore, we evaluate the interpretability based on the word intrusion task. Results demonstrate that our approach remains competitive on link prediction while improving the interpretability of the learned embeddings.

In summary, our main contributions are:

  • Presenting the first sparse linear KGE model, sRESCAL, that increases interpretability and remains competitive with the state-of-the-art;

  • Adapting the online optimization algorithm generalized Regularized Dual Averaging to an unknown setting and showing its efficiency;

  • an extensive evaluation and comparison between dense KGE models and sRESCAL.

2 Related work

The literature has widely studied the representation of KG’s entities and relations as continuous vectors in a low-dimensional space. We first review works that produce embeddings based on tensor decomposition methods which are of primary interest to our approach. Then, we present different techniques for interpreting KG embeddings.

2.1 Knowledge graph embedding models

Various tensor factorisation models for link prediction have been proposed in the literature:

RESCAL [25] is a bilinear model which associates entities with a vector that captures latent semantics. Relations are represented as matrices that model pairwise interactions between the latent factors.

DistMult [44] is a specific case of RESCAL where relations are diagonal matrices. Therefore, relations only capture pairwise interactions between the same dimensions. While this model requires fewer parameters, it cannot model asymmetric relations because of its diagonal property.

ComplEx [37] is an extension of DistMult which outputs complex-valued embeddings. This property makes the model asymmetric, improving the performance in non-symmetric relations.

TuckER [2] is a linear model based on the Tucker decomposition [38] and previous linear models can be interpreted as special cases of this. Its power relies on a core tensor which allows multi-task learning by sharing knowledge between entities and relations.

Other models or variations on tensor factorisation such as DUality-induced RegulArizer (DURA) [45], CP decomposition [17], ComplEx [37], KGFP [26], knowledge-driven regularizers for embedding learning [22] or SeAttE [20] are also worth mentioning. Additionally, the survey [21] provides a complete overview of the different models and techniques, and [29] offers a comparative analysis of the approaches.

All of these are based on tensor factorisation techniques and produce distributed representations. However, unlike our model, they do not introduce sparsity in the embeddings, and the semantic meaning of the representations’ dimensions is less understandable. Furthermore, our method could be adapted to different models since it relies on the optimisation process rather than the factorisation itself.

2.2 Interpreting knowledge graph embeddings

Authors in the literature have proposed different methods for interpreting KGEs. In [5] authors induce interpretability in the embeddings by adding a measure of coherence as a regularisation term of the overall loss function using additional entity co-occurrence statistics from the text. The coherence measure allows automated evaluation of the quality of topics learned by topic modelling methods by using additional Point-wise Mutual Information (PMI) for word pairs. In [12], authors adopt “pedagogical approaches” to interpret KGEs and extract weighted Horn rules to increase their interpretability. In the work [3], authors present a model that does predict links and decides whether it is a “topical” or a “social” link. A different work, [43], presents a translational model (ITransF) which learns associations between relations and concepts via sparse interpretable attention vectors. Authors in [9] use the topological properties of a network to explain the contribution of particular categories of features in link prediction. In the work [11] authors improve KGEs using background taxonomic information. Finally, in [1] authors analyse the latent structure and semantic meaning of KGEs based on theoretical interpretations of word embeddings. While all of these works try to increase the interpretability of link prediction in Knowledge Graphs, several properties differentiate our approach: it directly increases the interpretability in the embeddings, the process is done at training time, and it does not use external knowledge in the process.

Not only methods for interpreting KGEs, but different approaches for improving KG-related tasks and techniques were proposed in the literature. In [40], the authors propose an ensemble framework to enhance the robustness and trust of knowledge graph completion. Authors in [39] consider a Bayesian reinforcement learning paradigm to harness uncertainty into multi-hop reasoning; in this manner, the method improves interpretability and performance in the task. Authors in [46] incorporate external knowledge with explicitly syntactic and contextual information for the task of aspect-based sentiment analysis.

Techniques for interpreting word embeddings by sparsity mechanism have also been proposed. In [23], authors apply a sparse non-negative matrix factorisation model to produce word embeddings. In the work [10], they use sparse coding techniques to derive sparse word embeddings from dense word representations. Similar to our work, authors from [35] modify the Continuous Bag of Words model and add the \(l_1\) regulariser in online training by employing the Regularized Dual Averagaging method. Authors in [34] present a k-sparse denoising autoencoder to produce a sparse non-negative high dimensional projection of word embeddings. Finally, [27] produce word embeddings by an underlying LDA-based generative model, which helps to generate sparse vectors. These approaches are closer to the approach presented in this work since they produce sparse embeddings. However, they all focus on word embeddings which are developed differently. Our approach works on tensor factorisation techniques that can not be used to build word embeddings.

3 Sparse tensor factorization

This section presents our approach for applying the \(l_1\) regularization to the RESCAL model via the gRDA optimization algorithm. First, we introduce the background and the RESCAL model itself, then describe the details of the approach. The same technique can be applied to other KGE models such as TuckER or DistMult.

3.1 Background

Let \(\mathcal {E}\) denote the set of all entities and \(\mathcal {R}\) the set of all relations present in a knowledge graph. We denote the collection of triples in a KG as \(\mathcal {D}\) and each triple is represented as \((e_s, r, e_o) \in \mathcal {D}\), with \(e_s, e_o \in \mathcal {E}\) denoting subject and object entities respectively and \(r \in \mathcal {R}\) the relation between them.

In the link prediction task the objective is to learn a scoring function \(\phi\) which scores, \(s = \phi (e_s,r,e_o) \in \mathbb {R}\), whether a triple is true or false. To complete the task, we observe a subset of all true triples, aiming to score all the missing ones correctly. In this work, we only consider scoring functions given by a tensor factorization technique, e.g., RESCAL.

3.2 RESCAL model

The RESCAL model [25] is a relaxed version of the DEDICOM [13] matrix factorization method, which decomposes a matrix into two matrices that provide an asymmetric relation between entities.

In the case of KGs the tensor is three-mode, therefore the original tensor \(\mathcal {X} \in \mathbb {R}^{n_e \times n_e \times n_r}\), is decomposed by RESCAL into an entity matrix \({\textbf {A}} \in \mathbb {R}^{n_e \times d}\) and k relation matrices \({\textbf {R}}_k \in \mathbb {R}^{d \times d}\) where \(n_e\) represents the number of entities in the KG and d is a hyperpameter for the embedding dimensionality. Thus, the k-th slice of the tensor is factorized as

$$\begin{aligned} \mathcal {X}_k \approx A {\textbf {R}}_k {\textbf {A}}^T, \text {for } k = 1,..., n_r \end{aligned}$$
(1)

where \(n_r\) is the number of slices (relations) in the tensor.

RESCAL is a latent feature model which scores triples by the interaction of the latent features of the subject and object entities. The scoring is given by

$$\begin{aligned} \phi (e_s, r, e_o) = {\textbf {e}}_{s}^{T} {\textbf {R}}_k {\textbf {e}}_o, \end{aligned}$$
(2)

where \({\textbf {e}}_{s}, {\textbf {e}}_{o} \in \mathbb {R}^{d}\) are respectively the subject and object embeddings from the entity embedding matrix E (A in Eq. 1), and \({\textbf {R}}_k\) is the asymmetric relation matrix corresponding to the k-th relation in the KG.

3.3 Sparse RESCAL

We aim to introduce sparsity into the embeddings via \(l_1\) regularization. As we know, the \(l_1\) regularization applies a penalty in the learned weights, which makes them push towards 0, leading to finding sparse solutions for the embeddings.

The RESCAL model was originally [25] optimized using the Alternating Least Squares (ALS) method. Nevertheless, different works have pointed out that newer training techniques improve model performance [15, 18]. Furthermore, in a recent work [30] that provides an extensive analysis on KG models and their training processes, authors present the Adam [16] method to be the best for training RESCAL. However, as Adam is a stochastic gradient descent (SGD) extension, it can not produce sparse solutions directly applying \(l_1\) regularization to the loss function. The minor updates which SGD makes in the vectors’ values difficult the output of many zero entries [8], i.e., it is quite challenging to get two float numbers to add up and equal zero. To overcome this issue, we propose using generalized Regularized Dual Averaging (gRDA) for optimizing the RESCAL model with sparse constraints. gRDA is itself a generalization of the RDA algorithm for sparse neural networks [42], which can be very useful for sparse online learning with \(l_1\) regularization as it can explicitly exploit the regularization structure. In each iteration of the RDA algorithm, the weights of the model are updated, taking into account not only the loss function but also the whole regularization term introduced to achieve the sparsity.

To apply \(l_1\) regularization to our model, we first update the RESCAL model’s loss function:

$$\begin{aligned} \mathcal {L}_\mathrm{sRESCAL} = \mathcal {L}_\mathrm{RESCAL} + \lambda \sum _{w \in W}{\{|| w \}||_1} \end{aligned}$$
(3)

where \(\lambda\) is a hyperparameter that controls the importance of the regularization term.

We follow the same training procedure as [2]. We apply the data augmentation method used by [7] adding reciprocal relations for every triple in the dataset. We also use 1-N scoring, i.e., we score all entity-relation pairs \(\{( e_s,r \})\) and the corresponding inverse \(\{( e_o,r^{-1} \})\) with every entity \(e \in \mathcal {E}\). We use the Binary Cross Entropy (BCE) loss function:

$$\begin{aligned} \mathcal {L}_\mathrm{RESCAL} = - \frac{1}{n_e} \sum ^{n_e}_{i=1} \{( {\textbf {y}}^{i} \log \{({\textbf {p}}^{i} \}) + \{( 1 - {\textbf {y}}^{i} \}) \log \{( 1 - {\textbf {p}}^{i} \}) \}), \end{aligned}$$
(4)

where \({\textbf {p}} \in \mathbb {R}^{n_e}\) is the vector of predicted probabilities and \({\textbf {y}} \in \mathbb {R}^{n_e}\) is the binary label vector.

3.4 Optimization

In gRDA, at each iteration, the learning weights are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss function. Its update rule is the following:

$$\begin{aligned} {\mathbf{w}}_{{n + 1}} & = \mathop {\arg \min }\limits_{{{\mathbf{w}} \in \mathbb{R}^{d} }} \left\{ {{\text{w}}^{T} \left. {\left\{ {\left( { - \rm{w}_{0} + \gamma \sum\limits_{{1 = 0}}^{n} {\left. {\Delta f\left\{ {(w_{i} ;Z_{{i + 1}} } \right\}} \right)} } \right.} \right\}} \right)} \right. \\ & \quad + \left. {g\left. {\left\{ {(n,\gamma } \right\}} \right){\mathcal {P}}\left( {\text{w}} \right) + F\left( {\text{w}} \right)} \right\} \\ \end{aligned}$$
(5)

where \(\gamma\) is the step size, \(\mathcal {P}({\textbf {w}})\) is the penalty function, \(F({\textbf {w}})\) is a deterministic and convex regularizer which stabilizes the optimization process in the same manner as proposed in [32] and \(g \{(n, \gamma \})\) is a deterministic non-negative function of \(n, \gamma\). Notice that RDA is a special case of gRDA where \(g \{(n, \gamma \}) = n\gamma\) and \({\textbf {w}}_0 = 0\).

Since we want to apply the \(l_1\) regularization to the RESCAL model, we follow the same criteria given by [6] for gRDA-\(l_1\). We set the strongly convex function \(F({\textbf {w}}) = \frac{1}{2}\{|| {\textbf {w}} \}||_2^2\) and the penalty function to be \(\mathcal {P}({\textbf {w}}) = \{|| w \}||_1\). We also follow the function

$$\begin{aligned} g(n, \gamma ) = c\gamma ^{\frac{1}{2}}(n\gamma - t_0)_+^\mu , \end{aligned}$$
(6)

which is conjectured by authors to be a universal formula for applying gRDA-\(l_1\) in difficult tasks. In Eq. 6, \(c = \gamma\) and \(\mu\) are hyperparameters and \(t_0 \ge 0\) is the time mean dynamics. Furthermore, \(\mu\) is the trade-off hyperparameter between accuracy and sparsity.

Following the stochastic mirror descent representation of gRDA (see [6] Section 2) and given \(F({\textbf {w}}) = \frac{1}{2}\{|| {\textbf {w}} \}||_2^2\), we can use the closed-form proximal operator of \(g(n,\lambda )\mathcal {P}({\textbf {w}})\) [28], for \(j = 1,2,...,d\),

$$\begin{aligned} \Delta \Psi ^*_{n,\gamma ,j}({\textbf {v}}) = \text {sgn}(v_j) \cdot \{( \{| v_j \}| - g \{( n,\gamma \}) \})_+, \end{aligned}$$
(7)

which will serve as our penalty function \(\mathcal {P}\) for gRDA-\(l_1\). Thanks to the closed-form proximal operator, the computational cost per iteration in the stochastic mirror descent gRDA is as cheap as SGD.

We present the algorithm for optimizing the sparse RESCAL model (sRESCAL) in Algorithm 1. We first initialize the weights of the entity and relation matrices (for each slice) \({\textbf {E}}, {\textbf {R}}_k\). Then, for each triplet n in the training set \(\mathcal {D}\), we update the accumulator of gradients v by applying Eq. 6. Finally, we set the weights by following the update rule in Eq. 7.

figure a

4 Experiments

In this section, we analyze the performance in the link prediction task and the interpretability of the proposed algorithm. We compare our sparse model with several dense model baselines. First of all, we describe the datasets and the setup used for the experiments. Then, we study the link prediction task and compare our model with the state-of-the-art dense models. Finally, we analyze the interpretability of the model using the word intrusion task.

4.1 Datasets

The datasets used for the evaluation are the following (see Table 1):

  • FB15k [4] is a subset of the Freebase database which contains facts about the world.

  • FB15k-237 [36] is a better-suited version of FB15k where the inverse of many relations was removed to increase the difficulty.

  • WN18 [4] is a subset of WordNet, a hierarchical database containing lexical relations between words.

  • WN18RR [7] is a better-suited version of WN18 where the inverse of many relations was removed to increase the difficulty.

Both datasets were created for the link prediction task.

Table 1 Datasets for link prediction and their number of entities and relations

4.2 Implementation details

We open source the PyTorch implementation of the sRESCAL model on GitHubFootnote 1.

We select the hyper-parameters using random search by validation set performance. For FB15k and FB15k-237, we select the entity and relation embedding size from 128, 248, 512. Embedding size for WN18 and WN18RR is chosen from 50, 100, 200 due to the small number of relations in the dataset. Additionally, we apply batch normalization [14] and dropout [33] to ease the training procedure. We choose the dropout value in the range (0.1, 0.5) for every dataset. The learning rate is selected within the range (0.1, 0.9) also for both datasets as we observed high learning rate values necessary to overcome local minima. We set c to 0.00005 since we found it to be a suitable value for the starting sparsity in the embeddings. We choose \(\mu\) in the range (0.6, 0.8) where high values correspond to higher sparsity in embeddings. We analyze the impact of \(\mu\) in the following subsections. Finally, the batch size is selected from 248, 512, 1028. The best hyperparameters for each dataset are presented in Table 2.

Table 2 Best hyperparameters for sRESCAL for datasets where: LR denotes learning rate, \(\mu\) is the trade-off hyperparameter between accuracy and sparsity, c is the hyperparameter controlling the starting sparsity value, \(d_e\) and \(d_r\) correspond to the entity and relation embedding size, \(d\#k, k \in \{1, 2, 3\}\) are the dropout values applied on the subject entity embedding, relation matrix, and subject entity embedding after it has been transformed by the relation matrix respectively, and \(\mathrm{LS}\) is the label smoothing

4.3 Link prediction

For the evaluation of the link prediction task, we create all possible candidate triples by adding every entity in \(\mathcal {E}\) to the test entity-pair, then we use our model to score and rank each of the candidates. We apply the filtered setting where all known true triples in the Knowledge Graph \(\mathcal {D}\) are removed from the candidate set. We use the standard metrics for the task: mean reciprocal rank and hits@k, k \(\in\) 1, 3, 10. Mean reciprocal rank is the average of the inverse of the mean rank assigned to the true triple overall candidate triples. Hits@k measures the percentage of times a true triple is ranked within the top k candidate triples.

Table 3 Results in link prediction

We present the link prediction results in Table 3. Although sRESCAL does not outperform the state-of-the-art models, it does remain competitive while finding more interpretable and sparse solutions (see Sect. 4.4).

Regarding the simplest versions of the datasets, in WN18, sRESCAL achieves high results on every metric and gets close to the other models. For FB15k, sRESCAL suffers the same phenomena as the standard RESCAL model. The performance on the dataset is considerably lower when compared to other models. We argue that this happens due to not finding the optimal hyperparameters for the dataset. Results from [30, 41] also provide low values for RESCAL in such dataset. However, as we will see next, in the more challenging version of the dataset, FB15k-237, RESCAL, and sRESCAL provide significantly better results, being two of the strongest models overall.

Regarding the most challenging datasets, sRESCAL achieves fantastic results, being competitive in every metric. Our model equates DistMult and is close to ComplEx performance-wise in the WN18RR dataset and stays close to the state-of-the-art model, TuckER, by less than 4 points in the metrics. For FB15k-237, sRESCAL outperforms both DistMult and ComplEx by a significant margin in every metric. Furthermore, the performance almost equals RESCAL and TuckER providing fantastic results on every metric.

While the performance of sRESCAL remains competitive to other models, it does also achieve high sparsity and interpretability levels. Thus, we present sRESCAL as a valid and robust model for the link prediction task. Furthermore, as stated before, the induction of sparsity in the resulting embeddings will increase their interpretability by raising semanticity in their latent dimension. The increase of interpretability is further analyzed in Sect. 4.4 and the outcome is observable in Table 6. However, the introduction of sparsity to the embeddings does also have a penalization on performance. Thus, we analyze the tradeoff between sparsity, thus interpretability, and performance.

4.3.1 Sparsity and performance trade-off

The trade-off between the sparsity level and performance in the embeddings is a crucial aspect to study. sRESCAL is a flexible model that can be tuned to increase or decrease its sparsity levels and, thus, its performance.

Since sRESCAL uses the gRDA-\(l_1\) optimization algorithm, sparsity is defined by hyperparameters c (initial sparsity level) and \(\mu\), which defines the penalization applied to the learning weights. By tuning those hyperparameters, we can achieve the sparsity level we want at the cost of lowering the final performance of the model. To demonstrate the variation on those hyperparameters, we present different metrics: (a) mean rank value, (b) DistRatio (an interpretability metric based on the word intrusion task) value and (c) sparsity percentage for dataset FB15k-237 in Fig. 1. We provide results on different values of \(\mu\) while fixing every other hyperparameter (we also fix c since we found the value 0.0005 to be the most optimal for sRESCAL in every case).

Results show a clear trade-off between \(\mu\), thus the sparsity level of the solution, and the final results on both performance and interpretability. Figure 1a presents a difference between \(\mu\) values where high values, 0.85 and 0.65, achieve worst results on mean rank when compared to the lower, 0.25, 0.45 values. However, we find that \(\mu =0.65\) remains competitive to lower values as the difference in the mean rank is quite small. Furthermore, when revising the interpretability metric DistRatio (further explained in next subsection) in (b), the difference between the most interpretable models with high \(\mu\) values is evidenced. As expected, solutions with a high \(\mu\) value achieve good results on DistRatio while those with low \(\mu\) perform worse. This fact is directly related to the sparsity level of the solution, which is shown in (c), where high \(\mu\) valued solutions have higher sparsity levels. As we stated, gRDA allows sRESCAL to be a model that obtains high sparsity values and improves its interpretabilityFootnote 2. Nevertheless, this comes with a trade-off on the standard link prediction performance metrics such as mean rank, lowering the model’s performance on those but raising the interpretability of the outcome embeddings.

Fig. 1
figure 1

Results on FB15k-237 for fixed hyperparameters except \(\mu\)

4.4 Interpretability

In this section, we analyze the interpretability of our sparse model compared to the dense state-of-the-art models. In the same manner as [10, 23], we perform experiments on the word intrusion task to evaluate the interpretability of our model. For this section, we only consider the datasets FB15k-237 and WN18RR due to their difficulty and the fact that they are the standard datasets in the current literature.

The word intrusion task evaluates coherence regarding the semantic meaning in each dimension belonging to the representations. Since sparse models create embeddings with only a few dimensions activated, i.e., the value is not 0, each dimension should correspond to a meaningful semantic concept. The task works in the following way: for each dimension i of the embeddings, it finds the k (\(k \in \mathbb {N}\)) most relevant Knowledge Graph entities, which are those that have the highest values in the corresponding dimension. Then, a non-relevant Knowledge Graph entity (one that has a low value in the corresponding dimension) is chosen, and it is combined with the top-k entities to create a set of k + 1 entities for dimension i. We refer to the non-relevant Knowledge Graph entity as the intruder, and we seek to identify it. An example set of entities for \(k = 5\): {Screenwriter, Film director, Film producer, Television producer, Actor, Erie} where Erie is the intruder entity since it is a city and does not belong to the TV and film industry. Human annotators usually perform the word intrusion task; however, we adopt the automatic version of the task presented in [35]. In this version, the DistRatio evaluation metric is applied. The idea is that to find the intruder entity automatically, its distance to the top-k entities should be high. We measure the distance ratio between the intruder entity and the top-k entities to the distance between the top-k entities themselves. High ratio values mean better interpretability levels because the intruder entity is far (in the embedding space) from the top words (which are close due to their semantic meaning). The metric can formally be presented as:

$$\begin{aligned} {\rm DistRatio}= {} \frac{1}{d}\sum _{i=1}^d \frac{{\rm InterDist}_i}{{\rm IntraDist}_i} \end{aligned}$$
(8)
$$\begin{aligned} {\rm IntraDist}_i= {} \sum _{w_j\in {\rm top}_k(i)} \sum _{\begin{array}{c} {w_j\in {\rm top}_k(i)} \\ {w_k \ne w_j} \end{array}} \frac{\rm dist(w_j, w_k)}{k(k-1)} \end{aligned}$$
(9)
$$\begin{aligned} {\rm InterDist}_i= \sum _{w_j \in {\rm top}_k(i)} \frac{{\rm dist}(w_j, {w_b}_i)}{k} \end{aligned}$$
(10)

where \(\mathrm{top}_k(i)\) denotes the top-k entities corresponding to dimension i, \({w_b}_i\) denotes the intruder entity for dimension i, \(\mathrm{dist}(w_j,w_k)\) is the distance between entities \(w_j\) and \(w_k\), \(\mathrm{IntraDist}_i\) is the average distance between the top-k entities on dimension i and \(\mathrm{InterDist}_i\) denotes the average distance between the intruder entity and top-k words on dimension i. We set \(k=5\) and the distance function to be the Euclidean distance.

We present the interpretability results on FB15k-237 and WN18RR in Tables 4 and 5 respectively. We provide sparsity percentages, DistRatio and negativity percentages for entities and relations for sRESCAL, RESCAl and TuckER models. sRESCAL achieves the best results on DistRatio for both datasets, except in relations for FB15k-237, while maintaining high sparsity values. Both RESCAL and TuckER do not have any sparsity level since they are optimized using the Adam method. Furthermore, our experiments demonstrate the effectiveness of sparsity for developing more interpretable embeddings. Moreover, we present results on negativity since gRDA-\(l_1\), while not constraining to positive values, enforces non-negativity on embeddings. Positivity is an important characteristic for the interpretability of the embeddings since human people do not describe a concept by what is not. Besides, positivity allows to perform additive combinations and create representations from differents parts [19, 23]. Results present sRESCAL as a highly non-negative method when compared to RESCAL and TuckER, improving interpretability in the embeddings.

Table 4 Sparsity, DistRatio, and negativity results for FB15k-237
Table 5 Sparsity, DistRatio, and negativity results for WN18RR

We also provide a qualitative evaluation of the interpretability results. We select the top 5 words of a few dimensions in RESCAL and sRESCAL and present them in Table 6. While the top 5 words provided by the standard RESCAL model do not have any coherence or semantic meaning, the sRESCAL model gets clear topics. From the first row to the fifth: dog breed, films, actors and directors, music topics (instruments), and basketball teams.

Table 6 Top 5 words of some dimensions in RESCAL and sRESCAL

5 Conclusion

In this work, we present a novel technique for learning sparse Knowledge Graph Representations. The approach uses the generalized Regularized Dual Averaging online optimization algorithm and applies the \(l_1\) regularization into the RESCAL model. The experiments demonstrate that the sparse RESCAL remains competitive with the state-of-the-art while achieving high sparsity and non-negativity in the embeddings. We prove that gRDA-\(l_1\) can effectively be applied to RESCAL and other tensor factorization models. This fact provides the opportunity to increase the interpretability in the link prediction task from the core elements used to perform it, the embeddings. Furthermore, the results in the word intrusion task show that the model’s interpretability is effectively improved. The developed embeddings contain a higher semantic coherence and are more understandable for people. In the future, we consider to extend gRDA-\(l_1\) to other tensor factorisation models.