Deep embeddings and Graph Neural Networks: using context to improve domain-independent predictions

Graph neural networks (GNNs) are deep learning architectures that apply graph convolutions through message-passing processes between nodes, represented as embeddings. GNNs have recently become popular because of their ability to obtain a contextual representation of each node taking into account information from its surroundings. However, existing work has focused on the development of GNN architectures, using basic domain-specific information about the nodes to compute embeddings. Meanwhile, in the closely-related area of knowledge graphs, much effort has been put towards developing deep learning techniques to obtain node embeddings that preserve information about relationships and structure without relying on domain-specific data. The potential application of deep embeddings of knowledge graphs in GNNs remains largely unexplored. In this paper, we carry out a number of experiments to answer open research questions about the impact on GNNs performance when combined with deep embeddings. We test 7 different deep embeddings across several attribute prediction tasks in two state-of-art attribute-rich datasets. We conclude that, while there is a significant performance improvement, its magnitude varies heavily depending on the specific task and deep embedding technique considered.


Introduction
Graph Neural Network (GNN) architectures seek to leverage the connections in a graph when it comes to making predictions about the elements of the graph [44]. To achieve this, the nodes of the graph are represented as numeric vectors called embeddings that can capture and summarise the implicit information present in them. For example, graph convolutional layers [27] aggregate the embeddings of each node with those of its neighbours to endow them with contextual information. These architectures allow the networks to have a more complete understanding of the graph that  1 University of Seville, ETSII, Avda. Reina Mercedes, s/n., Seville, Spain helps it make node predictions for a variety of tasks, such as image and text classification or natural language processing, among others. Research in this area has so far mainly focused on developing new GNN architectures or applying existing ones to new domains. However, the type of embeddings being used has received significantly less attention. In most cases, node embeddings are defined manually, and created using any domain-specific information available; for example, when representing data from the academic research domain in which the nodes are scientific papers, embeddings can be bag-of-words representations of the textual contexts of a paper [21]. Another example can be found in the genetic information domain, using genetical positional sequences as embeddings [18]. Only in a few cases, node embeddings have been considered when dealing with GNN and they just employ popular embedding approaches such as TransE [4] and DistMult [45] as baseline in tasks like link prediction [31,38].
At the same time, Knowledge Graphs (KGs) have become a popular research topic as companies such as Google, Facebook, and Amazon [28] are increasingly relying on them to integrate and curate information to support their business processes. A considerable amount of research effort focuses on developing deep learning techniques that are able to obtain embeddings in a domain-independent way [40]. This is usually done by training a neural network to generate task-specific embeddings, and then using said neural network to carry out some prediction task using the generated embeddings. In addition, it is also possible to leverage these embeddings by using them as input for other deep learning architectures such as GNN, and to tackle prediction tasks different from those they were initially designed for. This is feasible because a well-produced embedding space contains latent information about the elements represented in it [40], and therefore, they can be fed to architectures for a variety of tasks, such as question answering [51], KG completion [5,19], scientific fact-checking [6], clustering [20], or plagiarism detection [12], to mention a few.
Additionally, deep embedding techniques can automatise the generation of embeddings from heterogeneous datasets, such as those containing nodes with different types and attributes. This can be a significant improvement over handmade embeddings, usually tied to certain entity types and attributes, e.g. the aforementioned bag-of-words representation of nodes symbolising textual documents. More recent approaches have introduced attributive embeddings [14] to be able to leverage the available information about the attributes of a node, which are usually numerical or textual values. This may be beneficial for prediction tasks in which the value of a property or the similarity between node properties can be exploited.
It has caught our attention that, while GNNs and deep KG embeddings are clearly related, there are almost no insights in the existing literature on how well they perform when combined, because most efforts have been put towards the application of GNN on different fields and the development of new GNN models and architectural variants [1]. While domain-specific embeddings benefit from GNNs, we believe that it is interesting to study the effect of using deep KG embeddings with GNNs because of their different nature and aforementioned advantages. This motivated us to carry out an experiment to shed some light on the feasibility of such combination.
In this paper, we present the results of an experimental study to analyse the benefits of combining deep KG embeddings with a baseline GNN architecture. Our focus is far from the specific GNN architectures and deep KG embeddings or their features, which have been thoroughly researched in previous works; instead, we want to analyse the performance improvement that can be expected from a state-of-the-art combination of both approaches. The improvement resulting from GNNs has been thoroughly studied for manually defined embeddings, but remains unexplored in the field of deep embeddings. These may be more challenging to exploit given their unsupervised nature, but can be computed from any graph without homogeneity restrictions or the need to manually define suitable representations. Specifically, our work focuses on comparing the results obtained by a baseline feedforward network and a standard GNN when trying to predict the value of different attributes. We test seven different deep embedding techniques on seven attribute prediction tasks. We focus mainly on the testing of attributive embeddings, since they contain additional information that could increase the benefits of applying a GNN; therefore, we limited ourselves to standard datasets on the deep embedding state-of-the-art that are rich in attributes, namely: YAGO [35] and FB15K237 [37]. The results obtained by these configurations contribute towards the state-of-the-art by thoroughly answering a series of open research questions about how much different types of neural networks can benefit from deep embeddings when performing prediction tasks.
The rest of this paper is structured as follows: Sect. 2 details the state-of-the-art in the fields of GNNs and KG deep embeddings. Section 3 describes the specific research questions we have identified and the neural network architectures used in our experiments. Section 4 describes the experimental setup and discusses the obtained results. Finally, Sect. 5 summarises our contributions and discusses potential future work.

Related work
In the following subsections, we summarise the current state-of-the-art both in the fields of GNNs and deep node embedding techniques.

Graph neural networks
Back in the 1990 s, neural networks were first applied to graphs by propagating states from one node to the others in an iterative way until a stable point was reached, using recurrent graph neural networks (RGNN) [44]. Some of their main drawbacks were that they were computationally costly and lacked representation capabilities and extendability. Later, several approaches that tried to leverage the progress in convolutional techniques emerged and redefined the concept of convolution on graphs by using not only the features of a node, but also those of its neighbours [7]. This type of procedure is common in image processing, in which pixels are updated with the information features of adjacent pixels. The resulting networks are known as convolutional graph neural networks (CGNN), and are further divided in two groups: spectral-based approaches and spatialbased ones [49]. RGNNs and CGNN are significantly related as they are both based on the same node representation update with neighbouring information principle. Their main difference is that RGNNs always use the same recurrent layer, using contractive constraints to ensure convergence, whereas CGNNs use several convolutional layers with different weights in each of them. These characteristics make CGNNs a more flexible, powerful and less costly approach than RGNNs, which are mostly considered, nowadays, pioneer works of GNNs [44] that inspired later research on convolutional networks. Therefore, CGNNs have emerged as the dominant architecture for graph-related tasks, and for that reason, this work focuses on these widespread and stateof-art approaches.
In other matters, the performance of GNNs when applied on a KG might be influenced by the size and type of the graph, so these aspects should be taken into consideration. KGs can be classified as follows [52]: • Directed vs. undirected: directed edges provide more information than undirected ones, which can also be seen as double-directed edges. • Homogeneous vs. heterogeneous: heterogeneous graphs provide a type for each node and edge, adding additional information to them. • Spatiotemporal vs. static: on dynamic graphs, also known as spatial-temporal ones, topology or features change over time, a characteristic that needs to be properly addressed. • Large vs. small: there are not clearly defined criteria to distinguish between a small or large graph; the boundaries are ever changing due to computation capabilities improvements on devices like GPUs.
There are different kinds of tasks that can be carried out using GNNs: node attribute prediction, node classification [21], link or edge strength prediction [17,26,33], and graph level tasks such as graph classification [29,46,48]. Nonetheless, there are some challenges about GNNs that are still to be addressed. The literature specially reports some scalability issues, as these techniques usually require having the graph loaded in memory in order to perform the convolutions, since doing sampling or clustering may end up losing neighbourhood information [44]. Other challenges than remain unsolved are defining a method to systematically select the optimal GNN architecture for a given network or task, and finding the most suitable knowledge graph embedding technique to maximise the performance of the network.

Deep node embedding techniques
Knowledge graph deep node embeddings have been widely studied recently because of their numerous possibilities [16]. Such embeddings aim to represent nodes as multi-dimensional vectors while retaining information about the structure of the graph and the attributes of its nodes, so that they can be used as input for other algorithms in subsequent tasks. Consequently, it should be noted that the performance of GNNs, as deep learning architectures, can therefore be influenced by the type of node embedding it is provided with.
Typical deep embedding approaches use distance-based scoring functions to produce embeddings and to maintain information about the relations between nodes. This way, with a triple <s,r,t> where s and t are source and target nodes and r the relation between them, the embedding of s plus the embedding of r should be near the embedding of t in the corresponding multi-dimensional space. In this regard, these approaches only take into consideration the topology of the graph, and so they are called structure-based embeddings.
When using these techniques, literal information contained in nodes such as textual, numeric or even image properties is not leveraged. Since these literals provide useful information, the challenge lies in learning embeddings taking them into account, which can be addressed in two ways [14]. The first option is to handle literals separately, i.e., training the classical structure-based embedding and a node feature-learner one at the same time so that the network uses both data sources in each step to learn the node embeddings. The second option is to combine the classical structure-based embedding with the node literals by adding, multiplying, concatenating, etc. these features in the form of additional embeddings. Intuitively, these attributive embeddings contain much richer information that can be helpful for GNNs and their message-passing nature.
Some of the classical knowledge graph embedding generation techniques are TransE [4] and TransR [24]. TransE trains a vector of embeddings for each entity and relation, so that the sum of the embedding of the source node and the relation embedding results in the embedding of the target node. TransR works in a similar way, but performs the addition with the projection of the embedding vectors into a different space, separate from that of entities and relations.
In terms of attributive embedding techniques, ASNE [23] trains a network that predicts the node connections from the concatenation of an embeddings vector associated to the node itself (structural embedding) and another one associated to the attributes. The vector of attribute embeddings is created from the concatenation of numeric attributes (including their value) and categorical attributes (by one-hot encoding). It does not take into account non-categorical textual attributes, nor does it consider the existence of different types of relations.
LiteralE [22] integrates the embedding of each node with a vector of the node's attributes values. This integration is done by a modular function that enriches the embeddings before providing them to an existing embedding adjustment network, among which the authors propose and provide an implementation of DistMult and ComplEx. The integration function uses dense layers to train the way in which information about literals will be integrated.
MTKGNN [36] incorporates literals by introducing a learning task in addition to the typical separation of positive and negative triples. This task attempts to predict the value of a numerical attribute, so that the embeddings associated with the entities are indirectly affected by them. TransEA [43], similar to MTKGNN, incorporates the attributes in an additional learning task, added to TransE. It only considers numeric attributes.

State-of-the-art approaches
With all previous considerations in mind, we summarise in Table 1 the most popular GNN architectures approaches, as well as the datasets on which they are applied, the embedding techniques that they use, and the training, testing, and validation splits that are used in their experimental validation. Among them, the most referenced proposals like GCN [21] or GraphSage [18], represent the classical formulation of GNNs with iterative updates of node features through an aggregation function over the features of neighbouring nodes. Besides these architectures, attention-based GNNs like GAT [39] and GAAN [47], employ attention mechanisms to allow the model to learn neighbouring nodes weights according to their importance. These GNN architectures offer some advantages like their ability to handle graph-structured data, capture local and global structures, and reach state-of-art performance. However, GNNs also face several challenges such as scalability, interpretability, and robustness [44,52]. Recent studies have proposed solutions to address these issues, including the use of hierarchical GNNs [8] or adversarial training [42]. Despite these efforts, further research is required to make GNNs more efficient, interpretable, and robust. Another issue to be addressed is the lack of standardisation in the design and evaluation of GNN models, making it difficult to compare results across studies and benchmark datasets; although some efforts have been made in this direction [11]. This is evidenced by the fact that, as shown in Table 1, for all proposals, the technique used to compute node embeddings is usually not specifically designed for the task, or it is even not described at all, not being considered a relevant aspect of the proposal. In terms of datasets, we find that Reddit [18], PPI [18], Cora [32], and Citeseer [15], which are domain-specific ones, are commonly used. Other general purpose KGs that are heterogeneous and rich in node attributes, such as YAGO [30] or Freebase [3], are not usually considered. Additionally, there is a big variability in the train/test/validation splits, since they are very dependent on the characteristics of each dataset. These reasons led us to carry out the current study, in which we focus on the combination of deep embeddings and GNN, abstracting ourselves from state-of-art architectures particularities, as explained later on Sect. 3.2.

Aim and scope
In this section we describe our contributions: in Sect. 3.1 we define the goals of our work and our research questions, while Sect. 3.2 describes the architecture of the neural networks that we used to answer the previous research questions.

Goals
The previous study of the state-of-the-art clearly shows that there has been little consideration for the combination of GNN and KG deep embeddings. More specifically, to the best of our knowledge, it does not exist a report on how this type of

Architecture of the neural networks
In order to answer the previous questions and, since the specific architectures being used are not the focus of our work, we used a baseline feedforward neural network and a standard graph neural network. The feedforward one, described in Fig. 1, is composed by a series of five dense blocks with skip connections, except the first one; and a final dense layer which provides the output. Each dense block is composed by two groups of batch normalisation, dropout and dense layers. The computational modules employed in this architecture are well-known, standard and commonly used together [2,13,41] so they conform a representative enough network for our experiments.
The standard GNN, shown in Fig. 2, is based in Graph-Sage [18], which is a well-known, representative and mature GNN approach of the state-of-art [50], and compared to other well-stablished GNN approaches like GCN [21] and GAT [39], offers a more flexible convolution configuration because of the variable aggregation function. Our representative architecture is composed of two dense blocks that perform preprocessing and postprocessing functions and between which we have placed two graph convolution lay- ers. These are made up of a setup dense block, a message aggregation layer and a node embedding updating layer, as well as a skip connection over them. Finally, the output is provided by a dense layer. It is worth noting that all layers in the networks, essentially the dense layers, are configured with 32 nodes or units. The initial hyperparameter setup was the one proposed in Graph-Sage [18], and was carefully tuned through a set of small, empirical and informal tests until a suitable typical state-ofart configuration was found. This configuration includes a learning rate of 0.01, RMSprop optimization, a dropout rate of 0.5, a batch size of 256 and 300 epochs with early stopping.
In terms of input, the feedforward network only receives a matrix containing the embeddings of the graph nodes and outputs the predicted value for the specific attribute prediction task performed. The input of the GNN architecture is a matrix of the embeddings of the nodes and the list of graph edges, that is, the relations between nodes in a (source, target) format. The GNN architecture also outputs the predicted attribute value for each node. Graph nodes embeddings are computed with the knowledge graph embedding techniques specified later in Sect. 4 and using the available node attributes on each dataset.

Evaluation
In this section, we discuss our evaluation setup: the datasets that we have used, the different embedding techniques under evaluation and the tasks that were performed. Subsequently, we display and comment on our experimental results.

Experimental setup
The attribute-rich datasets we took into consideration for our experiments were FB15K237 [37] and a reduced version of YAGO [35] which only contains nodes with at least one attribute and that have at least ten connected edges. It is worth noting that they were chosen because they are standard attribute-rich datasets on the deep embedding state-of-art, as can be seen in Table 3. Other datasets such as Word-Net are also typically used, but they do not contain node attributes, making it impossible to apply attribute-based techniques on them. Additionally, we also included Cora [32], a citation dataset that is very commonly used in experiments involving GNNs, as shown in Table 1. The first two datasets are rich in attributes, which allow us to compute attributed embeddings and perform attribute prediction tasks, while the Cora dataset allows us to compare the improvement in performance with regards to that obtained by domain-specific embeddings. More detailed statistics of each dataset, like nodes, edges and attributes number, are provided in Table 2.
The prediction tasks we selected, as shown in Table 4, include a variety of node attributes to predict. It is also worth noting that tasks involving FB15K237 and YAGO consist in the regression of numerical attributes, while the Cora task consists in node classification. While the node type can be treated as an attribute, the Cora dataset does not include any actual attributes that can be used to compute attributive embeddings, and thus we limit the deep embeddings experiments to the FB15K237 and YAGO tasks, leaving Cora dataset to the domain-specific embeddings experiment that serves as a representative of how GNNs improve performance when using said embeddings. We tested each task with three different train/test split proportions: 80%/20%, 50%/50% and 20%/80%, respectively. Additionally, we executed every task 10 times in order to better assess the overall obtained performance.
To perform the experiments, we implemented a script, shown in Algorithm 1, to execute all the combinations in terms of train split proportions and embedding techniques for every defined task in the datasets, for both the baseline feedforward neural network and the standard GNN. The output    Tables 5 and 6 show the results of the experiments that we conducted (note that a more comprehensive account of the embeddings experimental results can be found in Table 8). Table 7 offers a summarised view of the results. Table 5 collects the reference results of the Cora dataset, in terms of accuracy. Figure 4 shows in a more visual way the MAE difference between GNNs and NNs depending on the performed task and the train set size, for deep embedding technique LiteralE-DistMult, in which the effect of using different train set sizes is particularly significant. Table 6 contains the mean absolute error (MAE) difference obtained after applying each embedding technique in combination with the standard neural network and the GNN, to perform different tasks and considering different training set sizes, on two of the datasets (FB15K237 and YAGO). Each execution was repeated ten times to compute average values. We have boldfaced those results where MAE difference is a negative value, that is, GNN outperforms the regular neural network. Fig. 3 displays a representative picture of the convergence of the networks and early stopping policy, on two representative tasks, employing ASNE embeddings and 80% training set size. Table 9 shows a summary of mean training times for both architectures on every performed task.

Experimental results and analysis
Next, we provide the answers to the questions posed in Sect. 3.1, according to the former experimental results.
Q1: When using a GNN, does the kind of deep embedding that is used have a significant effect on performance? To what extent is the performance of GNNs affected by the use of attributive embeddings?
There are clear differences in GNN performance depending on the embedding technique and it is more noticeable in certain tasks, e.g. between TransEA and ASNE in the has Latitude task, as shown in Table 8. However, we have not identified an embedding technique that consistently obtains better results in most tasks. The same applies to nonattributive embeddings (TransE and TransR) when compared to the rest. We conclude that the kind of embedding has a significant effect in a per-task manner, without a clear overall winner.
Q2: When using deep embeddings for attribute prediction tasks, does the use of a GNN instead of a regular neural network result in significant performance differences?
Generally, it does, as can be seen in the MAE% columns in Table 6, showing the percentage difference between GNNs and NN in bold. The results indicate that the GNN outperforms the regular NN in the 65.7% of cases, Table 7, with GNN taking less but longer-in-time epochs to converge, Fig. 3. However, some tasks tend to leverage context information and thus, the improvement is greater in them, like location Area or population N umber, see Fig. 4, while other tasks may only need their own information to make a judgement, and the use of convolutions and contextual data may blur the characteristics of the current node, showing GNN then, no clear strength over NN, like in task has Area. It may therefore be of interest to study whether the neighbouraggregation process of GNN is beneficial for a specific task and whether the improvement in results obtained with a GNN compared to those of a common neural network justify the higher complexity and computational cost of a GNN when dealing with that specific task, which can mean i.e., on average, about double the training time, as shown in Table 9.
Q3: Are the aforementioned differences affected by the kind of deep embedding being used?
Some embeddings seem to benefit more from the application of a GNN with higher, more consistent reductions of the MAE. Overall, there is a higher chance of improvement  for attributive embeddings than for non-attributive ones. As can be seen in Table 7, GNNs lead to an improvement in roughly 43% of cases when using TransR or TransE nonattributive embeddings, but when using the attributive ones ASNE, LiteralE-ComplEx, LiteralE-DistMult, MTKGNN and TransEA there is an improvement in 65% of cases. For these reasons, the deep embedding technique used as GNN input should be carefully selected as it may heavily influence the obtained results. Q4: Is the improvement as noticeable as the reported in the state of art for tasks that use domain-specific embeddings? Does it vary with the prediction task?
The results for the Cora dataset shown in Table 5 showcase an accuracy improvement between 3% and 8%. However, when talking about more generic and attribute-rich datasets, results vary significantly between tasks, as shown in Table 6.
It is clear that some tasks, like population N umber, seem to benefit more from the contextual information provided by the GNNs, reaching improvements of more than 25% in some cases, see Table 8. In other tasks, such as has Area or person H eight, there is little information in the context of a node that could help improve the prediction while, for example, the rating of a film could be easier to predict based on its contextual information. In this case, GNNs lead to  Table 1, and reach better performances.
Q5: To what extent is the improvement achieved by GNNs affected by the amount of training data?
As can be seen in Fig. 4, there is no clear trend when training size is altered. In some tasks a reduced training size leads to a larger improvement, maybe due to how a GNN can help use additional information to compensate for the lack of numerous examples. In others, the opposite happens, which may be caused by the higher complexity of a GNN needing a larger number of examples to reach its potential. Therefore, training set size does not seem to be a determining factor in GNNs performance improvement against regular neural networks.

Conclusions and future work
In this paper we have presented a comprehensive study about how deep embeddings perform when applied together with Graph Neural Networks (GNNs). Deep embeddings have the appeal of being domain-independent, potentially able to capture latent information about the content of the graph and offer universal, automatised embedding generation that can deal with heterogeneous datasets. This has led to their extended use in a variety of tasks, including prediction of graph elements by feeding them to neural networks. GNNs, which intend to endow these networks with contextual information, seem to be a perfect fit, but so far they have only been tested with domain-specific embeddings, which motivated our study. The novelty and value of our work resides in how we answer several open research questions about the performance of GNNs under several sets of circumstances including seven attribute prediction tasks, seven types of deep embeddings, and three different training sizes. We conclude that the application of GNNs to improve performance obtained by deep embeddings has significant potential, as it can be seen in several tasks in which there is a reduction of error of more than 25%. However, research so far has focused too much on proposing novel network architectures and too little on determining under what circumstances they work best. As our experiments have shown, the same GNN can obtain completely different results depending on the task and embeddings being used.
Future work should focus on collecting a large set of attribute-rich datasets for the evaluation of GNNs and deep embeddings in an automated way. It would be particularly useful to catalogue said datasets according to their topology and other characteristics that should affect how useful the information-passing mechanisms in GNNs are.