1 Introduction

Graph Neural Networks (hereafter, GNNs) have received a significant attention in recent years for their ability to work with graph-structured data. They are designed to learn node and arc representations in a graph, capable of capturing important features of the underlying data [1,2,3]. In particular, GNNs are very successful in tasks where graph structure plays a key role in determining the output [4, 5], for example in identifying communities within a social network, predicting links between nodes, classifying nodes or graphs [6,7,8,9]. These tasks have important applications in several domains, such as Social Network Analysis, bioinformatics and recommender systems [6, 10, 11], materials science and chemistry [12]. Studies on challenging tasks, such as protein-protein interaction [13] and causal inference in brain networks [14], have demonstrated the expressive power and potential of GNNs and stimulated further investigation by the academic community.

The growing popularity of GNNs has led to the development of various architectures and algorithms related to them [15,16,17,18], as well as of approaches to improve their discriminative power [19, 20]. However, despite their success, there are still many open questions about how GNNs learn and represent information. One of the challenging issues is to understand the dynamics underlying learned representations and how these representations relate to the underlying graph structure. For example, it is unclear how GNNs process different types of information (e.g., node attributes and arc features) in their learned representations, and how they capture different types of structural patterns in graph data. Another challenge concerns the development of methods to evaluate the quality and generalization of learned representations in GNNs. This is particularly important in real-world applications, where GNNs must be robust and generalize accurately to unknown data. Finally, understanding the limitations and biases of GNNs can help identify potential improvement areas and provide insights into the design of better models capable of solving complex graph-based problems.

In this scenario, a compelling and interesting aspect regards the study of learned representations, especially node embeddings, as it provides insights into how GNNs capture and process information in a graph. Node embeddings model each node in a graph as a vector representing the corresponding structural and semantic information. By examining how GNNs handle the information present in input graphs, we can obtain valuable insights into their encoding and processing mechanisms. This analysis can also help us understand how GNNs generalize this information to new data. Moreover, node embeddings can be used in a wide range of specific tasks (think, for instance, of node classification, link prediction and graph clustering), which represent fundamental problems in areas like Network Analysis (NA, for short) and Machine Learning.

This paper falls right in this context and proposes a comprehensive framework that uses (Social) Network Analysis theory and techniques to investigate learned representations of GNNs. NA theory provides a powerful and intuitive means of analyzing the structure and properties of graph-structured data. When applied to learned representations of GNNs, this theory and the corresponding techniques can support the extraction of important insights enabling a deeper understanding of learned information. Furthermore, NA can support us in assessing the quality of learned representations and identifying improvement areas.

Our framework operates as follows: it receives a graph and passes it to the underlying GNN to be investigated. The GNN returns a set of node embeddings corresponding to the graph received as input. At this point, our framework creates a new network-based representation by combining the node embeddings returned by the GNN and the graph received as input. After that, it uses NA theory and techniques to obtain insights regarding the GNN performance. Then, it uses these insights to define a new loss function in order to enhance the training of the GNN. For this purpose, it considers the differences between the graph received as input and the one reconstructed from the node embeddings returned by the underlying GNN.

To the best of our knowledge, this paper represents one of the first attempts to study the learned representations of a GNN and to improve its learning process through a comprehensive framework based on NA. Specifically, the main contributions of the framework presented in this paper are as follows:

  • It proposes a method for mapping into a network the learned representations returned by a GNN after processing a graph received in input;

  • It employs NA theory and techniques to extract insights about the structure and behavior of the underlying GNN;

  • It defines a new training loss function to both assess and enhance the quality of the GNN’s learning process, and thus the likelihood that it will next provide more accurate results.

The outline of this paper is as follows: Sect. 2 describes related literature. Section 3 presents the proposed framework. Section 4 illustrates the experimental campaign we conducted to test it. Finally, Sect. 5 reports our conclusion and outlines some possible future developments of our research efforts.

2 Related literature

GNNs have emerged as a powerful framework for learning representations of graph-structured data [1, 6]. Indeed, they are capable of encoding both node attributes and input network topology into node embeddings, which are vector representations of nodes capable of capturing their structural and/or semantic properties. A very active research area concerns the analysis of such embeddings to answer questions such as: “Do they capture meaningful relationships?”, “Can they be measured quantitatively?”. Questions like these can be addressed by studying the representational power of GNNs [21,22,23]. A large number of papers in the literature have shown that GNNs are an effective tool for learning graph representations in many applications. However, there is still little understanding about their properties, limitations and learned representations [21, 24]. This paper aims to make a contribution to filling this gap by investigating and providing insights into the dynamics underlying the learned representations returned by GNNs. To this end, it proposes a conceptual framework that employs the theory and techniques of Network Analysis to evaluate these learned representations.

As far as the scientific literature is concerned, the study proposed in this paper belongs to the broader field of representation learning through GNNs. In the following we will focus on two aspects of our paper, namely: (i) the study and analysis of learned representations in GNNs, and (ii) the use of (Social) Network Analysis models and techniques to investigate the dynamics of GNNs and assess the quality of their learned representations. With regard to these aspects, we highlight what is already in the literature and how our framework relates to existing ones. In addition, in order to highlight the features of our framework, at the end of our overview we briefly discuss some application contexts that can benefit from the exploitation of GNNs and learned representations. We believe it is worth pointing out that our framework represents one of the first attempts to develop an approach that uses NA not only to evaluate the quality of learned representations in GNNs but also to enhance their training.

As for the study and analysis of GNNs learned representations, several surveys have recently been proposed [1, 6, 25]. The study in [21] is widely recognized as one of the most comprehensive studies on the expressive power of GNNs. In it, the authors present a theoretical framework for analyzing the expressive power of GNNs and apply it to characterize popular variants of GNNs, such as Graph Convolutional Networks. They also develop a simple neural architecture, called Graph Isomorphism Network, which has been shown to be as powerful as other GNNs. The work in [21] and ours are related in terms of motivations. Indeed, the work in [21] suggests that a maximally powerful GNN should be able to distinguish different graph structures by mapping them to different representations in the embedding space, but this is a hard problem. Our approach addresses this issue by analyzing the learned representations of a GNN, using the theory and techniques of NA, which intrinsically encompasses properties of graph structures.

In [26], the authors use spectral analysis to investigate the expressive power of GNNs. They argue that this perspective provides a complementary viewpoint for understanding GNNs and demonstrate the equivalence of convolution processes in spatial and spectral GNNs. Moreover, through several experiments, they show that graph convolutions in GNNs are problem-specific rather than problem-agnostic. The approach in [26] is orthogonal to ours since it focuses on characterizing the expressive power of GNNs from a spectral perspective while our approach focuses on the analysis of learned representations. In [27], the authors propose an interpretable embedding procedure based on a knowledge distillation method that leverages the learned representation of a GNN. The authors show that the graph structure caught in the learned representation captures relational information better than classical representations, such as those obtained from attention networks. The approach in [27] and ours share the use of GNNs learned representation. However, they have different goals. In fact, the approach in [27] focuses on knowledge distillation and wants to generate an interpretable embedding procedure. By contrast, our approach focuses on the investigation of embeddings through the theory and techniques of NA.

In [28], the authors focus on the graph comparison problem and propose an approach to produce a fast-to-compute feature map that represents a graph through the distribution of its node embeddings. The authors employ the proposed approach in a graph classification task and compare it with other supervised and unsupervised approaches. This paper is interesting in that the authors explore embedding techniques and construct particular feature maps on top of the results, thus effectively exploiting learned representations. However, unlike our approach, the one of [28] does not consider learned representations obtained by GNNs. In [29], the authors investigate the empirical robustness of embeddings, produced by different models, to random and adversarial poisoning attacks. In their evaluation, they include matrix factorization-based models, skip-gram-based models, and deep neural network-based models. Although both our approach and that of [29] analyze learned representations, the approach in [29] is devoted to a different task, namely assessing a specific property of node embeddings.

In [30], the authors present ROLAND, a graph representation learning framework for real-world dynamic graphs. ROLAND allows researchers to easily repurpose any static GNN to dynamic graphs. The idea is to view node embeddings at different GNN layers as hierarchical node states and then recurrently update them over time. While both the approach in [30] and ours leverage learned representations, they have very different goals. Nevertheless, in future, it would be interesting to conduct our study on dynamic graphs also exploiting models built by ROLAND.

As for the analysis of GNN dynamics and the evaluation of the quality of their learned representations, in [31] the authors discuss the representation power of Graph Convolutional Networks (GCNs) in learning graph moments, which encode paths of various lengths in graph topology. The approach in [31] and ours share the idea of analyzing learned representations through graph properties and structures. However, the approach in [31] aims to prove the limitations of a GCN in learning graph topology, particularly graph moments, while our approach studies learned representations directly through the proposed framework. Moreover, the approach in [31] targets generation models while ours focuses on learned representations of any GNN. In [32], the authors discuss the limitations in using random walk-based sampling strategies for network embeddings. They also present a new method that combines neighbor information and local-subgraph similarity to learn node embeddings. This method also uses structural information (graphlets) to enhance the quality of embeddings. It aims to enhance learned representations by using topological information for network data. However, it does not perform a subsequent analysis of the enhanced learned representation.

In [33], the authors propose NCA-GE (Network Centrality Approximation using Graph Embeddings), a fast and efficient approach to approximate node centralities in large networks using neural networks and graph embedding techniques. NCA-GE represents a direct application of learned representations for addressing a task that cannot be performed easily on very large graphs. NCA-GE and our framework can be considered complementary. In fact, our framework could exploit the results of NCA-GE to further measure the quality of learned representations. In [34], the authors propose a new approach called DEMO-Net. It performs multi-task graph convolution, where each task carries out node representation learning for nodes with a specific degree value. The authors also introduce a new graph-level pooling/readout scheme for learning graph representations, and show that this scheme is efficient and effective in many cases. DEMO-Net also focuses on modifying the learning model to explicitly capture the graph topology through a variant of GNN.

In [35], the authors investigate structural node embeddings. These embeddings are based on the principle that nodes having similar functions, ties or interactions should be close in the embedding space, regardless of their distance in the network. In the paper, the authors want to understand what types of equivalence are captured by structural embeddings and provide an in-depth empirical analysis of them using a variety of datasets and tasks. The approach in [35] and ours share some similarities. In fact, both of them propose an intrinsic evaluation of node embeddings and in both cases this evaluation is based on some network parameters. However, the approach in [35] focuses on structural node embeddings while our framework can be applied to any type of node embeddings. In addition, our framework also wants to improve the learning process of GNNs. In [36], the authors aim to interpret vector embeddings of social network data and propose concrete interpretations in terms of preserved network properties. Their approach relates embeddings with network centralities. As a result, they obtain that different embedding methods learn different network properties. Both the approach in [36] and our framework use (Social) Network Analysis to provide an explanation of embedding relatedness. However, their goals are different. In fact, the approach in [36] wants to predict the centrality values of a particular node based on its embeddings. In contrast, our framework focuses on analyzing the structures within the embeddings and improving the learning process. Finally, the approach in [36] does not consider the learned representations resulting from GNNs.

In [37], the authors analyze the effectiveness of vector embeddings of nodes in encoding the elementary properties of the nodes themselves. They also evaluate three state-of-the-art node representation models (i.e., DeepWalk, node2vec and LINE) on different tasks and different graphs, and show that node2vec and LINE best encode network properties for sparse and dense graphs, respectively. The approach in [37] and our framework share the use of NA to evaluate node embeddings. However, the goals of the two approaches are different. In fact, the approach in [37] focuses on the effectiveness of various node representations in predicting the properties of graphs and builds a model to do this. In contrast, our approach wants to evaluate node representations and use that evaluation to improve GNN learning. Furthermore, the approach in [37] does not consider learned representations resulting from GNNs, but uses classical models, such as DeepWalk. In [38], the authors propose a framework for unsupervised graph embedding comparison. Although the goals of this approach differ from those of our framework, some concepts and insights present in [38] (e.g., the concept of divergence score) may be useful in our context.

To conclude, we point out that the expressive power of GNNs enabled a series of contexts and tasks to be successfully managed by them. In particular, researchers have investigated the usage of GNNs to solve problems in complex and social networks [39,40,41,42]. The ability of extracting node-level structural features from graphs is one of the key aspects of GNNs. For instance, in [39], the authors propose a GCN-based framework for the estimation of communication network reliability. This framework employs several graph convolution layers to extract node-level structural features from input information. The crucial task of identifying critical nodes and links in graphs is the focus of the approach described in [40]. Here, the authors propose a scalable and generic GNN for identifying critical nodes and links in large complex networks. The idea is to learn the node and link criticality score on a small representative subset of nodes and links, and then predict the scores of nodes and links on a larger scale network. An approach tackling general time-evolving social network problems is presented in [41]. Here, the authors propose a GNN-based framework, called Spatial-Temporal Graph Social Network (STGSN), whose aim is to model a social network taking both spatial and temporal perspectives into account. STGSN belongs to the general field of GNNs supporting Social Network Analysis; therefore, it can be considered orthogonal to our approach. In fact, while STGSN employs GNNs for modeling the dynamics of a time-evolving Social Network, our framework employs (Social) Network Analysis models and techniques to investigate the dynamics of GNNs. To address the social recommendation problem, the authors of [42] propose a GNN-based framework able to coherently model graph data with the aim to learn better user and item representations. The framework takes advantage of an attention mechanism to discern the heterogeneous strengths of social relationships between users. Also the study of functional brain networks has been addressed through GNNs. For instance, in [43] the authors propose BrainTGL, a temporal graph representation learning framework for brain networks. The objective is to capture the potentially complex spatial and temporal correlations in human brain through a combination of approaches, such as temporal graph pooling and dual temporal graph learning.

3 Description of our framework

In this section, we present our NA-based framework conceived to investigate the node embeddings returned by a GNN and to improve its learning process. Specifically, in Sect. 3.1 we define the network model adopted by our framework. In Sect. 3.2, we describe a set of analysis measures used by it to achieve its goals. Finally, in Sect. 3.3 we describe the GNN evaluation and enhancement process performed by it.

3.1 Description of the model underlying our framework

Let G be a graph to be processed and let \({\mathcal {GNN}}\) be a Graph Neural Network that performs a machine learning task (e.g., node classification, graph classification, edge prediction, etc.) on G. The ultimate goal of our framework is to evaluate the application of \({\mathcal {GNN}}\) to G and improve its performance.

G can be modeled as \(G = \langle V, E, W \rangle\). Here, V is the set of nodes; each node \(v_i \in V\) has associated a set \(X_i \in {\mathbb {R}}^h\), \(h \ge 1\), of features. \(E \in V \times V\) is the set of edges. An edge \(e_{ij} \in E\) exists between the nodes \(v_i\) and \(v_j\) if there exists a relationshipFootnote 1 between them. There exists a weight \(w_{ij} \in W\) for each edge \(e_{ij} \in E\); it indicates the strength of the relationship between \(v_i\) and \(v_j\).

This modeling is as general as possible. Therefore, depending on the specific application, \(v_i\), \(X_i\) and \(e_{ij}\) have different meanings. For example, in the well-known Cora dataset,Footnote 2 each node \(v_i\) represents a paper. \(X_i\) is a one-hot encoded vector such that each of its elements is associated with a word of interest and is set to 1 if the associated word is present in the paper corresponding to \(v_i\), otherwise it is set to 0. The arc \(e_{ij}\) indicates a relationship between \(v_i\) and \(v_j\) such that the paper corresponding to \(v_i\) cites the one associated with \(v_j\). The corresponding weight \(w_{ij}\) is always equal to 1. In this example, G is a directed graph, but the model we are defining is generic and admits both directed and undirected graphs.

Suppose we train \({\mathcal {GNN}}\) for \(\eta\) epochs. During its training, \({\mathcal {GNN}}\) learns the node representations in a latent space, i.e., node embeddings. We can define a function \(f: {\mathbb {R}}^h \rightarrow {\mathbb {R}}^{emb}\) representing this learning task. Here, h is the number of features of a node, emb is the dimension of the embedding; generally, \(emb \ll h\). In particular, given a node \(v_i\) of G and the feature vector \(X_i\) of \(v_i\), the embedding \({{\mathcal {E}}}_i\) of \(v_i\) is a vector in \({\mathbb {R}}^{emb}\). It captures both structural and feature-based properties of \(v_i\) because it derives directly from the learning activity of \({\mathcal {GNN}}\) performed on G and the vector \(X_i\) of the features of \(v_i\).

At the end of each training epoch t, we can use f to obtain the corresponding node embeddings at that time in order to analyze the evolution of learned representations and to enhance the training of \({\mathcal {GNN}}\). To address this issue we need a way to map the embeddings back to the original graph. For this purpose, we introduce a new graph \(G^e(t) = \langle V, E, W^e(t) \rangle\), which represents a variant of G. In fact, \(G^e(t)\) has the same set V of nodes and the same set E of edges of G, while the vector \(W^e(t)\) of the edge weights is different. In fact, since \(G^e(t)\) contains feature vectors different from those of G, it is necessary to take this additional information into account, and this is done precisely by redefining \(W^e(t)\). One way to address this issue is to consider the distance and/or similarity between the embeddings of the two nodes associated with the edges whose weight is being computed. Several measures capable of doing this have been proposed in the literature. One of the most widely used is cosine similarity, which computes the cosine of the angle between the two vectors. It ranges between -1 and 1; in particular, it is set to 1 if the two vectors are exactly the same, to 0 if they are orthogonal and to -1 if they are diametrically opposed. Taking this similarity into account, the weight \(w^e_{ij}(t)\) associated with the edge \(e_{ij}\) in \(G^e(t)\) can be computed as follows:

$$\begin{aligned} w^e_{ij}(t) = w_{ij} \ \frac{{{\mathcal {E}}}_i \cdot {{\mathcal {E}}}_j}{||{\mathcal {E}}_i|| \ ||{{\mathcal {E}}}_j||} = w_{ij} \ \frac{\sum _{k=1}^{emb} {\mathcal {E}}_{i_k} \ {{\mathcal {E}}}_{j_k}}{\sqrt{\sum _{k=1}^{emb} {{\mathcal {E}}}^2_{i_k}} \ {\sqrt{\sum _{k=1}^{emb} {{\mathcal {E}}}^2_{j_k}}}} \end{aligned}$$
(3.1)

Here, \({{\mathcal {E}}}_i\) and \({{\mathcal {E}}}_j\) are the embedding vectors of \(v_i\) and \(v_j\) in \(G^e(t)\), while \(w_{ij}\) is the weight of the edge \(e_{ij}\) in the original graph G. The reasoning behind this formula is as follows: the weight \(w^e_{ij}(t)\) is obtained by multiplying the weight of the original edge \(w_{ij}\) by the cosine similarity of the embedding vectors associated with \(v_i\) and \(v_j\) in \(G^e(t)\). In this way, the initial weight \(w_{ij}\) varies by a factor defined by the embeddings extracted by \({\mathcal {GNN}}\). Thus, the strength of the connections between the nodes in \(G^e_{ij}(t)\) changes according to the training of \({\mathcal {GNN}}\). We point out that this formula also allows the edges of \(G^e_{ij}(t)\) to maintain the same semantics as the edges of G since the weights are only scaled by a pure factor.

3.2 Analysis measures used by our framework

In this section, we present the analysis measures employed by our framework to reach its goals. Specifically, in Sect. 3.2.1 we describe two measures to compare communities in G and in \(G^e(t)\). In Sect. 3.2.2, we present a measure to analyze the clustering coefficient of \(G^e(t)\) during the training of \({\mathcal {GNN}}\). Finally, in Sect. 3.2.3, we illustrate a measure to evaluate the differences between G and \(G^e(t)\) in terms of centrality measures.

3.2.1 Two measures for comparing communities in G and \(G^e(t)\)

In this section, we want to study the structure of the communities that can be derived from G and \(G^e(t)\). In particular, we propose two ways to perform such a task. Preliminarily, we need an algorithm \({{\mathcal {A}}}_{CD}\) for community detection on graphs with weighted edges, such as Louvain [44], FastGreedy [45], Label Propagation Algorithm [46], and/or another among those proposed in the past literature [47]. The application of \({{\mathcal {A}}}_{CD}\) to G and \(G^e(t)\) returns two sets of communities \({\mathcal{C}\mathcal{S}}\) and \({\mathcal{C}\mathcal{S}}^e(t)\). Once we have these two sets, we need a way to identify which among them best splits the corresponding graph in communities characterized by many strong intra-community edges and few weak inter-community edges. A popular metric addressing this issue is modularity [48], defined as:

$$\begin{aligned} Q = \frac{1}{2w_{tot}} \sum _{i,j=1}^n\left[ A[i,j] - \frac{d_i d_j}{2w_{tot}}\right] \delta ({{\mathcal {C}}}_i, {{\mathcal {C}}}_j) \end{aligned}$$
(3.2)

Here: (i) \(w_{tot}\) is the sum of the weights of the edges in the network into consideration; (ii) A[ij] is the element at position (ij) of the adjacency matrix corresponding to the graph; (iii) \(d_i\) (resp., \(d_j\)) is the degree of the node \(v_i\) (resp., \(v_j\)); (iv) \({{\mathcal {C}}}_i\) (resp., \({{\mathcal {C}}}_j)\) is the community to which the node \(v_i\) (resp., \(v_j\)) belongs; (v) \(\delta ({{\mathcal {C}}}_i, {{\mathcal {C}}}_j)\) is the Kronecker delta function, which returns 1 if \(v_i\) and \(v_j\) belong to the same community and 0 otherwise. The greater Q is, the better the partition of nodes in communities. Our expectation is that the communities extracted from \(G^e(t)\) have associated a higher value of Q than communities extracted from G since \(G^e(t)\) “contains” additional knowledge than G; such a knowledge is the one provided by \({\mathcal {GNN}}\).

The modularity Q defined above allows us to analyze the differences in the quality of communities. However, Q does not allow us to know whether the composition of the communities of G and \(G^e(t)\) is similar or not. Since \({\mathcal{C}\mathcal{S}}\) and \({\mathcal{C}\mathcal{S}}^e(t)\) are sets of communities and each community consists of a set of nodes, the comparison of community compositions is not straightforward. To solve this problem we proceed as follows. For each community \({{\mathcal {C}}}_i \in {\mathcal{C}\mathcal{S}}\), we compute the Jaccard coefficient [49] \(J_{ij}\) between its nodes and those of each community \({{\mathcal {C}}}_j \in {\mathcal{C}\mathcal{S}}^e(t)\). Recall that the value of the Jaccard coefficient ranges in the real interval [0, 1]; the greater the value of the Jaccard coefficient \(J_{ij}\) of \({{\mathcal {C}}}_i\) and \({{\mathcal {C}}}_j\), the greater the overlap between these two communities. After computing the Jaccard coefficient between \({{\mathcal {C}}}_i\) and any community \({{\mathcal {C}}}_j \in {\mathcal{C}\mathcal{S}}^e(t)\), we take the maximum value \(J_i\) of the Jaccard coefficients thus obtained; it represents the maximum possible overlap between \({{\mathcal {C}}}_i\) and a community of \({\mathcal{C}\mathcal{S}}^e(t)\). Proceeding in this way, we obtain a value \(J_i\) for each community \({{\mathcal {C}}}_i \in {\mathcal{C}\mathcal{S}}\). Afterward, we calculate the mean \(J_M\) of all the values \(J_i\) thus obtained. \(J_M\) is an indicator of the structural similarity between the communities of \({\mathcal{C}\mathcal{S}}\) and \({\mathcal{C}\mathcal{S}}^e(t)\). Its value ranges between 0 and 1; the greater \(J_M\), the greater the structural similarity between the communities of G and \(G^e(t)\). The pseudocode describing this behavior is reported in Algorithm 1.

Algorithm 1
figure a

Function COMMUNITY_STRUCTURAL_SIMILARITY

At the end of this analysis, we have two metrics to compare the communities of G and \(G^e(t)\), namely the modularity Q and the average Jaccard coefficient \(J_M\). In Sect. 4.3, we test whether these measures are useful to train \({\mathcal {GNN}}\), and thus whether they can help improve its performance.

3.2.2 A measure to compare clustering coefficients in G and \(G^e(t)\)

Another interesting analysis to compare G and \(G^e(t)\) is based on clustering coefficients. Recall that the clustering coefficient of a node measures its tendency to cluster with other nodes. In fact, the higher the average clustering coefficient in a graph, the greater the number of closed triads compared to the number of open triads [50]. Since G and \(G^e(t)\) have weighted arcs, we use the following formula to compute the weighted clustering coefficient \(c_i\) relative to a node \(v_i\) [51]:

$$\begin{aligned} c_i = \frac{1}{d_i \ (d_i - 1)} \ \sum _{v_j \in neigh(v_i)} \ \sum _{v_k \in neigh(v_i), v_k \ne v_j} ({\hat{w}}_{ij} \ {\hat{w}}_{ik} \ {\hat{w}}_{jk})^{\frac{1}{3}} \end{aligned}$$
(3.3)

Here: (i) \(d_i\) is the degree of the node \(v_i\); (ii) \(neigh(v_i)\) represents the neighbors of \(v_i\), i.e., the nodes connected to \(v_i\) through an edge; (iii) \({\hat{w}}_{ij}\) (resp., \({\hat{w}}_{ik}\), \({\hat{w}}_{jk}\)) is the normalized weight of the edge \(e_{ij}\) (resp., \(e_{ik}\), \(e_{jk}\)); if we indicate with \(w_\textrm{max}\) the maximum edge weight in the network, then \({\hat{w}}_{ij} = \frac{w_{ij}}{w_{max}}\) (resp., \({\hat{w}}_{ik} = \frac{w_{ik}}{w_{max}}\), \({\hat{w}}_{jk} = \frac{w_{jk}}{w_{max}}\)). Following this definition, the higher the weight of a triad is, the higher its importance.

Having the weighted clustering coefficient of each node in the network, we can compute the mean weighted clustering coefficient \(c_M\) of the network as:

$$\begin{aligned} c_M = \frac{\sum _{v_i \in V} c_i}{|V|} \end{aligned}$$
(3.4)

The weighted clustering coefficient could be useful to test whether the embeddings of \({\mathcal {GNN}}\) tend to increase or decrease the strength of the connections of triads during its training. We might expect that the edges and structures of \(G^e(t)\) would have different weights than those of G just because of the embeddings learned from \({\mathcal {GNN}}\). Similarly to the measures Q and \(J_M\) seen in the previous section, \(c_M\) might have an impact on the performance of \({\mathcal {GNN}}\). Therefore, in Sect. 4.4, we present some tests we performed to analyze the differences between the mean weighted clustering coefficients of G and \(G^e(t)\) and to understand whether \(c_M\) can really support the training of G.

3.2.3 A measure to compare centrality measures in G and \(G^e(t)\)

Centrality measures represent another interesting viewpoint to investigate for our framework. They indicate the importance of a node in the network in terms of the number of connections, number of shortest paths passing through it, relevance of its neighbors, and so on [52]. Let \(\gamma\) be a centrality measure in a weighted graph. Based on the values of \(\gamma\), a ranking of the graph nodes can be built. We might think of computing \(\gamma\) on G and \(G^e(t)\) to obtain the rankings \(R_\gamma\) and \(R_\gamma ^e(t)\) of the nodes of G and \(G^e(t)\) with respect to the values of \(\gamma\). Following this reasoning, we could verify whether the most important nodes in \(R_\gamma\) are the same or different than the most important nodes in \(R_\gamma ^e(t)\). Such a check would allow us to understand whether \({\mathcal {GNN}}\) has modified the relevance of the nodes of G based on what it has learned during its training process. Since most centrality measures follow a power law distribution, for each ranking we can focus only on the nodes with the highest values. For instance, we could focus only on the top 20% of the nodes in each ranking.

To check the differences between the two rankings \(R_\gamma\) and \(R_\gamma ^e(t)\), we rely on Kendall’s tau coefficient. It is a measure of the correspondence between two rankings. The higher its value, the greater the correspondence between the two rankings into examination [53]. Thus, within our framework, if the value of the Kendall’s tau coefficient \(\tau _\gamma\), calculated on the rankings \(R_\gamma\) and \(R_\gamma ^e(t)\), tends to 1, it means that \(R_\gamma\) and \(R_\gamma ^e(t)\) are close, and thus that the embeddings returned by \({\mathcal {GNN}}\) have preserved the importance of the nodes of G during the training activity. Otherwise, if \(\tau _\gamma\) tends to 0, it means that the rankings are different, and therefore the training of \({\mathcal {GNN}}\) has changed the relevance of the nodes of G. As a consequence, investigating \(\tau _\gamma\) is interesting to observe whether or not the training of \({\mathcal {GNN}}\) has led to changes in the importance of the nodes in G. The experiments related to this investigation are explained in detail in Sect. 4.5.

3.3 Evaluation and enhancement of a GNN performed by our framework

In the previous sections, we introduced three different perspectives to investigate embeddings learned through a GNN, namely community structure, clustering coefficient and centrality measures. For each of these perspectives, we identified measures allowing us to carry out a quantitative study. These measures are Q and \(J_M\) for communities, \(c_M\) for clustering coefficient and \(\tau _\gamma\) for centrality measures. Thanks to them, we can study the evolution of the embeddings returned by \(G^e(t)\) during the various epochs of the training of \({\mathcal {GNN}}\). Such a study could reveal the presence of trends that, in addition to being valuable for the analysis and evaluation of the phenomenon as such, could provide support for improving the performance of \({\mathcal {GNN}}\). In fact, these measures could add information and knowledge capable of supporting the training of this network. One way for our measures to play a role in improving the performance of \({\mathcal {GNN}}\) is to leverage the loss function \({{\mathcal {L}}}\) used to train this network. In our framework, we do not need to identify a specific loss function but can use any loss function available in the literature, such as binary cross-entropy, categorical cross-entropy and mean squared error-based loss [54].

Therefore, let \({{\mathcal {L}}}\) be the starting loss function that we chose. To add the information that can be derived through our measures, we can introduce a new loss function derived from \({\mathcal {L}}\), which we call \({{\mathcal {L}}}^e\). It is defined as followsFootnote 3:

$$\begin{aligned} {{\mathcal {L}}}^e = \lambda _1 \ \mathcal {L} - \lambda _2 \ Q - \lambda _3 \ J_M - \lambda _4 \ c_M - \lambda _5 \ \tau _\gamma \end{aligned}$$
(3.5)

Each component of Eq. 3.5 is weighted with a factor \(\lambda _i\) defining its importance. The value of \(\lambda _i\) belongs to the real interval [0, 1]; moreover, the sum of all weights is equal to 1, i.e., \(\sum _{i=1}^5 \lambda _i = 1\). \({\mathcal {L}}^e\) contains both the information carried out by \({{\mathcal {L}}}\) and the one derived through our framework. In defining \({{\mathcal {L}}}^e\) we start from \({{\mathcal {L}}}\) (and, thus, we do not give up its contribution) because \({{\mathcal {L}}}\) is tailored to the Machine Learning task that \({\mathcal {GNN}}\) is supposed to solve, such as the binary cross-entropy for binary classification, the mean squared error for regression, etc. Moreover, we add to it the information carried out by the four measures defined above.

It is worth noting that all the measures we identified were added with a negative sign. Therefore, the greater their value, the smaller the value of \({{\mathcal {L}}}^e\). In this way, we are forcing \({\mathcal {GNN}}\) to consider the structural information resulting from the current \(G^e(t)\) and the differences between \(G^e(t)\) and G. Clearly, each measure has its own specific impact in the training of \({\mathcal {GNN}}\). In particular, maximizing Q implies that \(G^e(t)\) should have the highest possible modularity, which could be extremely important in those tasks requiring communities to be as partitioned as possible. The maximization of \(J_M\) implies that we want to have a good overlap between the communities in G and \(G^e(t)\), and therefore that \({\mathcal {GNN}}\) should not disrupt the initial network structure. Maximizing \(c_M\) could lead \(G^e(t)\) to have a higher mean weighted clustering coefficient, and thus stronger triads. Finally, maximizing \(\tau _\gamma\) implies that the ranking of nodes with respect to the centrality measure \(\gamma\) should be preserved when moving from G to \(G^e(t)\). The details of the proposed loss function and how it changes the training and validation processes of the model \({\mathcal {GNN}}\) are provided in Algorithm 2.

Algorithm 2
figure b

Model evaluation and enhancement performed by our framework

As reported in this algorithm, after splitting G in three subgraphs \(G_{train}\), \(G_{val}\) and \(G_{test}\), our framework trains \({\mathcal {GNN}}\) for \(\eta\) epochs. During the current epoch t, it calculates \({{\mathcal {L}}}_{train}\) by applying the loss function chosen for the task performed by the \({\mathcal {GNN}}\). Then, it applies \({\mathcal {GNN}}\) on \(G_{train}\) to compute node embeddings. Afterward, it computes \(G^e_{train}(t)\) starting from \(G_{train}\) and node embeddings computed previously. After that, it proceeds to calculate the analysis measures Q, \(J_M\), \(c_M\) and \(\tau _\gamma\) from \(G^e(t)\) and, starting from them and \({{\mathcal {L}}}_{train}\), calculates \({{\mathcal {L}}}^e_{train}\). At this point, it begins the validation process during which it first computes \({{\mathcal {L}}}_{val}\) and, then, \(G^e_{val}(t)\) starting from \(G_{val}\) and the node embeddings returned by \({\mathcal {GNN}}\). Afterward, it computes Q, \(J_M\), \(c_M\), \(\tau _\gamma\) and \({{\mathcal {L}}}^e_{val}\). If \({\mathcal {L}}^e_{val}\) is less than the current minimum value of the loss function, it means that, during the current epoch t, \({\mathcal {GNN}}\) has been improved. In this case, our framework saves the new weights of \({\mathcal {GNN}}\) and considers \({{\mathcal {L}}}^e_{val}\) as the new current minimum value of the loss function.

Finally, we would like to point out that the formula of \({{\mathcal {L}}}^e\) specified in Eq. 3.5 is generic, does not depend on the starting loss function adopted and can be applied on any GNN, regardless of the Machine Learning task it must perform. Clearly, it is also possible to use only some, or even one, analysis measures by setting to 0 the weights \(\lambda _i\) corresponding to the measures that we do not want to employ. Indeed, starting from \({{\mathcal {L}}}^e\), by setting some of its weights to 0, it is possible to obtain various specific functions, which can be used to study the contribution of each measure on the scenario under consideration.

In particular, by setting \(\lambda _3=\lambda _4=\lambda _5=0\), we obtain a particular version of \({{\mathcal {L}}}^e\), which we call \({\mathcal {L}}^e_Q\), defined as:

$$\begin{aligned} {{\mathcal {L}}}^e_Q = \lambda _1 \ \mathcal {L} - \lambda _2 \ Q \end{aligned}$$
(3.6)

Now, since the sum of the weights of \({{\mathcal {L}}}^e\) must be equal to 1, we have that \(\lambda _2 = 1 - \lambda _1\). Furthermore, by setting \(\lambda = 1 - \lambda _1\), we have that \({{\mathcal {L}}}^e_Q\) can be defined as:

$$\begin{aligned} {{\mathcal {L}}}^e_Q = (1 - \lambda ) \ \mathcal {L} - \lambda \ Q \end{aligned}$$
(3.7)

This specialization of \({{\mathcal {L}}}^e\) emphasizes the role of modularity in the training of a GNN. Therefore, it can be very useful when we want to analyze this role. In fact, we employ it in Sect. 4.3, devoted to this task.

Instead, by setting \(\lambda _2=\lambda _4=\lambda _5=0\) and proceeding similarly to what we have seen for \({{\mathcal {L}}}^e_Q\), we obtain a particular version of \({{\mathcal {L}}}^e\), which we call \({{\mathcal {L}}}^e_J\), defined as:

$$\begin{aligned} {{\mathcal {L}}}^e_J = (1 - \lambda ) \ \mathcal {L} - \lambda \ J_M \end{aligned}$$
(3.8)

\({{\mathcal {L}}}^e_J\) emphasizes the role of the Jaccard coefficient \(J_M\) in the training of a GNN. We employ it in Sect. 4.3, where this role is studied in detail.

Similarly, by setting \(\lambda _2 = \lambda _3 = \lambda _5 = 0\) and performing the same operations as in the previous two cases, we obtain a particular version of \({{\mathcal {L}}}^e\), which we call \({\mathcal {L}}^e_{c_M}\), defined as:

$$\begin{aligned} {{\mathcal {L}}}^e_{c_M} = (1 - \lambda ) \ \mathcal {L} - \lambda \ c_M \end{aligned}$$
(3.9)

\({{\mathcal {L}}}^e_{c_M}\) emphasizes the role of the clustering coefficient on the training of a GNN. We adopted it in Sect. 4.4, devoted to study this role.

We end this presentation of the specializations of \({{\mathcal {L}}}^e\) (albeit many more could be defined by appropriately setting the weights \(\lambda _i\)) by setting \(\lambda _2 = \lambda _3 = \lambda _4 = 0\) and proceeding as in the previous cases. We obtain a particular version of \({{\mathcal {L}}}^e\), which we call \({{\mathcal {L}}}^e_\gamma\), defined as:

$$\begin{aligned} {{\mathcal {L}}}^e_\gamma = (1 - \lambda ) \ \mathcal {L} - \lambda \ \tau _\gamma \end{aligned}$$
(3.10)

\({{\mathcal {L}}}^e_\gamma\) emphasizes the role of the centrality measure \(\gamma\) in the training of a GNN. It will be employed in Sect. 4.5 where we investigate the role of the degree centrality in the training of a GNN.

In Eq. 3.7 (resp., 3.83.9, 3.10), the higher the weight assigned to Q (resp., \(J_M\), \(c_M\), \(\tau _\gamma\)) and the lower the weight assigned to \({{\mathcal {L}}}\). Clearly, when \(\lambda =0\), the GNN is trained only with \({{\mathcal {L}}}\), while when \(\lambda =1\), the information carried by Q (resp., \(J_M\), \(c_M\), \(\tau _\gamma\)) is the only one that contributes to the training of the GNN.

4 Experiments

In this section, we present the experiments we conducted to test our framework. Specifically, in Sect. 4.1, we provide an overview of the datasets used. In Sect. 4.2, we describe the GNN employed. In Sects.  4.3, 4.4 and 4.5, we illustrate the experiments regarding community structures, clustering coefficient and centrality measures, respectively. Finally, in Sect. 4.6, we evaluate the ability of our framework to enhance the training of the underlying GNN thanks to the information extracted through it.

4.1 Datasets

To ensure a robust validation of our framework, in our experiments we employed six different datasets widely adopted in the fields of GNN and Network Analysis. By leveraging these datasets we could thoroughly evaluate the performance and effectiveness of our framework under various conditions and scenarios. The first dataset is Cora [55]; it consists of a citation network in which each node represents a scholarly paper and an arc connecting two nodes indicates that one paper cites the other. The second dataset is Chameleon [56], which contains page-page networks focusing on particular topics, such as chameleons. In this dataset, nodes correspond to articles while edges indicate reciprocal links between articles. The third dataset is Actor [57]; it is a network of actor co-occurrences; in it, each node represents an actor while an edge between two nodes denotes the co-occurrence of the corresponding actors on the same Wikipedia page. The fourth, fifth, and sixth datasets are Cornell, Texas and Wisconsin [58], which are all of the same type with nodes representing web pages and arcs denoting hyperlinks from one page to another. The statistics of the six datasets used are shown in Table 1.

Table 1 Some statistics of the adopted datasets

For the sake of space, we decided to work only with one Machine Learning task typically performed by GNNs, namely node classification. However, we point out that our framework can handle any Machine Learning task as long as node embeddings can be extracted through it.

4.2 Reference GNN model

The GNN model we adopted for our experiments is the Feature Selection Graph Neural Network (FSGNN) introduced in [59]. We decided to use this model because of its high performance in the node classification task, which makes it a good baseline to evaluate the effectiveness of our framework. FSGNN is a two-layered GNN model designed to handle node classification tasks. Because of the significant differences between the datasets of interest (see Table 1), we had to provide different values to its hyperparameters, depending on the dataset on which it was applied. The hyperparameters of FSGNN are: (i) the decay of the weights for the first and second fully connected layers (\(WD_{fc1}\) and \(WD_{fc2}\)), (ii) the learning rate of the fully connected layers (\(LR_{fc}\)), and (iii) the dropout rate. The corresponding values are reported in Table 2.

Table 2 Hyperparameters of FSGNN for each dataset

4.3 Experiments on communities structures

In this section, we describe the experiments performed to analyze the communities in G and \(G^e(t)\) for the datasets selected.

First, we tested whether the modularity Q of \(G^e(t)\) changed during the training epochs, which could show that FSGNN was creating stronger or weaker communities during its learning. To this end, we trained FSGNN for 100 epochs. At the end of each epoch, we extracted the communities from \(G^e(t)\) through the Louvain and Clauset-Newman-Moore algorithms and, after that, calculated the modularity of \(G^e(t)\). We used two different community extraction algorithms to better verify the stability of the results obtained. The corresponding results for each dataset are shown in Fig. 1.

Fig. 1
figure 1

Modularity Q of \(G^e(t)\) against training epochs for the datasets of interest

From the analysis of this figure, we can observe that in Cora, Chameleon, Actor, Cornell and Wisconsin the modularity of \(G^e(t)\) increased as the number of epochs increased. This means that FSGNN created more cohesive communities than those in the original graph. Specifically, in this case, the increase in Q ranges from 1.8%, obtained for the Texas dataset, to 12.70%, obtained for the Cornell dataset. Only in the Texas dataset we do not observe a stable growth of Q, due to some fluctuations. However, even in this less favorable case, the modularity of \(G^e(t)\) at the end of training is greater than that at the beginning of this task. This confirms our hypothesis that the modularity grows as the number of epochs increases.

To test whether modularity can improve the performance of FSGNN, we added the information associated with Q into the model training. For this purpose, we adopted the specialization of \({{\mathcal {L}}}^e\) called \({{\mathcal {L}}}^e_Q\) defined in Eq. 3.7. We repeated the experiments five times with different training and testing splits. In Tables 3 and 4, we report the average values of the classification performance metrics we obtained. In these tables, as well as in the next ones, we use bold to indicate the maximum values in the corresponding column. For these tables, as well as for Tables 5 and 6 below, we report only the results obtained with the communities returned by the Louvain algorithm. However, the results obtained with the communities returned by the Clauset-Newman-Moore algorithm are similar.

Table 3 Average values of classification metrics obtained by FSGNN using Q in the training loss function—Cora, Chameleon, and Actor datasets
Table 4 Average values of classification metrics obtained by FSGNN using Q in the training loss function—Cornell, Texas and Wisconsin datasets

From the analysis of these tables, we can observe that the use of Q in the training loss function of FSGNN increases the classification performance of this model. In the Cora and Chameleon datasets, the best results are obtained for \(\lambda =0.1\); in this case, the values of classification metrics increase only slightly. Instead, in the Actor dataset we achieve interesting performances when \(\lambda = 0.9\), in which case Accuracy increases by 3.2%, Precision by 23.6% and Recall by 8.7%. Similarly, in the Cornell, Texas and Wisconsin datasets, the performance gets interesting results when \(\lambda =0.8\) or \(\lambda =0.9\). More specifically, for the Cornell dataset, Accuracy increases by 7.1%, Precision by 6.7% and Recall by 9.5%. As for the Texas dataset, Accuracy increases by 10.7%, Precision by 15.5% and Recall by 19.6%. Finally, for what concerns the Wisconsin dataset, Accuracy increases by 3.3%, Precision by 9.9% and Recall by 4.7%. Hence, we can conclude that the adoption of \(G^e(t)\) and Q during the training of FSGNN results in significant improvements in classification performance in four datasets (i.e., Actor, Cornell, Texas, and Wisconsin), while it leads to marginal improvements in the remaining two ones (i.e., Cora and Chameleon). Still for the community investigation task, we next tested the role of the Jaccard coefficient \(J_M\) to see whether and to what extent the overlapping level between G and \(G^e(t)\) varies against training epochs. Again, we trained FSGNN for 100 epochs. At the end of each epoch, we extracted communities from G and \(G^e(t)\) using the Louvain and Clauset-Newman-Moore algorithms. Afterward, we calculated \(J_M\) by applying the procedure described in Algorithm 1. In Fig. 2, we show the values of \(J_M\) during the training of FSGNN.

Fig. 2
figure 2

Jaccard coefficient \(J_M\) of \(G^e(t)\) against training epochs for the datasets of interest

From the analysis of this figure, we can see that \(J_M\) tends to decrease as the number of epochs increases. In fact, at epoch 0, when training has not yet begun, \(J_M\) is equal to 1, which implies that the communities in G and \(G^e(t)\) are the same. Then \(J_M\) decreases as the number of epochs increases and, at the end of the 100th epoch, it reaches a value that is always less than its original one. This occurs regardless of which algorithm we use to detect communities. In particular, for the Cora and Chameleon datasets, \(J_M\) decreases rapidly and reaches a value close to 0.65, which means that there is 65% overlapping between the communities of G and \(G^e(t)\). A similar reasoning applies to the Texas and Wisconsin datasets, for which \(J_M\) decreases rapidly and reaches a value close to 0.60. As for the Actor dataset, \(J_M\) decreases very rapidly and settles at 0.40 in the case of adoption of the Clauset-Newman-Moore algorithm, while it settles to 0.10 in the case of employment of the Louvain algorithm. Finally, for the Cornell dataset, \(J_M\) fluctuates in a range between 0.86 and 0.92.

Next, we tested whether \(J_M\) could improve the performance of FSGNN in classifying nodes. To this end, we added the information brought by \(J_M\) into the training process of FSGNN. To do this, we employed the specialization \({{\mathcal {L}}}^e_J\) of the function \({{\mathcal {L}}}^e\) defined in Eq. 3.8. We repeated the training of FSGNN five times with different training and testing splits. We report the average values of the classification metrics in Tables 5 and 6.

Table 5 Average values of classification metrics obtained by FSGNN using \(J_M\) in the training loss function—Cora, Chameleon, and Actor datasets
Table 6 Average values of classification metrics obtained by FSGNN using \(J_M\) in the training loss function—Cornell, Texas and Wisconsin datasets

From the analysis of these tables, we can see that, in many cases, \(J_M\) improves the performance of FSGNN although there are differences among the datasets. In fact, in the Cora and Chameleon datasets, we observe slight improvements in Accuracy, Precision, and Recall. In the Actor dataset, Precision increases by 16.7% when \(\lambda =0.1\), while Accuracy and Recall do not show significant changes. Furthermore, in the Cornell dataset, we get an increase in Accuracy of 7.8%, an increase in Precision of 10.0%, and an increase in Recall of 12.9% for \(\lambda =0.8\). Similarly, in the Texas dataset, we obtained a significant increase. In contrast, in the Wisconsin dataset, we can see that Accuracy increased by 2.4%, while Precision and Recall decreased by 1.7% and 3.2%, respectively.

In conclusion, we observe that the knowledge on community structures extracted during the training of FSGNN can be used to improve its performance. For this purpose, it is possible to employ the two analysis measures Q and \(J_M\) that summarize the knowledge on the variations of community structures extracted during the training of FSGNN.

4.4 Experiments on clustering coefficient

The first experiment on the role of the clustering coefficient in the training of FSGNN aimed to analyze the evolution of the values of the mean weighted clustering coefficient \(c_M\) in \(G^e(t)\) after each training epoch. Specifically, we trained FSGNN for 100 epochs and, at the end of each epoch, we created \(G^e(t)\), calculated the weighted clustering coefficients for all nodes, and finally computed \(c_M\). The results obtained are shown in Fig. 3.

Fig. 3
figure 3

Average weighted clustering coefficient \(c_M\) of \(G^e(t)\) against training epochs for the datasets of interest

From the analysis of this figure we can observe that, in all cases, the value of \(c_M\) decreases as the number of epochs increases. This means that the triads present in G lose power during the training of FSGNN. In particular, we observe a decrease in \(c_M\) ranging from 6.2% in the Cora dataset to 62.5% in the Wisconsin dataset. This is an interesting result in that it tells us that the training of FSGNN leads to a decrease in the power exerted by the weighted triads of G.

To check whether and how much \(c_M\) exerts an influence in the training of FSGNN, we proceeded employing the specialization \({{\mathcal {L}}}^e_{c_M}\) of the function \({{\mathcal {L}}}^e\), defined in Eq. 3.9. Afterward, we tested \(\mathcal {L}^e_{c_M}\) for different values of \(\lambda\). Again, we repeated this experiment five times with different training and testing splits. In Tables 7 and 8, we report the results obtained.

Table 7 Average values of classification metrics obtained by FSGNN using \(c_M\) in the training loss function—Cora, Chameleon, and Actor datasets
Table 8 Average values of classification metrics obtained by FSGNN using \(c_M\) in the training loss function—Cornell, Texas and Wisconsin datasets

From the analysis of these tables, we can observe trends similar to those we have seen for community structures. In particular, we note that the benefits of using \(c_M\) in the Cora and Chameleon datasets are small, while they become significant for the Actor, Cornell, Texas and Wisconsin datasets. In fact, as for the Actor dataset, we achieved an increase in Precision of 15.6%, in Accuracy of 1.5% and in Recall of 1.7%. As for the Cornell dataset, we obtained an increase in Accuracy of 7.1%, in Precision of 6.8% and in Recall of 9.5%. In the case of Texas, we achieved a higher Accuracy of 10.7%, a higher Precision of 15.7% and a higher Recall of 19.6%. Finally, in the Wisconsin dataset, Accuracy increased by 3.3%, Precision by 9.9% and Recall by 4.7%.

The results obtained lead us to conclude that the mean weighted clustering coefficient \(c_M\) of \(G^e(t)\) changes against training epochs. Therefore, it could be a relevant factor to be investigated during the learning of a GNN model. As evidence of this, the introduction of \(c_M\) in the loss function allowed us to obtain higher values of the performance metrics than those of the baseline.

4.5 Experiments on centrality measures

Our experiment on centrality measures involved the analysis of the trend of the Kendall’s tau coefficient \(\tau _\gamma\) against training epochs. Recall that \(\tau _\gamma\) is an indicator of the agreement of the rankings of the main nodes of G and \(G^e(t)\) with respect to the centrality measure \(\gamma\). The first decision we had to make was the choice of \(\gamma\) and of the way to select the main nodes. Regarding centrality measures, we know from Network Analysis theory that there are four main centrality measures, namely Degree Centrality, Closeness Centrality, Betweenness Centrality and Eigenvector Centrality [50]. However, only the former is fast to compute, so that it has a low impact on the training time of the GNN; instead, the others have a long computation time. Therefore, we chose Weighted Degree Centrality as the benchmark centrality measure. Regarding the choice of the main nodes, we know that Degree Centrality follows a power law distribution [50]. Therefore, we thought of using the Pareto principle underlying such a distribution and selected the 20% of nodes with the highest Weighted Degree Centrality values as the most important nodes.

At this point, similar to what we did for the previous measures, we trained FSGNN for 100 epochs. At the end of each epoch, we calculated \(\tau _{wd}\) (i.e., the Kendall’s tau coefficient specialized to Weighted Degree Centrality) for each node in the training split and then averaged the values thus obtained. The corresponding results are reported in Fig. 4.

Fig. 4
figure 4

Kendall’s tau coefficient \(\tau _{wd}\), computed on the Weighted Degree Centrality, against training epochs for the datasets of interest

From the analysis of this figure, we observe that \(\tau _{wd}\) tends to decrease rapidly during the training of FSGNN. In fact, in all cases, we start with \(\tau _{wd}=1\) and, in a few epochs, reach a value of \(\tau _{wd}\) between 0 (for Cora, Chameleon, Actor and Cornell) and 0.4 (for Texas and Wisconsin). This result shows that the rankings of the most central nodes of G and \(G^e(t)\) are different. This tells us that, during the training of FSGNN, the weighted degree centrality of the nodes of \(G^e(t)\), and consequently the weights of the edges incident on them, substantially change.

After that, we wanted to test whether these changes had a positive impact on the training of FSGNN. To this end, similar to what we have done for the previous measures, we employed the specialization \({{\mathcal {L}}}^e_\gamma\) of the function \({{\mathcal {L}}}^e\) defined in Eq. 3.10. Since, in this experiment, we chose the Weighted Degree Centrality as centrality measure, for the sake of clarity, we prefer to write \({\mathcal {L}}^e_{wd}\) and \(\tau _{wd}\), instead of \({{\mathcal {L}}}^e_\gamma\) and \(\tau _\gamma\). \({{\mathcal {L}}}^e_{wd}\) is defined as: \({{\mathcal {L}}}^e_{wd} = (1 - \lambda ) \ \mathcal {L} - \lambda \ \tau _{wd}\). We want to point out again that this is only a change of notation to increase the clarity of presentation. We repeated this experiment five times with different training and testing splits. The average results thus obtained are reported in Tables 9 and 10.

Table 9 Average values of classification metrics obtained by FSGNN using \(\tau _{wd}\) in the training loss function—Cora, Chameleon, and Actor datasets
Table 10 Average values of classification metrics obtained by FSGNN using \(\tau _{wd}\) in the training loss function—Cornell, Texas and Wisconsin datasets

From the analysis of these tables, we can see that \(\tau _{wd}\) has a positive impact on the training of FSGNN on average. In particular, as for Cora, there is a small increase in Precision and Recall. In the case of Chameleon, improvement is negligible. As for Actor, Accuracy increased by 2.1% (for \(\lambda =0.9\)), Precision by 15.6% (for \(\lambda =0.4\)) and Recall by 2.3% (for \(\lambda =0.9\)). Instead, as for Cornell, Accuracy increased by 7.2%, Precision by 9.1% and Recall by 11.4% when \(\lambda =0.8\). In Texas, Accuracy grew by 10.2%, Precision by 15.3% and Recall by 21.1% when \(\lambda =0.8\). Finally, in Wisconsin, Accuracy increased by 3.7% (for \(\lambda =0.8\)), Precision by 7.4% (for \(\lambda =0.9\)) and Recall by 5.7% (for \(\lambda =0.8\)). As a consequence, we can conclude that, in our setting, this measure also has an impact on the training of FSGNN because in almost all cases its introduction into the loss function resulted in an improvement in classification performance compared with the baseline.

4.6 Experiments on the structural information in the GNN training

Having tested the improvements in the training of a GNN made by each analysis measure separately from the others, we now want to verify the effectiveness of \(\mathcal {L}^e\) (see Eq. 3.5), which takes into account the contributions of all the analysis measures at once. Recall that, in the definition of \(\mathcal {L}^e\), each of the weights \(\lambda _i\), \(1 \le i \le 5\), associated with \(\mathcal {L}\) and the four analysis measures under consideration, ranges in the real interval [0, 1], and \(\sum _{i=1}^5 \lambda _i = 1\). Therefore, by adjusting these weights, we could define many versions of \(\mathcal {L}^e\) according to our needs (see, for instance, Eqs. 3.73.10). In all the previous tests, we considered scenarios in which only two values of \(\lambda _i\) were different from 0; one of them was always \(\lambda _1\) while the other depended on the analysis measure we wanted to test. It would now be interesting to consider 3, 4 and 5 weights \(\lambda _i\) different from 0 in \(\mathcal {L}^e\). Unfortunately, testing all possible combinations of \(\lambda _i\) is extremely wasteful and essentially useless. Therefore, in conducting the experiments described in this section, we had to find a workaround different from the one of the previous experiments.

To this end, we decided to add the weights \(\lambda _i\) as learnable parameters of FSGNN. In this way, their values are learned during the training of the model. As a result, we do not have to check all the values but, hopefully, we obtain the best ones by solving the node classification task. Moreover, in all the previous experiments, we have seen that excluding the contribution of the initial loss function \({{\mathcal {L}}}\) often returned very low results (see the last rows of Tables 3, 4, 5, 6, 7, 8, 9, 10). This highlights that \({{\mathcal {L}}}\) is critical to the Machine Learning task performed by FSGNN. Consequently, we decided to keep \(\lambda _1\) (i.e., the weight associated with \({{\mathcal {L}}}\) in \({\mathcal {L}}^e\)) in all tests. The results of the experiments for the various combinations of \(\lambda _i\) kept in the test and the corresponding performances obtained by FSGNN are shown in Tables 11 and 12.

In these tables, the row “Baseline” corresponds to the case in which we train FSGNN with only the loss function \({{\mathcal {L}}}\) and, therefore, to the case in which \(\lambda _1=1\), \(\lambda _i=0\), \(2 \le i \le 5\). Each row corresponds to a choice of the parameters \(\lambda _i\) that we decide to keep and train. The values of the parameters not present in that row are set to 0. For example, in the second row of the two tables, we decided to keep only the weights \(\lambda _1\), \(\lambda _2\) and \(\lambda _3\); consequently, \(\lambda _4 =0\) and \(\lambda _5=0\). The values of the three weights \(\lambda _1\), \(\lambda _2\) and \(\lambda _3\) are initially set equal to each other and, consequently, equal to \(\frac{1}{3}\). At each iteration of the training, the values of the weights vary. To give an example, let us consider the last row of Table 12. It represents a scenario in which all the weights \(\lambda _i\), \(1 \le i \le 5\), are kept and their values are tuned during training. At the first iteration of training, \(\lambda _i=0.2\), \(1 \le i \le 5\). Such a scenario returns the maximum values of Accuracy, Precision and Recall for the Cornell and Wisconsin datasets, the maximum value of Precision for the Actor dataset and the maximum values of Accuracy and Recall for the Texas dataset. Regarding the values of the weights \(\lambda _i\), \(1 \le i \le 5\), in this row we obtained that: (i) for the Cornell dataset: \(\lambda _1=0.12\), \(\lambda _2=0.22\), \(\lambda _3=0.22\), \(\lambda _4=0.22\), \(\lambda _5=0.22\); (ii) for the Texas dataset: \(\lambda _1=0.12\), \(\lambda _2=0.25\), \(\lambda _3=0.25\), \(\lambda _4=0.23\), \(\lambda _5=0.15\); (iii) for the Wisconsin dataset: \(\lambda _1=0.12\), \(\lambda _2=0.23\), \(\lambda _3=0.22\), \(\lambda _4=0.22\), \(\lambda _5=0.20\); (iv) for the Actor dataset: \(\lambda _1=0.12\), \(\lambda _2=0.24\), \(\lambda _3=0.24\), \(\lambda _4=0.23\), \(\lambda _5=0.18\).

Table 11 Average values of classification metrics obtained by FSGNN using different combinations of \(\mathcal {L}^e\) as training loss function—Cora, Chameleon and Actor datasets
Table 12 Average values of classification metrics obtained by FSGNN using different combinations of \(\mathcal {L}^e\) as training loss function—Cornell, Texas and Wisconsin datasets

From the analysis of Tables 11 and 12, we observe some interesting results. First of all, they are in line with the ones seen in the previous tests, although we employed a completely different workaround in this experiment. In particular, similar to the previous tests, the improvements achieved by our framework when applied to Cora and Chameleon are slight. As a matter of fact, the performances achieved are similar to that of the baseline, which means that the framework did not alter the training of FSGNN too much. As for Actor, the performances achieved are similar to that of the baseline with the exception of Precision, which, with the use of all five weights, achieves a substantial increase. We observe that FSGNN has the most significant improvements for Cornell, Texas, and Wisconsin. These are smaller datasets than the previous ones and all refer to the same context. These observations might prompt some preliminary considerations about the role that dataset size can play on the behavior of our framework. However, before coming to more definitive conclusions, we feel it appropriate to explore this aspect further in the future.

More specifically, the results obtained by our framework using Actor, Cornell, Texas and Wisconsin are promising. In fact, in almost all cases, the values of the classification metrics are greater than those of the baselines. In particular, as for Actor, while Accuracy remains the same, Precision increases by 13.77% using all five weights. Recall increases by 1.33% when we use the weights \(\lambda _1\), \(\lambda _2\), \(\lambda _3\) and \(\lambda _5\); if we use all five weights Recall still increase, but its growth is smaller, as it is 0.67%. As for Cornell, the highest values of Accuracy, Precision and Recall are obtained using all five weights, and thus all the analysis measures. In this case, compared to the baseline, the values of Accuracy, Precision and Recall increase by 5.68%, 10.00% and 7.86%, respectively. As for Texas, the highest values of Accuracy and Recall are obtained using all five weights. In this case, compared to the baseline, Accuracy increases by 6.99%, Precision by 5.60% and Recall by 17.00%. Actually, for this dataset, there are other configurations that are able to provide an even greater increase in Precision, equal to 10.81%. Finally, as for Wisconsin, the highest values of the classification metrics are again obtained using all five weights. In this case, compared to the baseline, Accuracy increases by 3.23%, Precision by 9.45% and Recall by 5.22%.

In conclusion, this experiment shows that our framework is capable of ensuring substantially higher classification performance than the baseline for the three datasets with smaller sizes. Currently, we think it is possible to hypothesize a role of dataset size on the performance of our approach. However, we think it is premature to draw firm conclusions about this. Certainly, the results obtained can be a significant starting point for further investigations in this direction.

5 Conclusion

In this paper, we have proposed a framework that employs the theory and techniques of Network Analysis to investigate the dynamics underlying the learned representations of a GNN. Our framework receives a graph as input and passes it to the GNN to be analyzed. This returns the suitable node embeddings corresponding to the graph received as input. Afterward, our framework uses the original graph and the corresponding node embeddings to derive insights concerning the behavior of the GNN. Then, it employs these insights to define a new loss function that accounts for the differences between the graph received as input and the one reconstructed from the embeddings returned by the GNN. Finally, it uses this loss function to enhance the training of the GNN in such a way as to improve its performance. We have also described a large set of experiments that confirmed the goodness of our framework.

The main contributions of this paper with respect to the existing literature are as follows: (i) it proposes a method to map the learned representations returned by a GNN onto the graph from which they were obtained; (ii) it uses the theory and techniques of Network Analysis to define a framework that, in many datasets of different size and nature, has been shown capable of extracting insights regarding the structure and behavior of the GNN; (iii) it defines a new loss function that, in various datasets of different size and nature, assessed and enhanced the quality of the learning process of the GNN so that it can subsequently return better results.

This paper should not be considered as an ending point but rather as a starting point for further future research on this topic. In particular, in future, we would like to improve the representation of \(G^e(t)\) using a multilayer network [60] in order to be able to handle more details about the embeddings returned by the GNN during its training. The multilayer network representation could have a temporal component corresponding to the training epochs. Each layer could be associated with \(G^e(t)\) at a given epoch t, and layers could be connected to each other according to the training progress. As a further development, we would like to improve our framework so that it can handle other GNN architectures, such as Graph Autoencoder and Spatio-Temporal Graph Neural Networks. These two types of models bring new challenges, and it would be interesting to see if, how, and with what modifications our framework could address them. Last but not least, we would like to test our framework on other machine learning tasks where GNNs are used, such as graph classification, edge prediction and unsupervised scenarios.