struc2gauss: Structural role preserving network embedding via Gaussian embedding

Network embedding (NE) is playing a principal role in network mining, due to its ability to map nodes into efficient low-dimensional embedding vectors. However, two major limitations exist in state-of-the-art NE methods: role preservation and uncertainty modeling. Almost all previous methods represent a node into a point in space and focus on local structural information, i.e., neighborhood information. However, neighborhood information does not capture global structural information and point vector representation fails in modeling the uncertainty of node representations. In this paper, we propose a new NE framework, struc2gauss, which learns node representations in the space of Gaussian distributions and performs network embedding based on global structural information. struc2gauss first employs a given node similarity metric to measure the global structural information, then generates structural context for nodes and finally learns node representations via Gaussian embedding. Different structural similarity measures of networks and energy functions of Gaussian embedding are investigated. Experiments conducted on real-world networks demonstrate that struc2gauss effectively captures global structural information while state-of-the-art network embedding methods fail to, outperforms other methods on the structure-based clustering and classification task and provides more information on uncertainties of node representations.


Introduction
Network analysis consists of numerous tasks including community detection (Fortunato 2010), role discovery (Rossi and Ahmed 2015), link prediction (Liben-Nowell and Kleinberg 2007), etc. As relations exist between nodes that disobey the i.i.d assumption, it is non-trivial to apply traditional data mining techniques in networks directly. Network embedding (NE) fills the gap by mapping nodes in a network into a lowdimensional space according to their structural information in the network. It has been reported that using embedded node representations can achieve promising performance on many network analysis tasks (Cao et al. 2015;Grover and Leskovec 2016;Perozzi et al. 2014;Ribeiro et al. 2017).
Previous NE techniques mainly relied on eigendecomposition (Shaw and Jebara 2009;Tenenbaum et al. 2000), but the high computational complexity of eigendecomposition makes it difficult to apply in real-world networks. With the fast development of neural network techniques, unsupervised embedding algorithms have been widely used in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors in the learned embedding space, e.g., word2vec (Mikolov et al. 2013a, b) and GloVe (Pennington et al. 2014). By drawing an analogy between paths consists of several nodes on networks and word sequences in text, DeepWalk (Perozzi et al. 2014) learns node representations based on random walks using the same mechanism of word2vec. Afterwards, a sequence of studies have been conducted to improve DeepWalk either by extending the definition of neighborhood to higher-order proximity (Cao et al. 2015;Grover and Leskovec 2016;Perozzi et al. 2016;Tang et al. 2015b) or incorporating more information for node representations such as attributes (Li et al. 2017;Wang et al. 2017) and heterogeneity (Chang et al. 2015;Tang et al. 2015a).
Although a variety of NE methods have been proposed, two major limitations exist in previous NE studies: role preservation and uncertainty modeling. Previous methods focused only on one of these two limitations and while neglecting the other. In particular, for role preservation, most studies applied random walk to learn representations. However, random walk based embedding strategies and their higher-order extensions can only capture local structural information, i.e., first-order and higherorder proximity within the neighborhood of the target node (Lyu et al. 2017). Local structural information is reflected in community structures of networks. But these methods may fail in capturing global structural information, i.e., structural roles (Rossi and Ahmed 2015;. Global structural information represents roles of nodes in networks, where two nodes have the same role if they are structurally similar from a global perspective. An example of global structural information (roles) and local structural information (communities) is shown in Fig. 1. In summary, nodes that belong to the same community require dense local connections while nodes that have the same role may have no common neighbors at all (Tu et al. 2018). Empirical evidence based on this example for illustrating this limitation will be shown in Sect. 5.2. For uncertainty modeling, most previous methods represented a node into a point vector in the learned embedding space. However, real-world networks may be noisy and imbalanced. For example, node degree distributions in real-world networks are often skewed where some low-degree nodes may contain less discriminative information (Tu et al. 2018). Point vector representations learned by these methods are Community 1 Community 2 Fig. 1 An example of ten nodes belonging to (1) three groups (different colors indicate different groups) based on global structural information, i.e., the structural roles and (2) two groups (groups are shown by the dashed ellipses) based on local structural information, i.e., the communities. For example, nodes 0, 1, 4, 5 and 8 belong to the same group Community 1 based on local structural perspective because they have more internal connections. Node 0 and 2 are far from each other, but they are in the same group based on global structural perspective (Color figure online) deterministic (Dos Santos et al. 2016) and are not capable of modeling the uncertainties of node representations.
There are a few studies trying to address these limitations in the literature. For instance, struc2vec (Ribeiro et al. 2017) builds a hierarchy to measure similarity at different scales, and constructs a multilayer graph to encode the structural similarities. SNS (Lyu et al. 2017) discovers graphlets as a pre-processing step to obtain the structural similar nodes. DRNE (Tu et al. 2018) learns network embedding by modeling regular equivalence (Wasserman and Faust 1994). However, these studies aim only to solve the problem of role preservation to some extent. Thus the limitation of uncertainty modeling remains a challenge. Dos Santos et al. (2016) and Bojchevski and Günnemann (2017) put effort in improving classification tasks by embedding nodes into Gaussian distributions but both methods only capture the neighborhood information based on random walk techniques. DVNE  learns Gaussian embedding for nodes in the Wasserstein space as the latent representations to capture uncertainties of nodes, but they focus only on first-and second-order proximity of networks same to previous methods. Therefore, the problem of role preservation has not been solved in these studies.
In this paper, we propose struc2gauss, a new structural role preserving network embedding framework. struc2gauss learns node representations in the space of Gaussian distributions and performs NE based on global structural information so that it can address both limitations simultaneously. On the one hand, struc2gauss generates node context based on a global structural similarity measure to learn node representations so that global structural information can be taken into consideration. On the other hand, struc2gauss learns node representations via Gaussian embedding and each node is represented as a Gaussian distribution where the mean indicates the position of this node in the embedding space and the covariance represents its uncertainty. Furthermore, we analyze and compare two different energy functions for Gaussian embedding to calculate the closeness of two embedded Gaussian distributions, i.e., expected likelihood and KL divergence. To investigate the influence of structural information, we also compare struc2gauss to two other structural similarity measures for networks, i.e., MatchSim and SimRank.
We summarize the contributions of this paper as follows: -We propose a flexible structure preserving network embedding framework, struc2gauss, which learns node representations in the space of Gaussian distributions. struc2gauss is capable of preserving structural roles and modeling uncertainties. -We investigate the influence of different energy functions in Gaussian embedding and compare to different structural similarity measures in preserving global structural information of networks. -We conduct extensive experiments in node clustering and classification tasks which demonstrate the effectiveness of struc2gauss in capturing the global structural role information of networks and modeling the uncertainty of learned node representations.
The rest of the paper is organized as follows. Section 2 provides an overview of the related work. We present the problem statement in Sect. 3. Section 4 explains the technical details of struc2gauss. In Sect. 5 we then discuss our experimental study. The possible extensions of struc2gauss are discussed in Sect. 6. Finally, in Sect. 7 we draw conclusions and outline directions for future work.

Network embedding
Network embedding methods map nodes in a network into a low-dimensional space according to their structural information in the network. The learned node representations can boost performance in many network analysis tasks, e.g., community detection and link prediction. Previous methods mainly viewed NE as part of dimensionality reduction techniques (Goyal and Ferrara 2018). They first construct a pairwise similarity graph based on neighborhood and then embed the nodes of the graph into a lower dimensional vector space. Locally Linear Embedding (LLE) (Tenenbaum et al. 2000) and Laplacian Eigenmaps (Belkin and Niyogi 2001) are two representative methods in this category. SPE (Shaw and Jebara 2009) learns a low-rank kernel matrix to capture the structures of input graph via a set of linear inequalities as constraints. But the high computational complexity makes these methods difficult to apply in real-world networks.
With increasing attention attracted by neural network research, unsupervised neural network techniques have opened up a new world for embedding. word2vec as well as Skip-Gram and CBOW (Mikolov et al. 2013a, b) learn low-rank representations of words in text based on word context and show promising results of different NLP tasks. Based on word2vec, DeepWalk (Perozzi et al. 2014) first introduces such embedding mechanism to networks by treating nodes as words and random walks as sentences. Afterwards, a sequence of studies have been conducted to improve DeepWalk either by extending the definition of neighborhood to higher-order proximity (Cao et al. 2015;Grover and Leskovec 2016;Perozzi et al. 2016;Tang et al. 2015b) or incorporating more information for node representations such as attributes (Li et al. 2017;Wang et al. 2017) and heterogeneity (Chang et al. 2015;Tang et al. 2015a). Recently, deeper neural networks have also been introduced in NE problem to capture the nonlinear characteristics of networks, such as SDNE (Wang et al. 2016). However, these approaches represent a node into a point vector in the learned embedding space and are not capable of modeling the uncertainties of node representations. To solve this problem, inspired by Vilnis and McCallum (2014), Gaussian embedding has been used in NE. Bojchevski and Günnemann (2017) learns node embeddings by leveraging Gaussian embedding to capture uncertainties. Dos Santos et al. (2016) combines Gaussian embedding and classification loss function for multi-label network classification. DVNE ) learns a Gaussian embedding for each node in the Wasserstein space as the latent representation so that the uncertainties can be modeled. We refer the reader to Hamilton et al. (2017b), Cui et al. (2018) and Cai et al. (2018) for more details.
Recent years have witnessed increasing interest in neural networks on graphs. Graph neural networks (Scarselli et al. 2008) can also learn node representations but using more complicated operations such as convolution. Kipf and Welling (2016) proposes a GCN model using an efficient layer-wise propagation rule based on a first-order approximation of spectral convolutions on graphs. Gilmer et al. (2017) introduces a general message passing neural network framework to interpret different previous neural models for graphs. GraphSAGE (Hamilton et al. 2017a) learns node representations in an inductive manner sampling a fixed-size neighborhood of each node, and then performing a specific aggregator over it. Embedding Propagation (EP) (Duran and Niepert 2017) learns representations of graphs by passing messages forward and backward in an unsupervised setting. Graph Attention Networks (GATs) (Velickovic et al. 2017) extend graph convolutions by utilizing masked self-attention layers to assign different importances to different nodes with different sized neighborhoods.
However, most NE methods as well as graph neural networks only concern the local structural information represented by paths consists of linked nodes, i.e., the community structures of networks. But they fail to capture global structural information, i.e., structural roles. SNS (Lyu et al. 2017), struc2vec (Ribeiro et al. 2017) and DRNE (Tu et al. 2018) are exceptions which take global structural information into consideration. SNS uses graphlet information for structural similarity calculation as a pre-propcessing step. struc2vec applies the dynamic time warping to measure similarity between two nodes' degree sequences and builds a new multilayer graph based on the similarity. Then similar mechanism used in DeepWalk has been used to learn node representations. DRNE explicitly models regular equivalence, which is one way to define the structural role, and leverages the layer normalized LSTM (Ba et al. 2016) to learn the representations for nodes. Another related work focusing on global structural information is REGAL (Heimann et al. 2018). REGAL aims at matching nodes across different graphs so the global structural patterns should be considered. However, its target is network alignment but not representation learning. A brief summary of these NE methods is list in Table 1.

Structural similarity
Structure based network analysis tasks can be categorized into two types: structural similarity calculation and network clustering . Calculating structural similarities between nodes is a hot topic in recent years and different methods have been proposed. SimRank (Jeh and Widom 2002) is one of the most representative notions to calculate structural similarity. It implements a recursive definition of node similarity based on the assumption that two objects are similar if they relate to similar objects. SimRank++ (Antonellis et al. 2008) adds an evidence weight which partially compensates for the neighbor matching cardinality problem. P-Rank (Zhao et al. 2009) extends SimRank by jointly encoding both in-and out-link relationships into structural similarity computation. MatchSim (Lin et al. 2009) uses maximal matching of neighbors to calculate the structural similarity. RoleSim (Jin et al. 2011) is the only similarity measure which can satisfy the automorphic equivalence properties.
Network clusters can be based on either global or local structural information. Graph clustering based on global structural information is the problem of role discovery (Rossi and Ahmed 2015). In social science research, roles are represented as concepts of equivalence (Wasserman and Faust 1994). Graph-based methods and feature-based methods have been proposed for this task. Graph-based methods take nodes and edges as input and directly partition nodes into groups based on their structural patterns. For example, Mixed Membership Stochastic Blockmodel (Airoldi et al. 2008) infers the role distribution of each node using the Bayesian generative model. Feature-based methods first transfer the original network into feature vectors and then use clustering methods to group nodes. For example, RolX (Henderson et al. 2012) employs ReFeX (Henderson et al. 2011) to extract features of networks and then uses non-negative matrix factorization to cluster nodes. Local structural information based clustering corresponds to the problem of community detection (Fortunato 2010). A community is a group of nodes that interact with each other more frequently than with those outside the group. Thus, it captures only local connections between nodes.

Problem statement
We illustrated local community structure and global role structure in Sect. 1 using the example in Fig. 1. In this section, definitions of community and role will be presented and then we formally define the problem of structural role preserving network embedding.
Structural role is from social science and used to describe nodes in a network from a global perspective. Formally, Definition 1 (Structural role) In a network, a set of nodes have the same role if they share similar structural properties (such as degree, clustering coefficient, and betweenness) and structural roles can often be associated with various functions in a network.
For example, hub nodes with high degree in a social network are more likely to be opinion leaders, whereas bridge nodes with high betweenness are gatekeepers to connect different groups. Structural roles can reflect the global structural information because two nodes which have the same role could be far from each other and have no direct links or shared neighbors. In contrast to roles, community structures focus on local connections between nodes.
Definition 2 (Community structure) In a network, communities can represent the local structures of nodes, i.e., the organization of nodes in communities, with many edges joining nodes of the same community and comparatively few edges joining nodes of different communities (Fortunato 2010). A community is a set of nodes where nodes in this set are densely connected internally.
It can be seen that the focus of community structure is the internal and local connections so it aims to capture the local structural information of networks In this study, we only consider the global structural information, i.e., structural role information, so without mentioning it explicitly, structural information indicates the global one and the keyphrases "structural role information" and "global structural information" are used interchangeably.

Definition 3 (Structural Role Preserving Network Embedding) Given a network
where V is a set of nodes and E is a set of edges between the nodes, the problem of Structural Preserving Network Embedding aims to represent each node v ∈ V into a Gaussian distribution with mean μ and covariance Σ in a low-dimensional space R d , i.e., learning a function where μ ∈ R d is the mean, Σ ∈ R d×d is the covariance and d |V |. In the space R d , the global structural role information of nodes introduced in Definition 1 can be preserved, i.e., if two nodes have the same role their means should be similar, and the uncertainty of node representations can be captured, i.e., the values of variances indicate the levels of uncertainties of learned representations.

struc2gauss
An overview of our proposed struc2gauss framework is shown in Fig. 2. Given a network, a similarity measure is employed to calculate the similarity matrix, then the training set which consists of positive and negative pairs are sampled based on the similarity matrix. Finally, Gaussian embedding techniques are applied on the training set and generate the embedded Gaussian distributions as the node representations and uncertainties of the representations. Besides, we analyze the computational complexity and the flexibility of our struc2gauss framework.

Structural similarity calculation
It has been theoretically proved that random walk sampling based NE methods are not capable of capturing structural equivalence (Lyu et al. 2017) which is one way to model the structural roles in networks (Wasserman and Faust 1994). Thus, to capture the global structural information, we calculate the pairwise structural similarity as a pre-processing step similar to Lyu et al. (2017) and Ribeiro et al. (2017).
In the literature, a variety of structural similarity measures have been proposed to calculate node similarity based on the structures of networks, e.g., SimRank (Jeh and Widom 2002), MatchSim (Lin et al. 2009) and RoleSim (Jin et al. 2011(Jin et al. , 2014. However, not all of these measures can capture the global structural role information and we will show the empirical evidence in the experiments in Sect. 5. Therefore, in this paper we leverage RoleSim for the structural similarity since it satisfies all the requirements of Axiomatic Role Similarity Properties for modeling the equivalence (Jin et al. 2011), i.e., the structural roles. RoleSim also generalizes Jaccard coefficient and corresponds linearly to the maximal weighted matching. RoleSim similarity between two nodes u and v is defined as: where |N (u)| and |N (v)| are the numbers of neighbors of node u and v, respectively.
The parameter β is a decay factor where 0 < β < 1. The intuition of RoleSim is that two nodes are structurally similar if their corresponding neighbors are also structurally similar. This intuition is consistent with the notion of automorphic and regular equivalence (Wasserman and Faust 1994).
In practice, RoleSim values can be computed iteratively and are guaranteed to converge. The procedure of computing RoleSim consists of three steps: -Step 1: Initialize matrix of RoleSim scores R 0 ; -Step 2: Compute the kth iteration R k scores for the (k − 1)th iteration's values, R k−1 using: Step 3: Repeat Step 2 until R values converge for each pair of nodes.
Note that there are other strategies can be used to capture the global structural role information except structural similarity, and these possible strategies will be discussed in Sect. 6. The advantage of RoleSim in capturing structural roles to other structural measures will also be discussed empirically in Sect. 5.6.

Training set sampling
The target of structural role preserving network embedding is to map nodes in the network to a latent space where the learned latent representations of two nodes are (1) more similar if these two nodes are structurally similar, and (2) more dissimilar if these two nodes are not structurally similar. Hence, we need to generate structurally similar and dissimilar node pairs as the training set based on the similarity we learned in Sect. 4.1. We name the structurally similar pairs of nodes the positive set and the structurally dissimilar pairs the negative set.
In detail, for node v, we rank its similarity values towards other nodes and then select top-k most similar nodes u i , i = 1, . . . , k to form its positive set Γ + = {(v, u i )|i = 1, . . . , k}. For the negative set, we randomly select the same number of nodes {u i , i = 1, . . . , k} same to Vilnis and McCallum (2014) and other random walk sampling based methods (Grover and Leskovec 2016;Tang et al. 2015b;Perozzi et al. 2014), i.e., Γ − = {(v, u i )|i = 1, . . . , k}. Therefore, k is a parameter indicating the number of positive/negative nodes per node. We will generate r positive and negative sets for each node where r is a parameter indicating the number of samples per node. The influence of these parameters will be analyzed empirically in Sect. 5.7. Note that the selection of the positive set is similar to that in DeepWalk and the difference is that we follow the similarity rank to select the positive nodes instead of random walks.

Overview
Recently language modeling techniques such as word2vec have been extensively used to learn word representations in and almost all NE studies are based on these word embedding techniques. However, these NE studies map each entity to a fixed point vector in a low-dimension space so that the uncertainties of learned embeddings are ignored. Gaussian embedding aims to solve this problem by learning density-based distributed embeddings in the space of Gaussian distributions (Vilnis and McCallum 2014). Gaussian embedding has been utilized in different graph mining tasks including triplet classification on knowledge graphs (He et al. 2015), multi-label classification on heterogeneous graphs (Dos Santos et al. 2016) and link prediction and node classification on attributed graphs (Bojchevski and Günnemann 2017).
Gaussian embedding trains with a ranking-based loss based on the ranks of positive and negative samples. Following Vilnis and McCallum (2014), we choose the maxmargin ranking objective which can push scores of positive pairs above negatives by a margin defined as: where Γ + and Γ − are the positive and negative pairs, respectively. E(·, ·) is the energy function which is used to measure the similarity of two distributions, z v and z u are the learned Gaussian distributions for nodes v and u, and m is the margin separating positive and negative pairs. In this paper, we present two different energy functions to measure the similarity of two distributions for node representation learning, i.e., expected likelihood and KL divergence based energy functions. For the learned Gaussian distribution z i ∼ N (0; μ i , Σ i ) for node i, to reduce the computational complexity, we restrict the covariance matrix Σ i to be diagonal and spherical in this work.

Expected likelihood based energy
Although both dot product and inner product can be used to measure similarity between two distributions, dot product only considers means and does not incorporate covariances. Thus, we use inner product to measure the similarity. Formally, the integral of inner product between two Gaussian distributions z i and z j (learned Gaussian embeddings for node i and j respectively), a.k.a., expected likelihood, is defined as: For simplicity in computation and comparison, we use the logarithm of Eq. (4) as the final energy function: where d is the number of dimensions. The gradient of this energy function with respect to the means μ and covariances Σ can be calculated in a closed form as: where (He et al. 2015;Vilnis and McCallum 2014). Note that expected likelihood is a symmetric similarity measure, i.e.,

KL divergence based energy
KL divergence is another straightforward way to measure the similarity between two distributions so we utilize the energy function E K L (z i , z j ) based on the KL divergence to measure the similarity between Gaussian distributions z i and z j (learned Gaussian embeddings for node i and j respectively): where d is the number of dimensions. Similarly, we can compute the gradients of this energy function with respect to the means μ and covariances Σ: where Δ i j = Σ −1 i (μ i − μ j ). Note that KL divergence based energy is asymmetric but we can easily extend to a symmetric similarity measure as follows:

Learning
To avoid the means to grow too large and ensure the covariances to be positive definite as well as reasonably sized, we regularize the means and covariances to learn the embedding (Vilnis and McCallum 2014). Due to the different geometric characteristics, two different hard constraint strategies have been used for means and covariances, respectively. Note that we only consider diagonal and spherical covariances. In particular, we have The constraint on means guarantees them to be sufficiently small and constraint on covariances ensures that they are positive definite and of appropriate size. For example, Σ ii ← max(c min , min(c max , Σ ii )) can be used to regularize diagonal covariances. We use AdaGrad (Duchi et al. 2011) to optimize the parameters. The learning procedure is described in Algorithm 1. Initialization phase is from line 1 to 4, context generation is shown in line 7, and Gaussian embeddings are learned from line 8 to 14.

Algorithm 1 The Learning Algorithm of struc2gauss
Input: An energy function E(z i , z j ), a graph G = (V , E), embedding dimension d, constraint values C for mean and c max and c min for covariance, learning rate α, and maximum epochs n. Output: Gaussian embeddings (mean vector μ and covariance matrix Σ) for nodes v ∈ V 1: for all v ∈ V do 2: Initialize mean μ for v 3: Initialize covariance Σ for v 4: Regularize μ and Σ with constraint in Eq. (10) and (11) 5: end for 6: while not reach the maximum epochs n do 7: Generate positive and negative sets Γ + and Γ − for each node 8: if use expected likelihood based energy then 9: Update means and covariances based on Eq. (6) 10: end if 11: if use KL divergence based energy then 12: Update means and covariances based on Eq. (8) 13: end if 14: Regularize μ and Σ with constraint in Eq. (10) and (11)  15: end while

Computational complexity
The complexity of different components of struc2gauss are analyzed as follows: 1 For structural similarity calculation using RoleSim, the computational complexity is O (kn 2 d), where n is the number of nodes, k is the number of iterations and d is the average of y log y over all node-pair bipartite graph in G (Jin et al. 2011) where y = |N (u)| × |N (v)| for each pair of nodes u and v. The complexity O(y log y) is from the complexity of the fast greedy algorithm offers a 1 2 -approximation of the globally optimal matching. 2 To generate the training set based on similarity matrix, we need to sample from the most similar nodes for each node, i.e., to select k largest numbers from an unsorted array. Using heap, the complexity is O(n log k). 3 For Gaussian embedding, the operations include matrix addition, multiplication and inversion. In practice, as stated above, we only consider two types of covariance matrices, i.e., diagonal and spherical, so all these operations have the complexity of O(n).
Overall, the component of similarity calculation is the bottleneck of the framework. One possible and effective way to optimize this part is to set the similarity to be 0 if two nodes have a large difference in degrees. The reason is: (1) we generate the context only based on most similar nodes; and (2) two nodes are less likely to be structural similar if their degrees are very different.

Experiments
We evaluate struc2gauss in different tasks in order to understand its effectiveness in capturing structural information, capability in modeling uncertainties of embeddings and stability of the model towards parameters. We also study the influence of different similarity measures empirically. The source code of struc2gauss is available online. 1

Datasets
We conduct experiments on two types of network datasets: networks with and without ground-truth labels where these labels can represent the global structural role information of nodes in the networks. For networks with labels, to compare to state-of-the-art, we use air-traffic networks from Ribeiro et al. (2017) where the networks are undirected, nodes are airports, edges indicate the existence of commercial flights and labels correspond to their levels of activities. For networks without labels, we select five realworld networks in different domains from Network Repository. 2 A brief introduction to these datasets is shown in Table 2. Note that the numbers of groups for networks

Baselines
We compare struc2gauss with several state-of-the-art NE methods.
-DeepWalk (Perozzi et al. 2014): DeepWalk (Perozzi et al. 2014) learns node representations based on random walks using the same mechanism of word2vec by drawing an analogy between paths consists of several nodes on networks and word sequences in text. The structural information is captured by the paths of nodes generated by random walks. -node2vec (Grover and Leskovec 2016): It extends DeepWalk to learn latent representations from the node paths generated by biased random walk. Two hyperparameters p and q are used to control the random walk to be breadth-first or depth-first. In this way, node2vec can capture the structural information in networks. Note that when p = q = 1, node2vec degrades to DeepWalk. -LINE (Tang et al. 2015b): It learns node embeddings via preserving both the local and global network structures. By extending DeepWalk, LINE aims to capture both the first-order, i.e., the neighbors of nodes, and second-order proximities, i.e., the shared neighborhood structures of nodes. -Embedding Propagation (EP) (Duran and Niepert 2017): EP is an unsupervised learning framework for network embedding and learns vector representations of graphs by passing two types of messages between neighboring nodes. EP, as one of graph neural networks, is similar to graph convolutional networks (GCN) (Kipf and Welling 2016). The difference is that EP is unsupervised and GCN is designed for semi-supervised learning. For all baselines, we use the implementation released by the original authors. For our framework struc2gauss, we test four variants: struc2gauss with expected likelihood and diagonal covariance (s2g_el_d), expected likelihood and spherical covariance (s2g_el_s), KL divergence and diagonal covariance (s2g_kl_d), and KL divergence and spherical covariance (s2g_kl_s). Note that we only use means of Gaussian distributions as the node embeddings in role clustering and classification tasks. The covariances are left for uncertainty modeling. For other settings including parameters and evaluation metrics, different settings will be discussed in each task.

Case study: visualization in 2-D space
We use the toy example shown in Fig. 1 to demonstrate the effectiveness of struc2gauss in capturing the global structural information and the failure of other state-of-the-art techniques in this task. The toy network consists of ten nodes and they can be clustered from two different perspectives: -from the perspective of the global role structure, they belong to three groups, i.e., {0, 1, 2, 3} (yellow color), {4, 5, 6, 7} (blue color) and {8, 9} (red color) because different groups have different structural functions in this network; -from the perspective of the local community structure, they belong to two groups, i.e., {0, 1, 4, 5, 6, 8} and {2, 3, 6, 7, 9} because there are denser connections/more edges inside each community that outside the community.
Note that from the perspective of role discovery, these three groups of nodes can be explained to play the roles of periphery, star and bridge, respectively. In this study, we aim to preserve the global structural information in network embedding. Figure 3 shows the learned node representations by different methods. For shared parameters in all methods, we use the same settings by default: representation dimension: 2, number of walks per node: 20, walk length: 80, skipgram window size: 5. For node2vec, we set p = 1 and q = 2. For graph2gauss and struc2gauss, the number of walks per node is 20 and the number of positive/negative nodes per node is 5. The constraint for means C is 2 and constraints for covariances c min and c max are 0.5 and 2, respectively. From the visualization results, it can be observed that: , e struc2vec, f struc2gauss using KL divergence with diagonal covariance, g struc2gauss using KL divergence with spherical covariance, g struc2gauss using KL divergence with diagonal covariance, h struc2gauss using expected likelihood with diagonal covariance, and i struc2gauss using expected likelihood with spherical covariance struc2gauss with spherical covariances performs better than diagonal covariances since it can recognize star and bridge nodes better. -Methods aim to capture the global structural information performs better than random walk sampling based methods. For example, struc2vec can solve this problem to some extent. However, there is overlap between node 6 and 9. It has been stated that node2vec can capture the structural equivalence but the visualization shows that it still captures the local structural information similar to DeepWalk. -DeepWalk, LINE and graph2gauss fail to capture the global structural information because these methods are based on random walk which only captures the local community structures. DeepWalk is capable to capture the local structural information since nodes are separated into two parts corresponding to the two communities shown in Fig. 1.

Structural role clustering
The most common network mining application based on global structural information is the problem of role discovery and role discovery essentially is a clustering task. Thus, we consider this task to illustrate the potential of node representations learned by struc2gauss. We use the latent representations learned by different methods (in struc2gauss, we use means of learned Gaussian distribution) as features and K-means as the clustering algorithm to cluster nodes. Parameters For these baselines, we use the same settings in the literature: representation dimension: 128, number of walks per node: 20, walk length: 80, skipgram window size: 10. For node2vec, we set p = 1 and q = 2. For graph2gauss and struc2gauss, we set the constraint for means C to be 2 and constraints for covariances c min and c max to be 0.5 and 2, respectively. The number of walks per node is 10, the number of positive/negative nodes per node is 120 and the representation dimension is also 128.
Evaluation metrics To quantitatively evaluate clustering performance in labeled networks, we use Normalized Mutual Information (NMI) as the evaluation metric. NMI is obtained by dividing the mutual information by the arithmetic average of the entropy of obtained cluster and ground-truth cluster. It evaluates the clustering quality based on information theory, and is defined by normalization on the mutual information between the cluster assignments and the pre-existing input labeling of the classes: where obtained cluster C and ground-truth cluster D.

The mutual information I(C, D) is defined as I(C, D) = H(C) − H(C|D) and H(·) is the entropy.
For unlabeled networks, we use normalized goodness-of-fit as the evaluation metric. goodness-of-fit can measure how well the representation of roles and the relations among these roles fit a given network (Wasserman and Faust 1994). In goodness-offit, it is assumed that the output of a role discovery method is an optimal model, and nodes belonging to the same role are predicted to be perfectly structurally equivalent. In real-world social networks, nodes belonging to the same role are only approximately structurally equivalent. The essence of goodness-of-fit indices is to measure how just how approximate are the approximate structural equivalences. If the optimal model holds, then all nodes belonging to the same role are exactly structurally equivalent.
In detail, given a social network with n vertices V = {v 1 , v 2 , . . . , v n } and m roles, we have the adjacency matrix A = {A i j ∈ {0, 1}|1 ≤ i, j ≤ n} and the role set R = {R 1 , R 2 , . . . , R m }, where v i ∈ R j indicates node v i belongs to the jth role, as obtained using DyNMF. Note that R partitions V , in the sense that each v ∈ V belongs to exactly one R i ∈ R. Then the density matrix Δ is defined as: We also define block matrix B based on the discovered roles. In fact, there are several criteria which can be used to build the block matrix including perfect fit, zeroblock, oneblock and α density criterion (Wasserman and Faust 1994). Since real social network data rarely contain perfectly structural equivalent nodes (Faust and Wasserman 1992), perfect fit, zeroblock and oneblock criteria would not work well in real-world data and we use α density criterion to construct the block matrix B: where α is the threshold to determine the values in blocks. α density criterion is based on the density of edges between nodes belong to the same role and defined as Based on the definitions of density matrix Δ and block matrix B, the goodness-of-fit index e is defined as To make the evaluation metric value in the range of [0, 1], we normalize goodness-of-fit by dividing r 2 where r is number of groups/roles. For more details about goodnessof-fit indices, please refer to Wasserman and Faust (1994).

Results
The NMI values for node clustering on networks with labels are shown in Table 3 and the normalized goodness-of-fit values for networks without labels are Fig. 4 Goodness-of-fit of global structure preserving embedding baselines and struc2gauss with different strategies on three real-world networks. Lower value means better performance shown in Fig. 4. Note that random walk and neighbor based embedding methods, including DeepWalk, LINE, node2vec, EP and graph2gauss, aim at capturing local structural information and so are incapable of preserving structural roles. Hence, for simplicity, we will not compare them to these role preserving methods on networks without clustering labels.
From these results, some conclusions can be drawn: -For both types of networks with and without clustering labels, struc2gauss outperforms all other methods in different evaluation metrics. It indicates the effectiveness of struc2gauss in capturing the global structural information. -Comparing struc2gauss with diagonal and spherical covariances, it can be observed that spherical covariance can achieve better performance in node clustering. This finding is similar to the results of word embedding in Vilnis and McCallum (2014). A possible explanation could be: spherical covariance requires the diagonal elements to be the same which limits the representation power of covariance matrices but on the contrast enhance the representation power of the learned means. Since we only use means to represent nodes, the method with spherical covariance matrix could learn more relaxed means which leads to better performance. -For baselines, struc2vec, GraphWave and DRNE can capture the structural role information to some extent since their performance is better than these random walk based methods, i.e., DeepWalk and node2vec, and neighbor-based method, i.e., EP and graph2gauss, while all of them fail in capturing the global structural information for node clustering.

Structural role classification
Node classification is another widely used task for embedding evaluation. Different from previous studies which focused on community structures, our approach aims to preserve the global role structures. Thus, we evaluate the effectiveness of struc2gauss in role classification task. Same to the node clustering task in Sect. 5.3, we use the latent representations learned by different methods as features. Each dataset is separated into training set and test set (we will explore the classification performance with different percentages of training set). To focus on the learned representation, we use logistic regression as the classifier. Structural role classification as a supervised task, the ground-truth labels are required. Thus we only use two air-traffic networks for evaluation. We compare our approach to the same state-of-the-art NE algorithms as baselines used in Sect. 5.3, i.e., DeepWalk, LINE, node2vec, EP, graph2gauss, struc2vec, GraphWave and DRNE. Same to Tu et al. (2018), we also compare to four centrality measures, i.e., closeness centrality, betweenness centrality, eigenvector centrality and k-core. Since the combination of these four measures perform best (Tu et al. 2018), we only compare the classification performance of the combination as features in this task. The parameters of baselines and struc2gauss, we use the same settings in Sect. 5.3.
The average accuracies for structural role classification in Europe-air and USA-air are shown in Figs. 5 and 6. From the results, we can observe that: -struc2gauss outperforms almost all other methods in both networks except DRNE in Europe-air network. In Europe-air network, struc2gauss with expected likelihood and spherical covariances, i.e., s2g_el_s, performs best. struc2gauss with KL divergence and spherical covariances, i.e., s2g_kl_s, achieves the second best performance especially when the training ratio is larger than 0.7. struc2gauss with diagonal covariances, i.e., s2g_el_d and s2g_kl_d, are on par with GraphWave, DRNE and struc2vec and outperform other methods. In the USA-air network, struc2gauss with different settings outperforms all baselines. This indicates the effectiveness of struc2gauss in modeling the structural role information. Although not the same combination of energy function and covariance form performs best in two networks, different variants of struc2gauss are always the best. -Among the baselines, only struc2vec, GraphWave and DRNE can capture the structural role information so that they achieve better classification accuracy than other baselines. DRNE performs the best among these baselines since it captures regular equivalence. GraphWave and struc2vec are the second best baselines because they also aim to capture structural roles. -Random walk and neighbor based NE methods only capture local community structures so they perform worse than struc2gauss, GraphWave, DRNE and our proposed struc2gauss. Node that methods such as DeepWalk, LINE and node2vec, although considering the first-, second-and/or higher-order proximity, still are not capable of modeling structural role information.

Uncertainty modeling
Mapping a node in a network into a distribution rather than a point vector allows us to model the uncertainty of the learned representation which is another advantage of struc2gauss. Different factors can lead to uncertainties of data. It is intuitive that the more noisy edges a node has, the less discriminative information it contains, thus making its embedding more uncertain. Similarly, incompleteness of information in the network can also bring uncertainties to the representation learning. Therefore, in this section, we study two factors: noisy information and incomplete information.
To verify these hypotheses, we conduct the following experiment using Brazil-air and Europe-air networks. For noisy information, we randomly insert certain number of edges to the network and then learn the latent representations and covariances. The average variance is used to measure the uncertainties. For Brazil-air network, we range the number of noisy edges from 50 to 300 and for Europe-air it ranges from 500 to 3000. For incomplete information, we randomly delete certain number of edges to the network to make it incomplete and then learn the latent representations and covariances. Similarly, for Brazil-air network, we range the number of removed edges from 50 to 300 and for Europe-air it ranges from 500 to 3000. The other parameter settings are same to Sect. 5.3. The results are shown in Figs. 7 and 8. It can be observed that (1) with more noisy edges being added to the networks and (2) with more removed edges from the networks, average variance values become larger. struc2gauss with different energy functions and covariance forms have the same trend. This demonstrates that our proposed struc2gauss is able to model the uncertainties of learned node representations. It is interesting that struc2gauss with expected likelihood and diagonal covariance (s2g_el_d) always has the lowest average variance while struc2gauss with KL divergence and diagonal (s2g_kl_d) always has the largest value. This may result from the learning mechanism of different energy functions when measuring the distance between two distributions. To clarify the results, we also list the NMI for the clustering task in Tables 4 and 5. Compared to the original Gaussian embedding method, we again show the effectiveness of our method in preserving structural role and modeling uncertainties.

Influence of similarity measures
As we mentioned not all structural similarity measures can capture the global structural role information, to validate the rationale to select RoleSim as the similarity measure for structural role information, we investigate the influence of different similarity measures on learning node representations. In specific, we select two other widely used structural similarity measures, i.e., SimRank (Jeh and Widom 2002) and MatchSim (Lin et al. 2009), and we incorporate these measures by replacing RoleSim in our framework. The datasets and evaluation metrics used in this experiment are  Fig. 9. We can come to the following conclusions: -RoleSim outperforms other two similarity measures in both types of networks with and without clustering labels. It indicates RoleSim can better capture the global structural information. Performance of MatchSim varies on different networks and is similar to struc2vec. Thus, it can capture the global structural information to some extent. -SimRank performs worse than other similarity measures as well as struc2vec (Table 3). Considering the basic assumption of SimRank that "two objects are similar if they relate to similar objects", it computes the similarity also via relations between nodes so that the mechanism is similar to random walk based methods which have been proved not being capable of capturing the global structural information (Lyu et al. 2017).

Parameter sensitivity
We consider two types of parameters in struc2gauss: (1) parameters also used in other NE methods including latent dimensions, number of samples per node and number of positive/negative nodes per node; and (2) parameters only used in Gaussian embedding including mean constraint C and covariance constraint c max (note that we fix the minimal covariance c min to be 0.5 for simplicity). In order to evaluate how changes to these parameters affect performance, we conducted the same node clustering experiment on the labeled USA-air network introduced in Sect. 5.3. In the interest of brevity, we tune one parameter by fixing all other parameters. In specific, the number of latent dimensions varies from 10 to 200, the number of samples varies from 5 to 15 and the number of positive/negative nodes varies from 40 to 190. Mean constraint C is from 1 to 10, and covariance constraint c max ranges from 1 to 10. The results of parameter sensitivity are shown in Figs. 10 and 11. It can be observed from Fig. 10a, b that the trends are relatively stable, i.e., the performance is insensitive to the changes of representation dimensions and numbers of samples. The performance of clustering is improved with the increase of numbers of positive/negative nodes shown in Fig. 10c. Therefore, we can conclude that struc2guass is more stable than other methods. It has been reported that other methods, e.g., DeepWalk (Perozzi et al. 2014), LINE (Tang et al. 2015b) and node2vec (Grover and Leskovec 2016), are sensitive to many parameters. In general, more dimensions, more walks and more context can achieve better performance. However, it is difficult to search for the best combination of parameters in practice and it may also lead to overfitting. For Gaussian embedding specific parameters C and c max , both trends are stable, i.e., the selection of these contraints have little effect on the performance. Although with larger mean constraint C, the NMI decreases but the difference is not huge.

Efficiency and effectiveness study
As discussed above in Sect. 4.5, the high computational complexity is one of the major issues in our method. In this experiment, we empirically study this computational issue by comparing the run-time and performance of different global structural preserving baselines and a heuristic method to accelerate the RoleSim measures. The heuristic method, named Fast struc2gauss, is introduced in Sect. 4.5: we set the similarity to be 0 if two nodes have a large difference in degrees to avoid more computing for dissimilar node pairs. For simplicity, we only test struc2gauss with KL and spherical covariance. Also, we only consider embedding methods that can preserve the structural role information as baselines, i.e., GraphWave, struc2vec and DRNE.
We conduct the experiments on the larger networks without ground-truth labels because on smaller networks the run-time differences are not significant. The run-time comparison is shown in Table 7 and the performance comparison is shown in Table 8. Note that NA in these tables because these methods reported a memory error and did  not obtain any results. To make a fair comparison, all these methods are run in the same machine with 128GB memory and GPU have not been used for DRNE. From these results, it can be observed: (1) although the computational issue still exists, our method can achieve good performance compared to state-of-the-art structural role preserving network embedding methods such as GraphWAVE and struc2vec.
(2) Although DRNE is much fast, its performance is worse than our method and other baselines. Moreover, it is incapable of modeling uncertainties.
(3) Fast struc2gauss can effectively accelerate RoleSim computing and achieve comparable performance in role clustering.

Discussion
The proposed struc2gauss is a flexible framework for node representations. As shown in Fig. 2, different similarity measures can be incorporated into this framework and empirical studies will be presented in Sect. 5.6. Furthermore, other types of methods which model structural information can be utilized in struc2gauss as well.
To illustrate the potential to incorporate different methods, we categorize different methods for capturing structural information into three types: -Similarity-based methods. Similarity-based methods calculate pairwise similarity based on the structural information of a given network. Related work has been reviewed in Sect. 2.2. -Ranking-based methods. PageRank (Page et al. 1999) and HITS (Kleinberg 1999) are two most representative ranking-based methods which learns the structural information. PageRank has been used for NE in (Ma et al. 2017).
In this paper, we focus on similarity-based methods. For ranking-based methods, we can use a fixed sliding window on the ranking list, then given a node the nodes within the window can be viewed as the context. In fact, this mechanism is similar to DeepWalk. For partition-based methods, we can consider the nodes in the same group as the context for each other.

Conclusions and future work
Two major limitations exist in previous NE studies: i.e., structure preservation and uncertainty modeling. Random-walk based NE methods fail in capturing global structural information and representing a node into a point vector are not capable of modeling the uncertainties of node representations. We proposed a flexible structure preserving network embedding framework, struc2gauss, to tackle these limitations. On the one hand, struc2gauss learns node representations based on structural similarity measures so that global structural information can be taken into consideration. On the other hand, struc2gauss utilizes Gaussian embedding to represent each node as a Gaussian distribution where the mean indicates the position of this node in the embedding space and the covariance represents its uncertainty.
We experimentally compared three different structural similarity measures for networks and two different energy functions for Gaussian embedding. By conducting experiments from different perspectives, we demonstrated that struc2gauss excels in capturing global structural information, compared to state-of-the-art NE techniques such as DeepWalk, node2vec and struc2vec. It outperforms other competitor methods in role discovery task and structural role classification on several real-world networks. It also overcomes the limitation of uncertainty modeling and is capable of capturing different levels of uncertainties. Additionally, struc2gauss is less sensitive to different parameters which makes it more stable in practice without putting more effort in tuning parameters.
In the future, we will explore faster RoleSim measures for more scalable NE methods, for example, fast method to select k most similar nodes for a given node. Also, it is a promising research direction to investigate different strategies to model global structural information except structural similarity in NE tasks. Besides, other future investigations in this area include learning node representations in dynamic and temporal networks.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.