8.1 Introduction

As a natural way to represent objects and their relationships, the network is ubiquitous in our daily lives. The rapid development of social networks like Facebook and Twitter encourage researchers to design effective and efficient algorithms on network structure. A key problem of network study is how to represent the network information properly. Traditional representations of networks are usually high dimensional and sparse, which becomes a weakness when people apply statistical learning to networks. With the development of machine learning, feature learning of vertices in a network is becoming an emerging task. Therefore, network representation learning algorithms turn network information into low-dimensional dense real-valued vectors, which can be used as input for existing machine learning algorithms. For example, the representations of vertices can be fed to a classifier like Support Vector Machine (SVM) for the vertex classification task. Also, the representations can be used for visualization by taking the representations as points in Euclidean space. In this section, we will formalize the network representation learning problem.

Denote a network as \(G=(V,E)\) where V is the vertex set and E is the edge set. An edge \(e=(v_i,v_j)\in E\) where \(v_i,v_j\in V\) is a directed edge from vertex \(v_i\) to \(v_j\). The outdegree of vertex \(v_i\) is defined as \(\deg _O(v_i)=|\{v_j|(v_i,v_j)\in E\}|\). Similarly, the indegree of vertex \(v_i\) is \(\deg _I(v_i)=|\{v_j|(v_j,v_i)\in E\}|\). For undirected network, we have \(\deg (v_i)=\deg _O(v_i)=\deg _I(v_i)\). Taking social network as an example, a vertex represents a user and an edge represents the friendship between two users. The indegree and outdegree represent the number of followers and followees of a user, respectively.

Adjacency matrix \(A\in \mathbb {R}^{|V|\times |V|}\) is a matrix where \(A_{ij}=1\) if \((v_i,v_j)\in E\) and \(A_{ij}=0\) otherwise. We can easily generalize adjacency matrix to weighted network by setting \(A_{ij}\) to the weight of edge \((v_i,v_j)\). The adjacency matrix is a simple and straightforward representation of the network. Each row of adjacency matrix A denotes the relationship between a vertex and other vertices and can be seen as the representation of the corresponding vertex.

Though convenient and straightforward, the representation of the adjacency matrix suffers from the scalability problem. Adjacency matrix A takes \(|V|\times |V|\) space to store, and it is usually unacceptable when |V| grows large. Also, the adjacency matrix is very sparse, which means most of its entries are zeros. The data sparsity makes discrete algorithms applicable, but it is still hard to develop efficient algorithms for statistic learning [93].

Fig. 8.1
figure 1

A visualization of vertex embeddings learned by DeepWalk model [93]

Therefore, people come up with the idea to learn low-dimensional dense representations for vertices in a network. Formally, the goal of network representation learning is to learn a real-valued vector \(\textbf{v}\in \mathbb {R}^d\) for vertex \(v\in V\) where dimension d is much smaller than the number of vertices |V|. The idea is that similar vertices should have close representations as shown in Fig. 8.1. Network representation learning can be unsupervised or semi-supervised. The representations are automatically learned without feature engineering and can be further used for specific tasks like classifications once they are learned. These representations are low dimensional, which enables efficient algorithms to be designed over the representations without considering the network structure itself. We will discuss more details about the evaluation of network representations later in this chapter.

8.2 Network Representation

In this section, we will introduce several kinds of network representation learning algorithms in detail.

8.2.1 Spectral Clustering Based Methods

Spectral clustering based methods are a group of algorithms that compute first k eigenvectors or singular vectors of an affinity matrix, such as adjacency or Laplacian matrix of the network. These methods depend heavily on the construction of the affinity matrix. The evaluation result of different affinity matrices varies a lot. Generally speaking, spectral clustering based methods have a high complexity because the computations of eigenvectors and singular vectors have a nonlinear time complexity.

On the other hand, spectral clustering based methods need to save an affinity matrix in the memory during the computation. Thus the space complexity cannot be ignored, either. These disadvantages limit the large-scale and online generalization of these methods. Now we will present several algorithms based on spectral clustering.

Locally Linear Embedding (LLE) [98] assumes that the representations of vertices are sampled from a manifold. More specifically, LLE supposes that the representations of a vertex and its neighbors lie in a locally linear patch of the manifold. That is to say, a vertex’s representation can be approximated by a linear combination of the representation of its neighbors. LLE uses the linear combination of neighbors to reconstruct the center vertex. Formally, the reconstruction error of all vertices can be expressed as

$$\begin{aligned} \mathscr {L}(\textbf{W},\textbf{V})=\sum _{i=1}^{|V|}\left\| \textbf{v}_i-\sum _{j=1}^{|V|} \textbf{W}_{ij}\textbf{v}_j \right\| ^2, \end{aligned}$$
(8.1)

where \(\textbf{V} \in \mathbb {R}^{|V|\times d}\) is the vertex embedding matrix and \(\textbf{W}_{ij}\) is the contribution coefficient of vertex \(v_j\) to \(v_i\). LLE enforces \(\textbf{W}_{ij}=0\) if \(v_i\) and \(v_j\) are not connected, i.e., \((v_i,v_j)\not \in E\). Further, the summation of a row of matrix \(\textbf{W}\) is set to 1, i.e., \(\sum _{j=1}^{|V|} \textbf{W}_{ij}=1\).

Equation 8.1 is solved by alternatively optimizing weight matrix \(\textbf{W}\) and representation \(\textbf{V}\). The optimization over \(\textbf{W}\) can be solved as a least-squares problem. The optimization over representation \(\textbf{V}\) leads to the following optimization problem:

$$\begin{aligned} \mathscr {L}(\textbf{W},\textbf{V})=\sum _{i=1}^{|V|}\left\| \textbf{v}_i-\sum _{j=1}^{|V|} \textbf{W}_{ij}\textbf{v}_j\right\| ^2, \end{aligned}$$
(8.2)
$$\begin{aligned} s.t. \ \sum _{i=1}^{|V|} \textbf{v}_i=\textbf{0}, \end{aligned}$$
(8.3)
$$\begin{aligned} \text {and} \ \ |V|^{-1} \sum _{i=1}^{|V|} \textbf{v}_i^{\top }\textbf{v}_i=\textbf{I}_d, \end{aligned}$$
(8.4)

where \(\textbf{I}_d\) denotes \(d\times d\) identity matrix. The conditions Eqs. 8.3 and 8.4 ensure the uniqueness of the solution. The first condition enforces the center of all vertex embeddings to zero point and the second condition guarantees different coordinates have the same scale, i.e., equal contribution to the reconstruction error.

The optimization problem can be formulated as the computation of eigenvectors of matrix \((\textbf{I}_{|V|}-\textbf{W}^{\top })(\textbf{I}_{|V|}-\textbf{W})\), which is an easily solvable eigenvalue problem. More details can be found in the note [22].

Laplacian Eigenmap [8] algorithm simply follows the idea that the representations of two connected vertices should be close. Specifically, the “closeness” is measured by the square of Euclidean distance. We use D to denote diagonal degree matrix where D is a \(|V|\times |V|\) diagonal matrix and the ith diagonal entry \(D_{ii}\) is the degree of vertex \(v_i\). The Laplacian matrix L of a graph is defined as the difference of diagonal matrix D and adjacency matrix A, i.e., \(L=D-A\).

Laplacian Eigenmap algorithm wants to minimize the following cost function:

$$\begin{aligned} \mathscr {L}(\textbf{V})=\sum _{\{i,j|(v_i,v_j)\in E\}}\Vert \textbf{v}_i-\textbf{v}_j\Vert ^2, \end{aligned}$$
(8.5)
$$\begin{aligned} s.t.\ \textbf{V}^{\top }D\textbf{V}=\textbf{I}_d. \end{aligned}$$
(8.6)

The cost function is the summation of square loss of all connected vertex pairs and the condition prevents the trivial all-zero solution caused by arbitrary scale. Equation 8.5 can be reformulated in matrix form as

$$\begin{aligned} \textbf{V}^*=\arg \min _{\textbf{V}^{\top }D\textbf{V}=\textbf{I}_d} \text {tr}(\textbf{V}^{\top }L\textbf{V}). \end{aligned}$$
(8.7)

Algebraic knowledge tells us that the optimal solution \(\textbf{V}^*\) of Eq. 8.7 is the corresponding eigenvectors of d smallest nonzero eigenvalues of Laplacian matrix L. Note that the Laplacian Eigenmap algorithm can be easily generalized to the weighted graph.

Both LLE and Laplacian Eigenmap have a symmetric cost function which indicates that both algorithms cannot be applied to the directed graph. Directed Graph Embedding (DGE) [17] was proposed to generalize Laplacian Eigenmap.

For both directed and undirected graph, we can define a transition probability matrix \(\textbf{P}\in \mathbb {R}^{|V|\times |V|}\), where \(\textbf{P}_{ij}\) denotes the probability that vertex \(v_i\) walks to \(v_j\). The transition matrix defines a Markov random walk through the graph. We denote the stationary value of vertex \(v_i\) as \(\pi _i\) where \(\sum _i \pi _i=1\). The stationary distribution of random walk is commonly used in many ranking algorithms such as PageRank. DGE designs a new cost function which emphasizes the important vertices, which have a higher stationary value:

$$\begin{aligned} \mathscr {L}(\textbf{V})=\sum _{i=1}^{|V|}\pi _i \sum _{j=1}^{|V|}\textbf{P}_{ij}\Vert \textbf{v}_i-\textbf{v}_j\Vert ^2. \end{aligned}$$
(8.8)

By denoting \(\textbf{M} = \text {diag}(\pi _1,\pi _2,\dots ,\pi _{|V|})\), the cost function Eq. 8.8 can be reformulated as

$$\begin{aligned} \mathscr {L}(\textbf{V})=2\text {tr}(\textbf{V}^{\top }\textbf{B}\textbf{V}), \end{aligned}$$
(8.9)
$$\begin{aligned} s.t. \ \textbf{V}^{\top }\textbf{M} \textbf{V}=\textbf{I}_d, \end{aligned}$$
(8.10)

where

$$\begin{aligned} \textbf{B}=\textbf{M}-\frac{\textbf{M} \textbf{P}-\textbf{P}^{\top }\textbf{M}}{2}. \end{aligned}$$
(8.11)

The condition Eq. 8.10 is added to remove an arbitrary scaling factor. Similar to Laplacian Eigenmap, the optimization problem can also be solved as a generalized eigenvector problem.

For comparisons between the above three network embedding learning algorithms, we conclude the following table to illustrate their applicability (Table 8.1).

Table 8.1 Applicability of LLE, Laplacian Eigenmap, and DGE algorithms on undirected, weighted, and directed graph

Unlike previous works which minimize the distance between vertex representations, Tang and Liu [112] introduces modularity [85] into the cost function instead. Modularity is a measurement which characterizes how far the graph is away from a uniform random graph. Given graph \(G=(V,E)\), we assume that vertices V are divided into k nonoverlapping communities. By “uniform random graph”, we mean vertices connect to each other based on a uniform distribution given their degrees. Then the expected edges between \(v_i\) and \(v_j\) is \(\frac{{\text {deg}}(v_i){\text {deg}}({\text {v}}_{{\text {j}}})}{2|E|}\). Then the modularity of a graph Q is defined as

$$\begin{aligned} Q=\frac{1}{2|E|}\sum _{i,j}\left[ A_{ij}-\frac{{\text {deg}}(v_i){\text {deg}}(v_j)}{2|E|}\right] \delta (v_i,v_j), \end{aligned}$$
(8.12)

where \(\delta (v_i,v_j)=1\) if \(v_i\) and \(v_j\) belong to the same community and \(\delta (v_i,v_j)=0\) otherwise. A larger modularity indicates that the subgraphs inside communities are denser, which follows the intuition that a community is a dense well-connected cluster. Then the problem is to find a partition that maximizes the modularity Q.

However, a hard clustering on modularity maximization is proved to be NP hard. Therefore, they relax the problem to a soft case. Let \(\textbf{d}\in \mathbb {Z}_+^{|V|}\) denotes the degree of all vertices and \(\textbf{1}\in \{0,1\}^{|V|\times k}\) denotes the community indicator matrix where

$$\begin{aligned} \textbf{1}_{ij}=\left\{ \begin{aligned} 1&\qquad \text {if vertex}\, i \, \text {belongs to community}\, j, \\ 0&\qquad \text {otherwise.} \end{aligned} \right. \end{aligned}$$
(8.13)

Then we define modularity matrix \(\textbf{B}\) as

$$\begin{aligned} \textbf{B}=A-\frac{\textbf{d}\textbf{d}^T}{2|E|}, \end{aligned}$$
(8.14)

and modularity Q can be reformulated as

$$\begin{aligned} Q=\frac{1}{2|E|}\text {tr}(\textbf{1}^{\top }\textbf{B}\textbf{1}). \end{aligned}$$
(8.15)

By relaxing \(\textbf{1}\) to a continuous matrix, it has been proved that the optimal solution \(\textbf{1}\) is the top-k eigenvectors of modularity matrix \(\textbf{B}\) [84].

As an alternatively cost function, Tang and Liu also proposed another algorithm [113] by optimizing over normalized cut of the graph. Similarly, the algorithm turns to the computation of top-k eigenvectors of normalized graph Laplacian \(\widetilde{L}\):

$$\begin{aligned} \widetilde{L}=D^{-\frac{1}{2}}LD^{-\frac{1}{2}}=I-D^{-\frac{1}{2}}AD^{-\frac{1}{2}}. \end{aligned}$$
(8.16)

Then the community indicator matrix \(\textbf{1}\) is taken as a k-dimensional vertex representation.

To conclude spectral clustering methods for network representation learning, these methods often define a cost function that is linear or quadratic to the vertex embedding. Then they reformulate the cost function as a matrix form and figure out that the optimal solutions are eigenvectors of a particular matrix according to algebra knowledge. The major drawback of spectral clustering methods is the complexity: the computation of eigenvectors for large-scale matrices is both time consuming and space consuming.

8.2.2 DeepWalk

As shown in previous subsections, accurate computation of the optimal solution, such as eigenvector computation, is not very efficient for large-scale problems. Meantime, neural network approaches have proved their effectiveness in many areas such as natural language and image processing. Though the gradient descent method cannot always guarantee an optimal solution of the neural network models, the implementation and learning of neural networks are relatively fast, and they usually have good performances. On the other hand, neural network models can let people get rid of feature engineering and are mostly data driven. Thus, the exploration of the neural network approach on representation learning is becoming an emerging task.

DeepWalk [93] proposes a novel approach that introduces deep learning techniques into network representation learning for the first time. The benefits of modeling truncated random walks instead of the adjacency matrix are twofold: first, random walks need only local information and thus enable discrete and online algorithms on it while modeling of adjacency matrix may need to store everything in memory and thus be space consuming; second, modeling random walks can alleviate the variance and uncertainty of modeling original binary adjacency matrix. We will look insight into DeepWalk in the next subsection.

Unsupervised representation learning algorithms have been widely studied and applied in the natural language processing area. The authors show that the vertex frequency in short random walks also follows the power law as words in documents do. Showing the connection between vertex to the word and random walks to sentences, the authors adapted a well-known word representation learning algorithm word2vec [80] into vertex representation learning. Now, we will introduce DeepWalk algorithms in detail.

Given graph \(G=(V,E)\), we denote a random walk started at vertex \(v_i\) as \(\ell _{v_i}\). We use \(\ell _{v_i}^k\) to represent the kth vertex in the random walk \(\ell _{v_i}\). The next vertex \(\ell _{v_i}^{k+1}\) is generated by uniformly random selection from neighbors of vertex \(\ell _{v_i}^k\). Random walk sequences have been used for many network analysis tasks, such as similarity measurement and community detection [2, 32].

DeepWalk follows the idea of language modeling to model short random walk sequences. That is to estimate the likelihood of observing vertex \(v_i\) given all previous vertices in the random walk:

$$\begin{aligned} P(v_i|(v_1,v_2,\dots , v_{i-1})). \end{aligned}$$
(8.17)

To the extent of vertex representation learning, we turn to predict vertex \(v_i\) given the representations of all previous vertices:

$$\begin{aligned} P(v_i|(\textbf{v}_1,\textbf{v}_2,\dots , \textbf{v}_{i-1})). \end{aligned}$$
(8.18)

A relaxation of this formula in language modeling turns to use vertex \(v_i\) to predict its neighboring vertices \(v_{i-w},\dots ,v_{i-1},v_{i+1},\dots ,v_{i+w}\) where w is the window size. This part of model is named as Skip-gram model in word embedding learning. The neighboring vertices are also called context vertices of the center vertex. As another simplification, DeepWalk ignores the order and offset of the vertices and thus predict \(v_{i-w}\) and \(v_{i-1}\) in the same way. The optimization function of a single vertex of a random walk can be formulated as

$$\begin{aligned} \min _{\textbf{v}} -\log P(\{v_{i-w},\dots ,v_{i-1},v_{i+1},\dots ,v_{i+w}\}|\textbf{v}_i). \end{aligned}$$
(8.19)

Based on independent assumption, the loss function can be rewritten as

$$\begin{aligned} \min _{\textbf{v}} \sum _{k=-w,k\ne 0}^w-\log P(v_{i+k}|\textbf{v}_i). \end{aligned}$$
(8.20)

The overall loss function can be obtained by adding up over every vertex in every random walk.

Now we talk about how to predict a single vertex \(v_j\) given center vertex \(v_i\). In DeepWalk, each vertex \(v_i\) has two representations with the same dimension: vertex representation \(\textbf{v}_i\in \mathbb {R}^d\) and context representation \(\textbf{c}_i\in \mathbb {R}^d\). The probability of prediction \(P(v_j|\textbf{v}_i)\) is defined by a softmax function over all vertices:

$$\begin{aligned} P(v_j|\textbf{v}_i)=\frac{\exp (\textbf{v}_i\textbf{c}_j^{\top })}{\sum _{k=1}^{|V|}\exp (\textbf{v}_i\textbf{c}_k^{\top })}. \end{aligned}$$
(8.21)

Here we come to the parameter learning phase of DeepWalk. We first present the pseudocode of the DeepWalk framework in Algorithm 8.1.

figure a

where RandomWalk\((G,v_i,l)\) generates a random walk rooted at \(v_i\) with length l and Skip-gram\((\textbf{V},\ell _{v_i},w)\) function is defined in Algorithm 8.2, where \(\alpha _l\) is the learning rate of stochastic gradient descent.

figure b

Note that the parameter updating rule \(\textbf{V}=\textbf{V}-\alpha _l\frac{\partial J}{\partial \textbf{V}}\) in Skip-gram has a complexity of O(|V|) because in the computation of the gradient of \(P(v_{k}|\textbf{v}_j)\) (as shown in Eq. 8.21), the denominator has |V| terms to compute. This complexity is unacceptable for large-scale networks.

To address this problem, people proposed Hierarchical Softmax as a variant of original softmax function. The core idea is to map the vertices to a balanced binary tree, where each vertex corresponds to a leaf of the tree. Then the prediction of a vertex turns to the prediction of the path from the root to the corresponding leaf. Assume that the path from root to vertex \(v_k\) is denoted by a sequence of tree nodes \(b_1,b_2\dots ,b_{\lceil \log |V|\rceil }\) and then we have

$$\begin{aligned} \log P(v_k|\textbf{v}_j)=\sum _{i=1}^{\lceil \log |V|\rceil } \log P(b_i|\textbf{v}_j). \end{aligned}$$
(8.22)

A logistic function can easily implement a binary decision on a tree node. Hence, the time complexity reduces to \(O(\log |V|)\) from O(|V|). We can accelerate the algorithm by using Huffman coding to map frequent vertices to the tree nodes that are close to the root. We can also use negative sampling which is used in word2vec to replace hierarchical softmax for speeding up.

So far, we have finished the introduction of the DeepWalk algorithm. DeepWalk introduces efficient deep learning techniques into network embedding learning. Table 8.2 gives an analogy between DeepWalk and Word2vec. DeepWalk outperforms traditional network representation learning methods on network classification tasks and is also efficient for large-scale networks. Besides, the generation of random walks can be generalized to nonrandom walk, such as the information propagation streams. In the next subsection, we will give a detailed proof to demonstrate the correlation between DeepWalk and matrix factorization.

Table 8.2 Analogy of DeepWalk and word2vec

8.2.2.1 Matrix Factorization Comprehension of DeepWalk

Perozzi et al. introduced the Skip-gram model into the study of social network for the first time, and designed an algorithm named DeepWalk [93] for learning vertex representation on a graph. In this subsection, we prove that the DeepWalk algorithm with Skip-gram and softmax model is actually factoring a matrix \(\textbf{M}\) where each entry \(\textbf{M}_{ij}\) is the logarithm of the average probability that vertex \(v_i\) randomly walks to vertex \(v_j\) in fix steps. We will explain it later.

Since the Skip-gram model does not consider the offset of context vertex and predict context vertices independently, we can regard the random walks as a set of vertex-context pairs. The useful information on random walks is the co-occurrence of vertex pairs inside a window. Given network \(G=(V, E)\), we suppose that vertex-context set D is generated from random walks, where each piece of D is a vertex-context pair (vc). Let V be the set of nodes, and \(V_C\) be the set of context nodes. In most cases, \(V=V_C\).

Consider a vertex-context pair (vc):

\(N_{(v,c)}\) denotes the number of times (vc) appears in D. \(N_{v}=\sum _{c'\in V_C}N_{(v,c')}\) and \(N_{c}=\sum _{v'\in V}N_{(v',c)}\) denotes the number of times v and c appears in D. Note that \(|D|=\sum _{v'\in V}\sum _{c'\in V_C}N_{(v',c')}\).

A context vertex \(c\in V_C\) is represented by a d-dimension vector \(\textbf{c}\in \mathbb {R}^d\) and \(\textbf{C}\) is a \(|V_C|\times d\) matrix, where row j is vector \(\mathbf {c_j}\). Our goal is to figure out a matrix \(\textbf{M}=\textbf{V}\textbf{C}^{\top }\).

Perozzi et al. implemented the DeepWalk algorithm with the Skip-gram and Hierarchical Softmax model. Note that Hierarchical Softmax is a variant of softmax for speeding the training time. In this subsection, we give proofs for both negative sampling and softmax with the Skip-gram model.

Negative sampling approximately maximizes the probability of softmax function by randomly choosing k negative samples from the context set. Levy and Goldberg showed that Skip-gram with the Negative Sampling model (SGNS) is implicitly factorizing a word-context matrix [69] by assuming that dimensionality d is sufficiently large. In other words, we can assign each product \(\textbf{v} \cdot \textbf{c}\) a value independent of the others.

In SGNS model, we have

$$\begin{aligned} P((v,c)\in D)={\text {Sigmoid}}(\textbf{v} \cdot \textbf{c})=\frac{1}{1+e^{-\textbf{v} \cdot \textbf{c}}}. \end{aligned}$$
(8.23)

Suppose we choose k negative samples for each vertex-context pair (vc) according to the distribution \(P_D(c_N)=\frac{N_{c_N}}{|D|}\). Then, the objective function for SGNS can be written as

$$\begin{aligned} \begin{aligned} \mathscr {O}&=\sum _{v\in V}\sum _{c\in V_C}N_{(v,c)}(\log {\text {Sigmoid}}(\textbf{v} \cdot \textbf{c})+k \mathbb {E}_{c_N\sim P_D}[\log {\text {Sigmoid}}(-\textbf{v} \cdot \textbf{c})])\\&=\sum _{v\in V}\sum _{c\in V_C}N_{(v,c)}\log {\text {Sigmoid}}(\textbf{v} \cdot \textbf{c})+k\sum _{v\in V}N_{v} \sum _{c_N\in V_C}\frac{N_{c_N}}{|D|}\log {\text {Sigmoid}}(-\textbf{v} \cdot \textbf{c})\\&=\sum _{v\in V}\sum _{c\in V_C} N_{(v,c)}\log {\text {Sigmoid}}(\textbf{v} \cdot \textbf{c})+k N_{v}\frac{N_{c}}{|D|} \log {\text {Sigmoid}}(-\textbf{v} \cdot \textbf{c}). \end{aligned} \end{aligned}$$
(8.24)

Denote \(x=\textbf{v} \cdot \textbf{c}\). By solving \(\frac{\partial \mathscr {O}}{\partial x}=0\), we have

$$\begin{aligned} \textbf{v} \cdot \textbf{c} = x = \log \frac{N_{(v,c)} |D|}{N_v N_c} -\log k. \end{aligned}$$
(8.25)

Thus we have \(\textbf{M}_{ij}=\log \frac{\frac{N_{(v_i,c_j)}}{ |D|}}{\frac{N_{v_i}}{|D|} \frac{N_{c_j}}{|D|}} -\log k\). \(\textbf{M}_{ij}\) can be interpreted as Point-wise Mutual Information(PMI) of vertex-context pair \((v_i,c_j)\) shifted by \(\log k\).

Since both negative sampling and hierarchical softmax are variants of softmax, we pay more attention to the softmax model and give a further discussion on it. We also assume that the values of \(\textbf{v} \cdot \textbf{c}\) are independent.

In softmax model,

$$\begin{aligned} P((v,c)\in D)=\frac{e^{\textbf{v} \cdot \textbf{c}}}{\sum _{c'\in V_C} e^{\textbf{v} \cdot \mathbf {c'}}}. \end{aligned}$$
(8.26)

And the objective function is

$$\begin{aligned} \mathscr {O}=\sum _{v\in V}\sum _{c\in V_C} N_{(v,c)} \log \frac{e^{\textbf{v} \cdot \textbf{c}}}{\sum _{c'\in V_C} e^{\textbf{v} \cdot \mathbf {c'}}}. \end{aligned}$$
(8.27)

After extracting all terms associated to \(\textbf{v} \cdot \textbf{c}\) as \(\mathscr {O}(v,c)\), we have

$$\begin{aligned} \mathscr {O}(v,c)=N_{(v,c)}\log \frac{e^{\textbf{v} \cdot \textbf{c}}}{\sum _{c'\in V_C, c'\ne c} e^{\textbf{v} \cdot \textbf{c}'}+e^{\textbf{v}\cdot \textbf{c}}} +\sum _{\tilde{c}\in V_C, \tilde{c}\ne c}N_{(v,\tilde{c})} \log \frac{e^{\textbf{v} \cdot \tilde{\textbf{c}}}}{\sum _{c'\in V_C, c'\ne c} e^{\textbf{v} \cdot \mathbf {c'}}+e^{\textbf{v} \cdot \textbf{c}}}. \end{aligned}$$
(8.28)

Note that \(\mathscr {O}=\frac{1}{|V_C|}\sum _{v\in V}\sum _{c\in V_C} \mathscr {O}(v,c)\). Denote \(x=\textbf{v} \cdot \textbf{c}\). By solving \(\frac{\partial \mathscr {O}}{\partial x}=0\) for all such x, we have

$$\begin{aligned} \textbf{v} \cdot \textbf{c} = x = \log \frac{N_{(v,c)}}{N_{v}} + b_v, \end{aligned}$$
(8.29)

where \(b_v\) can be any real constant since it will be canceled when we compute \(P((v,c)\in D)\). Thus, we have \(\textbf{M}_{ij}= \log \frac{N_{(v_i,c_j)}}{N_{(v_i)}} + b_{v_i}\). We will discuss what \(\textbf{M}_{ij}\) represents in next section.

It is clear that the method of sampling vertex-context pairs, i.e., random walks generation, will affect matrix \(\textbf{M}\). In this section, we will discuss \(\frac{N_{v}}{|D|}\), \(\frac{N_{c}}{|D|}\) and \(\frac{N_{(v,c)}}{N_{v}}\) based on an ideal sampling method for DeepWalk algorithm.

Assume the graph is connected and undirected, and the window size is w. The sampling algorithm is illustrated in Algorithm 8.3. We can easily generalize this sampling method to the directed graph by only adding \((RW_i, RW_j)\) into D.

figure c

Each appearance of vertex i will be recorded 2w times in D for undirected graph and w times for directed graph. Thus, we can figure out that \(\frac{N_{v_i}}{|D|}\) is the frequency of \(v_i\) that appears in the random walk, which is exactly the PageRank value of \(v_i\). Also note that \(\frac{N_{(v_i,v_j)}}{N_{v_i}/2w}\) is the expectation times that \(v_j\) is observed in left/right w neighbors of \(v_i\).

Denote the transition matrix in PageRank algorithm be \(\textbf{P}\). More formally, let \(\text {deg}(v_i)\) be the degree of vertex i. \(\textbf{P}_{ij}=\frac{1}{\text {deg}(v_i)}\) if \((i,j)\in E\) and \(\textbf{P}_{ij}=0\) otherwise. We use \(\textbf{e}_i\) to denote a |V|-dimension row vector, where all entries are zero except the ith entry is 1.

Suppose that we start a random walk from vertex i and use \(\textbf{e}_i\) to denote the initial state. Then \(\textbf{e}_i\textbf{P}\) is the distribution over all the vertices where jth entry is the probability that vertex \(v_i\) walks to vertex \(v_j\). Hence, jth entry of \(\textbf{e}_i\textbf{P}^w\) is the probability that vertex \(v_i\) walks to vertex \(v_j\) at exactly w steps. Thus \([\textbf{e}_i(\textbf{P}+\textbf{P}^2+\dots +\textbf{P}^w)]_j\) is the expectation times that \(v_j\) appears in right w neighbors of \(v_i\).

Hence

$$\begin{aligned} \begin{aligned} \frac{N_{(v_i,v_j)}}{N_{v_i}/2w}&=2[\textbf{e}_i(\textbf{P}+\textbf{P}^2+\dots +\textbf{P}^w)]_j,\\ \frac{N_{(v_i,v_j)}}{N_{v_i}}&=\frac{[\textbf{e}_i(\textbf{P}+\textbf{P}^2+\dots +\textbf{P}^w)]_j}{w}. \end{aligned} \end{aligned}$$
(8.30)

This equality also holds for a directed graph.

By setting \(b_{v_i}=\log 2w\) for all i, \(\textbf{M}_{ij}=\log \frac{N_{(v_i,v_j)}}{N_{v_i}/2w}\) is logarithm of the expectation times that \(v_j\) appears in left/right w neighbors of \(v_i\).

By setting \(b_{v_i}=0\) for all i, \(\textbf{M}_{ij}=\log \frac{N_{(v_i,v_j)}}{N_{v_i}}=\log \frac{[\textbf{e}_i(A+A^2+\dots +A^w)]_j}{w}\) is logarithm of the average probability that vertex \(v_i\) randomly walks to vertex \(v_j\) in w steps.

8.2.2.2 Discussion

So far we have seen many different network representation learning algorithms and we can figure out some patterns that how network representation methods share. Then we will move forward and see how these patterns match some recent network embedding algorithms.

Most network representation algorithms try to reconstruct a data matrix generated from the graph with vertex embeddings. The simplest matrix would be the adjacency matrix. However, recovering the adjacency matrix may not be the best choice. First, real-world networks are mostly very sparse which means \(O(|E|)=O(|V|)\). Therefore, the adjacency matrix will be very sparse as well. Though the sparseness enables an efficient algorithm, it can harm the performance of vertex representation learning because of the deficiency of useful information. Second, the adjacency matrix may be noisy and sensitive. A single missing link can completely change the correlation between two vertices.

Hence people seek to find an alternative matrix to replace the adjacency matrix though implicitly. Take DeepWalk as an example, DeepWalk models the following matrix based on matrix factorization comprehension of DeepWalk:

$$\begin{aligned} \textbf{M}=\textbf{P}+\textbf{P}^2+\dots +\textbf{P}^w, \end{aligned}$$
(8.31)

where

$$\begin{aligned} \textbf{P}_{ij}=\left\{ \begin{aligned} 1/\text {deg}(v_i)&\qquad \text {if}\, (v_i,v_j)\in E, \\ 0\qquad&\qquad \text {otherwise.} \end{aligned} \right. \end{aligned}$$
(8.32)

Compared with the adjacency matrix A, the matrix \(\textbf{M}\) modeled by DeepWalk is much denser. Furthermore, the window size parameter w can adjust the density: a larger window size models a denser matrix but will slow down the algorithm. Hence, the window size w works as a harmonic factor to balance efficiency and effectiveness. On the other hand, the matrix \(\textbf{M}\) can alleviate the noises in the adjacency matrix. Consider two similar vertices \(v_i\) and \(v_j\), even though the edge between them is missing, they can still have many co-occurrences by appearing inside a window size of the same random walks.

In a real-world application, direct computation of \(\textbf{M}\) may have a high time complexity when window size w grows. Thus, it is essential to choose a proper w. However, window size w is a discrete parameter, and thus the matrix M may grow from too sparse to too dense by changing w by 1. Here, we can see another benefit of random walks. Random walks used by DeepWalk serve as Monte Carlo simulations for approximating matrix \(\textbf{M}\). The more random walks you walk, the more likely you can approximate the matrix.

After we choose a matrix to model, we need to correlate the matrix entry with vertex representations pairs. There are two widely used measurements of vertices pairs: Euclidean distance and inner product. Assume that we want to model the entry \(M_{ij}\) given vertex representations \(\textbf{v}_i\) and \(\textbf{v}_j\), we can employ

$$\begin{aligned} \begin{aligned} \textbf{M}_{ij}&=f(\Vert \textbf{v}_i-\textbf{v}_j\Vert _2),\\ \textbf{M}_{ij}&=f(\textbf{v}_i\cdot \textbf{v}_j), \end{aligned} \end{aligned}$$
(8.33)

where function f can be any reasonable matching functions such as sigmoid function or linear function for our propose. Actually, the inner product \(\textbf{v}_i\cdot \textbf{v}_j\) is used more widely and would correspond to equivalent matrix factorization methods.

The next phase is to design a proper loss function between \(\textbf{M}_{ij}\) and \(f(\textbf{v}_i \cdot \textbf{v}_j)\). Several loss functions such as square loss and hinge loss can be employed. You can also design a generative model and maximize the likelihood of matrix \(\textbf{M}\).

The final step of a network representation learning algorithm would be parameter learning. The most frequently used parameter learning method would be Stochastic Gradient Descent (SGD). Other variants of SGD such as AdaGrad and AdaDelta can make the learning phase converge faster. In the next subsection, we will see some recent network representation learning algorithms which follow DeepWalk. We will find that their models can match all these phases above and have some innovations on building matrix \(\textbf{M}\), modifying function f, and changing loss function.

8.2.3 Matrix Factorization Based Methods

We will focus on two network representation learning algorithms LINE and GraRep [13, 111] in this subsection. They both follow the framework introduced in the last subsection.

8.2.3.1 LINE

Tang et al. [111] proposed a network embedding model named as LINE. LINE algorithm can handle large-scale networks with arbitrary types: (un)directed or weighted. To model the interaction between vertices, LINE models first-order proximity which is represented by observed links and second-order proximity which is determined by shared neighbors but not links between vertices.

Before we introduce the details of the algorithm, we can move one step back and see how the idea works. The modeling of first-order proximity, i.e., observed links, is the modeling of the adjacency matrix. As we said in the last subsection, the adjacency matrix is usually too sparse. Hence the modeling of second-order proximity, i.e., vertices with shared neighbors, can serve as complement information to enrich the adjacency matrix and make it denser. The enumeration of all vertex pairs which have common neighbors is time consuming. Thus, it is necessary to design a sampling phase to handle large-scale networks. The sampling phase works like Monte Carlo simulation to approximate the ideal matrix.

Now we only have two questions: how to define first-order and second-order proximity and how to define the loss function. In other words, it is equal to how to define \(\textbf{M}\) and loss function.

First-order proximity between vertex u and v is defined as the weight \(w_{uv}\) on edge (uv). If there is no edge between vertex u and v, then the first-order proximity between them is 0.

Second-order proximity between vertex u and v is defined as the similarity between their neighborhood network. Let \(p_u=(w_{u,1},\dots ,w_{u,|V|})\) denote the first-order proximity between vertex u and all other vertices. Then the second-order proximity between u and v is defined as the similarity of \(p_u\) and \(p_v\). If they have no shared neighbors, then the second-order proximity is zero.

Then we can introduce LINE model more specifically. The joint probability between \(v_i\) and \(v_j\) is

$$\begin{aligned} p_1(v_i,v_j)=\frac{1}{1+\exp (-\textbf{v}_i\cdot \textbf{v}_j)}, \end{aligned}$$
(8.34)

where \(\textbf{v}_i\) and \(\textbf{v}_j\) are d-dimensional row vectors which indicate the representations of vertex \(v_i\) and \(v_j\).

To supervise the probabilities, empirical probability is defined as \(\hat{p}_1(i,j)=\frac{w_{ij}}{W}\), where \(W=\sum _{(v_i,v_j)\in E}w_{ij}\). Thus our goal is to find vertex embeddings to approximate \(\frac{w_{ij}}{W}\) with \(\frac{1}{1+\exp (-\textbf{v}_i\cdot \textbf{v}_j)}\). Following the idea in last subsection, it is equivalent to say \(\textbf{v}_i\cdot \textbf{v}_j=\textbf{M}_{ij}=-\log (\frac{W}{w_{ij}}-1)\).

The loss function between joint probability \(p_1\) and its empirical probability \(\hat{p}_1\) is

$$\begin{aligned} \mathscr {L}_1=D_{\text {KL}}(\hat{p}_1\left| \right| p_1), \end{aligned}$$
(8.35)

where \(D_{\text {KL}}(\cdot \left| \right| \cdot )\) is KL-divergence of two probability distributions.

On the other hand, we define the probability that vertex \(v_j\) appears in \(v_i\)’s context:

$$\begin{aligned} p_2(v_j|v_i)=\frac{\exp (\textbf{c}_j\cdot \textbf{v}_i)}{\sum _{k=1}^{|V|}\exp (\textbf{c}_k\cdot \textbf{v}_i)}. \end{aligned}$$
(8.36)

Similarly, the empirical probability is defined as \(\hat{p}_2(v_j|v_i)=\frac{w_{ij}}{d_i}\) where \(d_i=\sum _k w_{ik}\) and the loss function is

$$\begin{aligned} \mathscr {L}_2=\sum _i d_i D_{\text {KL}}(\hat{p_2}(\cdot ,v_i)\left| \right| p_2(\cdot ,v_i)). \end{aligned}$$
(8.37)

The first-order and second-order proximity embeddings are trained separately, and we concatenate the embeddings together after the training phase as vertex representations.

8.2.3.2 GraRep

Now we turn to another network representation learning algorithm, GraRep, which directly follows the proof of matrix factorization form of DeepWalk. Recall that we prove DeepWalk is actually factorizing a matrix \(\textbf{M}\) where \(\textbf{M}=\log \frac{A+A^2+\dots +A^w}{w}\). GraRep algorithm can be divided into 3 steps:

  • Get k-step transition probability matrix \(A^k\) for each \(k=1,2,\dots ,K\).

  • Get each k-step representation.

  • Concatenate all k-step representations.

GraRep uses a simple idea, i.e., SVD decomposition on \(A^k\), in the second step to get embeddings. As K gets large, the matrix \(\textbf{M}\) gets denser and thus outputs a better representation. However, this algorithm is not very efficient especially when K gets large.

8.2.4 Structural Deep Network Methods

Different from previous methods that use a shallow neural network model to characterize the network representations, Structural Deep Network Embedding (SDNE) [125] employs the deeper neural model to model the nonlinearity between vertex embeddings. As shown in Fig. 8.2, the whole model can be divided into two parts: (1) the first part is supervised by Laplacian Eigenmaps, which models the first-order proximity; (2) the second part is unsupervised deep neural autoencoder which characterizes the second-order proximity. Finally, the algorithm takes the intermediate layer which is used for the supervised part as the network representation.

First, we will give a brief introduction to deep neural autoencoder. A neural autoencoder requires that the output vector should be as similar to the input vector. Generally speaking, the output cannot be the same with the input vector because the dimension of intermediate layers of the autoencoder is much smaller than that of the input and output layer. That is to say, a deep autoencoder first compresses the input into a low-dimensional intermediate vector and then tries to reconstruct the original input vector from the low-dimensional intermediate vector. Once the deep autoencoder is trained, we can say that the intermediate layer is an excellent low-dimensional representation of the original inputs since we can recover the input vector from it.

More formally, we assume the input vector is \(\textbf{x}_i\). Then the hidden representation of each layer is defined as

$$\begin{aligned} \begin{aligned} \textbf{y}_i^{(1)}&={\text {Sigmoid}}(\textbf{W}^{(1)}\textbf{x}_i+\textbf{b}^{(1)}),\\ \textbf{y}_i^{(k)}&={\text {Sigmoid}}(\textbf{W}^{(k)}\textbf{y}_i^{(k-1)}+\textbf{b}^{(k)}),k=2,3\dots , \end{aligned} \end{aligned}$$
(8.38)

where \(\textbf{W}^{(k)}\) and \(\textbf{b}^{(k)}\) are weighted matrix and bias vector of kth layer. We assume that the hidden representation of the Kth layer has the minimum dimension. After obtaining \(\textbf{y}_i^{(K)}\), we can get the output \(\hat{\textbf{x}_i}\) by reversing the calculation process. Then the optimization objective of autoencoder is to minimize the difference between input vector \(\textbf{x}_i\) and output vector \(\hat{\textbf{x}_i}\):

$$\begin{aligned} \mathscr {L}(\textbf{W},\textbf{b})=\sum _{i=1}^{n}\Vert \hat{\textbf{x}}_i-\textbf{x}_i\Vert ^2, \end{aligned}$$
(8.39)

where n is the number of input instances.

Fig. 8.2
figure 2

The architecture of structural deep network embedding model

Back to the network representation problem, SDNE applies the autoencoder to every vertex. The input vector \(\textbf{x}_i\) of each vertex \(v_i\) is defined as follows: if vertex \(v_i\) and \(v_j\) are connected, then the jth entry \(\textbf{x}_{ij}>0\), otherwise \(\textbf{x}_{ij}=0\). For unweighed graph, if vertex \((v_i,v_j)\in E\), \(\textbf{x}_{ij}=1\). Then the intermediate layer \(\textbf{y}_i^{(K)}\) can be seen as the low-dimension representation of vertex \(v_i\). Also note that there are much more zero entries in input vectors than positive entries due to the sparity of real-world network. Therefore, the loss of positive entries should be emphasized. Therefore, the final optimization objective of second proximity modeling can be written as

$$\begin{aligned} \mathscr {L}_{2nd}=\sum _{i=1}^{|V|}\Vert (\hat{\textbf{x}}_i-\textbf{x}_i)\odot \textbf{b}_i\Vert ^2, \end{aligned}$$
(8.40)

where \(\odot \) denotes element-wise multiplication and \(\textbf{b}_{ij}=1\) if \(\textbf{x}_{ij}=0\) while \(\textbf{b}_{ij}=\beta >1\) if \(\textbf{x}_{ij}>0\).

We have introduced the unsupervised part modeled by deep autoencoder. Now we turn to the supervised part. The supervised part simply requires that the representation of connected vertices should be close to each other. Thus, the loss function of this part is

$$\begin{aligned} \mathscr {L}_{1st}=\sum _{i,j=1}^{|V|}\textbf{x}_{ij}\Vert \textbf{y}_i^{(K)}-\textbf{y}_j^{(K)}\Vert ^2. \end{aligned}$$
(8.41)

Finally, the overall loss function included regularization term is

$$\begin{aligned} \mathscr {L}=\mathscr {L}_{2nd}+\alpha \mathscr {L}_{1st}+\lambda \mathscr {L}_{reg}, \end{aligned}$$
(8.42)

where \(\alpha \) and \(\lambda \) are harmonic hyperparameter and regularization loss \(\mathscr {L}_{reg}\) is the sum of the square of all parameters. The model can be optimized by back-propagation in a standard neural network way. After the training process, \(\textbf{y}_i^{(K)}\) is taken as the representation of vertex \(v_i\).

8.2.5 Extensions

8.2.5.1 Network Representation with Internal Information

Asymmetric Transitivity Preserving Network Representation. Existing network representation learning algorithms mostly focus on an undirected graph. Most of the methods cannot handle the directed graph well because they do not accurately characterize the asymmetric property. High-Order Proximity preserved Embedding (HOPE) [89] is proposed to preserve high-order proximities of large-scale graphs and capture the asymmetric transitivity. The algorithm further derives a general formulation that covers multiple popular high-order proximity measurements and provides an approximate algorithm with an upper bound of RMSE (Root Mean Squared Error).

Network embedding assumes that the more and the shorter paths from \(v_i\) to \(v_j\), the more similar should be their representation vectors. In particular, the algorithm assigns two vectors, i.e., source and target vectors for each vertex. We denote adjacency matrix as A and the user representations as \(\textbf{U}=[\textbf{U}^s,\textbf{U}^t]\), where \(\textbf{U}^s\in \mathbb {R}^{|V|\times d}\) and \(\textbf{U}^t\in \mathbb {R}^{|V|\times d}\) are source and target vertex embeddings, respectively. We define a high-order proximity matrix as \(\textbf{S}\), where \(\textbf{S}_{ij}\) is the proximity between \(v_i\) and \(v_j\). Then our goal is to approximate the matrix \(\textbf{S}\) with the product of \(\textbf{U}^s\) and \(\textbf{U}^t\). The optimization objective can be written as

$$\begin{aligned} \min _{\textbf{U}^s,\textbf{U}^t} \Vert \textbf{S}-\textbf{U}^s{\textbf{U}^t}^{\top }\Vert _F^2. \end{aligned}$$
(8.43)

Many high-order proximity measurements which characterize the asymmetric transitivity share a general formulation which can be used for the approximation of the proximities:

$$\begin{aligned} \textbf{S}=\textbf{M}_g^{-1}\textbf{M}_l, \end{aligned}$$
(8.44)

where \(\textbf{M}_g\) and \(\textbf{M}_l\) are both polynomials of matrices. Now we will take three commonly used high-order proximity measurements to illustrate the formula.

  • Katz Index Katz Index is a weighted summation over the path set between two vertices. The computation of the Katz Index can be written recurrently:

    $$\begin{aligned} \textbf{S}:=\beta A \textbf{S}+\beta A, \end{aligned}$$
    (8.45)

    where the decay parameter \(\beta \) represents how fast the weight decreases when the length of paths grows.

  • Rooted PageRank For rooted PageRank, \(\textbf{S}_{ij}\) is the probability that a random walk from vertex \(v_i\) will locate at \(v_j\) in the stable state. The formula can be written as

    $$\begin{aligned} \textbf{S}:=\alpha \textbf{S} \textbf{P}+(1-\alpha )\textbf{I}, \end{aligned}$$
    (8.46)

    where \(\alpha \) is the probability that a random walk returns to its start point and \(\textbf{P}\) is the transition matrix.

  • Common Neighbors \(\textbf{S}_{ij}\) is the number of vertexes which is the target of an edge from \(v_i\) and the source of an edge to \(v_j\). The matrix \(\textbf{S}\) can be expressed as

    $$\begin{aligned} \textbf{S}=A^2. \end{aligned}$$
    (8.47)

For the three high-order proximity measurements introduced above, we summarize their equivalent form \(\textbf{S}=\textbf{M}_g^{-1}\textbf{M}_l\) in the following table (Table 8.3).

Table 8.3 General formula for high-order proximity measurements

A simple idea of approximating \(\textbf{S}\) with the product of matrices is SVD decomposition. However, the direct computation of SVD decomposition of matrix \(\textbf{S}\) has a complexity of \(O(|V|^3)\). By writing matrix \(\textbf{S}\) as \(\textbf{M}_g^{-1}\textbf{M}_l\), we do not need to compute matrix \(\textbf{S}\) directly. Instead, we can do JDGSVD decomposition on \(\textbf{M}_g\) and \(\textbf{M}_l\) independently and then use their results to derive the decomposition of \(\textbf{S}\). The complexity reduces to \(|E|d^2\) for each iteration of JDGSVD.

Community Preserving Network Representation. While previous methods aim at preserving the microscopic structure of a network such as first- and second-order proximities. Wang et al. [127] proposed Modularized Nonnegative Matrix Factorization (M-NMF), which encodes the mesoscopic community structure information into the network representations. The basic idea is to consider the modularity as part of the optimization function. Recall that the modularity is formulated in Eq. 8.15 and \(\textbf{S}\) is the community indicator matrix. Then the loss function of modularity part is to minimize \(-\text {tr}(\textbf{S}^{\top }\textbf{B}\textbf{S})\).

Similar to previous methods, M-NMF also factorizes an affinity matrix which encodes first-order and second-order proximities. Specifically, M-NMF takes adjacency matrix A as the first-order proximity matrix \(A_1\) and computes the cosine similarity of corresponding rows of adjacency matrix A as the second-order proximity matrix \(A_2\). M-NMF uses a mixture of \(A_1\) and \(A_2\) as the similarity matrix. To conclude, the overall optimization function of M-NMF is

$$\begin{aligned} \min _{\textbf{M},\textbf{U},\textbf{S},\textbf{C}} \left\| A_1+\eta A_2-\textbf{M}\textbf{U}^{\top }\right\| _F^2+\alpha \left\| \textbf{S}-\textbf{U}\textbf{C}^{\top }\right\| _F^2-\beta \text {tr}(\textbf{S}^{\top }\textbf{B}\textbf{S}), \end{aligned}$$
(8.48)

where \(\textbf{S}\in \mathbb {R}^{|V|\times k},\textbf{M},\textbf{U}{\in } \mathbb {R}^{|V|\times m},\textbf{C}{\in } \mathbb {R}^{k\times m}, \textbf{M}_{ij},\textbf{U}_{ij},\textbf{S}_{ij},\textbf{C}_{ij}\ge 0, \forall i\forall j,\text {tr}(\textbf{S}^{\top }\textbf{S})=|V|\) and \(\alpha ,\beta ,\eta >0\) are harmonic hyperparameters. Subscript F denotes Frobenius norm. Here similarity matrix \(A_1+\eta A_2\) is factorized into two nonnegative matrices \(\textbf{M}\) and \(\textbf{U}\). Then community representation matrix \(\textbf{C}\) in the second term bridges the matrix factorization part and the modularity part.

A concurrent algorithm Community-enhanced NRL (CNRL) [116, 117] is a pipeline algorithm that learns node-community assignment at first and then reforms the DeepWalk algorithm to incorporate community information. Specifically, in the first phase, CNRL made an analogy between community detection and topic modeling. Then CNRL started by generating random walks and fed these vertex sequences into Latent Dirichlet Allocation (LDA) algorithm. By taking a vertex as a word and a topic as a community, CNRL can get a soft-assignment of vertex-community membership. Then in the second phase, both the embedding of a center node and the embedding of its community are used to predict the neighborhood vertices in the random walk sequences. The illustration figure is shown in Fig. 8.3.

Fig. 8.3
figure 3

The architecture of community preserving network embedding model

8.2.5.2 Network Representation with External Information

Network Representation with Text Information. We will present the network embedding algorithm TADW, which further generalizes the matrix factorization framework to take advantage of text information. Text-Associated DeepWalk (TADW) [136] incorporates text features of vertices into network representation learning under the framework of matrix factorization. The matrix factorization view of DeepWalk enables the introduction of text information into matrix factorization for network representation learning. Figure 8.4 shows the main idea of TADW: factorize vertex affinity matrix \(\textbf{M}\in \mathbb {R}^{|V|\times |V|}\) into the product of three matrices: \(\textbf{W}\in \mathbb {R}^{k\times |V|}\), \(\textbf{H}\in \mathbb {R}^{k\times f_t}\), and text features \(\textbf{T} \in \mathbb {R}^{f_t\times |V|}\). Then TADW concatenates \(\textbf{W}\) and \(\textbf{H}\textbf{T}\) as 2k-dimensional representations of vertices.

Fig. 8.4
figure 4

The architecture of text-associated DeepWalk model

Then the question is how to build vertex affinity matrix \(\textbf{M}\) and how to extract text feature \(\textbf{T}\) from the text information. Following the proof of matrix factorization form of DeepWalk, TADW set vertex affinity matrix \(\textbf{M}\) to a tradeoff between speed and accuracy: factorize the matrix \(\textbf{M} = (A+A^2)/2\) where A is the row-normalized adjacency matrix. For text feature matrix \(\textbf{T}\), TADW first constructs the TF-IDF matrix from the text and then reduces the dimension of the TF-IDF matrix to 200 via SVD decomposition.

Formally, the model of TADW minimizes the following optimization function:

$$\begin{aligned} \min _{\textbf{W},\textbf{H}} \Vert \textbf{M}-\textbf{W}^{\top }\textbf{H}\textbf{T}\Vert _F^2+\frac{\lambda }{2}(\Vert \textbf{W}\Vert _F^2+\Vert \textbf{H}\Vert _F^2), \end{aligned}$$
(8.49)

where \(\lambda \) is the regularization factor. The optimization of parameters are processed by updating \(\textbf{W}\) and \(\textbf{H}\) iteratively via conjugate gradient descent.

TransNet. Most existing NRL methods neglect the semantic information of edges and simplify the edge as a binary or continuous value. TransNet algorithm [119] considers the label information on the edges instead of nodes. In particular, TransNet is based on translation mechanism shown in Fig. 8.5.

Fig. 8.5
figure 5

The architecture of TransNet model

In the settings of TransNet, each edge has a number of binary labels on it. Then the loss function of TransNet consists of two parts: one part is the translation loss which measures the distance between \(\textbf{u}+\textbf{e}\) and \(\textbf{v}\) where \(\textbf{u},\textbf{e},\textbf{v}\) stand for the embeddings of head vertex, edge, and tail vertex; another part is the reconstruction loss of the autoencoder which encodes the labels of an edge into its embedding \(\textbf{e}\) and restore the labels from the embedding. After the learning phase, we can compute the edge embedding by subtracting two vertices and use the decoder part of the autoencoder to predict the labels of an unobserved edge.

Semi-supervised Network Representation. In this part, we introduce several semi-supervised network representation learning methods that are applied to heterogeneous networks. All methods learn vertex embeddings and their classification labels simultaneously.

(1) LSHM The first algorithm LSHM (Latent Space Heterogeneous Model) [52], follows the manifold assumption which assumes that two connected nodes tend to have similar node embeddings. Thus, the regularization loss which forces connected nodes to have similar representations can be formulated as

$$\begin{aligned} \sum _{i,j}w_{ij}\Vert \textbf{v}_i-\textbf{v}_j\Vert ^2, \end{aligned}$$
(8.50)

where \(w_{ij}\) is the weight of edge \((v_i,v_j)\).

As a semi-supervised representation learning algorithm, LHSM also needs to predict the classification labels for unlabeled vertices. To train the classifiers, LHSM computes the loss of observed labels as

$$\begin{aligned} \sum _{i}\varDelta (f_\theta (\textbf{v}_i),\textbf{y}_i), \end{aligned}$$
(8.51)

where \(f_\theta (\textbf{v}_i)\) is the predicted label for vertex \(v_i\), \(y_i\) is the observed label for \(v_i\) and \(\varDelta (\cdot ,\cdot )\) is the loss function between predicted label and ground truth label. Specifically, \(f_\theta (\cdot )\) is a linear function and \(\varDelta (\cdot ,\cdot )\) is set to hinge loss.

Finally, the objective function is

$$\begin{aligned} \mathscr {L}(\textbf{V},\theta )=\sum _{i}\varDelta (f_\theta (\textbf{v}_i),y_i)+\lambda \sum _{i,j}w_{ij}\Vert \textbf{v}_i-\textbf{v}_j\Vert ^2, \end{aligned}$$
(8.52)

where \(\lambda \) is a harmonic hyperparameter. The algorithm is optimized via stochastic gradient descent.

(2) node2vec Node2vec [38] modifies DeepWalk by changing the generation of random walks. As shown in previous subsections, DeepWalk generates rooted random walks by choosing the next vertex according to a uniform distribution, which could be improved by using a well-designed random walk generation strategy.

Node2vec first considers two extreme cases of vertex visiting sequences: Breadth-First Search (BFS) and Depth-First Search (DFS). By restricting the search to nearby nodes, BFS characterizes the nearby neighborhoods of center vertices and obtains a microscopic view of the neighborhood of every node. Vertices in the sampled neighborhoods of BFS tend to repeat many times, which can reduce the variance in characterizing the distribution of neighboring vertices of the source node. In contrast, the sampled nodes in DFS reflect a macro-view of the neighborhood which is essential in inferring communities based on homophily.

Node2vec designs a neighborhood sampling strategy which can smoothly interpolate between BFS and DFS. More specifically, consider a random walk that just walks through edge (tv) and now stays at vertex v. The walk evaluates the transition probabilities of edge (vx) to decide the next step. Node2vec sets the unnormalized transition probability to \(\pi _{vx}=\alpha _{pq}(t,x)\cdot w_{vx}\), where

$$\begin{aligned} \alpha _{pq}(t,x) = \left\{ \begin{array}{lcl} {\frac{1}{p}} &{}\text {if} &{}d_{tx}=0, \\ {1} &{}\text {if} &{}d_{tx}=1, \\ {\frac{1}{q}} &{}\text {if} &{}d_{tx}=2,\\ \end{array} \right. \end{aligned}$$
(8.53)

and \(d_{tx}\) denotes the shortest path distance between vertices t and x. p and q are parameters that guide the random walk and control how fast the walk explores and leaves the neighborhood of starting vertex. A low p will increase the probability of revisiting a vertex and make the random walk focus on local neighborhoods while a low q will encourage the random walk to explore further vertices. After the generation of the random walks, the rest of the algorithm is almost the same as that of DeepWalk.

(3) MMDW Max-Margin DeepWalk (MMDW) [118] utilizes the max-margin strategy in SVM to generalize DeepWalk algorithm for semi-supervised learning. Specifically, MMDW employs the matrix factorization form of DeepWalk proved in TADW [136] and further add the max-margin constraint which requires that the embeddings of nodes from different labels should be far from each other. The optimization function can be written as

$$\begin{aligned} \begin{aligned} \min _{\textbf{X},\textbf{Y},\textbf{W},\xi }\mathscr {L}= \min _{\textbf{X},\textbf{Y},\textbf{W},\xi } \mathscr {L}_{DW}&+\frac{1}{2}\Vert \textbf{W}\Vert ^2+C\sum _{i=1}^T \xi _i,\\ s.t.\ w_{l_i}^{\top } x_i-w_j^{\top }x_i&\ge e_i^j-\xi _i, \forall i,j, \end{aligned} \end{aligned}$$
(8.54)

where \(W=[w_1,w_2,\dots ,w_m]^T\) is the weight matrix of SVM, \(\xi \) is the slack variables, \(e_i^j=1\) if \(l_i\ne j\) and \(e_i^j=0\) otherwise, and \(\mathscr {L}_{DW}\) is the matrix factorization form DeepWalk loss function:

$$\begin{aligned} \mathscr {L}_{DW}=\Vert \textbf{M}-\textbf{X}^{\top }\textbf{Y}\Vert _2^2+\frac{\lambda }{2}(\Vert \textbf{X}\Vert _2^2+\Vert \textbf{Y}\Vert _2^2), \end{aligned}$$
(8.55)

which is introduced in previous sections.

Fig. 8.6
figure 6

A visualization of t-SNE 2D representations on Wiki dataset (left: DeepWalk, right: MMDW) [118]

Figure 8.6 shows the visualization result of the DeepWalk and MMDW algorithm on the Wiki dataset [103]. We can see that the embeddings of nodes from different classes are more separable with the help of semi-supervised max-margin representation learning.

(4) PTE Another algorithm called PTE (Predictive Text Embedding) [110] focuses on text network such as the bibliography network where a paper is a vertex, and the citation relationship between papers forms the edges. PTE considers network structure together with plain text and observed vertex labels. PTE proposes a semi-supervised framework to learn vertex representation and predict unobserved vertex labels.

A text network is divided into three bipartite networks: word-word, word-document, and word-label networks. We will introduce the definition of the three networks in more detail.

For the word-word network, the weight \(w_{ij}\) of the edge between word \(v_i\) and \(v_j\) is defined as the number of times that the two words co-occur in the same context windows. For word-document network, the weight \(w_{ij}\) between word \(v_i\) and document \(d_j\) is defined as the number of times \(v_i\) appears in document \(d_j\). For the word-label network, the weight \(w_{ij}\) of the edge between word \(v_i\) and class \(c_j\) is defined as: \(w_{ij} =\sum _{d:l_d=j}n_{di}\), where \(n_{di}\) is the term frequency of word \(v_i\) in document d, and \(l_d\) is the class label of document d.

Then following previous work LINE, given bipartite network \(G=(V_A\cup V_B, E)\), the conditional probability of generating \(v_i\in V_A\) from \(v_j\in V_B\) is defined as

$$\begin{aligned} P(v_i|v_j)=\frac{\exp (\textbf{v}_j\cdot \textbf{v}_i)}{\sum _{k=1}^{|V|}\exp (\textbf{v}_k\cdot \textbf{v}_i)}. \end{aligned}$$
(8.56)

Similar to LINE model, the loss function is defined as the KL-divergence between empirical distribution and conditional distribution. The optimization objective can be further formulated as

$$\begin{aligned} \mathscr {L}=-\sum _{(v_i,v_j)\in E}w_{ij}\log P(v_i|v_j). \end{aligned}$$
(8.57)

Then the final objective can be obtained by summing all three bipartite networks:

$$\begin{aligned} \mathscr {L}_{pte}=\mathscr {L}_{ww}+\mathscr {L}_{wd}+\mathscr {L}_{wl}, \end{aligned}$$
(8.58)

where

$$\begin{aligned} \mathscr {L}_{ww}=-\sum _{(v_i,v_j)\in E_{ww}}w_{ij}\log P(v_i|v_j), \end{aligned}$$
(8.59)
$$\begin{aligned} \mathscr {L}_{wd}=-\sum _{(v_i,v_j)\in E_{wd}}w_{ij}\log P(v_i|d_j), \end{aligned}$$
(8.60)
$$\begin{aligned} \mathscr {L}_{wl}=-\sum _{(v_i,v_j)\in E_{wl}}w_{ij}\log P(v_i|l_j). \end{aligned}$$
(8.61)

Then the optimization can be done by stochastic gradient descent.

8.2.5.3 Task-Specific Network Representation

Network Representation for Community Detection. As shown in spectral clustering methods, people make their effort to learn community indicator matrix based on modularity and normalized graph cut. The continuous community indicator matrix can be seen as a k-dimensional vertex representation, where k is the number of communities. Note that modularity and graph cut is defined for nonoverlapping communities. By alternating a cost function for overlapping communities, the idea can also work for overlapping community detection. In this subsection, we will introduce several community detection algorithms. These community detection algorithms start by learning a k-dimensional nonnegative vertex-community affinity matrix and then derive a hard community assignment for vertices based on the matrix. Therefore, the key procedure of these algorithms can be regarded as an unsupervised k-dimensional nonnegative vertex embedding learning.

BIGCLAM [140] is an overlapping community detection method. It assumes that matrix \(\textbf{F}\in \mathbb {R}^{|V|\times k}\) is the user-community affinity matrix, where \(\textbf{F}_{vc}\) is the strength between vertex v and community c. Matrix \(\textbf{F}\) is nonnegative and \(\textbf{F}_{vc}=0\) indicates no affiliation. BIGCLAM builds a generative model by modeling the probability that vertex \(v_i\) connects \(v_j\) given user-community affinity matrix \(\textbf{F}\). More specifically, given matrix \(\textbf{F}\), BIGCLAM generates an edge between vertex \(v_i\) and \(v_j\) with a probability

$$\begin{aligned} P(v_i,v_j)=1-\exp (-\textbf{F}_{v_i}\cdot \textbf{F}_{v_j}), \end{aligned}$$
(8.62)

where \(\textbf{F}_{v_i}\) is the corresponding row of matrix \(\textbf{F}\) for vertex \(v_i\) and can be seen as the representation of \(v_i\). Note that the probability \(P(v_i,v_j)\) has an increasing relationship with \(\textbf{F}_{v_i}\cdot \textbf{F}_{v_j}^\top =\sum _c \textbf{F}_{v_i,c}\textbf{F}_{v_j,c}\), which indicates that the more communities a pair of nodes shared, the more likely they are connected.

For the case that \(\textbf{F}_{v_i}\cdot \textbf{F}_{v_j}=0\), BIGCLAM adds a background probability \(\epsilon =\frac{2|E|}{|V|(|V|-1)}\) to the pair of nodes to avoid a zero probability.

Then BIGCLAM tries to maximize the log-likelihood of the graph \(G=(V,E)\):

$$\begin{aligned} \mathscr {O}(\textbf{F})=\sum _{i,j:(v_i,v_j)\in E} \log P(v_i,v_j) + \sum _{i,j:(v_i,v_j)\not \in E} \log (1-P(v_i,v_j)), \end{aligned}$$
(8.63)

which can be reformulated as

$$\begin{aligned} \mathscr {O}(\textbf{F})=\sum _{i,j:(v_i,v_j)\in E} \log (1-\exp (-\textbf{F}_{v_i}\cdot \textbf{F}_{v_j})) - \sum _{i,j:(v_i,v_j)\not \in E} \textbf{F}_{v_i}\cdot \textbf{F}_{v_j}. \end{aligned}$$
(8.64)

The parameters \(\textbf{F}\) are learned by projected gradient descent. Note that the training objective can be regarded as a variant of nonnegative matrix factorization. The maximization of log-likelihood function is an approximation of adjacency matrix A by \(\textbf{F}\textbf{F}^{\top }\). Compared with L2-norm loss function, the gradient of Eq. 8.64 can be computed more efficiently for a sparse matrix A which is the most case in the real-world dataset.

The model can also be generalized to asymmetric case [141]. That is to replace Eq. 8.62 by

$$\begin{aligned} P(v_i,v_j)=1-\exp (-\textbf{F}_{v_i}\cdot \textbf{H}_{v_j}), \end{aligned}$$
(8.65)

where \(\textbf{H}\) is another matrix that has the same size with the matrix \(\textbf{F}\). The generative model can also consider attributes of vertices by adding attribute terms to Eq. 8.62 [79].

8.2.5.4 Network Representation for Visualization

Different from previous algorithms that focus on machine learning tasks, the algorithms introduced in this subsection are designed for visualization. As a commonly used data structure, the visualization of networks is an important task. The dimensions of representations of vertices are usually 2 or 3 to draw the graph.

Representation learning for network visualization generally follows the following aesthetic criteria [30]:

  • Distribute the vertices evenly in the frame.

  • Minimize edge crossings.

  • Make edge lengths uniform.

  • Reflect inherent symmetry.

  • Conform to the frame.

Following these criteria, graph visualization algorithms build a force-directed graph drawing framework. The basic assumption is that there is a spring between each pair of vertices. Then the optimization objective is to minimize the energy of the graph according to Hooke’s law:

$$\begin{aligned} \mathscr {E}=\sum _{i,j}\frac{1}{2}k_{ij}(\Vert \textbf{v}_i-\textbf{v}_j\Vert -l_{ij})^2, \end{aligned}$$
(8.66)

where \(k_{ij}\) is spring constant, \(\textbf{v}_i\) is the position of vertex \(v_i\) and \(l_{ij}\) is the length of shortest path between vertex \(v_i\) and \(v_j\). The intuition is straightforward: close vertices should have close positions in the drawing. Several algorithms have been proposed to improve this framework [34, 54, 60] by changing the setting of spring constant \(k_{ij}\) or the energy function. The parameters can be easily learned via gradient descent.

8.2.5.5 Embedding Enhancement via High-Order Proximity Approximation

Yang et al. [137] summarize several existing NRL methods into a unified two-step framework, including proximity matrix construction and dimension reduction. They conclude that an NRL method can be improved by exploring higher order proximities when building the proximity matrix. Then they propose Network Embedding Update (NEU) algorithm, which implicitly approximates higher order proximities with theoretical approximation bound and can be applied to any NRL methods to enhance their performances. NEU can make a consistent and significant improvement over some NRL methods with almost negligible running time.

The two-step framework is summarized as follows:

Step 1: Proximity Matrix Construction. Compute a proximity matrix \(\textbf{M}\in \mathbb {R}^{|V|\times |V|}\), which encodes the information of k-order proximity matrix where \(k=1,2\dots ,K\). For example, \(\textbf{M}=\frac{1}{K}A+\frac{1}{K}A^2\dots +\frac{1}{K}A^K\) stands for an average combination of k-order proximity matrix for \(k=1,2\dots ,K\). The proximity matrix M is usually represented by a polynomial of normalized adjacency matrix A of degree K, and we denote the polynomial as \(f(A)\in \mathbb {R}^{|V|\times |V|}\). Here the degree K of polynomial f(A) corresponds to the maximum order of proximities encoded in the proximity matrix. Note that the storage and computation of proximity matrix M doesn’t necessarily take \(O(|V|^2)\) time because we only need to save and compute the nonzero entries.

Step 2: Dimension Reduction. Find network embedding matrix \(\textbf{V}\in \mathbb {R}^{|V|\times d}\) and context embedding \(\textbf{C}\in \mathbb {R}^{|V|\times d}\) so that the product \(\textbf{V}\textbf{C}^{\top }\) approximates proximity matrix \(\textbf{M}\). Here different algorithms may employ different distance functions to minimize the distance between \(\textbf{M}\) and \(\textbf{V}\textbf{C}^{\top }\). For example, we can naturally use the norm of matrix \(\textbf{M}-\textbf{V}\textbf{C}^{\top }\) to measure the distance and minimize it.

Spectral Clustering, DeepWalk, and GraRep can be formalized into the two-step framework. Now we focus on the first step and study how to define the right proximity matrix for NRL.

Table 8.4 Comparisons among three NRL methods

We summarize the comparisons among Spectral Clustering (SC), DeepWalk, and GraRep in Table 8.4 and conclude the following observations.

Observation 8.1

Modeling higher order and accurate proximity matrix can improve the quality of network representation. In other words, NRL can benefit from exploring a polynomial proximity matrix f(A) of a higher degree.

From the development of NRL methods, it can be seen that DeepWalk outperforms Spectral Clustering because DeepWalk considers higher order proximity matrices, and the higher order proximity matrices can provide complementary information for lower order proximity matrices. GraRep outperforms DeepWalk because GraRep accurately calculates the k-order proximity matrix rather than approximating it by Monte Carlo simulation as DeepWalk does.

Observation 8.2

Accurate computation of high-order proximity matrix is not feasible for large-scale networks.

The major drawback of GraRep is the computation complexity of calculating the accurate k-order proximity matrix. In fact, the computation of high-order proximity matrix takes \(O(|V|^2)\) time and the time complexity of SVD decomposition also increases as k-order proximity matrix gets dense when k grows. In summary, the time complexity of \(O(|V|^2)\) is too expensive to handle large-scale networks.

The first observation provides the motivation to explore higher order proximity matrices in NRL models, but the second observation indicates that an accurate inference of higher order proximity matrices isn’t acceptable. Therefore, how to learn network embeddings from approximate higher order proximity matrices efficiently becomes important. To be more efficient, the network representations which encode the information of lower order proximity matrices can be used as our basis to avoid repeated computations. The problem is formalized below.

Problem Formalization. Assume that we have normalized adjacency matrix A as the first-order proximity matrix, network embedding \(\textbf{V}\), and context embedding \(\textbf{C}\), where \(\textbf{V},\textbf{C}\in \mathbb {R}^{|V|\times d}\). Suppose that the embeddings \(\textbf{V}\) and \(\textbf{C}\) are learned by the above NRL framework which indicates that the product \(\textbf{V} \textbf{C}^{\top }\) approximates a polynomial proximity matrix f(A) of degree K. The goal is to learn a better representation \(\textbf{V}'\) and \(\textbf{C}'\), which approximates a polynomial proximity matrix g(A) with higher degree than f(A). Also, the algorithm should be efficient in the linear time of |V|. Note that the lower bound of time complexity is O(|V|d) which is the size of embedding matrix R.

There is a simple, efficient, and effective iterative updating algorithm to solve the above problem.

Method. Given hyperparameter \(\lambda \in (0,\frac{1}{2}]\), normalized adjacency matrix A, we update \(\textbf{V}\) and \(\textbf{C}\) as follows:

$$\begin{aligned} \begin{aligned} \textbf{V}'&=\textbf{V}+\lambda A \textbf{V},\\ \textbf{C}'&=\textbf{C}+\lambda A^{\top } \textbf{C}. \end{aligned} \end{aligned}$$
(8.67)

The time complexity of computing \(A \textbf{V}\) and \(A^{\top } \textbf{C}\) is O(|V|d) because matrix A is sparse and has O(|V|) nonzero entries. Thus the overall time complexity of one iteration of operation (Eq. 8.67) is O(|V|d).

Recall that product of previous embedding \(\textbf{V}\) and \(\textbf{C}\) approximates polynomial proximity matrix f(A) of degree K. It can be proved that the algorithm can learn better embeddings \(\textbf{V}'\) and \(\textbf{C}'\), where the product \(\textbf{V}'\textbf{C}'^{\top }\) approximates a polynomial proximity matrix g(A) of degree \(K+2\) bounded by matrix infinite norm.

Theorem

Denote the network and context embedding by \(\textbf{V}\) and \(\textbf{C}\), and suppose that the approximation between \(\textbf{V} \textbf{C}^{\top }\) and proximity matrix \(\textbf{M}=f(A)\) is bounded by \(r=\Vert f(A)-\textbf{V} \textbf{C}^{\top }\Vert _\infty \) and \(f(\cdot )\) is a polynomial of degree K. Then the product of updated embeddings \(\textbf{V}'\) and \(\textbf{C}'\) from Eq. 8.67 approximates a polynomial \(g(A)=f(A)+2\lambda Af(A)+\lambda ^2 A^2 f(A)\) of degree \(K+2\) with approximation bound \(r'=(1+2\lambda +\lambda ^2) r\le \frac{9}{4}r\).

Proof

Assume that \(\textbf{S}=f(A)-\textbf{V}\textbf{C}^{\top }\) and thus \(r=\Vert \textbf{S}\Vert _\infty \).

$$\begin{aligned} \begin{aligned}&\Vert g(A)-\textbf{V}' \textbf{C}'^{\top }\Vert _\infty = \Vert g(A)-(\textbf{V}+\lambda A\textbf{V})(\textbf{C}^{\top }+\lambda \textbf{C}^{\top }A)\Vert _\infty \\&=\Vert g(A)-\textbf{V}\textbf{C}^{\top }-\lambda A\textbf{V}\textbf{C}^{\top }-\lambda \textbf{V}\textbf{C}^{\top }A-\lambda ^2 A\textbf{V}\textbf{C}^{\top }A\Vert _\infty \\&=\Vert \textbf{S}+\lambda A\textbf{S}+\lambda \textbf{S}A+\lambda ^2 A\textbf{S}A\Vert _\infty \\&\le \Vert \textbf{S}\Vert _\infty +\lambda \Vert A\Vert _\infty \Vert \textbf{S}\Vert _\infty +\lambda \Vert \textbf{S}\Vert _\infty \Vert A\Vert _\infty +\lambda ^2\Vert \textbf{S}\Vert _\infty \Vert A\Vert _\infty ^2\\&= r+2\lambda r+\lambda ^2 r, \end{aligned} \end{aligned}$$
(8.68)

where the second last equality replaces g(A) and \(f(A)-\textbf{V}\textbf{C}^{\top }\) by the definitions of g(A) and \(\textbf{S}\) and the last equality uses the fact that \(\Vert A\Vert _\infty =\max _i \sum _j |A_{ij}|=1\) because the summation of each row of A equals to 1.

In the experimental settings, it is assumed that the weight of lower order proximities should be larger than higher order proximities because they are more directly related to the original network. Therefore, given \(g(A)=f(A)+2\lambda Af(A)+\lambda ^2 A^2 f(A)\), we have \(1\ge 2\lambda \ge \lambda ^2>0\) which indicates that \(\lambda \in (0,\frac{1}{2}]\). The proof indicates that the updated embedding can implicitly approximate a polynomial g(A) of 2 more degrees within \(\frac{9}{4}\) times matrix infinite norm of previous embeddings.    \(\blacksquare \)

Algorithm. The update Eq. 8.67 can be further generalized in two directions. First we can update embeddings \(\textbf{V}\) and \(\textbf{C}\) according to Eq. 8.69:

$$\begin{aligned} \begin{aligned} \textbf{V}'&=\textbf{V}+\lambda _1 A\ \textbf{V}+\lambda _2 A\ (A\ \textbf{V}),\\ \textbf{C}'&=\textbf{C}+\lambda _1 A^{\top }\ \textbf{C}+\lambda _2 A^{\top }\ (A^{\top }\ \textbf{C}). \end{aligned} \end{aligned}$$
(8.69)

The time complexity is still O(|V|d) but Eq. 8.69 can obtain higher proximity matrix approximation than Eq. 8.67 in one iteration. More complex update formulas that explore further higher proximities than Eq. 8.69 can also be applied but Eq. 8.69 is used in current experiments as a cost-effective choice.

Another direction is that the update equation can be processed for T rounds to obtain higher proximity approximation. However, the approximation bound would grow exponentially as the number of rounds T grows and thus the update cannot be done infinitely. Note that the update operation of \(\textbf{V}\) and \(\textbf{C}\) are completely independent. Therefore, only updating network embedding \(\textbf{V}\) is enough for NRL. The above algorithm (NEU) avoids an accurate computation of high-order proximity matrix but can yield network embeddings that actually approximate high-order proximities. Hence, this algorithm can improve the quality of network embeddings efficiently. Intuitively, Eqs. 8.67 and 8.69 allow the learned embeddings to further propagate to their neighbors. Hence, the proximities of longer distances between vertices will be embedded.

8.2.6 Applications

In this part, we will introduce common applications for network representation learning and their evaluation metrics.

8.2.6.1 Multi-label Classification

A multi-label classification task is the most widely used network representation learning evaluation task. The representations of vertices are considered as vertex features and applied to classifiers to predict vertex labels. More formally, we assume that there are K labels in total. The vertex-label relationship can be expressed as a binary matrix \(\textbf{M}\in \{0,1\}^{|V|\times K}\) where \(\textbf{M}_{ij}=1\) indicates that vertex \(v_i\) has jth label and \(\textbf{M}_{ij}=0\) otherwise. Specifically, for the multiclass classification problem, each vertex has exactly one label, which means there is only an “1” in each row of matrix \(\textbf{M}\). For the evaluation task, we set a training ratio which indicates how much percent of vertices have observed labels. Then our goal is to predict the labels for the vertices in the test set.

For unsupervised network representation learning algorithms, the labels of the training set are not used for embedding learning. The network representations are fed to classifiers like SVM or logistic regression. Each label will have its classifier. For semi-supervised learning methods, they take the observed vertex labels into account in the representation learning period. These algorithms will have their specific classifiers for label prediction.

Once the label prediction is done, we can move to compute the evaluation metrics. For multiclass classification, we assume that the number of correctly predicted vertices is \(|V_r|\). Then the classification accuracy is defined as the ratio of correctly predicted vertices which can be formulated as \(|V_r|/|V|\). For multi-label classification, the precision, recall, and F1 are the most popular metrics, which are computed as follows:

$$\begin{aligned} \begin{aligned} \text {Precision}&=\frac{N_{ \text {correctly predicted labels}}}{N_{ \text {predicted labels}}},\\ \text {Recall}&=\frac{N_{ \text {correctly predicted labels}}}{N_{ \text {unobserved labels}}},\\ \text {F1-Score}&=\frac{2\text {Precision}\times \text {Recall}}{\text {Precision}+\text {Recall}}. \end{aligned} \end{aligned}$$
(8.70)

8.2.6.2 Link Prediction

Link prediction is another important evaluation task for network representation learning because a good network embedding should have the ability to model the affinity between vertices. For evaluation, we randomly pick up edges as training set and leave the rest as test set. Cross-validation can also be employed for training and testing.

To make link prediction given the vertex representations, we first need to evaluate the strength of a pair of vertices. The strength between two vertices is evaluated by computing the similarity between their representations. This similarity is usually computed by cosine similarity, inner product, or square loss, which depends on the algorithm. For example, if an algorithm uses \(\Vert \textbf{V}_i-\textbf{C}_j\Vert _2^2\) in their objective function, then square loss should be used to measure the similarity between vertex representations. Then after we get the similarity of all unobserved links, we can rank them for link prediction. There are two significant metrics for link prediction: area under the receiver operating characteristic curve (AUC) and precision.

AUC. The AUC value is the probability that a randomly chosen missing link has a higher score than a randomly chosen nonexistent link. For implementation, we randomly select a missing link and a nonexistent link and compare their similarity score. Assume that there are \(n_1\) times that missing link having a higher score and \(n_2\) times they have the same score among n independent comparisons. Then the AUC value is

$$\begin{aligned} {\text {AUC}}=\frac{n_1+0.5n_2}{n}. \end{aligned}$$
(8.71)

Note that for a random network representation, the AUC value should be 0.5.

Precision. Given the ranking of all the non-observed links, we predict the links with top-L highest score as predicted ones. Assume that there are \(L_{r}\) links that are missing links, then the precision is defined as \(L_{r}/L\).

8.2.6.3 Community Detection

For the network representation based community detection algorithm, we first need to convert the nonnegative vertex representation into the hard assignment of communities. Assume that we have network representation matrix \(\textbf{V}\in \mathbb {R^+}^{|V|\times k}\) where row i of \(\textbf{V}\) is the nonnegative embedding of vertex \(v_i\). For community detection, we regard each dimension of the embeddings as a community. That is to say, \(\textbf{V}_{ij}\) denotes the affinity between vertex \(v_i\) and community \(c_j\). For each column of matrix \(\textbf{V}\), we set a threshold \(\varDelta \) and the vertices with affinity score higher than the threshold will be considered as a member of the corresponding community. The threshold can be set in various ways. For example, we can set \(\delta \) so that a vertex belongs to a community c if the node is connected to other members of c with an edge probability higher than 1/N: [140]

$$\begin{aligned} \frac{1}{N}\le 1-\exp (-\varDelta ^2), \end{aligned}$$
(8.72)

which indicates that \(\varDelta =\sqrt{-\log (1-1/N)}\).

For evaluation metrics, we have two choices: modularity and matching score.

Modularity. Recall that the modularity of a graph Q is defined as

$$\begin{aligned} Q=\frac{1}{2|E|}\sum _{i,j}\left[ A_{ij}-\frac{\text {deg}(v_i)\text {deg}(v_j)}{2|E|}\right] \delta (v_i,v_j), \end{aligned}$$
(8.73)

where \(\delta (v_i,v_j)=1\) if \(v_i\) and \(v_j\) belong to the same community and \(\delta (v_i,v_j)=0\) otherwise. A larger modularity indicates a better community detection algorithm.

Matching Score. This is a more sophisticated evaluation metric for community detection. To compare a set of ground truth communities \(C^*\) to a set of detected communities C, we first need to match each detected community to the most similar ground truth community. On the other side, we also find the most similar detected community for each ground truth community. Then the final performance is evaluated by the average of both sides:

$$\begin{aligned} \frac{1}{2|C^*|}\sum _{c_i^*\in C^*}\max _{c_j\in C}\delta (c_i^*,c_j)+\frac{1}{2|C|}\sum _{c_j\in C}\max _{c_i^*\in C^*}\delta (c_i^*,c_j), \end{aligned}$$
(8.74)

where \(\delta (c_i^*,c_j)\) is a similarity measurement of ground truth community \(c_i^*\) and detected community \(c_j\), such as Jaccard similarity. The score is between 0 and 1, where 1 indicates a perfect matching of ground truth communities.

8.2.6.4 Recommender System

Recommender systems aim at recommending items (e.g., products, movies, or locations) for users and cover a wide range of applications. In many cases, an application comes with an associated social network between users. Now we will present an example to show how to use the idea of network representation for building recommender systems in location-based social networks.

Fig. 8.7
figure 7

An illustrative example for the data in LBSNs: a Link connections represent the friendship between users. b A trajectory generated by a user is a sequence of chronologically ordered check-in records [138]

The accelerated growth of mobile trajectories in location-based services brings valuable data resources to understand users’ moving behaviors. Apart from recording the trajectory data, another major characteristic of these location-based services is that they also allow the users to connect whomever they like or are interested in. As shown in Fig. 8.7, a combination of social networking and location-based services is called as Location-Based Social Networks (LBSN). As shown in [21], locations that are frequently visited by socially related persons tend to be correlated, which indicates the close association between social connections and trajectory behaviors of users in LBSNs. In order to better analyze and mine LBSN data, we need to have a comprehensive view to analyze and mine the information from the two aspects, i.e., the social network and mobile trajectory data.

Specifically, JNTM [138] is proposed to model both social networks and mobile trajectories jointly. The model consists of two components: the construction of social networks and the generation of mobile trajectories. First, JNTM adopts a network embedding method for the construction of social networks where a networking representation can be derived for a user. Secondly, JNTM considers four factors that influence the generation process of mobile trajectories, namely, user visit preference, influence of friends, short-term sequential contexts, and long-term sequential contexts. Then JNTM uses real-valued representations to encode the four factors and set two different user representations to model the first two factors: a visit interest representation and a network representation. To characterize the last two contexts, JNTM employs the RNN and GRU models to capture the sequential relatedness in mobile trajectories at different levels, i.e., short term or long term. Finally, the two components are tied by sharing user network representations. The overall model is illustrated in Fig. 8.8.

8.2.6.5 Information Diffusion Prediction

Information diffusion prediction is an important task which studies how information items spread among users. The prediction of information diffusion, also known as cascade prediction, has been studied over a wide range of applications, such as product adoption [67], epidemiology [124], social networks [63], and the spread of news and opinions [68].

Fig. 8.8
figure 8

The architecture of JNTM model

Fig. 8.9
figure 9

Illustrative examples for microscopic next infected user prediction (left) and macroscopic cascade size prediction (right) [139]

As shown in Fig. 8.9, microscopic diffusion prediction aims at guessing the next infected user, while macroscopic diffusion prediction estimates the total numbers of infected users during the diffusion process. Also, an underlying social graph among users will be available when information diffusion occurs on a social network service. The social graph will be considered as additional structural inputs for diffusion prediction.

Fig. 8.10
figure 10

An illustrative example of structural context extraction of the orange node by neighbor sampling and feature aggregation [139]

FOREST [139] is the first work to address both microscopic and macroscopic predictions. As shown in Fig. 8.10, FOREST proposes a structural context extraction algorithm that was originally introduced for accelerating graph convolutional networks [41] to build an RNN-based microscopic cascade model. For each user v, we first sample Z users \(\{u_1,u_2\dots ,u_{Z}\}\) from v and its neighbors \(\mathscr {N}(v)\). Then we update its feature vector by aggregating the neighborhood features. The updated user feature vector encodes structural information by aggregating features from v’s first-order neighbors. The operation can also be processed recursively to explore a larger neighborhood of user v. Empirically, a two-step neighborhood exploration is time efficient and enough to give promising results.

FOREST further incorporates the ability of macroscopic prediction, i.e., estimating the eventual size of a cascade into the model by reinforcement learning. The method can be divided into four steps: (a) encode observed K users by a microscopic cascade model; (b) enable the microscopic cascade model to predict the size of a cascade by cascade simulations; (c) use Mean-Square Log-Transformed Error (MSLE) as the supervision signal for macroscopic predictions; and (d) employ a reinforcement learning framework to update parameters through policy gradient algorithm. The overall workflow is illustrated in Fig. 8.11.

8.3 Graph Neural Networks

We now give a short introduction to Graph Neural Networks for NRL, partially based on our review [161] and tutorial [162] whose publishing agreement allows the authors to reuse these contents.

8.3.1 Motivations

Graph Neural Networks (GNNs) are deep learning based methods that operate on graph domain. Due to its convincing performance and high interpretability, GNN has been a widely applied graph analysis method recently. In this subsection, we will illustrate the fundamental motivations of graph neural networks.

Fig. 8.11
figure 11

The workflow of adopting microscopic cascade model for macroscopic size prediction by reinforcement learning

In recent years, CNNs [65] have made breakthroughs in various machine learning areas, especially in the area of computer vision, and started the revolution of deep learning [64]. CNNs are capable of extracting multiscale localized features and these features are used to generate more expressive representations. As we are going deeper into CNNs and graphs, we found the keys of CNNs: local connection, shared weights, and the use of multilayer [64]. These are also of great importance in solving problems of graph domain, because (1) graphs are the most typical locally connected structure, (2) shared weights reduce the computational cost compared with traditional spectral graph theory [23], and (3) multilayer structure is the key to deal with hierarchical patterns, which captures the features of various sizes. However, CNNs can only operate on regular Euclidean data like images (2D grid) and text (1D sequence) while these data structures can be regarded as instances of graphs. Therefore, it is straightforward to think of finding the generalization of CNNs to graphs. As shown in Fig. 8.12, it is hard to define localized convolutional filters and pooling operators, which hinders the transformation of CNN from Euclidean domain to non-Euclidean domain.

Fig. 8.12
figure 12

Left: image in Euclidean space. Right: graph in non-Euclidean space [155]

The other motivation comes from network embedding [12, 24, 37, 42, 149]. In the field of graph analysis, traditional machine learning approaches usually rely on hand-engineered features and are limited by its inflexibility and high cost. Following the idea of representation learning and the success of word embedding [81], DeepWalk [93], which is regarded as the first graph embedding method based on representation learning, applies Skip-gram model [81] on the generated random walks. Similar approaches such as node2vec [38], LINE [111], and TADW [136] also achieved breakthroughs. However, these methods suffer from two severe drawbacks [42]. First, no parameters are shared between nodes in the encoder, which leads to computational inefficiency, since it means the number of parameters grows linearly with the number of nodes. Second, the direct embedding methods lack the ability of generalization, which means they cannot deal with dynamic graphs or generalize to new graphs.

Based on CNNs and network embedding, Graph Neural Networks (GNNs) are proposed to collectively aggregate information from graph structure. Thus, they can model input and/or output consisting of elements and their dependency. Further, the graph neural networks can simultaneously model the diffusion process on the graph with the RNN kernel.

In the rest of this section, we will first introduce several typical variants of graph neural networks such as Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs), and Graph Recurrent Networks (GRNs). Then we will introduce several extensions to the original model and finally, we will give some examples of applications that utilize graph neural networks.

8.3.2 Graph Convolutional Networks

Graph Convolutional Networks (GCNs) aim to generalize convolutions to the graph domain. Advances in this direction are often categorized as spectral approaches and spatial (nonspectral) approaches.

8.3.2.1 Spectral Approaches

Spectral approaches work with a spectral representation of the graphs.

Spectral Network. Bruna et al. [11] proposes the spectral network. The convolution operation is defined in the Fourier domain by computing the eigendecomposition of the graph Laplacian. The operation can be defined as the multiplication of a signal \({\textbf{x}} \in \mathbb {R}^N\) (a scalar for each node) with a filter \({g}_\theta = \)diag\((\boldsymbol{\theta })\) parameterized by \(\boldsymbol{\theta } \in \mathbb {R}^N\):

$$\begin{aligned} {g}_{\theta } \star {\textbf{x}} = {U}{g}_{\theta }({\Lambda }){U}^T {\textbf{x}} , \end{aligned}$$
(8.75)

where U is the matrix of eigenvectors of the normalized graph Laplacian \({L} = \textbf{I}_N - {D}^{-\frac{1}{2}} {A} {D}^{-\frac{1}{2}} = {U}{\Lambda }{U}^T\) (D is the degree matrix and A is the adjacency matrix of the graph), with a diagonal matrix of its eigenvalues \({\Lambda }\).

This operation results in potentially intense computations and non-spatially localized filters. Henaff et al. [47] attempts to make the spectral filters spatially localized by introducing a parameterization with smooth coefficients.

ChebNet. Hammond et al. [43] suggests that \({g}_{\theta }({\Lambda })\) can be approximated by a truncated expansion in terms of Chebyshev polynomials \({T}_k(x)\) up to Kth order. Thus, the operation is

$$\begin{aligned} {g}_\theta \star \textbf{x} \approx \sum _{k=0}^K {\theta }_k {T}_k(\tilde{{L}})\textbf{x}, \end{aligned}$$
(8.76)

with \(\tilde{{L}}={2}/{\lambda _{max}}{L} - \textbf{I}_N\). \(\lambda _{max}\) denotes the largest eigenvalue of L. \(\theta \in \mathbb {R}^K\) is now a vector of Chebyshev coefficients. The Chebyshev polynomials are defined as \( {T}_k({x}) = 2{x}{T}_{k-1}({x}) - {T}_{k-2}({x})\), with \({T}_0({x})=1\) and \({T}_1({x})={x}\). It can be observed that the operation is K-localized since it is a Kth-order polynomial in the Laplacian. Defferrard et al. [28] proposes the ChebNet. It uses this K-localized convolution to define a convolutional neural network, which could remove the need to compute the eigenvectors of the Laplacian.

GCN. Kipf and Welling [59] limits the layer-wise convolution operation to \(K=1\) to alleviate the problem of overfitting on local neighborhood structures for graphs with very wide node degree distributions. It further approximates \(\lambda _{max} \approx 2\) and the equation simplifies to

$$\begin{aligned} {g}_{\theta '} \star \textbf{x} \approx \theta _0' \textbf{x} + \theta _1' \left( {L}-\textbf{I}_N\right) \textbf{x} = \theta _0' \textbf{x} - \theta _1' {D}^{-\frac{1}{2}}{A}{D}^{-\frac{1}{2}} \textbf{x}, \end{aligned}$$
(8.77)

with two free parameters \(\theta _0'\) and \(\theta _1'\). After constraining the number of parameters with \(\theta = \theta _0' = -\theta _1'\), we can obtain the following expression:

$$\begin{aligned} {g}_{\theta } \star \textbf{x} \approx \theta \left( \textbf{I}_N + {D}^{-\frac{1}{2}}{A}{D}^{-\frac{1}{2}}\right) \textbf{x} . \end{aligned}$$
(8.78)

Note that stacking this operator could lead to numerical instabilities and exploding/vanishing gradients, [59] introduces the renormalization trick: \(\textbf{I}_N + {D}^{-\frac{1}{2}}{A}{D}^{-\frac{1}{2}}\rightarrow \tilde{{D}}^{-\frac{1}{2}}\tilde{{A}}\tilde{{D}}^{-\frac{1}{2}}\), with \(\tilde{{A}} = {A} + \textbf{I}_N\) and \(\tilde{{D}}_{ii} = \sum _j \tilde{{A}}_{ij}\). Finally, [59] generalizes the definition to a signal \({X} \in \mathbb {R}^{N \times C}\) with C input channels and F filters for feature maps as follows:

$$\begin{aligned} \textbf{H} = f(\tilde{{D}}^{-\frac{1}{2}}\tilde{{A}}\tilde{{D}}^{-\frac{1}{2}}\textbf{X}\textbf{W}), \end{aligned}$$
(8.79)

where \(\textbf{W} \in \mathbb {R}^{C \times F}\) is a matrix of filter parameters, \(\textbf{H} \in \mathbb {R}^{N \times F}\) is the convolved signal matrix and \(f(\cdot )\) is the activation function.

The GCN layer can be stacked for multiple times so that we have the equation:

$$\begin{aligned} \textbf{H}^{(t)} = f(\tilde{{D}}^{-\frac{1}{2}}\tilde{{A}}\tilde{{D}}^{-\frac{1}{2}}\textbf{H}^{(t-1)}\textbf{W}^{(t-1)}), \end{aligned}$$
(8.80)

where the superscripts t and \(t-1\) denote the layers of the matrices, the initial matrix \(\textbf{H}^{(0)}\) could be \(\textbf{X}\). After L layers, we can use the final embedding matrix \(\textbf{H}^{(L)}\) and a readout function to get the final output matrix \(\textbf{Z}\):

$$\begin{aligned} \textbf{Z} = \text {Readout}(H^{(L)}), \end{aligned}$$
(8.81)

where the readout function can be any machine learning methods, such as MLP.

Finally, as a semi-supervised algorithm, GCN uses the feature matrix at the top layer \(\textbf{Z}\) which has the same dimension with the total number of labels to predict the labels of all observed labels. The loss function can be written as

$$\begin{aligned} \mathscr {L}=-\sum _{l\in y_L}\sum _{f}Y_{lf}\ln \textbf{Z}_{lf}, \end{aligned}$$
(8.82)

where \(y_L\) is the set of node indices that have observed labels. Figure 8.13 shows the algorithm of GCN.

Fig. 8.13
figure 13

The architecture of graph convolutional network model

8.3.2.2 Spatial Approaches

In all of the spectral approaches mentioned above, the learned filters depend on the Laplacian eigenbasis, which depends on the graph structure, that is, a model trained on a specific structure could not be directly applied to a graph with a different structure.

Spatial approaches define convolutions directly on the graph, operating on spatially close neighbors. The major challenge of spatial approaches is defining the convolution operation with differently sized neighborhoods and maintaining the local invariance of CNNs.

Neural FPs. Duvenaud et al. [31] uses different weight matrices for nodes with different degrees

$$\begin{aligned} \begin{aligned} \textbf{x}^{(t)}&= \textbf{h}_v^{(t-1)} + \sum _{i=1}^{|{N}_v|} \textbf{h}_i^{(t-1)} ,\\ \textbf{h}^{(t)}_v&= f ( \textbf{W}_{|{N}_v|}^{(t)}\textbf{x}^{(t)}), \end{aligned} \end{aligned}$$
(8.83)

where \(\textbf{W}_{|{N}_v|}^{(t)}\) is the weight matrix for nodes with degree \(|{N}_v|\) at layer t. And the main drawback of the method is that it cannot be applied to large-scale graphs with more node degrees.

In the following description of other models, we use \(h_v^{(t)}\) to denote the hidden state of node v at layer t. \({N}_v\) denotes the neighbor set of node v and \(|N_v|\) denotes the size of the set.

DCNN. Atwood and Towsley [4] proposes the Diffusion-Convolutional Neural Networks (DCNNs). Transition matrices are used to define the neighborhood for nodes in DCNN. For node classification, it has

$$\begin{aligned} \textbf{H} = f \left( {\textbf{W}}^c \odot \overrightarrow{\textbf{P}} \textbf{X} \right) , \end{aligned}$$
(8.84)

where \(\odot \) is the element-wise multiplication and \(\textbf{X}\) is an \(N \times F\) matrix of input features. \(\overrightarrow{\textbf{P}}\) is an \(N \times K \times N\) tensor which contains the power series {\(\textbf{P},\textbf{P}^2\),..., \(\textbf{P}^K\)} of matrix \(\textbf{P}\). And \(\textbf{P}\) is the degree-normalized transition matrix from the graphs adjacency matrix A. Each entity is transformed to a diffusion-convolutional representation, which is a \(K \times F\) matrix defined by K hops of graph diffusion over F features. And then it will be defined by a \(K \times F\) weight matrix and a nonlinear activation function f. Finally \(\textbf{H}\) (which is \(N \times K \times F\)) denotes the diffusion representations of each node in the graph.

DGCN. Zhuang and Ma [158] proposes the Dual Graph Convolutional Network (DGCN) to consider the local consistency and global consistency of graphs jointly. It uses two convolutional networks to capture the local/global consistency and adopts an unsupervised loss to ensemble them. The first convolutional network is the same as Eq. 8.80. And the second network replaces the adjacency matrix with Positive Point-wise Mutual Information (PPMI) matrix:

$$\begin{aligned} \textbf{H}^{(t)} = f ({D}^{-\frac{1}{2}}_P {X}_P {D}^{-\frac{1}{2}}_P {H^{(t-1)}} \textbf{W}), \end{aligned}$$
(8.85)

where \({X}_P\) is the PPMI matrix and \({D}_P\) is the diagonal degree matrix of \({X}_P\).

GraphSAGE. Hamilton et al. [41] proposes the GraphSAGE, a general inductive framework. The framework generates embeddings by sampling and aggregating features from a node’s local neighborhood.

$$\begin{aligned} \begin{aligned} \textbf{h}_{{N}_v}^{(t)}&= {{\text {AGGREGATE}}}^{(t)} (\{\textbf{h}_u^{(t-1)}, \forall u\in {N}_v\} ), \\ \textbf{h}_{v}^{(t)}&= f ({\textbf{W}}^{(t)} [ \textbf{h}_{v}^{(t-1)} ; \textbf{h}_{{N}_v}^{(t)} ] ). \end{aligned} \end{aligned}$$
(8.86)

However, [41] does not utilize the full set of neighbors in Eq. 8.86 but a fixed-size set of neighbors by uniformly sampling. And [41] suggests three aggregator functions.

  • Mean aggregator. It could be viewed as an approximation of the convolutional operation from the transductive GCN framework [59], so that the inductive version of the GCN variant could be derived by

    $$\begin{aligned} \textbf{h}^{(t)}_v = f\left( \textbf{W}\cdot {{\text {MEAN}}}\left( \{\textbf{h}^{(t-1)}_v\} \cup \{\textbf{h}_u^{(t-1)}| \forall u \in {{N}_v}\}\right) \right) . \end{aligned}$$
    (8.87)

    The mean aggregator is different from other aggregators because it does not perform the concatenation operation which concatenates \(\textbf{h}_v^{t-1}\) and \(\textbf{h}_{N_v}^{t}\) in Eq. 8.86. It can be viewed as a form of “skip connection” [46] and can achieve better performance.

  • LSTM aggregator. Hamilton et al. [41] also uses an LSTM-based aggregator which has a larger expressive capability. However, LSTMs process inputs in a sequential manner so that they are not permutation invariant. Hamilton et al. [41] adapts LSTMs to operate on an unordered set by permutating node’s neighbors.

  • Pooling aggregator. In the pooling aggregator, each neighbor’s hidden state is fed through a fully connected layer and then a max-pooling operation is applied to the set of the node’s neighbors.

    $$\begin{aligned} \textbf{h}_{{N}_v}^{(t)} = \max (\{f(\textbf{W}_{pool}\textbf{h}^{(t-1)}_{u} + \textbf{b}), \forall u \in {N}_v\}). \end{aligned}$$
    (8.88)

    Note that any symmetric functions could be used in place of the max-pooling operation here.

Other methods. There are still many other spatial methods. The PATCHY-SAN model [86] first extracts exactly k nodes for each node and normalizes them. Then the convolutional operation is applied to the normalized neighborhood. LGCN [35] leverages CNNs as aggregators. It performs max-pooling on nodes’ neighborhood matrices to get top-k feature elements and then applies 1-D CNN to compute hidden representations. Monti et al. [82] proposes a spatial-domain model (MoNet) on non-Euclidean domains which could generalize several previous techniques. The Geodesic CNN (GCNN) [78] and Anisotropic CNN (ACNN) [10] on manifolds or GCN [59] and DCNN [4] on graphs could be formulated as particular instances of MoNet. Our readers can refer to their papers for more details.

8.3.3 Graph Attention Networks

The attention mechanism has been successfully used in many sequence-based tasks such as machine translation [5, 36, 121], machine reading [19], etc. Many works focus on generalizing the attention mechanism to the graph domain.

GAT. Velickovic et al. [122] proposes a Graph Attention Network (GAT) which incorporates the attention mechanism into the propagation step. Specifically, it uses the self-attention strategy and each node’s hidden state is computed by attending over its neighbors.

Velickovic et al. [122] defines a single graph attentional layer and constructs arbitrary graph attention networks by stacking this layer. The layer computes the coefficients in the attention mechanism of the node pair (ij) by:

$$\begin{aligned} \alpha _{ij} = \frac{\exp \left( \text {LeakyReLU}\left( \textbf{a}^{\top }[{\textbf{W}}\textbf{h}_i^{(t-1)};\textbf{W}\textbf{h}_j^{(t-1)}]\right) \right) }{\sum _{k\in {N}_i} \exp \left( \text {LeakyReLU}\left( \textbf{a}^{\top }[\textbf{W}\textbf{h}_i^{(t-1)};\textbf{W}\textbf{h}_k^{(t-1)}]\right) \right) }, \end{aligned}$$
(8.89)

where \(\alpha _{ij}\) is the attention coefficient of node j to i. \(\textbf{W} \in \mathbb {R}^{F' \times F}\) is the weight matrix of a shared linear transformation which applied to every node, \(\textbf{a} \in \mathbb {R}^{2F'}\) is the weight vector. It is normalized by a softmax function and the LeakyReLU nonlinearity (with negative input slop 0.2) is applied.

Then the final output features of each node can be obtained by (after applying a nonlinearity f):

$$\begin{aligned} \textbf{h}^{(t)}_i = f\left( \sum _{j\in {N}_i} \alpha _{ij} \textbf{W}\textbf{h}_j^{(t-1)}\right) . \end{aligned}$$
(8.90)

Moreover, the layer utilizes the multi-head attention similarly to [121] to stabilize the learning process. It applies K independent attention mechanisms to compute the hidden states and then concatenates their features(or computes the average), resulting in the following two output representations:

$$\begin{aligned} \textbf{h}^{(t)}_i = \Vert _{k=1}^K f\left( \sum _{j\in {N}_i}\alpha _{ij}^k\textbf{W}^k \textbf{h}_j^{(t-1)}\right) , \end{aligned}$$
(8.91)
$$\begin{aligned} \textbf{h}^{(t)}_i = f\left( \frac{1}{K}\sum _{k=1}^K \sum _{j\in {N}_i}\alpha _{ij}^k\textbf{W}^k \textbf{h}_j^{(t-1)}\right) , \end{aligned}$$
(8.92)

where \(\alpha _{ij}^k\) is normalized attention coefficient computed by the kth attention mechanism, \(\Vert \) is the concatenation operation.

The attention architecture in [122] has several properties: (1) the computation of the node-neighbor pairs is parallelizable thus the operation is efficient; (2) it can deal with nodes that have different degrees by assigning reasonable weights to their neighbors; (3) it can be applied to the inductive learning problems easily.

GAAN. Besides GAT, Gated Attention Network (GAAN) [150] also uses the multi-head attention mechanism. However, it uses a self-attention mechanism to gather information from different heads to replace the average operation of GAT.

8.3.4 Graph Recurrent Networks

Several works are attempting to use the gate mechanism like GRU [20] or LSTM [48] in the propagation step to release the limitations induced by the vanilla GNN architecture and improve the effectiveness of the long-term information propagation across the graph. We call these methods Graph Recurrent Networks (GRNs) and we will introduce some variants of GRNs in this subsection.

GGNN. Li et al. [72] proposes the gated graph neural network (GGNN) which uses the Gate Recurrent Units (GRU) in the propagation step. It follows the computation steps from recurrent neural networks for a fixed number of L steps, then it back-propagates through time to compute gradients.

Specifically, the basic recurrence of the propagation model is

$$\begin{aligned} {\textbf{a}}_v^{(t)} =&{} {A}_v^{\top }[{\textbf{h}}_1^{(t-1)} \dots {\textbf{h}}_{N}^{(t-1)}]^{\top }+ {\textbf{b}} \nonumber ,\\ {\textbf{z}}_v^{(t)} =&{} {\text {Sigmoid}}\left( {\textbf{W}}^z{{\textbf{a}}_v^{(t)}}+{\textbf{U}}^z{{\textbf{h}}_v^{(t-1)}}\right) \nonumber ,\\ {\textbf{r}}_v^{(t)} =&{} {\text {Sigmoid}}\left( {\textbf{W}}^r{{\textbf{a}}_v^{(t)}}+{\textbf{U}}^r{{\textbf{h}}_v^{(t-1)}}\right) ,\\ \widetilde{{\textbf{h}}}_v^{(t)} =&{} \tanh \left( {\textbf{W}}{{\textbf{a}}_v^{(t)}}+{\textbf{U}}\left( {{\textbf{r}}_v^{(t)}}\odot {{\textbf{h}}_v^{(t-1)}}\right) \right) \nonumber ,\\ {\textbf{h}}_v^{(t)} =&{} \left( 1-{{\textbf{z}}_v^{(t)}}\right) \odot {{\textbf{h}}_v^{(t-1)}}+{{\textbf{z}}_v^{(t)}}\odot {\widetilde{{\textbf{h}}}_v^{(t)}} \nonumber . \end{aligned}$$
(8.93)

The node v first aggregates message from its neighbors, where \({A}_v\) is the sub-matrix of the graph adjacency matrix A and denotes the connection of node v with its neighbors. Then the hidden state of the node is updated by the GRU-like function using the information from its neighbors and the hidden state from the previous timestep. \({\textbf{a}}\) gathers the neighborhood information of node v, \({\textbf{z}}\) and \({\textbf{r}}\) are the update and reset gates.

LSTMs are also used similarly as GRU through the propagation process based on a tree or a graph.

Tree-LSTM. Tai et al. [109] proposes two extensions to the basic LSTM architecture: the Child-Sum Tree-LSTM and the N-ary Tree-LSTM. Like in standard LSTM units, each Tree-LSTM unit (indexed by v) contains input and output gates \(\textbf{i}_v\) and \({\textbf{o}}_v\), a memory cell \({\textbf{c}}_v\) and hidden state \({\textbf{h}}_v\). The Tree-LSTM unit replaces the single forget gate by a forget gate \(\textbf{f}_{vk}\) for each child k, allowing node v to select information from its children accordingly. The equations of the Child-Sum Tree-LSTM are

$$\begin{aligned} \widetilde{\textbf{h}}_v^{t-1}&= \sum _{k \in {N}_v} {\textbf{h}}_k^{t-1} \nonumber ,\\ \textbf{i}_v^{t}&= {\text {Sigmoid}} \Big ( {\textbf{W}}^i \textbf{x}_v^t + {\textbf{U}}^i \widetilde{\textbf{h}}_v^{t-1} + {\textbf{b}}^i \Big ) \nonumber ,\\ \textbf{f}_{vk}^t&= {\text {Sigmoid}} \Big ( {\textbf{W}}^f \textbf{x}_v^t + {\textbf{U}}^f {\textbf{h}}_k^{t-1} + {\textbf{b}}^f \Big ) \nonumber ,\\ \textbf{o}_v^{t}&= {\text {Sigmoid}} \Big ( {\textbf{W}}^o \textbf{x}_v^t + {\textbf{U}}^o \widetilde{{\textbf{h}}}_v^{t-1} + {\textbf{b}}^o \Big ) ,\\ \textbf{u}_v^t&= \tanh \Big ({\textbf{W}}^u \textbf{x}_v^t + {\textbf{U}}^u \widetilde{{\textbf{h}}}_v^{t-1} + {\textbf{b}}^u \Big ) \nonumber ,\\ {\textbf{c}}_v^t&= \textbf{i}_v^t \odot \textbf{u}_v^t + \sum _{k \in {N}_v} \textbf{f}_{vk}^t \odot {\textbf{c}}_k^{t-1} \nonumber ,\\ {\textbf{h}}_v^t&= \textbf{o}_v^t \odot \tanh ({\textbf{c}}_v^t) \nonumber , \end{aligned}$$
(8.94)

where \(\textbf{x}_v^t\) is the input vector at time t in the standard LSTM setting.

In a specific case, if each node’s number of children is at most K and these children can be ordered from 1 to K, then the N-ary Tree-LSTM can be applied. For node v, \({\textbf{h}}_{vk}^t\) and \({\textbf{c}}_{vk}^t\) denote the hidden state and memory cell of its kth child at time t respectively. The transition equations are the following:

$$\begin{aligned} \textbf{i}_v^t&= {\text {Sigmoid}} \Big ( {\textbf{W}}^i \textbf{x}_v^t + \sum _{l=1}^K {\textbf{U}}_l^i {\textbf{h}}_{vl}^{t-1} + {\textbf{b}}^i \Big ) \nonumber , \\ \textbf{f}_{vk}^t&= {\text {Sigmoid}} \Big ( {\textbf{W}}^f \textbf{x}_v^t + \sum _{l=1}^K {\textbf{U}}_{kl}^f {\textbf{h}}_{vl}^{t-1} + {\textbf{b}}^f \Big ) \nonumber ,\\ \textbf{o}_v^t&= {\text {Sigmoid}} \Big ( {\textbf{W}}^o \textbf{x}_v^t + \sum _{l=1}^K {\textbf{U}}_l^o {\textbf{h}}_{vl}^{t-1} + {\textbf{b}}^o \Big ), \\ \textbf{u}_v^t&= \tanh \Big ( {\textbf{W}}^u \textbf{x}_v^t + \sum _{l=1}^K {\textbf{U}}_l^u {\textbf{h}}_{vl}^{t-1} +{\textbf{b}}^u \Big ) \nonumber ,\\ {\textbf{c}}_v^t&= \textbf{i}_v^t \odot \textbf{u}_v^t + \sum _{l=1}^K \textbf{f}_{vl}^t \odot {\textbf{c}}_{vl}^{t-1} \nonumber ,\\ {\textbf{h}}_v^t&= \textbf{o}_v^t \odot \tanh ({\textbf{c}}_v^t) \nonumber . \end{aligned}$$
(8.95)

Compared to the Child-Sum Tree-LSTM, the N-ary Tree-LSTM introduces separate parameters for each child k. These parameters allow the model to learn more fine-grained representations conditioning on each node’s children.

Graph LSTM. The two types of Tree-LSTMs can be easily adapted to the graph. The graph-structured LSTM in [148] is an example of the N-ary Tree-LSTM applied to the graph. However, it is a simplified version since each node in the graph has at most 2 incoming edges (from its parent and sibling predecessor). Peng et al. [92] proposes another variant of the Graph LSTM based on the relation extraction task. The main difference between graphs and trees is that edges of graphs have their labels, and [92] utilizes different weight matrices to represent different labels.

$$\begin{aligned} \textbf{i}_v^t&= {\text {Sigmoid}} \Big ({\textbf{W}}^i \textbf{x}_v^t + \sum _{k \in {N}_v} {\textbf{U}}_{m(v,k)}^i {\textbf{h}}_k^{t-1} + {\textbf{b}}^i \Big ) \nonumber ,\\ \textbf{f}_{vk}^t&= {\text {Sigmoid}} \Big ({\textbf{W}}^f \textbf{x}_v^t + {\textbf{U}}_{m(v,k)}^f {\textbf{h}}_k^{t-1} + {\textbf{b}}^f \Big ) \nonumber ,\\ \textbf{o}_v^t&= {\text {Sigmoid}} \Big ({\textbf{W}}^o \textbf{x}_v^t + \sum _{k \in {N}_v} {\textbf{U}}_{m(v,k)}^o {\textbf{h}}_{k}^{t-1} + {\textbf{b}}^o \Big ) ,\\ \textbf{u}_v^t&= \tanh \Big ({\textbf{W}}^u \textbf{x}_v^t + \sum _{k \in {N}_v} {\textbf{U}}_{m(v,k)}^u {\textbf{h}}_{k}^{t-1} + {\textbf{b}}^u \Big ) \nonumber ,\\ \mathbf {\textbf{c}}_v^t&= \textbf{i}_v^t \odot \textbf{u}_v^t + \sum _{k \in {N}_v} \textbf{f}_{vk}^t \odot {\textbf{c}}_{k}^{t-1} \nonumber ,\\ \mathbf {\textbf{h}}_v^t&= \textbf{o}_v^t \odot \tanh ({\textbf{c}}_v^t) \nonumber , \end{aligned}$$
(8.96)

where m(vk) denotes the edge label between node v and k.

Besides, [74] proposes a Graph LSTM network to address the semantic object parsing task. It uses the confidence-driven scheme to adaptively select the starting node and determine the node updating sequence. It follows the same idea of generalizing the existing LSTMs into the graph-structured data but has a specific updating sequence while the methods we mentioned above are agnostic to the order of nodes.

Sentence LSTM. Zhang et al. [152] proposes the Sentence LSTM (S-LSTM) for improving text encoding. It converts text into a graph and utilizes the Graph LSTM to learn the representation. The S-LSTM shows strong representation power in many NLP problems.

8.3.5 Extensions

In this subsection, we will talk about some extensions of graph neural networks.

8.3.5.1 Skip Connection

Many applications unroll or stack the graph neural network layer aiming to achieve better results as more layers (i.e., k layers) make each node aggregate more information from neighbors k hops away. However, it has been observed in many experiments that deeper models could not improve the performance and deeper models could even perform worse [59]. This is mainly because more layers could also propagate the noisy information from an exponentially increasing number of expanded neighborhood members.

A straightforward method to address the problem, the residual network [45], can be found from the computer vision community. Nevertheless, even with residual connections, GCNs with more layers do not perform as well as the 2-layer GCN on many datasets [59].

Highway Network. Rahimi et al. [96] borrows ideas from the highway network [159] and uses layer-wise gates to build a Highway GCN. The input of each layer is multiplied by the gating weights and then summed with the output:

$$\begin{aligned} \begin{aligned} {T}({\textbf{h}}^{(t)})&= {\text {Sigmoid}} \left( \textbf{W}^{(t)} \textbf{h}^{(t)} + \textbf{b}^{(t)} \right) ,\\ \textbf{h}^{(t+1)}&= \textbf{h}^{(t+1)} \odot {T}(\textbf{h}^{(t)}) + \textbf{h}^{(t)} \odot (1 - {T}(\textbf{h}^{(t)})). \end{aligned} \end{aligned}$$
(8.97)

By adding the highway gates, the performance peaks at four layers in a specific problem discussed in [96]. The Column Network (CLN) proposed in [94] also utilizes the highway network. However, it has a different function to compute the gating weights.

Jump Knowledge Network. Xu et al. [134] studies properties and resulting limitations of neighborhood aggregation schemes. It proposes the Jump Knowledge Network which could learn adaptive, structure-aware representations. The Jump Knowledge Network selects from all of the intermediate representations (which"jump" to the last layer) for each node at the last layer, which enables the model to select effective neighborhood information for each node. Xu et al. [134] uses three approaches of concatenation, max-pooling, and LSTM-attention in the experiments to aggregate information. The Jump Knowledge Network performs well on the experiments in social, bioinformatics, and citation networks. It can also be combined with models like Graph Convolutional Networks, GraphSAGE, and Graph Attention Networks to improve their performance.

8.3.5.2 Hierarchical Pooling

In the area of computer vision, a convolutional layer is usually followed by a pooling layer to get more general features. Similar to these pooling layers, much work focuses on designing hierarchical pooling layers on graphs. Complicated and large-scale graphs usually carry rich hierarchical structures that are of great importance for node-level and graph-level classification tasks.

To explore such inner features, Edge-Conditioned Convolution (ECC) [106] designs its pooling module with the recursively downsampling operation. The downsampling method is based on splitting the graph into two components by the sign of the largest eigenvector of the Laplacian.

DIFFPOOL [144] proposes a learnable hierarchical clustering module by training an assignment matrix in each layer:

$$\begin{aligned} \textbf{S}^{(l)} = {\text {Softmax}}({{\text {GNN}}_{l,pool}}(A^{(l)}, \textbf{V}^{(l)})), \end{aligned}$$
(8.98)

where \(\textbf{V}^{(l)}\) is node features and \(A^{(l)}\) is coarsened adjacency matrix of layer l.

8.3.5.3 Neighborhood Sampling

The original graph convolutional neural network has several drawbacks. Specifically, GCN requires the full graph Laplacian, which is computationally consuming for large graphs. Furthermore, the embedding of a node at layer L is computed recursively by the embeddings of all its neighbors at layer \(L-1\). Therefore, the receptive field of a single node grows exponentially with respect to the number of layers, so computing gradient for a single node costs a lot. Finally, GCN is trained independently for a fixed graph, which lacks the ability for inductive learning.

GraphSAGE  [41] is a comprehensive improvement of the original GCN. To solve the problems mentioned above, GraphSAGE replaced full graph Laplacian with learnable aggregation functions, which are crucial to perform message passing and generalize to unseen nodes. As shown in Eq. 8.86, they first aggregate neighborhood embeddings, concatenate with target node’s embedding, then propagate to the next layer. With learned aggregation and propagation functions, GraphSAGE could generate embeddings for unseen nodes. Also, GraphSAGE uses neighbor sampling to alleviate the receptive field expansion.

PinSage  [143] proposes importance-based sampling method. By simulating random walks starting from target nodes, this approach chooses the top T nodes with the highest normalized visit counts.

FastGCN [16] further improves the sampling algorithm. Instead of sampling neighbors for each node, FastGCN directly samples the receptive field for each layer. FastGCN uses importance sampling, which the important factor is calculated as below:

$$\begin{aligned} q(v)\propto \frac{1}{|{N}_v|}\sum _{u\in {N}_v}\frac{1}{|{N}_u|}. \end{aligned}$$
(8.99)

Adapt. In contrast to fixed sampling methods above, [51] introduces a parameterized and trainable sampler to perform layer-wise sampling conditioned on the former layer. Furthermore, this adaptive sampler could find optimal sampling importance and reduce variance simultaneously.

8.3.5.4 Various Graph Types

In the original GNN [101], the input graph consists of nodes with label information and undirected edges, which is the simplest graph format. However, there are many variants of graphs in the world. In the following, we will introduce some methods designed to model different kinds of graphs.

Directed Graphs. The first variant of the graph is directed graphs. Undirected edge which can be treated as two directed edges shows that there is a relation between two nodes. However, directed edges can bring more information than undirected edges. For example, in a knowledge graph where the edge starts from the head entity and ends at the tail entity, the head entity is the parent class of the tail entity, which suggests we should treat the information propagation process from parent classes and child classes differently. DGP [55] uses two kinds of the weight matrix, \(\textbf{W}_p\) and \(\textbf{W}_c\), to incorporate more precise structural information. The propagation rule is shown as follows:

$$\begin{aligned} \textbf{H}^{(t)} = f({D}^{-1}_p{A}_p f({D}^{-1}_c{A}_c\textbf{H}^{(t-1)}\textbf{W}_c)\textbf{W}_p), \end{aligned}$$
(8.100)

where \({D}^{-1}_p{A}_p\), \({D}^{-1}_c{A}_c\) are the normalized adjacency matrix for parents and children, respectively.

Heterogeneous Graphs. The second variant of the graph is a heterogeneous graph, where there are several kinds of nodes. The simplest way to process the heterogeneous graph is to convert the type of each node to a one-hot feature vector which is concatenated with the original feature.

What’s more, GraphInception [151] introduces the concept of metapath into the propagation on the heterogeneous graph. With metapath, we can group the neighbors according to their node types and distances. For each neighbor group, GraphInception treats it as a subgraph in a homogeneous graph to do propagation and concatenates the propagation results from different homogeneous graphs to do a collective node representation. Recently, [128] proposes the Heterogeneous graph Attention Network (HAN) which utilizes node-level and semantic-level attention. And the model has the ability to consider node importance and metapaths simultaneously.

Graphs with Edge Information. In another variant of graph, each edge has additional information like the weight or the type of the edge. We list two ways to handle this kind of graphs:

Firstly, we can convert the graph to a bipartite graph where the original edges also become nodes and one original edge is split into two new edges which means there are two new edges between the edge node and begin/end nodes. The encoder of G2S [7] uses the following aggregation function for neighbors:

$$\begin{aligned} \textbf{h}_v^{(t)} = f \left( \frac{1}{|{N}_v|}\sum _{u \in {N}_v} \textbf{W}_{r}\left( \textbf{r}_v^{(t)} \odot \textbf{h}_u^{(t-1)}\right) + \textbf{b}_{r}\right) , \end{aligned}$$
(8.101)

where \(\textbf{W}_{r}\) and \(\textbf{b}_{r}\) are the propagation parameters for different types of edges (relations).

Secondly, we can adapt different weight matrices for the propagation of different kinds of edges. When the number of relations is huge, r-GCN [102] introduces two kinds of regularization to reduce the number of parameters for modeling amounts of relations: basis- and block-diagonal-decomposition. With the basis decomposition, each \({W}_r\) is defined as follows:

$$\begin{aligned} \textbf{W}_r = \sum _{b=1}^B \alpha _{rb} \textbf{M}_b. \end{aligned}$$
(8.102)

Here each \(\textbf{W}_r\) is a linear combination of basis transformations \(\textbf{M}_b \in \mathbb {R}^{d_{in} \times d_{out}}\) with coefficients \(\alpha _{rb}\). In the block-diagonal decomposition, r-GCN defines each \(\textbf{W}_r\) through the direct sum over a set of low-dimensional matrices, which needs more parameters than the first one.

Dynamic Graphs. Another variant of the graph is dynamic graph, which has a static graph structure and dynamic input signals. To capture both kinds of information, DCRNN [71] and STGCN [147] first collect spatial information by GNNs, then feed the outputs into a sequence model like sequence-to-sequence model or CNNs. Differently, Structural-RNN [53] and ST-GCN [135] collect spatial and temporal messages at the same time. They extend static graph structure with temporal connections so they can apply traditional GNNs on the extended graphs.

8.3.6 Applications

Graph neural networks have been explored in a wide range of problem domains across supervised, semi-supervised, unsupervised, and reinforcement learning settings. In this section, we simply divide the applications into three scenarios: (1) Structural scenarios where the data has explicit relational structure, such as physical systems, molecular structures, and knowledge graphs; (2) Nonstructural scenarios where the relational structure is not explicit include image, text, etc; (3) Other application scenarios such as generative models and combinatorial optimization problems. Note that we only list several representative applications instead of providing an exhaustive list. We further give some examples of GNNs in the task of fact verification and relation extraction. Figure 8.14 illustrates some application scenarios of graph neural networks.

Fig. 8.14
figure 14

Application scenarios of graph neural network [155]

8.3.6.1 Structural Scenarios

In the following, we will introduce GNN’s applications in structural scenarios, where the data are naturally performed in the graph structure. For example, GNNs are widely being used in social network prediction [41, 59], traffic prediction [25, 96], recommender systems [120, 143], and graph representation [144]. Specifically, we are discussing how to model real-world physical systems with object-relationship graphs, how to predict the chemical properties of molecules and biological interaction properties of proteins and the applications of GNNs on knowledge graphs.

Physics. Modeling real-world physical systems is one of the most fundamental aspects of understanding human intelligence. By representing objects as nodes and relations as edges, we can perform GNN-based reasoning about objects, relations, and physics in a simplified but effective way.

Battaglia et al. [6] proposes Interaction Networks to make predictions and inferences about various physical systems. Objects and relations are first fed into the model as input. Then the model considers the interactions and physical dynamics to predict new states. They separately model relation-centric and object-centric models, making it easier to generalize across different systems. In CommNet [107], interactions are not modeled explicitly. Instead, an interaction vector is obtained by averaging all other agents’ hidden vectors. VAIN [49] further introduced attentional methods into the agent interaction process, which preserves both the complexity advantages and computational efficiency as well.

Visual Interaction Networks [132] can make predictions from pixels. It learns a state code from two consecutive input frames for each object. Then, after adding their interaction effect by an Interaction Net block, the state decoder converts state codes to the next step’s state.

Sanchez-Gonzalez et al. [99] proposes a Graph Network based model which could either perform state prediction or inductive inference. The inference model takes partially observed information as input and constructs a hidden graph for implicit system classification.

Molecular Fingerprints. Molecular fingerprints are feature vectors representing molecules, which are important in computer-aided drug design. Traditional molecular fingerprint discovering relies on heuristic methods which are hand-crafted. And GNNs can provide more flexible approaches for better fingerprints.

Duvenaud et al. [31] propose neural graph fingerprints (Neural FPs) that calculate substructure feature vectors via GCN and sum to get overall representation. The aggregation function is introduced in Eq. 8.83.

Kearnes et al. [56] further explicitly models atom and atom pairs independently to emphasize atom interactions. It introduces edge representation \(\textbf{e}^{(t)}_{uv}\) instead of aggregation function, i.e., \(\textbf{h}_{{N}_v}^{(t)} = \sum _{u\in {N_v}}\textbf{e}^{(t)}_{uv}\). The node update function is

$$\begin{aligned} \textbf{h}_{v}^{(t+1)} = \text {ReLU}(\textbf{W}_1[\text {ReLU}(\textbf{W}_0\textbf{h}^{(t)}_u); \textbf{h}_{{N}_v}^{(t)}]), \end{aligned}$$
(8.103)

while the edge update function is

$$\begin{aligned} \textbf{e}_{uv}^{(t+1)} = \text {ReLU}(\textbf{W}_4[\text {ReLU}(\textbf{W}_2\textbf{e}_{uv}^{(t)}); \text {ReLU}(\textbf{W}_3[\textbf{h}_{v}^{(t)};\textbf{h}_{u}^{(t)}])]). \end{aligned}$$
(8.104)

Protein Interface Prediction. Fout et al. [33] focuses on the task named protein interface prediction, which is a challenging problem with critical applications in drug discovery and design. The proposed GCN-based method, respectively, learns ligand and receptor protein residue representation and merges them for pair-wise classification.

GNN can also be used in biomedical engineering. With Protein-Protein Interaction Network, [97] leverages graph convolution and relation network for breast cancer subtype classification. Zitnik et al. [160] also suggest a GCN-based model for polypharmacy side effects prediction. Their work models the drug and protein interaction network and separately deals with edges in different types.

Knowledge Graph. Hamaguchi et al. [40] utilizes GNNs to solve the Out-Of-Knowledge-Base (OOKB) entity problem in Knowledge Base Completion (KBC). The OOKB entities in [40] are directly connected to the existing entities thus the embeddings of OOKB entities can be aggregated from the existing entities. The method achieves satisfying performance both in the standard KBC setting and the OOKB setting.

Wang et al. [130] utilize GCNs to solve the cross-lingual knowledge graph alignment problem. The model embeds entities from different languages into a unified embedding space and aligns them based on the embedding similarity.

8.3.6.2 Nonstructural Scenarios

In this section we will talk about applications on nonstructural scenarios such as image, text, programming source code [1, 72], and multi-agent systems [49, 58, 107]. We will only give a detailed introduction to the first two scenarios due to the length limit. Roughly, there are two ways to apply the graph neural networks on nonstructural scenarios: (1) Incorporate structural information from other domains to improve the performance, for example, using information from knowledge graphs to alleviate the zero-shot problems in image tasks; (2) Infer or assume the relational structure in the scenario and then apply the model to solve the problems defined on graphs, such as the method in [152] which models text as graphs.

Image classification. Image classification is a fundamental and essential task in the field of computer vision, which attracts much attention and has many famous datasets like ImageNet [62]. Recent progress in image classification benefits from big data and the strong power of GPU computation, which allows us to train a classifier without extracting structural information from images. However, zero-shot and few-shot learning become more and more popular in the field of image classification, because most models can achieve similar performance with enough data. There are several works leveraging graph neural networks to incorporate structural information in image classification.

First, knowledge graphs can be used as extra information to guide zero-shot recognition classification [55, 129]. Wang et al. [129] builds a knowledge graph where each node corresponds to an object category and takes the word embeddings of nodes as input for predicting the classifier of different categories. As the over-smoothing effect happens with the deep depth of convolution architecture, the 6-layer GCN used in [129] will wash out much useful information in the representation. To solve the smoothing problem in the propagation of GCN, [55] uses single-layer GCN with a larger neighborhood, which includes both one-hop and multi-hop nodes in the graph. And it proved effective in building a zero-shot classifier with existing ones.

Except for the knowledge graph, the similarity between images in the dataset is also helpful for few-shot learning [100]. Satorras and Estrach [100] propose to build a weighted fully connected image network based on the similarity and do message passing in the graph for few-shot recognition. As most knowledge graphs are large for reasoning, [77] selects some related entities to build a subgraph based on the result of object detection and apply GGNN to the extracted graph for prediction. Besides, [66] proposes to construct a new knowledge graph where the entities are all the categories. And, they defined three types of label relations: super-subordinate, positive correlation, and negative correlation and propagate the confidence of labels in the graph directly.

Visual reasoning. Computer vision systems usually need to perform reasoning by incorporating both spatial and semantic information. So it is natural to generate graphs for reasoning tasks.

A typical visual reasoning task is Visual Question Answering (VQA), [114], respectively, constructs image scene graph and question syntactic graph. Then it applies GGNN to train the embeddings for predicting the final answer. Despite spatial connections among objects, [87] builds the relational graphs conditioned on the questions. With knowledge graphs, [83, 131] can perform finer relation exploration and a more interpretable reasoning process.

Other applications of visual reasoning include object detection, interaction detection, and region classification. In object detection [39, 50], GNNs are used to calculate RoI features; In interaction detection [53, 95], GNNs are message-passing tools between human and objects; In region classification [18], GNNs perform reasoning on graphs which connects regions and classes.

Text Classification. Text classification is an essential and classical problem in natural language processing. The classical GCN models [4, 28, 41, 47, 59, 82] and GAT model [122] are applied to solve the problem, but they only use the structural information between the documents and they do not use much text information.

Peng et al. [91] propose a graph-CNN-based deep learning model. It first turns texts to graph-of-words, then the graph convolution operations in [347] are used on the word graph. Zhang et al. [152] propose the Sentence LSTM to encode text. The whole sentence is represented in a single state which contains a global sentence-level state and several substates for individual words. It uses the global sentence-level representation for classification tasks.

These methods either view a document or a sentence as a graph of word nodes or rely on the document citation relation to construct the graph. Yao et al. [142] regard the documents and words as nodes to construct the corpus graph (hence heterogeneous graph) and uses the Text GCN to learn embeddings of words and documents. Sentiment classification could also be regarded as a text classification problem and a Tree-LSTM approach is proposed by [109].

Sequence Labeling. As each node in GNNs has its hidden state, we can utilize the hidden state to address the sequence labeling problem if we consider every word in the sentence as a node. Zhang et al. [152] utilize the Sentence LSTM to label the sequence. It has conducted experiments on POS-tagging and NER tasks and achieves promising performance.

Semantic role labeling is another task of sequence labeling. Marcheggiani and Titov [76] propose a Syntactic GCN to solve the problem. The Syntactic GCN which operates on the direct graph with labeled edges is a special variant of the GCN [59]. It uses edge-wise gates that enable the model to regulate the contribution of each dependency edge. The Syntactic GCNs over syntactic dependency trees are used as sentence encoders to learn latent feature representations of words in the sentence. Marcheggiani and Titov [76] also reveal that GCNs and LSTMs are functionally complementary in the task.

8.3.6.3 Other Scenarios

Besides structural and nonstructural scenarios, there are some other scenarios where graph neural networks play an important role. In this subsection, we will introduce generative graph models and combinatorial optimization with GNNs.

Generative Models. Generative models for real-world graphs have drawn significant attention for its essential applications, including modeling social interactions, discovering new chemical structures, and constructing knowledge graphs. As deep learning methods have a powerful ability to learn the implicit distribution of graphs, there is a surge in neural graph generative models recently.

NetGAN [104] is one of the first works to build a neural graph generative model, which generates graphs via random walks. It transformed the problem of graph generation to the problem of walk generation, which takes the random walks from a specific graph as input and trains a walk generative model using GAN architecture. While the generated graph preserves essential topological properties of the original graph, the number of nodes is unable to change in the generating process, which is as same as the original graph. GraphRNN [146] generate the adjacency matrix of a graph by generating the adjacency vector of each node step by step, which can output required networks having different numbers of nodes.

Instead of generating adjacency matrix sequentially, MolGAN [27] predict a discrete graph structure (the adjacency matrix) at once and utilizes a permutation-invariant discriminator to solve the node variant problem in the adjacency matrix. Besides, it applies a reward network for RL-based optimization towards desired chemical properties. What is more, [75] proposes constrained variational autoencoders to ensure the semantic validity of generated graphs. Moreover, GCPN [145] incorporates domain-specific rules through reinforcement learning.

Li et al. [73] propose a model that generates edges and nodes sequentially and utilize a graph neural network to extract the hidden state of the current graph, which is used to decide the action in the next step during the sequential generative process.

Combinatorial optimization. Combinatorial optimization problems over graphs are a set of NP-hard problems that attract much attention from scientists of all fields. Some specific problems like Traveling Salesman Problem (TSP) have got various heuristic solutions. Recently, using a deep neural network for solving such problems has been a hotspot, and some of the solutions further leverage graph neural networks because of their graph structure.

Bello et al. [9] first propose a deep learning approach to tackle TSP. Their method consists of two parts: a Pointer Network [123] for parameterizing rewards and a policy gradient [108] module for training. This work has been proved to be comparable with traditional approaches. However, Pointer Networks are designed for sequential data like texts, while order-invariant encoders are more appropriate for such work.

Khalil et al. [57] and Kool and Welling [61] improve the above method by including graph neural networks. The former work first obtains the node embeddings from structure2vec [26] then feeds them into a Q-learning module for making decisions. The latter one builds an attention-based encoder-decoder system. By replacing the reinforcement learning module with an attention-based decoder, it is more efficient for training. These work achieved better performance than previous algorithms, which proved the representation power of graph neural networks.

Nowak et al. [88] focus on Quadratic Assignment Problem, i.e., measuring the similarity of two graphs. The GNN-based model learns node embeddings for each graph independently and matches them using an attention mechanism. Even in situations where traditional relaxation-based methods may perform not well, this model still shows satisfying performance.

8.3.6.4 Example: GNNs for Fact Verification

Due to the rapid development of Information Extraction (IE), huge volumes of data have been extracted. How to automatically verify the data becomes a vital problem for various data-driven applications, e.g., knowledge graph completion [126] and open domain question answering [15]. Hence, many recent research efforts have been devoted to Fact Verification (FV), which aims to verify given claims with the evidence retrieved from plain text. More specifically, given a claim, an FV system is asked to label it as “SUPPORTED”, “REFUTED”, or “NOT ENOUGH INFO”, which indicates that the evidence can support, refute, or is not sufficient for the claim. An example of the FV task is shown in Table 8.5.

Table 8.5 A case of the claim that requires integrating multiple evidence to verify. The representation for evidence “{DocName, LineNum}” means the evidence is extracted from the document “DocName” and of which the line number is LineNum

Existing FV methods formulate FV as a Natural Language Inference (NLI) [3] task. However, they utilize simple evidence combination methods such as concatenating the evidence or just dealing with each evidence-claim pair. These methods are unable to grasp sufficient relational and logical information among the evidence. In fact, many claims require to simultaneously integrate and reason over several pieces of evidence for verification. As shown in Table 8.5, for this particular example, we cannot verify the given claim by checking any evidence in isolation. The claim can be verified only by understanding and reasoning over multiple evidence.

To integrate and reason over information from multiple pieces of evidence,  [156] proposes a graph-based evidence aggregating and reasoning (GEAR) framework. Specifically,  [156] first builds a fully connected evidence graph and encourages information propagation among the evidence. Then, GEAR aggregates the pieces of evidence and adopts a classifier to decide whether the evidence can support, refute, or is not sufficient for the claim. Intuitively, by sufficiently exchanging and reasoning over evidence information on the evidence graph, the proposed model can make the best of the information for verifying claims. For example, by delivering the information “Los Angeles County is the most populous county in the USA” to “the Rodney King riots occurred in Los Angeles County” through the evidence graph, the synthetic information can support “The Rodney King riots took place in the most populous county in the USA”. Furthermore, we adopt an effective pretrained language representation model BERT [29] to better grasp both evidence and claim semantics.

Fig. 8.15
figure 15

The pipeline used in [156]. The GEAR framework is illustrated in the claim verification section

Zhou et al. [156] employ a three-step pipeline with components for document retrieval, sentence selection, and claim verification to solve the task. In the document retrieval and sentence selection stages, they simply follow the method from [44] since their method has the highest score on evidence recall in the former FEVER-shared task. And they propose the GEAR framework in the final claim verification stage. The full pipeline is illustrated in Fig. 8.15.

Given a claim and the retrieved evidence, GEAR first utilizes a sentence encoder to obtain representations for the claim and the evidence. Then it builds a fully connected evidence graph and uses an Evidence Reasoning Network (ERNet) to propagate information among evidence and reason over the graph. Finally, it utilizes an evidence aggregator to infer the final results.

Sentence Encoder. Given an input sentence, GEAR employs BERT [29] as the sentence encoder by extracting the final hidden state of the \([{\text {CLS}}]\) token as the representation, where \([{\text {CLS}}]\) is the special classification token in BERT.

$$\begin{aligned} \begin{aligned} \textbf{e}_i&= \text {BERT}\left( e_i, c\right) , \\ \textbf{c}&= \text {BERT}\left( c\right) . \end{aligned} \end{aligned}$$
(8.105)

Evidence Reasoning Network. To encourage the information propagation among evidence, GEAR builds a fully connected evidence graph where each node indicates a piece of evidence. It also adds self-loop to every node because each node needs the information from itself in the message propagation process. We use \({h}^t = \{{h}_1^t, {h}_2^t,\ldots , {h}_N^t\}\) to represent the hidden states of nodes at layer t. The initial hidden state of each evidence node \({h}_i^0\) is initialized by the evidence presentation: \({h}_i^0 = {e}_i\).

Inspired by recent work on semi-supervised graph learning and relational reasoning [59, 90, 122], Zhou et al. [156] propose an Evidence Reasoning Network (ERNet) to propagate information among the evidence nodes. It first uses an MLP to compute the attention coefficients between a node i and its neighbor j (\(j \in \mathscr {N}_i\)),

$$\begin{aligned} y_{ij} = \textbf{W}_1^{(t-1)} ( \text {ReLU} ( \textbf{W}_0^{(t-1)} [\textbf{h}_i^{(t-1)} ;\textbf{h}_j^{(t-1)} ] ) ), \end{aligned}$$
(8.106)

where \(\mathscr {N}_i\) denotes the set of neighbors of node i, \(\textbf{W}_0^{(t-1)}\) and \(\textbf{W}_1^{(t-1)}\) are weight matrices, and \([\cdot ;\cdot ]\) denotes concatenation operation.

Then, it normalizes the coefficients using the softmax function

$$\begin{aligned} \alpha _{ij} = {\text {Softmax}}_j(y_{ij}) = \frac{\text {exp}(y_{ij})}{\sum _{k \in {N}_i} \text {exp}(y_{ik})}. \end{aligned}$$
(8.107)

Finally, the normalized attention coefficients are used to compute a linear combination of the neighbor features and thus we obtain the features for node i at layer t,

$$\begin{aligned} \textbf{h}_i^{(t)} = \sum _{j \in {N}_i} \alpha _{ij} \textbf{h}_j^{(t-1)}. \end{aligned}$$
(8.108)

By stacking T layers of ERNet,  [156] assumes that each evidence could grasp enough information by communicating with other evidence.

Evidence Aggregator. Zhou et al. [156] employ an evidence aggregator to gather information from different evidence nodes and obtain the final hidden state o. The aggregator may utilize different aggregating strategies and  [156] suggests three aggregators:

Attention Aggregator. Zhou et al. [156] use the representation of the claim \(\textbf{c}\) to attend the hidden states of evidence and get the final aggregated state \(\textbf{o}\).

$$\begin{aligned} \begin{aligned} y_{j}&= \textbf{W}_1^{\prime } ( \text {ReLU} ( \textbf{W}_0^{\prime } [ \textbf{c} ; \textbf{h}_j^{\top } ] ) ), \\ \alpha _{j}&= \text {Softmax}(y_j) = \frac{\text {exp}(y_{j})}{\sum _{k=1}^N \text {exp}(y_{k})}, \\ \textbf{o}&= \sum _{k=1}^N \alpha _{k} \textbf{h}_k^{\top }. \end{aligned} \end{aligned}$$
(8.109)

Max Aggregator. The max aggregator performs the element-wise max operation among hidden states.

$$\begin{aligned} \textbf{o} = \text {Max} (\textbf{h}_1^{\top }, \textbf{h}_2^{\top },\ldots , \textbf{h}_N^{\top }). \end{aligned}$$
(8.110)

Mean Aggregator. The mean aggregator performs the element-wise mean operation among hidden states.

$$\begin{aligned} \textbf{o} = \text {Mean} (\textbf{h}_1^{\top }, \textbf{h}_2^{\top },\ldots , \textbf{h}_N^{\top }). \end{aligned}$$
(8.111)

Once the final state o is obtained, GEAR employs a one-layer MLP to get the final prediction \(\textbf{l}\).

$$\begin{aligned} \textbf{l} = {\text {Softmax}} (\text {ReLU}( \textbf{W} \textbf{o} +\textbf{b})), \end{aligned}$$
(8.112)

where \(\textbf{W}\) and \(\textbf{b}\) are parameters.

Zhou et al. [156] conduct experiments on the large-scale benchmark dataset for Fact Extraction and VERification (FEVER) [115]. Experimental results show that the proposed framework outperforms recent state-of-the-art baseline systems. The further case study indicates that the framework could better leverage multi-evidence information and reason over the evidence for FV.

8.4 Summary

In this chapter, we have introduced network representation learning, which turns the network structure information into the continuous vector space and make deep learning techniques possible on network data.

Unsupervised network representation learning comes first during the development of NRL. Spectral Clustering, DeepWalk, LINE, GraRep, and other methods utilize the network structure for vertex embedding learning. Afterward, TADW incorporates text information into NRL under the framework of matrix factorization. The NEU algorithm then moves one step forward and proposes a general method to improve the quality of any learned network embeddings. Other unsupervised methods also consider preserving specific properties of the network topology, e.g., community and asymmetry.

Recently, semi-supervised NRL algorithms have attracted much attention. This kind of methods focus on a specific task such as classification and use the labels of the training set to improve the quality of network embeddings. Node2vec, MMDW, and many other methods including the family of Graph Neural Networks (GNNs) are proposed for this end. Semi-supervised algorithms can achieve better results as they can take advantage of more information from the specific task.

For further understanding of network representation learning, you can also find more related papers in this paper list https://github.com/thunlp/GNNPapers. There are also some recommended surveys and books including the following:

  • Cui et al. A survey on network embedding [24].

  • Goyal and Ferrara. Graph embedding techniques, applications, and performance: A survey [37].

  • Zhang et al. Network representation learning: A survey [149].

  • Wu et al. A comprehensive survey on graph neural networks [133].

  • Zhou et al. Graph neural networks: A review of methods and applications [155].

  • Zhang et al. Deep learning on graphs: A survey [154].

In the future, for better network representation learning, some directions are requiring further efforts:

(1) More Complex and Realistic Networks. An intriguing direction would be the representation of learning on heterogeneous and dynamic networks where most real-world network data fall into this category. The vertices and edges in a heterogeneous network may belong to different types. Networks in real life are also highly dynamic, e.g., the friendship between Facebook users may establish and disappear. These characteristics require the researchers to design specific algorithms for them. Network embedding learning on dynamic network structures is, therefore, an important task. There have been some works proposed [14, 105] for much more complex and realistic settings.

(2) Deeper Model Architectures. Conventional deep neural networks can stack hundreds of layers to get better performance because the deeper structure has more parameters and may improve the expressive power significantly. However, NRL and GNN models are usually shallow. In fact, most of them have no more than three layers. Taking GCN as an example, as experiments in [70] show, stacking multiple GCN layers will result in over-smoothing: the representations of all vertices will converge to the same. Although some researchers have managed to tackle this problem [70, 125] to some extents, it remains to be a limitation of NRL. Designing deeper model architectures is an exciting challenge for future research, and will be a considerable contribution to the understanding of NRL.

(3) Scalability. Scalability determines whether an algorithm is able to be applied to practical use. How to apply NRL methods in real-world web-scale scenarios such as social networks or recommendation systems has been an essential problem for most network embedding algorithms. Scaling up NRL methods especially GNN is difficult because many core steps are computationally consuming in a big data environment. For example, network data are not regular Euclidean, and each node has its own neighborhood structure. Therefore, batch tricks cannot be easily applied. Moreover, computing graph Laplacian is also unfeasible when there are millions or even billions of nodes and edges. Several works has proposed their solutions to this problem [143, 153, 157] and we are paying close attention to the progress.